awk to skip lines find text and add text based on number


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to skip lines find text and add text based on number
# 1  
Old 02-15-2016
awk to skip lines find text and add text based on number

I am trying to use awk skip each line with a ## or
Code:
#

and check each line after for STB= and if that value in greater than or = to 0.8, then at the end of line the text "STRAND BIAS" is written in else "GOOD".

So in the file of 4 entries attached.

awk tried:
Code:
 awk NR > "##"' "#" -F"STB=" '{print $NF}' file

desired output:
Code:
##
##
##
....
....
....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER
..... GOOD
..... GOOD
..... GOOD
..... STRAND BIAS

file:
Code:
##fileformat=VCFv4.1
##FILTER=<ID=NOCALL,Description="Generic filter. Filtering details stored in FR info tag.">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele frequency based on Flow Evaluator observation counts">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=FAO,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observation count">
##FORMAT=<ID=FDP,Number=1,Type=Integer,Description="Flow Evaluator Read Depth">
##FORMAT=<ID=FRO,Number=1,Type=Integer,Description="Flow Evaluator Reference allele observation count">
##FORMAT=<ID=FSAF,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the forward strand">
##FORMAT=<ID=FSAR,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the reverse strand">
##FORMAT=<ID=FSRF,Number=1,Type=Integer,Description="Flow Evaluator reference observations on the forward strand">
##FORMAT=<ID=FSRR,Number=1,Type=Integer,Description="Flow Evaluator reference observations on the reverse strand">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=SAF,Number=A,Type=Integer,Description="Alternate allele observations on the forward strand">
##FORMAT=<ID=SAR,Number=A,Type=Integer,Description="Alternate allele observations on the reverse strand">
##FORMAT=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##FORMAT=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency based on Flow Evaluator observation counts">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observations">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=FAO,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations">
##INFO=<ID=FDP,Number=1,Type=Integer,Description="Flow Evaluator read depth at the locus">
##INFO=<ID=FR,Number=.,Type=String,Description="Reason why the variant was filtered.">
##INFO=<ID=FRO,Number=1,Type=Integer,Description="Flow Evaluator Reference allele observations">
##INFO=<ID=FSAF,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the forward strand">
##INFO=<ID=FSAR,Number=A,Type=Integer,Description="Flow Evaluator Alternate allele observations on the reverse strand">
##INFO=<ID=FSRF,Number=1,Type=Integer,Description="Flow Evaluator Reference observations on the forward strand">
##INFO=<ID=FSRR,Number=1,Type=Integer,Description="Flow Evaluator Reference observations on the reverse strand">
##INFO=<ID=FWDB,Number=A,Type=Float,Description="Forward strand bias in prediction.">
##INFO=<ID=FXX,Number=1,Type=Float,Description="Flow Evaluator failed read ratio">
##INFO=<ID=HRUN,Number=A,Type=Integer,Description="Run length: the number of consecutive repeats of the alternate allele in the reference genome">
##INFO=<ID=HS,Number=0,Type=Flag,Description="Indicate it is at a hot spot">
##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length">
##INFO=<ID=MLLD,Number=A,Type=Float,Description="Mean log-likelihood delta per read.">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=PB,Number=A,Type=Float,Description="Bias of relative variant position in reference reads versus variant reads. Equals Mann-Whitney U rho statistic P(Y>X)+0.5P(Y=X)">
##INFO=<ID=PBP,Number=A,Type=Float,Description="Pval of relative variant position in reference reads versus variant reads.  Related to GATK ReadPosRankSumTest">
##INFO=<ID=QD,Number=1,Type=Float,Description="QualityByDepth as 4*QUAL/FDP (analogous to GATK)">
##INFO=<ID=RBI,Number=A,Type=Float,Description="Distance of bias parameters from zero.">
##INFO=<ID=REFB,Number=A,Type=Float,Description="Reference Hypothesis bias in prediction.">
##INFO=<ID=REVB,Number=A,Type=Float,Description="Reverse strand bias in prediction.">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Reference allele observations">
##INFO=<ID=SAF,Number=A,Type=Integer,Description="Alternate allele observations on the forward strand">
##INFO=<ID=SAR,Number=A,Type=Integer,Description="Alternate allele observations on the reverse strand">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=SSEN,Number=A,Type=Float,Description="Strand-specific-error prediction on negative strand.">
##INFO=<ID=SSEP,Number=A,Type=Float,Description="Strand-specific-error prediction on positive strand.">
##INFO=<ID=SSSB,Number=A,Type=Float,Description="Strand-specific strand bias for allele.">
##INFO=<ID=STB,Number=A,Type=Float,Description="Strand bias in variant relative to reference.">
##INFO=<ID=STBP,Number=A,Type=Float,Description="Pval of Strand bias in variant relative to reference.">
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
##INFO=<ID=VARB,Number=A,Type=Float,Description="Variant Hypothesis bias in prediction.">
##LeftAlignVariants="analysis_type=LeftAlignVariants bypassFlowAlign=true kmer_len=19 min_var_count=5 short_suffix_match=5 min_indel_size=4 max_hp_length=8 min_var_freq=0.15 min_var_score=10.0 relative_strand_bias=0.8 output_mnv=0 sse_hp_size=0 sse_report_file= target_size=1.0 pref_kmer_max=3 pref_kmer_min=0 pref_delta_max=2 pref_delta_min=0 suff_kmer_max=3 suff_kmer_min=0 suff_delta_max=2 suff_delta_min=0 motif_min_ppv=0.2 generate_flow_position=0 analyze_missmatches=0 sse_rate=0.07 input_file=[] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL reference_sequence=/results/referenceLibrary/tmap-f3/hg19/hg19.fasta rodBind=[] nonDeterministicRandomSeed=false downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false BQSR=null defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 combined_sample_name= num_cpu_threads=null num_io_threads=null num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false logging_level=INFO log_to_file=null help=false variant=(RodBinding name=variant source=/results/analysis/output/Home/Auto_user_Proton-32-Lurie_Inh_Disease_151029_79_081/plugin_out/variantCaller_out.125/IonXpress_005/small_variants.sorted.vcf) out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub filter_mismatching_base_and_quals=false"
##basecallerVersion="4.6-11/0c0ef91"
##contig=<ID=chr1,length=249250621,assembly=hg19>
##contig=<ID=chr10,length=135534747,assembly=hg19>
##contig=<ID=chr11,length=135006516,assembly=hg19>
##contig=<ID=chr12,length=133851895,assembly=hg19>
##contig=<ID=chr13,length=115169878,assembly=hg19>
##contig=<ID=chr14,length=107349540,assembly=hg19>
##contig=<ID=chr15,length=102531392,assembly=hg19>
##contig=<ID=chr16,length=90354753,assembly=hg19>
##contig=<ID=chr17,length=81195210,assembly=hg19>
##contig=<ID=chr18,length=78077248,assembly=hg19>
##contig=<ID=chr19,length=59128983,assembly=hg19>
##contig=<ID=chr2,length=243199373,assembly=hg19>
##contig=<ID=chr20,length=63025520,assembly=hg19>
##contig=<ID=chr21,length=48129895,assembly=hg19>
##contig=<ID=chr22,length=51304566,assembly=hg19>
##contig=<ID=chr3,length=198022430,assembly=hg19>
##contig=<ID=chr4,length=191154276,assembly=hg19>
##contig=<ID=chr5,length=180915260,assembly=hg19>
##contig=<ID=chr6,length=171115067,assembly=hg19>
##contig=<ID=chr7,length=159138663,assembly=hg19>
##contig=<ID=chr8,length=146364022,assembly=hg19>
##contig=<ID=chr9,length=141213431,assembly=hg19>
##contig=<ID=chrM,length=16569,assembly=hg19>
##contig=<ID=chrX,length=155270560,assembly=hg19>
##contig=<ID=chrY,length=59373566,assembly=hg19>
##fileDate=20151029
##fileUTCtime=2015-10-29T22:48:03
##parametersDetails="germline_low_stringency_proton, TS version: 4.6"
##parametersName="Generic - Proton - Germ Line - Low Stringency"
##phasing=none
##reference=/results/referenceLibrary/tmap-f3/hg19/hg19.fasta
##reference=file:///results/referenceLibrary/tmap-f3/hg19/hg19.fasta
##source="tvc 4.6-11 (0c0ef91) - Torrent Variant Caller"
##tmapVersion="4.6.11 (0c0ef91) (201506161725)"
##INFO=<ID=OID,Number=.,Type=String,Description="List of original Hotspot IDs">
##INFO=<ID=OPOS,Number=.,Type=Integer,Description="List of original allele positions">
##INFO=<ID=OREF,Number=.,Type=String,Description="List of original reference bases">
##INFO=<ID=OALT,Number=.,Type=String,Description="List of original variant bases">
##INFO=<ID=OMAPALT,Number=.,Type=String,Description="Maps OID,OPOS,OREF,OALT entries to specific ALT alleles">
##deamination_metric=0.23163526491
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NA12878chr1    977330    .    T    C    519.68    PASS    F=1;AO=55;DP=55;FAO=55;FDP=55;FR=.;FRO=0;FSAF=37;FSAR=18;FSRF=0;FSRR=0;FWDB=-0.0448496;FXX=0;HRUN=1;LEN=1;MLLD=88.0543;PB=0.5;PBP=1;QD=37.7947;RBI=0.0449422;REFB=0;REVB=0.0028832;RO=0;SAF=37;SAR=18;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=3.85968e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-0.00027863;OID=.;OPOS=977330;OREF=T;OALT=C;OMAPALT=C    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:25:55:55:0:0:55:55:1:18:37:0:0:18:37:0:0
chr1    981931    .    A    G    1169.7    PASS    AF=0.984375;AO=125;DP=131;FAO=126;FDP=128;FR=.;FRO=2;FSAF=67;FSAR=59;FSRF=2;FSRR=0;FWDB=-0.000335669;FXX=0.022899;HRUN=1;LEN=1;MLLD=58.6432;PB=0.5;PBP=1;QD=36.5532;RBI=0.0593811;REFB=-0.0178713;REVB=-0.0593801;RO=2;SAF=66;SAR=59;SRF=2;SRR=0;SSEN=0;SSEP=0;SSSB=-0.0141789;STB=0.507352;STBP=0.247;TYPE=snp;VARB=-0.000577363;OID=.;OPOS=981931;OREF=A;OALT=G;OMAPALT=G    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:41:131:128:2:2:125:126:0.984375:59:66:2:0:59:67:2:0
chr1    982994    .    T    C    3016.4    PASS    AF=1;AO=317;DP=318;FAO=317;FDP=317;FR=.;FRO=0;FSAF=114;FSAR=203;FSRF=0;FSRR=0;FWDB=-0.0880862;FXX=0.00314456;HRUN=4;LEN=1;MLLD=48.0245;PB=0.5;PBP=1;QD=38.0619;RBI=0.13654;REFB=0.0494458;REVB=-0.104326;RO=1;SAF=114;SAR=203;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00234416;STB=0.5;STBP=1;TYPE=snp;VARB=-0.000568883;OID=.;OPOS=982994;OREF=T;OALT=C;OMAPALT=C    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:99:318:317:1:0:317:317:1:203:114:0:1:203:114:0:0
chr1    981931    .    A    C    1169.7    PASS    AF=0.984375;AO=125;DP=131;FAO=126;FDP=128;FR=.;FRO=2;FSAF=67;FSAR=59;FSRF=2;FSRR=0;FWDB=-0.000335669;FXX=0.022899;HRUN=1;LEN=1;MLLD=58.6432;PB=0.5;PBP=1;QD=36.5532;RBI=0.0593811;REFB=-0.0178713;REVB=-0.0593801;RO=2;SAF=66;SAR=59;SRF=2;SRR=0;SSEN=0;SSEP=0;SSSB=-0.0141789;STB=0.507352;STBP=0.247;TYPE=snp;VARB=-0.000577363;OID=.;OPOS=981931;OREF=A;OALT=G;OMAPALT=G    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:41:131:128:2:2:125:126:0.984375:59:66:2:0:59:67:2:0
chr1    982994    .    -    C    3016.4    PASS    AF=1;AO=317;DP=21;FAO=317;FDP=20;FR=.;FRO=0;FSAF=114;FSAR=203;FSRF=0;FSRR=0;FWDB=-0.0880862;FXX=0.00314456;HRUN=4;LEN=1;MLLD=48.0245;PB=0.5;PBP=1;QD=38.0619;RBI=0.13654;REFB=0.0494458;REVB=-0.104326;RO=1;SAF=114;SAR=203;SRF=0;SRR=1;SSEN=0;SSEP=0;SSSB=0.00234416;STB=0.9;STBP=1;TYPE=snp;VARB=-0.000568883;OID=.;OPOS=982994;OREF=T;OALT=C;OMAPALT=C    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:99:318:317:1:0:317:317:1:203:114:0:1:203:114:0:0


Last edited by cmccabe; 02-16-2016 at 05:15 PM.. Reason: updated missing line, updated FDP= value
# 2  
Old 02-15-2016
Perhaps Perl?
Code:
perl -ple '/^[^#].*STB=(\d+\.\d+);/ and $_.=$1 >= 0.8?" STRAND BIAS":" GOOD"'


Last edited by Aia; 02-15-2016 at 10:32 PM..
This User Gave Thanks to Aia For This Post:
# 3  
Old 02-16-2016
awk does not have the nice ( )match clause that perl has.
Code:
awk -v M=";STB=" '/^[^#]/ && match($0,M"[^;]*") {LM=length(M); print $0, (substr($0,RSTART+LM,RLENGTH-LM)>=0.8 ? "STRAND BIAS" : "GOOD")}' awk_test.txt

---------- Post updated at 02:51 PM ---------- Previous update was at 02:38 PM ----------

With the idea -F ";STB=" it simplifies a bit:
Code:
awk -F ";STB=" '/^[^#]/ && match($2,"[^;]*") {print $0, (substr($2,RSTART,RLENGTH)>=0.8 ? "STRAND BIAS" : "GOOD")}' awk_test.txt

---------- Post updated at 03:01 PM ---------- Previous update was at 02:51 PM ----------

In case you want to keep the comments:
Code:
awk -F ";STB=" '/^#/ {print; next} match($2,"[^;]*") {print $0, (substr($2,RSTART,RLENGTH)>=0.8 ? "STRAND BIAS" : "GOOD")}' awk_test.txt

This User Gave Thanks to MadeInGermany For This Post:
# 4  
Old 02-16-2016
In addition to capturing the STB= value, how can I also capture the FDP= value and whatever the value is of FDP= "reads" appears next to the text "STRAND BIAS" or "GOOD". Thank you Smilie.

Code:
perl -ple '/^[^#].*FDP=(\d+);*STB=(\d+\.\d+);/ and $1_= <30 $_.=$2 >= 0.8?" STRAND BIAS":" GOOD""$1 "reads""'

desired output:
Code:
##
##
##
....
....
....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER
..... GOOD  128 reads
..... GOOD  317 reads
..... GOOD  128 reads
..... STRAND BIAS  20 reads

# 5  
Old 02-16-2016
Please, try
Code:
perl -ple '/^[^#].*FDP=(\d+);.*STB=(\d+\.\d+);/ and $_.=($2 >= 0.8?" STRAND BIAS ":" GOOD ").$1." reads"'

This User Gave Thanks to Aia For This Post:
# 6  
Old 02-17-2016
Sometimes a multi-liner is easier to understand+expand
Code:
perl -ple '                                       
/^#/ and next;
/;STB=([^;]+)/ and $_.=($1 >= 0.8 ? " STRAND BIAS " : " GOOD ");
/;FDP=([^;]+)/ and $_.=$1;
' awk_test.txt

The ( ) is referred as $1.
$_ is the input line. /string/ is short for $_ =~ m/string/
The .= appends the string to $_. It's short for $_ = $_ . string
The perl -p option loops and prints at the end of each cycle. (While the -n option only loops.)
In loop mode the next statement jumps to the next cycle. (Like in awk, that is always in loop mode.)
This User Gave Thanks to MadeInGermany For This Post:
# 7  
Old 02-17-2016
Thank you both very much Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to add value and text to specific lines

In the awk I have a very large tab-delimeted file that I am trying to extract the DP= value put it in $16 and add specific text to $16 with . (dot) in $11-$15 and $18. Only the lines (there are several) that have the formating below in file will have an empty $16. Other lines will be in a... (6 Replies)
Discussion started by: cmccabe
6 Replies

2. Shell Programming and Scripting

awk to print lines based on text in field and value in two additional fields

In the awk below I am trying to print the entire line, along with the header row, if $2 is SNV or MNV or INDEL. If that condition is met or is true, and $3 is less than or equal to 0.05, then in $7 the sub pattern :GMAF= is found and the value after the = sign is checked. If that value is less than... (0 Replies)
Discussion started by: cmccabe
0 Replies

3. UNIX for Beginners Questions & Answers

How to find=grep or maybe sed/awk for multiple lines of text?

Hi, I am running the following: PASS="username/password" sqlplus -s << EOF | grep -v "^$" $PASS set feedback off set heading off set termout off select name from v\$database ; exit EOF Which gives ERROR: ORA-28002: the password will expire within 5 days PSMP1 (1 Reply)
Discussion started by: newbie_01
1 Replies

4. Shell Programming and Scripting

awk - Skip x Number of Lines in Counter

Hello, I am new to AWK and in UNIX in general. I am hoping you can help me out here. Here is my data: root@ubuntu:~# cat circuits.list WORD1 AA BB CC DD Active ISP1 ISP NAME1 XX-XXXXXX1 WORD1 AA BB CC (9 Replies)
Discussion started by: tattoostreet
9 Replies

5. Shell Programming and Scripting

How to delete lines of a text file based on another text file?

I have 2 TXT files with with 8 columns in them(tab separated). First file has 2000 entries whereas 2nd file has 300 entries. The first file has ALL the lines of second file. Now I need to remove those 300 lines (which are in both files) from first file so that first file's line count become... (2 Replies)
Discussion started by: prvnrk
2 Replies

6. UNIX for Dummies Questions & Answers

Extracting lines from a text file based on another text file with line numbers

Hi, I am trying to extract lines from a text file given a text file containing line numbers to be extracted from the first file. How do I go about doing this? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

7. Shell Programming and Scripting

How to merge lines based off of text?

Hello Everyone, I have two files, similar to the following: File 1: 8010 ITEM01 CODE1 FLAG1 filler filler 7020 OBJECT CODE2 FLAG2 filler 6010 THING1 CODE4 FLAG4 6011 ITEM20 CODE7 FLAG7 File 2 contains: 6020 ITEM01 CODEA FLAGA filler filler filler 7000 OBJECT CODEB... (2 Replies)
Discussion started by: jl487
2 Replies

8. Shell Programming and Scripting

Find and add/replace text in text files

Hi. I would like to have experts help on below action. I have text files in which page nubmers exists in form like PAGE : 1 PAGE : 2 PAGE : 3 and so on there is other text too. I would like to know is it possible to check the last occurance of Page... (6 Replies)
Discussion started by: lodhi1978
6 Replies

9. Shell Programming and Scripting

How to skip lines which don't begin with a number

Hi, I have a file: file.txt 1 word 2 word word word 3 word 4 word and I would like to create a set: set number = `cut -d" " -f1 ${1}` #${1} is the text file but it should only contain the lines which begin with numbers, and another set which contains the lines which begin with... (10 Replies)
Discussion started by: shira
10 Replies

10. Shell Programming and Scripting

how to combine 2 lines in same files based on any text

hi, I want to combine two lines in same file. If the line ends with '&' it should belongs to previous line only Here i am writing example. Ex1: line 1 : return abcdefgh& line 2 : ijklmnopqr& line 3 : stuvw& line 4 : xyz output should be line 1: return abcdefghijklmnopqrstuvwxyz ... (11 Replies)
Discussion started by: spc432
11 Replies
Login or Register to Ask a Question