awk RS/ORS error


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk RS/ORS error
# 1  
Old 03-08-2016
awk RS/ORS problem

Hello,
I am trying to filter fastq file (in short, every 4 lines to be a record) based on the GC counts (GC-contents) in sequence (i.e. field 2), which is the count % of the G/C chars in the string. The example script is to pick up records with GC contents > 0.6 in the sequence (second field).
One thing special is the "@" symbol is always the first char of the first row in each record, but it may appear in the third field of anywhere except the first position.
A sample input.file is:
Code:
@HWI-ST1410:193:C7847ANXX:3:1101:3144:2591
CCGCTTGGAGCGGATCAGGTAGTCGACCTGCTTAAGGAGGGC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1410:193:C7847ANXX:3:1101:3050:2607
CAAAAAAAATTTTCTATTTTACATATACAATGAAGAACGTCACTG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFHHH
@HWI-ST1410:193:C7847ANXX:3:1101:3075:2609
CACTGTACTAAGCTTTGGCGCTGATTCCATAATTTCTTTCTC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1410:193:C7847ANXX:3:1101:3098:2622
GGTACGTACACATAATCCGTTGACTAGCTCGATACGATTACG
+
BBBBBFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFF
@HWI-ST1410:193:C7847ANXX:3:1101:3097:2667
CCCGGCGGGAGAGGGACGGCAGGCTCGTCGGCGCCACAATCG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

So far, my script is:
Code:
awk 'BEGIN{RS="\n@", FS="\n"; OFS="\n"} {s=$2; if (gsub(/[GC]/, "x", $2)/length($2)>0.6) print "@"$1, s, $3, $4}' input.file

My script seems not doing what I want, as the first record always has double "@@" for its record/sequence name.
How to deal with the first record without the "\n@" as the RS? Sometimes the "@" symbol was NOT put back in front of $1 to have the original string.
I am using GNU Awk 4.0.1 under Linux 3.19.0-32-generic ~14.04.1 Ubuntu.
Thanks a lot for any clue!

Last edited by yifangt; 03-08-2016 at 06:59 PM.. Reason: typos
# 2  
Old 03-08-2016
try:
Code:
awk '
/^@/ && m && c > .6 {printf m;}
/^@/ {r=NR+1; m=""; c=0;}
{m=m $0 "\n";}
r==NR {c=gsub(/[GC]/, "x")/length($0);}
END { if (m && c > .6) printf m;}
' input.file

This User Gave Thanks to rdrtx1 For This Post:
# 3  
Old 03-08-2016
FWIW, the reason the first record gets a @@, is because it still starts with @, because the first record does not have a newline (\n) before it as is specified by RS="\n@"

So this could be corrected, by cutting the leading @ for the first line, using gawk4:
Code:
awk 'BEGIN{RS="\n@"; FS="\n"; OFS="\n"} FNR==1{sub(/^@/,x)} { s=$2; if (gsub(/[GC]/, "x", $2)/length($2)>0.6) print "@"$1, s, $3, $4}' file


Are you sure that what is now $2 in the sample is never more than one line?

Last edited by Scrutinizer; 03-08-2016 at 11:17 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 03-09-2016
Thanks Scrutinizer!
Are you sure that what is now $2 in the sample is never more than one line?
If I catch your question correctly, I would say Yes, $2 is never more than one line because every record is "4 line", positive.

One more thing to confirm for sub(/^@/, x), I was expecting the 'x' char at the beginning of the first record, but it is not there. When I read GNU manual about sub() function, it states:
Code:
.....Some versions of awk allow the third argument to be an expression that is not an lvalue.  
In such a case, sub() still searches for the pattern and returns zero or one, 
but the result of the substitution (if any) is thrown away because there is no place to put it.

so, in your script, the substitution 'x' was thrown away, but the original "^@" is put back by print "@"$1. Am I right?

Last edited by yifangt; 03-09-2016 at 11:45 AM..
# 5  
Old 03-09-2016
Hi yifangt, x is an unused variable (it is not a string, since there are no quotes around it), so it is equivalent to ""
so it can also be written as sub(/^@/, ""), which means delete the first @ of the record.

Using "" is probably clearer, so I would use that...

Last edited by Scrutinizer; 03-09-2016 at 03:45 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 03-09-2016
Hi Scrutinizer, thanks a lot for your explanation!
rdrtx1, can you, or anybody else, please elaborate your script? I have hard time to understand it.
Thanks again!
# 7  
Old 03-09-2016
Code:
awk '
/^@/ && m && c > .6 {printf m;}          # if line starts with @ and string m exists and criteria is > .6 print string m
/^@/ {r=NR+1; m=""; c=0;}                # if line starts with @ store the next record # in variable r; clear string m; clear criteria
{m=m $0 "\n";}                           # concatenate line to string m
r==NR {c=gsub(/[GC]/, "x")/length($0);}  # if record number matches stored record number evaluate criteria and store result in variable c
END { if (m && c > .6) printf m;}        # at end of file print string m if it exists and criteria is > .6
' input.file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Can someone please explain why we need to set ORS in below awk code?

Question: Write a command to print the fields in a text file in reverse order? awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename I was thinking it should be (what is the need to set ORS="" ? )- awk 'BEGIN { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename (3 Replies)
Discussion started by: Tanu
3 Replies

2. Shell Programming and Scripting

awk : ORS not to be printed after the last record

Hello Team, here is the code: scripts]# ls /etc/init.d/ | awk 'BEGIN{ORS=" && "} /was.init/ && !/interdependentwas/ && !/NodeAgent/ && !/dmgr/{print "\$\{service_cmd\} "$0 " status"}' 2>/dev/null ${service_cmd} cmserver_was.init status && ${service_cmd} fmserver_was.init status &&... (6 Replies)
Discussion started by: chandana.hs
6 Replies

3. UNIX for Dummies Questions & Answers

No error in awk...

Hi all... In the OSX forum I am starting a new awk project to learn awk. In this code snippet I have had to check boundaries to ensure that no NUMERICAL error occurs in the rest of the code... printf "Enter frequency required:- "; getline FREQ; RATE=(BYTES*FREQ); if ( RATE <= 4000 ) {... (4 Replies)
Discussion started by: wisecracker
4 Replies

4. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Good evening, Im newbie at unix specially with awk From an scheduler program called Autosys i want to extract some data reading an inputfile that comprises jobs names, then formating the output to columns for example 1. This is the inputfile: $ more MapaRep.txt ds_extra_nikira_usuarios... (18 Replies)
Discussion started by: alexcol
18 Replies

5. Shell Programming and Scripting

awk error

Hi Team, I have .csv file in the following format .csv file TAB1;COL1;DATATYPE;NOTNULL;WITH DEFAULT TAB2;COL1;DATATYPE;NOTNULL;WITH DEFAULT .... .... .... output: ALTER TABLE TAB1. add COL1 DATATYPE NOTNULL WITH DEFAULT; ALTER TABLE TAB2 add COL1 DATATYPE NOTNULL WITH DEFAULT; I... (5 Replies)
Discussion started by: rocking77
5 Replies

6. Shell Programming and Scripting

awk command in script gives error while same awk command at prompt runs fine: Why?

Hello all, Here is what my bash script does: sums number columns, saves the tot in new column, outputs if tot >= threshold val: > cat getnon0file.sh #!/bin/bash this="getnon0file.sh" USAGE=$this" InFile="xyz.38" Min="0.05" # awk '{sum=0; for(n=2; n<=NF; n++){sum+=$n};... (4 Replies)
Discussion started by: catalys
4 Replies

7. Shell Programming and Scripting

Awk error -- awk: 0602-562 Field $() is not correct.

typeset -i i=1 while read -r filename; do Splitfile=`$Targetfile_$i.txt` awk 'substr($0,1,5) == substr($filename,1,5) && substr($0,526,2) == substr($filename,6,2) && substr($0,750,12) == substr($filename,8,12)' $SourceFilename >> $Splitfile i=i+1 done < /tmp/list.out I am using this logic... (1 Reply)
Discussion started by: pukars4u
1 Replies

8. UNIX for Dummies Questions & Answers

awk Shell Script error : "Syntax Error : `Split' unexpected

hi there i write one awk script file in shell programing the code is related to dd/mm/yy to month, day year format but i get an error please can anybody help me out in this problem ?????? i give my code here including error awk ` # date-month -- convert mm/dd/yy to month day,... (2 Replies)
Discussion started by: Herry
2 Replies

9. Shell Programming and Scripting

Error in awk

var1=`echo "emp,dept,salgrade" | awk -F, '{print NF}'` count=1 while ; do i=`expr $count` tname=`echo "emp,dept,salgrade" | awk -F, '{ print $(echo $i) }'` count=$count+1; echo ${tname}; echo $count done I want to store in tname=emp, tname=dept,tname=salgrade I am getting... (2 Replies)
Discussion started by: dreams5617
2 Replies

10. Shell Programming and Scripting

rs and ors in gawk ...????

:D dear members I have a good knowledge of gawk and seem to do quite well with it.. but I have never understood what the use of the rs and ors are for or how they are used.. i am thinking they are for seperating lines and paragraphs but i have absolutely no idea how to make it work, if that is what... (2 Replies)
Discussion started by: moxxx68
2 Replies
Login or Register to Ask a Question