awk getline problem


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk getline problem
# 1  
Old 12-13-2012
awk getline problem

Hello,
I want to print out the DNA sequence entries (tens of thousand!) that are longer than certain value (i=200) from a file (FASTA file) as:
Code:
>S94D_ctg_8004 Average coverage: 402.95
ATAATGCCTGTGAATATGACATGTGTTCCTGTTTCTACATCAGACTACTATTCTTGCATA
TCCCGCATCAGTCACTTATGATAAGGCACCTTTAAGTCGCCACTGTGATATTGGTTGTGT
CTTGCATATGATTTCTAGCATGTACTGCTTCTAATTTGTTGAAACGCCATACAGTTTGAA
CTTGGCAGAGTGATATTGGTTGCATCTTTCATATGGTGTCTAGCTTGCAGTGCTTCTAAT
TT
>S94D_ctg_8022 Average coverage: 212.74
CGTTGTCTGTGTGACATTTAAGAGTGTTTTGTGCAGTGCAACAAAAGGTCACAATGGCAA
CGAGGTGAAGTAGATAACCCTGTTTACACAGAGGTGCAAAGGAAACAACTAGTGGTCAAT
AACCCTGTTTACCAGCACCATGTGCATCTACCGATGTACTGGGGTGATGCACAGTCTGTT
G
>S94D_ctg_8062 Average coverage: 710.87
ATGAATTTCAGACGAGTTTTGGATTTACTAGAATTTAAAAACCAGGCATCTCAATGTTTT
GCCGGCAATCAACGGTGCCCTGGTGTTTGAAATTCATTCCCATTTCTTGCATGGGACCTA
AGCATGCACCCAAGGACA
>S94D_ctg_9034 Average coverage: 37.84
AGATTATTTTGTCTGCCATGTATAATTTTGGTTGATGTTTAGCCTGTTGTGCTTAACATG
CTTCTCGACGTACCTACACAGGACAATTTGGGAACGACTGCTGTTTTCCATCGAGGTTAG
TTTCATCCCATGGCTTATATCTGCTCAATGTTCAGGATATCGGTAGCCGGTACCATATAG
GCCGGCGGCTGATAGGAGACTAATCGGTGAATCGGACT

I tried:
Code:
awk '! /^>S/ { next } { getline seq } (length(seq) >= i) { print $0 "\n" seq }' i=200 infile.fasta

Nothing was print out.
I noticed the DNA sequence line are 60bp wide, if my i is less than 60, only the header (the lines with ">" as separator) and the first sequence line were printed, but I am expecting all sequence lines under each header. If the i is bigger than 60, nothing was printed. For sure most of the entries in the infile.fasta are longer than 60bp. Did I miss anything?
Thanks a lot!
# 2  
Old 12-13-2012
getline gets lines, it doesn't paste them together.

I don't think you need getline here in any case, awk doesn't need getline's help to read lines one at a time from the default source.

It'd be helpful to see what this program should be doing, not just what it isn't.
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 12-13-2012
I *think* this is what you want.
awk -v len=200 -f yi.awk myFile
yi.awk:
Code:
function doPrint(r,h,   n)
{
     n=gsub(RS,RS,r)
     if (length(r)-n>len)
       print h RS r
}
/^>S/ {
   if (h) {
     doPrint(r,h)
     r=""
   }
   h=$0
   next
}
{r=(r)?r RS $0:$0}
END {
  doPrint(r,h)
}

This User Gave Thanks to vgersh99 For This Post:
# 4  
Old 12-13-2012
The interesting part is the script works with another fasta file in which each sequence line/row has more 120bp wide.
Code:
>S94D_ctg_8004 Average coverage: 402.95
ATAATGCCTGTGAATATGACATGTGTTCCTGTTTCTACATCAGACTACTATTCTTGCATATCCCGCATCAGTCACTTATGATAAGGCACCTTTAAGTCGCCACTGTGATATTGGTTGTGT
CTTGCATATGATTTCTAGCATGTACTGCTTCTAATTTGTTGAAACGCCATACAGTTTGAACTTGGCAGAGTGATATTGGTTGCATCTTTCATATGGTGTCTAGCTTGCAGTGCTTCTAAT
TT
>S94D_ctg_8022 Average coverage: 212.74
CGTTGTCTGTGTGACATTTAAGAGTGTTTTGTGCAGTGCAACAAAAGGTCACAATGGCAACGAGGTGAAGTAGATAACCCTGTTTACACAGAGGTGCAAAGGAAACAACTAGTGGTCAAT
AACCCTGTTTACCAGCACCATGTGCATCTACCGATGTACTGGGGTGATGCACAGTCTGTTG
>S94D_ctg_8062 Average coverage: 710.87
ATGAATTTCAGACGAGTTTTGGATTTACTAGAATTTAAAAACCAGGCATCTCAATGTTTTGCCGGCAATCAACGGTGCCCTGGTGTTTGAAATTCATTCCCATTTCTTGCATGGGACCTA
AGCATGCACCCAAGGACA
>S94D_ctg_9034 Average coverage: 37.84
AGATTATTTTGTCTGCCATGTATAATTTTGGTTGATGTTTAGCCTGTTGTGCTTAACATGCTTCTCGACGTACCTACACAGGACAATTTGGGAACGACTGCTGTTTTCCATCGAGGTTAG
TTTCATCCCATGGCTTATATCTGCTCAATGTTCAGGATATCGGTAGCCGGTACCATATAGGCCGGCGGCTGATAGGAGACTAATCGGTGAATCGGACT

I thought the script may be related to the newline, but I did check and could not find any abnormal. Not sure the EOL is really the same of the two files. I will check it. Thank you both!
# 5  
Old 12-13-2012
A program which doesn't do what you want really doesn't help explain what you do want, and you still haven't showed output demonstrating what you actually want. So we're still a bit in the dark, here.
# 6  
Old 12-13-2012
Hi, Corona:
For example, infile:
Code:
>S94D_ctg_8004 Average coverage: 402.95
ATAATGCCTGTGAATATGACATGTGTTCCTGTTTCTACATCAGACTACTATTCTTGCATA
TCCCGCATCAGTCACTTATGATAAGGCACCTTTAAGTCGCCACTGTGATATTGGTTGTGT
CTTGCATATGATTTCTAGCATGTACTGCTTCTAATTTGTTGAAACGCCATACAGTTTGAA
CTTGGCAGAGTGATATTGGTTGCATCTTTCATATGGTGTCTAGCTTGCAGTGCTTCTAAT
TT
>S94D_ctg_8022 Average coverage: 212.74
CGTTGTCTGTGTGACATTTAAGAGTGTTTTGTGCAGTGCAACAAAAGGTCACAATGGCAA
CGAGGTGAAGTAGATAACCCTGTTTACACAGAGGTGCAAAGGAAACAACTAGTGGTCAAT
AACCCTGTTTACCAGCACCATGTGCATCTACCGATGTACTGGGGTGATGCACAGTCTGTT
G
>S94D_ctg_8062 Average coverage: 710.87
ATGAATTTCAGACGAGTTTTGGATTTACTAGAATTTAAAAACCAGGCATCTCAATGTTTT
GCCGGCAATCAACGGTGCCCTGGTGTTTGAAATTCATTCCCATTTCTTGCATGGGACCTA
AGCATGCACCCAAGGACA
>S94D_ctg_9034 Average coverage: 37.84
AGATTATTTTGTCTGCCATGTATAATTTTGGTTGATGTTTAGCCTGTTGTGCTTAACATG
CTTCTCGACGTACCTACACAGGACAATTTGGGAACGACTGCTGTTTTCCATCGAGGTTAG
TTTCATCCCATGGCTTATATCTGCTCAATGTTCAGGATATCGGTAGCCGGTACCATATAG
GCCGGCGGCTGATAGGAGACTAATCGGTGAATCGGACT

I want all the entries with sequence less than 200bp are filtered to get outfile as:
Code:
>S94D_ctg_8004 Average coverage: 402.95
ATAATGCCTGTGAATATGACATGTGTTCCTGTTTCTACATCAGACTACTATTCTTGCATA
TCCCGCATCAGTCACTTATGATAAGGCACCTTTAAGTCGCCACTGTGATATTGGTTGTGT
CTTGCATATGATTTCTAGCATGTACTGCTTCTAATTTGTTGAAACGCCATACAGTTTGAA
CTTGGCAGAGTGATATTGGTTGCATCTTTCATATGGTGTCTAGCTTGCAGTGCTTCTAAT
TT
>S94D_ctg_9034 Average coverage: 37.84
AGATTATTTTGTCTGCCATGTATAATTTTGGTTGATGTTTAGCCTGTTGTGCTTAACATG
CTTCTCGACGTACCTACACAGGACAATTTGGGAACGACTGCTGTTTTCCATCGAGGTTAG
TTTCATCCCATGGCTTATATCTGCTCAATGTTCAGGATATCGGTAGCCGGTACCATATAG
GCCGGCGGCTGATAGGAGACTAATCGGTGAATCGGACT

There are some bioperl script for this job, but I like awk that can do it in a flash. Thanks!
# 7  
Old 12-13-2012
Do you have GNU awk? You can change the meaning of 'record' with RS, split on > instead of \n, and count the length more directly:

Code:
awk -v RS="\n>" -F"\n" '{ V=$0; gsub(/\n/, "", V); if((length(V)-length($1)) > 200) print ">"$0 }'

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk with if, getline, and another if

Howdy Folks, It seems like it is always awk that confuses the heck out of me and I even have books and examples. I have this line: awk '{if (/clientIP/)(SRV = $NF); if ($2 ~ /BUNDLE-GIM/) getline; if ($2 ~ /r100595/) {print SRV,"BUNDLE-GIM",$2}}' post.txt to parse this text: <api... (4 Replies)
Discussion started by: port43
4 Replies

2. Shell Programming and Scripting

awk getline

Hi, I have an awk script with the following function in it . function cmd( c ) { while( ( c | getline foo) > 0 ){ return foo ; close( c ); } } c =... (4 Replies)
Discussion started by: MetaMan
4 Replies

3. Shell Programming and Scripting

Getline not working in awk

hi, i am trying to parse a file in awk to generate a output to be written in a file depeding upon some condition. below is the code Content of file1 919873741577,9131638459976206,20130715150109,S,919811000214,2A65,405899136999995... (10 Replies)
Discussion started by: siramitsharma
10 Replies

4. Shell Programming and Scripting

awk getline question

hi experts i like to know how to print more line using getline command using awk. with below command i can only see one line (line no:1) however i do have more line as shown below line no:2,line no:3,line no:4 and so forth. how do i get those to show up using the below command. I tried... (7 Replies)
Discussion started by: Jared
7 Replies

5. Shell Programming and Scripting

Some Awk Getline help?

Greetings, I have about 3000 files that I want to search. The first column in all of these 3000 files has a unique serial number on each line. The subsequent columns have lots of data. I have another masterfile with three columns to help me find all the data I need in a moments notice: col 1... (15 Replies)
Discussion started by: jeeplou
15 Replies

6. Shell Programming and Scripting

Using getline in awk

I am using awk and want to use getline from a file like below getline x < file However file consists of two columns and I only want to store $2 Any way I can do this? ---------- Post updated at 06:54 AM ---------- Previous update was at 06:45 AM ---------- Done something like this.... (1 Reply)
Discussion started by: kristinu
1 Replies

7. Shell Programming and Scripting

syntax about getline of awk

i want to use getline to read command output to a var but the command i want to run is composed of a string and a variable,example: echo "" | awk 'BEGIN{myfile="anyfilename"}{"ls -l "myfile | getline a;print a}'and i got a error sh: anyfilename: command not foundit seems awk just ignored the... (4 Replies)
Discussion started by: b33713
4 Replies

8. Shell Programming and Scripting

awk getline

How do you make the getline function return to the original line? The example below should make it clear where I am currently going wrong. Thanks AWK SCRIPT: ------------- awk -F '-' '{ tmpLine = "EMPTY" print "CURRENT LINE :"$0 getline tmpLine print "NEXT LINE :"tmpLine }'... (1 Reply)
Discussion started by: garethsays
1 Replies

9. Shell Programming and Scripting

awk getline help maybe?

hello collegues, I am attempting to use awk to search file1 (serverlist.csv) from each row with file2 (supported.txt). If the is no entry exists in serverlist then output to a file called notsupp.out if there is an entry output to supp.out I can do this with basic shell scripting however... (0 Replies)
Discussion started by: chlawren
0 Replies

10. Shell Programming and Scripting

awk:Problem with getline

$ echo |awk ' BEGIN {"date" | getline current_time;close("date");print "Report printed on " current_time}' Report printed on Thu May 11 14:57:29 METDST 2006 This example works fine but how can i print all the output when is longer... (3 Replies)
Discussion started by: Klashxx
3 Replies
Login or Register to Ask a Question