Help with BASH/AWK queries ....


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Help with BASH/AWK queries ....
# 1  
Old 12-05-2010
Help with BASH/AWK queries ....

Hi Everyone,

I have an input file in the following format:
score.file1.txt
HTML Code:
contig00045 length=566   numreads=19    1047    0.0
contig00055 length=524   numreads=7    793    0.0
contig00052 length=535   numreads=10    607    e-176
contig00072 length=472   numreads=46    571    e-165
contig00019 length=667   numreads=5    474    e-136
I've a second file:
data.file1.txt
HTML Code:
>contig00045 length=566   numreads=19
GGGCTGACGTGCCGCTAATACGACTCACTATAGGGAGAGCATAAAACACG
CCTCCTGAGCTGCAGCAGAAAAAGAGACTCCCCTTGAGCTTTCAGATTGA
>contig00055 length=524   numreads=7
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGAGGGAGGAT
GCTGGAC
>contig00052 length=535   numreads=10
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGGATGTCCAC
AGGCAGAGGGATGTCCACAGGCAGAGGGATGCCACAGGCA
>contig00072 length=472   numreads=46
TTTAGCTGCTTTCCCCCGGAGGAGATTTGAATTCCGGTGAAATCCAGGCT
TTGTTCATTTTAATAAGCGTCAGCCTGTCAGCGCTGTCAGTTGACAGGCG
>contig00019 length=667   numreads=5
TATAGGGAGAGTGGCATTCTAATAACAGGGGACGGGGGCAGAGGACTCTC
GCTGACCGTCCCATGTAAGGGTGGTGTCGGAT
This file contains a header (>contig00045 length=566 numreads=19) followed by few lines of sequence.

In the first file (score.file1.txt), for each row the fourth column is score1(1047, 793, 607,571 etc.) and 5th column is score2 (0.0, 0.0, e-176, e-165 etc.).
I would like to extract those TWO data (from data.file1.txt) based on TOP score1 and if their score2 is NOT 0.0.

For example based on the above data, my desired output is:
output.file1.txt
HTML Code:
>contig00052 length=535   numreads=10
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGGATGTCCAC
AGGCAGAGGGATGTCCACAGGCAGAGGGATGCCACAGGCA
>contig00072 length=472   numreads=46
TTTAGCTGCTTTCCCCCGGAGGAGATTTGAATTCCGGTGAAATCCAGGCT
TTGTTCATTTTAATAAGCGTCAGCCTGTCAGCGCTGTCAGTTGACAGGCG
Thanks in advance.
# 2  
Old 12-05-2010
Code:
 
>contig00019 length=667   numreads=5
TATAGGGAGAGTGGCATTCTAATAACAGGGGACGGGGGCAGAGGACTCTC
GCTGACCGTCCCATGTAAGGGTGGTGTCGGAT

This is not also to be in the output.file1.txt file???
# 3  
Old 12-05-2010
Thanks for your reply. No, the one you mentioned will not be in the output file. If we look at the 'score.file1.txt' file:
Code:
contig00052 length=535   numreads=10    607    e-176
contig00072 length=472   numreads=46    571    e-165
contig00019 length=667   numreads=5    474    e-136

First one has score: 607, second one 571, and the third one 474. As we are taking only the top TWO scores, only the following should be there:
Code:
>contig00052 length=535   numreads=10
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGGATGTCCAC
AGGCAGAGGGATGTCCACAGGCAGAGGGATGCCACAGGCA
>contig00072 length=472   numreads=46
TTTAGCTGCTTTCCCCCGGAGGAGATTTGAATTCCGGTGAAATCCAGGCT
TTGTTCATTTTAATAAGCGTCAGCCTGTCAGCGCTGTCAGTTGACAGGCG

Thanks.

Last edited by Scott; 12-12-2010 at 06:40 AM..
# 4  
Old 12-05-2010
Code:
awk '
NR == FNR && $5 != 0 {
  if ($4 > m1) {
    m2 = m1 
    c2 = c1
    m1 = $4
    c1 = $1
  } else if ($4 > m2) {
    m2 = $4 
    c2 = $1
  }
}
NR > FNR && ($1 == c1 || $1 == c2)
' score.txt RS='>' ORS='' data.txt

These 2 Users Gave Thanks to binlib For This Post:
# 5  
Old 12-09-2010
Hi binlib,

Thanks.
The code is working fine.
Just wonder how I may get it working for multiple files if I have several 'score' and 'data' files.
For example:
Code:
score1.txt    data1.txt
score2.txt    data2.txt
....
....
etc.

Thanks once again.

Last edited by Scott; 12-12-2010 at 06:41 AM..
# 6  
Old 12-09-2010
If you need to do the processing for a pair of score and data file at a time, just loop through them:
Code:
for i in 1 2 3; do
awk '{...}' score$i.txt ... data$i.txt
done

On the other hand if you only need to do one processing but need to treat all the score files as one score, and all the data files as one, then either you can just cat all the score file together as one big score file, same for data. Or you can change the code a little bit:
Code:
awk '
RS != ">" && $5 != 0 {
  if ($4 > m1) {
    m2 = m1 
    c2 = c1
    m1 = $4
    c1 = $1
  } else if ($4 > m2) {
    m2 = $4 
    c2 = $1
  }
}
RS== ">"  && ($1 == c1 || $1 == c2)
' score*.txt RS='>' ORS='' data*.txt

# 7  
Old 12-10-2010
Thanks inlib!!!

Actually the first solution (for loop option) works better and I've used that.
Just need a bit more work to wrap it up.
Now each of the output files look like this:

outfile1.txt
Code:
contig00052  length=535   numreads=10
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGGATGTCCACAGGCAGAGGg
ATgtCCAca
contig00065  length=713   numreads=27
GGGgCTGACGTGgCCGCTAATACGACTCACTATAGGgAGAGGTTACATTGTCTTTGGAGT
GTATTGTT
contig00038  length=622   numreads=32
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGCACGCTGGGAAGGGATA
GAAATTGCTAAAC

Now I want to replace the header part so that:
Code:
'contig00052  length=535   numreads=10'  will become  '>Header_1'
'contig00065  length=713   numreads=27'  will become  '>Header_2'
'contig00038  length=622   numreads=32'  will become  '>Header_3'

and the final output would look like:
Code:
>Header_1
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGGGATGTCCACAGGCAGAGGg
ATgtCCAca
>Header_2
GGGgCTGACGTGgCCGCTAATACGACTCACTATAGGgAGAGGTTACATTGTCTTTGGAGT
GTATTGTT
>Header_3
GGGCTGACGTGGCCGCTAATACGACTCACTATAGGGAGAGAAGCACGCTGGGAAGGGATA
GAAATTGCTAAAC

It'll be good if I can assign any 'Header Number'. For example, instead of, Header_1, Header_2, Header_3 ... I can start from Header_50, Header_51_Header_52 ... i.e. they'll follow an incremental order from the starting 'Header number'.

Last edited by Scott; 12-12-2010 at 06:41 AM.. Reason: Please use code tags
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. IP Networking

RDNS Queries

Hey everyone, I have a question, I've been playing around with tcpdump, and noticed my machine making numerous rdns look ups. They are displayed like: 10.80.80.141.51234 > 10.80.80.1.domain: 9950+ PTR? 223.114.55.65.in-addr.arpa. (44) My question is, if dns works based on numerical... (0 Replies)
Discussion started by: Lost in Cyberia
0 Replies

2. Debian

GRUB Queries ?!

Hello, I am posting the following questions here because I need them answered by people who have actually done a lot of work in GRUB. DO NOT GIVE ME GUESS ANSWERS PLEASE. Feel free to redirect me if this is not the right place to ask these questions. Can I download GRUB separately from... (6 Replies)
Discussion started by: sreyan32
6 Replies

3. UNIX for Dummies Questions & Answers

FTP Queries

Hi, 1) How to get exact permissions, group names for files while transferring with FTP 2) Is there any command to transfer entire directory and sub directories. Thanks (1 Reply)
Discussion started by: nag_sathi
1 Replies

4. UNIX for Advanced & Expert Users

How many DNS queries

Is there any way to see how many queries come into our external DNS server? In looking at DNS providers, most of them base pricing on number of queries per month so I just wanted to see if you had any idea/way of gathering that data? A rough ballpark figure would even work. Our DNS server is... (1 Reply)
Discussion started by: raggmopp
1 Replies

5. Shell Programming and Scripting

Few queries regarding awk...

One of the command output is as below. -rw-r--r--+ 1 root root 75G Nov 21 16:43 /var/ovs/mount/86BXXX/running_pool/Machine1/System-sda.img -rw-r--r--+ 1 root root 75G Nov 21 16:36 /var/ovs/mount/86BXXX/running_pool/Machine2/System.img -rw-r--r--+ 1 root root 150G Sep 23 19:13... (2 Replies)
Discussion started by: pinga123
2 Replies

6. UNIX for Advanced & Expert Users

awk script queries

Hi, First query: I am trying to execute the below command to pull all the record whose length is not of the expected. But this is not giving the expected results. $2 is the record length passed in the script as second parameter.$filename is the filename on which the awk is executed.It is... (4 Replies)
Discussion started by: devina
4 Replies

7. Homework & Coursework Questions

Queries

Any help on like where to get started on this? I'm just confused. 1. The problem statement, all variables and given/known data: Enter text here.Queries to satisfy these two report requests (use your CCI database): Retrieve all rows of active inventory where current on hands is less than... (0 Replies)
Discussion started by: lakers34kb
0 Replies

8. Shell Programming and Scripting

my queries

hi guys Well, i need to have a report generation script or any script which will show me all the content/information of a file when i run that script. Please help me on this isssue at the earliest.As i am little bit aware of scripting.Thanks in advance! regards ash (4 Replies)
Discussion started by: whizkidash
4 Replies

9. UNIX for Advanced & Expert Users

Two small queries

Query 1 : How to check if a directory already exists? If doesn't exist then create a new one. Query 2 : I want to put following text using a single echo statement into a log file and also want to retain the formatting of the text. How it can be... (3 Replies)
Discussion started by: skyineyes
3 Replies

10. UNIX for Advanced & Expert Users

Some queries...

Guys need some advice on how to check some of the questions below? i'm running on an open VMS platform... which i am an idiot to... appreciate if anyone can give some hints or source on how to check on.. a script that is running on cron job... but doesn't run as the login user name.. 1. why... (6 Replies)
Discussion started by: 12yearold
6 Replies
Login or Register to Ask a Question