Filter and merge 2 files problem


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Filter and merge 2 files problem
# 1  
Old 08-09-2015
Filter and merge 2 files problem

Hi,

I'm trying to combine two files which have 1 column in common and filter out rows I don't need.

File 1:
Code:
ID       Start       End       Matched       Coverage
1       1       254     1515    5.96
2       1       135     402     2.98


File 2 (has 2 rows per entry):
Code:
>1 254:17:30:28.6351:1.62947
AAAAAAAAAAAAAAAAAAAAAACAGCTAAAGTTGAGGATTTCAAACAGAAAAGCAACA
>2 135:11:15:14.3786:0.609257
AAAAAAAAAAAAAAAAAAAAAACCTTCCCTCGGTCTGATATGTCTTCATTTACAATGCT

so I do filtering of the File 1 with awk (awk '$5>0 {print}' file 1 > file1_filtered) and want to merge File 1_filtered and File 2 retaining only sequences with IDs that passes >5 filter. Ideally I want to have file with 3 columns: ID Sequence Coverage:

Code:
ID       Sequence       Coverage
1       AAAAAAA......    5.96
2       AAAAAAAA....     2.98

I am a biologist making very first steps in bash scripting so I would greatly appreciate any comments or explanations on how it should work.

I use Cygwin on Windows.

Many thanks!

Moderator's Comments:
Mod Comment edit by bakunin: please use CODE-tags for file content too. Thank you.

Last edited by bakunin; 08-09-2015 at 09:17 PM..
# 2  
Old 08-09-2015
The initial filtering can be done as well at the same time.

Code:
awk '
    FNR==NR && $5+0>0 {          # filter file1
        s[$1] = $5;              # keep id and value of field 5
        next;                    # no lines from file1 will pass beyond here
    }
    $1 ~ /^>/ {                  # this is only for file2 that starts with >
        sub(">", "", $1);        # save just the id number
        if(s[$1]){               # check if the id has value from file1
            getline G;           # get sequence 
            print $1, G, s[$1];  # display id sequence and coverage
        } 
    } ' file1 file2


Last edited by Aia; 08-09-2015 at 11:34 PM..
This User Gave Thanks to Aia For This Post:
# 3  
Old 08-09-2015
Thanks a lot, Aia!

The script works perfectly and thank you very much for the comments! I only now noticed that some of my sequences are much longer than the others and have more than 1 line of sequence, plus they all vary in number of lines sequence occupies, I would be extremely grateful if you could point out the way to fix that in the script.

Many thanks!
# 4  
Old 08-10-2015
Try this, based on Aia's proposal:
Code:
awk '
FNR==NR &&
$5+0>0          {s[$1] = $5 
                 next
                }
$1 in s         {TMPA = $1
                 TMPB = s[$1]
                 $1=$2=""
                 printf "%s\t%s\t%s\n", TMPA, TMPB,  $0
                } 
' file1 RS=">" ORS="\n" FS=" " OFS="" file2

I reversed the output order to make it more readable because the sequence length is unpredictable.
This User Gave Thanks to RudiC For This Post:
# 5  
Old 08-10-2015
Thanks lots, that solves my problem perfectly!
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge files and generate a resume in two files

Dear Gents, Please I need your help... I need small script :) to do the following. I have a thousand of files in a folder produced daily. I need first to merge all files called. txt (0009.txt, 0010.txt, 0011.txt) and and to output a resume of all information on 2 separate files in csv... (14 Replies)
Discussion started by: jiam912
14 Replies

2. Shell Programming and Scripting

Copying the files after filter

Hi Guys, i want copy the all files another direcotry after filtering the command. and tried as like below...it's not working. ls -ltr|awk '{print $9}'|grep "images\|\.htm"|cp *.* /home/oracle Thanks (13 Replies)
Discussion started by: bmk
13 Replies

3. Shell Programming and Scripting

Checking in a directory how many files are present and basing on that merge all the files

Hi, My requirement is,there is a directory location like: :camp/current/ In this location there can be different flat files that are generated in a single day with same header and the data will be different, differentiated by timestamp, so i need to verify how many files are generated... (10 Replies)
Discussion started by: srikanth_sagi
10 Replies

4. Shell Programming and Scripting

Problem with filter data using sed command

Hi, I am using the following command(sed) to get the key/value pair from the string String="{ "test":"test message", "testmessage":"subscription is active, charge successfully} " }" status=$( echo $String | sed -e 's/^.*\("testmessage":*\).*$/\1/') echo $status i am getting this... (2 Replies)
Discussion started by: nanthagopal
2 Replies

5. Shell Programming and Scripting

how to filter files with given format

Hi, all, I have files like: nameserver 216.66.22.2 ; tserv1.ash1.ipv6.he.net. tserv13.ash1.ipv6.he.net. nameserver 216.66.38.58 ; tserv1.tor1.ipv6.he.net. tserv21.tor1.ipv6.he.net. nameserver 216.218.221.6 ;... (3 Replies)
Discussion started by: esolvepolito
3 Replies

6. Programming

Problem with Mail merge in perl

This could be a simple problem for the perl experts I am trying mail merge in perl The header file is head1 which is as under ... (6 Replies)
Discussion started by: sunnyboy
6 Replies

7. Shell Programming and Scripting

Filter files and print

Hi, I need to filter and store the files ends with log extension in the array and need to write the file names in the array to a file. I need to use array to derive this solution. Please help me out. Thanks (2 Replies)
Discussion started by: Sekar1
2 Replies

8. Shell Programming and Scripting

zip code filter problem using AWK need help

Hello all; I have a large file (csv file) with addresses. I am trying to filter out specific entries based on zip code from a particular column of data. However my awk statement can not account for zipcode that begins with a certain pattern. It finds the pattern anywhere within the zipcode. ... (3 Replies)
Discussion started by: nelsonsierra
3 Replies

9. Shell Programming and Scripting

filter the uniq record problem

Anyone can help for filter the uniq record for below example? Thank you very much Input file 20090503011111|test|abc 20090503011112|tet1|abc|def 20090503011112|test1|bcd|def 20090503011131|abc|abc 20090503011131|bbc|bcd 20090503011152|bcd|abc 20090503011151|abc|abc... (8 Replies)
Discussion started by: bleach8578
8 Replies

10. Shell Programming and Scripting

Merge files of differrent size with one field common in both files using awk

hi, i am facing a problem in merging two files using awk, the problem is as stated below, file1: A|B|C|D|E|F|G|H|I|1 M|N|O|P|Q|R|S|T|U|2 AA|BB|CC|DD|EE|FF|GG|HH|II|1 .... .... .... file2 : 1|Mn|op|qr (2 Replies)
Discussion started by: shashi1982
2 Replies
Login or Register to Ask a Question