Join lines from two files based on match


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Join lines from two files based on match
# 1  
Old 08-18-2013
Join lines from two files based on match

I have two files.
File1
Code:
>gi|11320906|gb|AF197889.1|_Buchnera_aphidicola
ATGAAATTTAAGATAAAAAATAGTATTTT
>gi|11320898|gb|AF197885.1|_Buchnera_aphidicola
ATGAAATTTAATATAAACAATAAAA
>gi|11320894|gb|AF197883.1|_Buchnera_aphidicola
ATGAAATTTAATATAAACAATAAAATTTTT

File2
Code:
AF197885	Uroleucon aeneum
AF197886	Uroleucon jaceae
AF197889	Uroleucon obscurum
AF197883	Uroleucon astronomus
AF197893	Uroleucon erigeronense

For all lines in file1, I want to match the term bracked by "gb|" and "." (i.e. AF197889 in the first line) to a line in file2. In this example of file1, all terms of interest start with "AF" but this isn't always the case.

If there's a match, I'd like to append the species name in file2, preceded by "_host_" to the matching line in file1, using underscores and no spaces. Desired output:
Code:
>gi|11320906|gb|AF197889.1|_Buchnera_aphidicola_host_Uroleucon_obscurum
ATGAAATTTAAGATAAAAAATAGTATTTT
>gi|11320898|gb|AF197885.1|_Buchnera_aphidicola_host_Uroleucon_aeneum
ATGAAATTTAATATAAACAATAAAA
>gi|11320894|gb|AF197883.1|_Buchnera_aphidicola_host_Uroleucon_astronomus
ATGAAATTTAATATAAACAATAAAATTTTT

With the meager skills I have, I could use "|" as a filed separator for file 1 and use awk to fill an array to find matches. But I'm not sure how to to append the file2 data, or how to accomplish it in one step. Can anyone help?

Last edited by Don Cragun; 08-18-2013 at 02:56 PM.. Reason: CODE tags; not QUOTE tags for input, output, and code samples.
# 2  
Old 08-18-2013
You could try something like:
Code:
awk '
FNR == NR {
        x[$1] = "_host"
        for(i = 2; i <= NF; i++)
                x[$1]=x[$1] "_" $i
        next
}
{       print $0 x[$4]
}' File2 FS='[|.]' File1

If you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 08-18-2013
Small variation in the first part:
Code:
awk 'NR==FNR{i=$1; $1="_host"; A[i]=$0; next} {print $0 A[$4]}' OFS=_ file2 FS='[|.]' file1

This User Gave Thanks to Scrutinizer For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Data match 2 files based on first 2 columns matching only and join if match

Hi, i have 2 files , the data i need to match is in masterfile and i need to pull out column 3 from master if column 1 and 2 match and output entire row to new file I have tried with join and awk and i keep getting blank outputs or same file is there an easier way than what i am... (4 Replies)
Discussion started by: axis88
4 Replies

2. Shell Programming and Scripting

Join columns across multiple lines in a Text based on common column using BASH

Hello, I have a file with 2 columns ( tableName , ColumnName) delimited by a Pipe like below . File is sorted by ColumnName. Table1|Column1 Table2|Column1 Table5|Column1 Table3|Column2 Table2|Column2 Table4|Column3 Table2|Column3 Table2|Column4 Table5|Column4 Table2|Column5 From... (6 Replies)
Discussion started by: nv186000
6 Replies

3. Shell Programming and Scripting

awk join lines based on keyword

Hello , I will need your help once again. I have the following file: cat file02.txt PATTERN XXX.YYY.ZZZ. 500 ROW01 aaa. 300 XS 14 ROW 45 29 AS XD.FD. PATTERN 500 ZZYN002 ROW gdf gsste ALT 267 fhhfe.ddgdg. PATTERN ERE.MAY. 280 PATTERRNTH 5000 rt.rt. ROW SO a 678 PATTERN... (2 Replies)
Discussion started by: alex2005
2 Replies

4. Shell Programming and Scripting

Merge lines based on match

I am trying to merge two lines to one based on some matching condition. The file is as follows: Matches filter: 'request ', timestamp, <HTTPFlow request=<GET: Matches filter: 'request ', timestamp, <HTTPFlow request=<GET: Matches filter: 'request ', timestamp, <HTTPFlow ... (8 Replies)
Discussion started by: jamie_123
8 Replies

5. UNIX for Dummies Questions & Answers

Join 2 files based on certain column

I have file input1.txt 11103|11|OTTAWA|City|AA|CAR|0|0|1|-1|0|8526|2014-09-07 23:00:14 11103|11|OTTAWA|City|BB|TRAIN|0|0|2|-2|6|6359|2014-09-07 23:00:14 11104|11|CANADA|City|CC|CAR|0|0|2|-2|0|5947|2014-09-07 23:00:14 11104|11|CANADA|City|DD|TRAIN|0|0|2|-2|1|4523|2014-09-07 23:00:14... (5 Replies)
Discussion started by: radius
5 Replies

6. UNIX for Dummies Questions & Answers

Join the lines until next pattern match

Hi, I have a data file where data is splitted into multiple lines. And, each valid record starts with a patten date | <?xml and ends with pattern </dmm> e.g. 20120924|<?xml record 1 line1....record 1 line1....record 1 line1.... record 1 line2....record 1 line2....record 1 line2.... record 1... (3 Replies)
Discussion started by: Dipalik
3 Replies

7. UNIX for Dummies Questions & Answers

join 2 lines based on 1st field

hi i have a file with the following lines 2303:13593:137135 16 abc1 26213806....... 1234:45675:123456 16 bbc1 9813806....... 2303:13593:137135 17 bna1 26566444.... 1234:45675:123456 18 nnb1 98123456....... i want to join the lines having common 1st field i,e., ... (1 Reply)
Discussion started by: anurupa777
1 Replies

8. UNIX for Dummies Questions & Answers

sed, join lines that do not match pattern

Hello, Could someone help me with sed. I have searched for solution 5 days allready :wall:, but cant find. Unfortunately my "sed" knowledge not good enough to manage it. I have the text: 123, foo1, bar1, short text1, dat1e, stable_pattern 124, foo2, bar2, long text with few lines, date,... (4 Replies)
Discussion started by: petrasl
4 Replies

9. Shell Programming and Scripting

join two files based on one column

Hi All, I am trying to join to files based on one common column. Cat File1 ID HID Ab_1 23 Cd 45 df 22 Vv 33 Cat File2 ID pval Ab_1 0.3 Cd 10 Vv 0.0444 (3 Replies)
Discussion started by: newpro
3 Replies

10. Shell Programming and Scripting

join based on line number when one file is missing lines

I have a file that contains 87 lines, each with a set of coordinates (x & y). This file looks like: 1 200.3 -0.3 2 201.7 -0.32 ... 87 200.2 -0.314 I have another file which contains data that was taken at certain of these 87 positions. i.e.: 37 125 42 175 86 142 where the first... (1 Reply)
Discussion started by: jackiev
1 Replies
Login or Register to Ask a Question