merge two text files of different size on common index


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting merge two text files of different size on common index
# 1  
merge two text files of different size on common index

I have two text files.

text file 1:
Code:
ID  filePath       col1      col2      col3
1   10584588.mol   269.126   190.958   23.237
2   10584549.mol   281.001   200.889   27.7414
3   10584511.mol   408.824   158.316   29.8561
4   10584499.mol   245.632   153.241   25.2815
5   10584459.mol   290.476   133.699   28.631
6   10584426.mol   440.552   150.846   30.1827
7   10584298.mol   243.248   164.409   21.5715
8   10584286.mol   283.078   230.034   24.3697
9   10584278.mol   287.807   198.625   27.7414
10  10584197.mol   224.356   184.317   24.3616

text file 2:
Code:
ID   filePath       SUB_ID     ChBrg_REGID
1    10584588.mol   10584588   9070369
2    10584549.mol   10584549   9070193
3    10584499.mol   10584499   9069982
4    10584459.mol   10584459   9069773
5    10584426.mol   10584426   9069641
6    10584278.mol   10584278   9069060
7    10584197.mol   10584197   9068744

I need to merge the two, keeping only the rows that appear in both files (the shorter list could be the index). The column filePath is the index, so the final file should look like.
Code:
ID  filePath       SUB_ID     ChBrg_REGID   col1      col2      col3
1   10584588.mol   10584588   9070369       269.126   190.958   23.237
2   10584549.mol   10584549   9070193       281.001   200.889   27.7414
4   10584499.mol   10584499   9069982       245.632   153.241   25.2815
5   10584459.mol   10584459   9069773       290.476   133.699   28.631
6   10584426.mol   10584426   9069641       440.552   150.846   30.1827
9   10584278.mol   10584278   9069060       287.807   198.625   27.7414
10  10584197.mol   10584197   9068744       224.356   184.317   24.3616

I am guessing this could be done in awk, and certainly in perl, but I'm not sure how do to the alignment by the index.

LMHmedchem
# 2  
Hi
Assuming your input files are a1 and a2:


Code:
awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}' a1 a2  
ID filePath        SUB_ID     ChBrg_REGID     col1       col2       col3
1  10584588.mol    10584588   9070369         269.126    190.958    23.237
2  10584549.mol    10584549   9070193         281.001    200.889    27.7414
4  10584499.mol    10584499   9069982         245.632    153.241    25.2815
5  10584459.mol    10584459   9069773         290.476    133.699    28.631
6  10584426.mol    10584426   9069641         440.552    150.846    30.1827
9  10584278.mol    10584278   9069060         287.807    198.625    27.7414
10 10584197.mol    10584197   9068744         224.356    184.317    24.3616




Guru
# 3  
That worked great, except the header row never made it to the output file.

---------- Post updated at 02:16 PM ---------- Previous update was at 12:24 AM ----------

I have been working on the header row. If I do,
Code:
   awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t",  $1, $2, $3, $4, $5 }' $a2 > temp.txt
   awk 'NR==1 {$1=$2=""}1' $a1 >> temp.txt

This comes close, but prints the entire a1 file to temp.txt, not just the first row. This takes the first 5 fields from file a2 and then is supposed to add from field 3 to the last field of file a1. This will gob together the header row, and then I can use the command above to fill in the rest of the file.

---------- Post updated at 02:20 PM ---------- Previous update was at 02:16 PM ----------

This seems to work, but overall this seems an odd way of adding the header row.
Code:
   awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t",  $1, $2, $3, $4, $5 }' $a2 > temp.txt
   awk '{$1=$2=""} NR==1{print $0}' $a1 >> temp.txt

LMHmedchem
# 4  
I'm no awk expert but..
Everything works fine here with your input files and guru's awk.
I'm getting header as it should be or ? :

After running the code vs your input (a1 and a2) i'm getting.

Code:
ID  filePath        SUB_ID     ChBrg_REGID     col1       col2       col3      
1   10584588.mol    10584588   9070369         269.126    190.958    23.237    
2   10584549.mol    10584549   9070193         281.001    200.889    27.7414   
4   10584499.mol    10584499   9069982         245.632    153.241    25.2815   
5   10584459.mol    10584459   9069773         290.476    133.699    28.631    
6   10584426.mol    10584426   9069641         440.552    150.846    30.1827   
9   10584278.mol    10584278   9069060         287.807    198.625    27.7414

Also for generating headers via awk, remove the header from the data and use BEGIN block or put another pair of { } bracers infront of guru's code (you will need to remove header from data first)
Something like :
Code:
{
if ( NR == 1 )
 print $1 $2 $3 $4 $5 ## use printf to format as you wish
}

or
Code:
BEGIN {
printf ... ## you will need to write headers here yourself, since you can't use them from input file(s)
}
<other code>

# 5  
I found a bug in my data where the first two files had a different header name for one header. The header row is now correct, more or less.

There still seems to be an issue in that the last column has three columns of space delimited data in it.
Code:
sumSO2Am                 SUB_ID     SOURCE                                          
0                 10584046   ChemBridge                                      
0                 10580948   ChemBridge                                      
0                 10580812   ChemBridge                                      
0                 10580337   ChemBridge                                      
0                 10579979   ChemBridge                                      
0                 10579233   ChemBridge

The last two, SUB_ID and SOURCE are duplicate cols (already occur at $3,$4). These come from $3, $4 in a2. Each row should end with the sumSO2Am field.

I don't see where that is happening in the command, or I just don't get it. I see how the first 7 fields are being printed, but not the rest of each row. I can post some short test files if that would help.

LMHmedchem
# 6  
you mean you have SUB_ID and SOURCE in both the file?
sample input files with which you tried the code would help.

regards,
Ahamed
# 7  
Yes, apparently, my input files were not quite what I thought, so the field identifiers were not quite right. I have everything corrected now. The test files I am using have more cols than the sample I posted. I thought I had this working a bit ago, but now it seems it's not working again.

I have attached .zip with 4 files. There are the a1 and a2 input files, the output I am hoping for, and the output (incorrect) I am now getting. I am just trying to add the values from the cols SUB_ID, SOURCE, ChBrg_REGID from file a2 to file a1. File a1 has fewer rows, so it is necessary to look for the values in a1"filePath" to match the right row. This is basically looking up the values for the 3 cols in the a2 file and adding them in right after the a1"filePath" col. I won't always want all of the rest of the cols from a1, but it's just as well to leave them in, since I can edit the file further with cut, etc.

This is the command I am using,
Code:
awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}'  a1_temp.txt a2_temp.txt > output_temp.txt

This is part of a more involved script, but I'm just trying to get this part working.

I really thought I had it for a bit, but I guess not.

LMHmedchem
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #262
Difficulty: Easy
During the Second World War, Alan Turing worked for the Government Code and Cypher School (GC&CS) at Bletchley Park, Britain's codebreaking centre that produced Ultra intelligence.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge multiple tab delimited files with index checking

Hello, I have 40 data files where the first three columns are the same (in theory) and the 4th column is different. Here is an example of three files, file 2: A_f0_r179_pred.txt Id Group Name E0 1 V N(,)'1 0.2904 2 V N(,)'2 0.3180 3 V N(,)'3 0.3277 4 V N(,)'4 0.3675 5 V N(,)'5 0.3456 ... (8 Replies)
Discussion started by: LMHmedchem
8 Replies

2. UNIX for Dummies Questions & Answers

Merge selective columns from files based on common key

Hi, I am trying to selectively merge two files based on keys reported in the 1st column. File1: #file1-header1 file1-header2 111 qwe rtz uio 198 asd fgh jkl 165 yxc 789 poi uzt rew 89 lkj File2: #file2-header2 file2-header2 165 ghz nko2 ... (2 Replies)
Discussion started by: dovah
2 Replies

3. Shell Programming and Scripting

Merge multiple files with common header

Hi all, Say i have multiple files x1 x2 x3 x4, all with common header (date, time, year, age),, How can I merge them to one singe file "X" in shell scripting Thanks for your suggestions. (2 Replies)
Discussion started by: msarguru
2 Replies

4. Shell Programming and Scripting

Find matched patterns in a column of 2 files with different size and merge them

Hi, i have input files like below:- input1 Name Seq_ID NewID Scores MT1 A0QZX3 1.65 277.4 IVO A0QZX3 1.65 244.5 HPO A0QZX3 1.65 240.5 RgP A0Q3PP 5.32 241.0 GX1 LPSZ3S 96.1 216.9 MEL LPSS3X 4.23 204.1 LDD LPSS3X 4.23 100.2 input2 Fac AddName NewID ... (9 Replies)
Discussion started by: redse171
9 Replies

5. Shell Programming and Scripting

Merge files based on both common and uncommon rows

Hi, I have two files A (2190 rows) and file B (1100 rows). I want to merge the contents of two files based on common field, also I need the unmatched rows from file A file A: ABC XYZ PQR file B: >LMN|chr1:11000-12456: >ABC|chr15:176578-187678: >PQR|chr3:14567-15866: output... (3 Replies)
Discussion started by: Diya123
3 Replies

6. Shell Programming and Scripting

script to merge two files on an index

I have a need to merge two files on the value of an index column. input file 1 id filePath MDL_NUMBER 1 MFCD00008104.mol MFCD00008104 2 MFCD00012849.mol MFCD00012849 3 MFCD00037597.mol MFCD00037597 4 MFCD00064558.mol MFCD00064558 5 MFCD00064559.mol MFCD00064559 input file 2 ... (9 Replies)
Discussion started by: LMHmedchem
9 Replies

7. UNIX for Dummies Questions & Answers

Merge two files with common IDs but unequal number of rows

Hi, I have two files that I would like to merge and think that there should be a solution using awk. The files look something like this: file 1 IDX1 IDY1 IDX2 IDY2 IDX3 IDY3 file 2 IDY1 dataA data1 IDY2 dataB data2 IDY3 dataC data3 Desired output IDX1 IDY1 dataA data1 IDX2 ... (5 Replies)
Discussion started by: katie8856
5 Replies

8. UNIX for Dummies Questions & Answers

Writing a loop to merge multiple files by common column

I have 100 data files labelled 250.1.txt through 250.100.txt. The second column of the data files partially match (there is about %90 overlap). Each data file has 4 columns. I want the merge all these text files by the matching values in the second column. In the output, the first column should... (1 Reply)
Discussion started by: evelibertine
1 Replies

9. Shell Programming and Scripting

How to remove common file names from text files

I'm running on freebsd -- with a default shell of csh. I have two files named A and B. Each line of each file contains a file name. How can I write a script that removes all the file names in file B from A. I tried to use perl to create a huge regular expression with "|" separating the file... (2 Replies)
Discussion started by: siegfried
2 Replies

10. Shell Programming and Scripting

Merge files of differrent size with one field common in both files using awk

hi, i am facing a problem in merging two files using awk, the problem is as stated below, file1: A|B|C|D|E|F|G|H|I|1 M|N|O|P|Q|R|S|T|U|2 AA|BB|CC|DD|EE|FF|GG|HH|II|1 .... .... .... file2 : 1|Mn|op|qr (2 Replies)
Discussion started by: shashi1982
2 Replies

Featured Tech Videos