merge two text files of different size on common index

04-30-2011

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

merge two text files of different size on common index

I have two text files.

text file 1:

Code:

ID  filePath       col1      col2      col3
1   10584588.mol   269.126   190.958   23.237
2   10584549.mol   281.001   200.889   27.7414
3   10584511.mol   408.824   158.316   29.8561
4   10584499.mol   245.632   153.241   25.2815
5   10584459.mol   290.476   133.699   28.631
6   10584426.mol   440.552   150.846   30.1827
7   10584298.mol   243.248   164.409   21.5715
8   10584286.mol   283.078   230.034   24.3697
9   10584278.mol   287.807   198.625   27.7414
10  10584197.mol   224.356   184.317   24.3616

text file 2:

Code:

ID   filePath       SUB_ID     ChBrg_REGID
1    10584588.mol   10584588   9070369
2    10584549.mol   10584549   9070193
3    10584499.mol   10584499   9069982
4    10584459.mol   10584459   9069773
5    10584426.mol   10584426   9069641
6    10584278.mol   10584278   9069060
7    10584197.mol   10584197   9068744

I need to merge the two, keeping only the rows that appear in both files (the shorter list could be the index). The column filePath is the index, so the final file should look like.

Code:

ID  filePath       SUB_ID     ChBrg_REGID   col1      col2      col3
1   10584588.mol   10584588   9070369       269.126   190.958   23.237
2   10584549.mol   10584549   9070193       281.001   200.889   27.7414
4   10584499.mol   10584499   9069982       245.632   153.241   25.2815
5   10584459.mol   10584459   9069773       290.476   133.699   28.631
6   10584426.mol   10584426   9069641       440.552   150.846   30.1827
9   10584278.mol   10584278   9069060       287.807   198.625   27.7414
10  10584197.mol   10584197   9068744       224.356   184.317   24.3616

I am guessing this could be done in awk, and certainly in perl, but I'm not sure how do to the alignment by the index.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

05-01-2011

Registered User

676, 217

Join Date: Jun 2009

Last Activity: 1 May 2020, 6:28 AM EDT

Location: India

Posts: 676

Thanks Given: 30

Thanked 217 Times in 215 Posts

Hi
Assuming your input files are a1 and a2:

Code:

awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}' a1 a2  
ID filePath        SUB_ID     ChBrg_REGID     col1       col2       col3
1  10584588.mol    10584588   9070369         269.126    190.958    23.237
2  10584549.mol    10584549   9070193         281.001    200.889    27.7414
4  10584499.mol    10584499   9069982         245.632    153.241    25.2815
5  10584459.mol    10584459   9069773         290.476    133.699    28.631
6  10584426.mol    10584426   9069641         440.552    150.846    30.1827
9  10584278.mol    10584278   9069060         287.807    198.625    27.7414
10 10584197.mol    10584197   9068744         224.356    184.317    24.3616

Guru

guruprasadpr

View Public Profile for guruprasadpr

Find all posts by guruprasadpr

05-01-2011

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

That worked great, except the header row never made it to the output file.

---------- Post updated at 02:16 PM ---------- Previous update was at 12:24 AM ----------

I have been working on the header row. If I do,

Code:

   awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t",  $1, $2, $3, $4, $5 }' $a2 > temp.txt
   awk 'NR==1 {$1=$2=""}1' $a1 >> temp.txt

This comes close, but prints the entire a1 file to temp.txt, not just the first row. This takes the first 5 fields from file a2 and then is supposed to add from field 3 to the last field of file a1. This will gob together the header row, and then I can use the command above to fill in the rest of the file.

---------- Post updated at 02:20 PM ---------- Previous update was at 02:16 PM ----------

This seems to work, but overall this seems an odd way of adding the header row.

Code:

   awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t",  $1, $2, $3, $4, $5 }' $a2 > temp.txt
   awk '{$1=$2=""} NR==1{print $0}' $a1 >> temp.txt

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

05-01-2011

Moderator

1,484, 567

Join Date: Mar 2011

Last Activity: 28 November 2020, 9:34 AM EST

Posts: 1,484

Thanks Given: 68

Thanked 567 Times in 444 Posts

I'm no awk expert but..
Everything works fine here with your input files and guru's awk.
I'm getting header as it should be or ? :

After running the code vs your input (a1 and a2) i'm getting.

Code:

ID  filePath        SUB_ID     ChBrg_REGID     col1       col2       col3      
1   10584588.mol    10584588   9070369         269.126    190.958    23.237    
2   10584549.mol    10584549   9070193         281.001    200.889    27.7414   
4   10584499.mol    10584499   9069982         245.632    153.241    25.2815   
5   10584459.mol    10584459   9069773         290.476    133.699    28.631    
6   10584426.mol    10584426   9069641         440.552    150.846    30.1827   
9   10584278.mol    10584278   9069060         287.807    198.625    27.7414

Also for generating headers via awk, remove the header from the data and use BEGIN block or put another pair of { } bracers infront of guru's code (you will need to remove header from data first)
Something like :

Code:

{
if ( NR == 1 )
 print $1 $2 $3 $4 $5 ## use printf to format as you wish
}

Code:

BEGIN {
printf ... ## you will need to write headers here yourself, since you can't use them from input file(s)
}
<other code>

Peasant

View Public Profile for Peasant

Find all posts by Peasant

05-01-2011

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

I found a bug in my data where the first two files had a different header name for one header. The header row is now correct, more or less.

There still seems to be an issue in that the last column has three columns of space delimited data in it.

Code:

sumSO2Am                 SUB_ID     SOURCE                                          
0                 10584046   ChemBridge                                      
0                 10580948   ChemBridge                                      
0                 10580812   ChemBridge                                      
0                 10580337   ChemBridge                                      
0                 10579979   ChemBridge                                      
0                 10579233   ChemBridge

The last two, SUB_ID and SOURCE are duplicate cols (already occur at $3,$4). These come from $3, $4 in a2. Each row should end with the sumSO2Am field.

I don't see where that is happening in the command, or I just don't get it. I see how the first 7 fields are being printed, but not the rest of each row. I can post some short test files if that would help.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

05-01-2011

Registered User

1,910, 488

Join Date: Sep 2008

Last Activity: 22 December 2019, 2:31 AM EST

Location: San Jose, CA

Posts: 1,910

Thanks Given: 54

Thanked 488 Times in 481 Posts

you mean you have SUB_ID and SOURCE in both the file?
sample input files with which you tried the code would help.

regards,
Ahamed

ahamed101

View Public Profile for ahamed101

Find all posts by ahamed101

05-01-2011

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Yes, apparently, my input files were not quite what I thought, so the field identifiers were not quite right. I have everything corrected now. The test files I am using have more cols than the sample I posted. I thought I had this working a bit ago, but now it seems it's not working again.

I have attached .zip with 4 files. There are the a1 and a2 input files, the output I am hoping for, and the output (incorrect) I am now getting. I am just trying to add the values from the cols SUB_ID, SOURCE, ChBrg_REGID from file a2 to file a1. File a1 has fewer rows, so it is necessary to look for the values in a1"filePath" to match the right row. This is basically looking up the values for the 3 cols in the a2 file and adding them in right after the a1"filePath" col. I won't always want all of the rest of the cols from a1, but it's just as well to leave them in, since I can edit the file further with cut, etc.

This is the command I am using,

Code:

awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}'  a1_temp.txt a2_temp.txt > output_temp.txt

This is part of a more involved script, but I'm just trying to get this part working.

I really thought I had it for a bit, but I guess not.

LMHmedchem

test_files.zip (3.1 KB)

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

Shell Programming and Scripting

merge two text files of different size on common index

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge multiple tab delimited files with index checking

Discussion started by: LMHmedchem

2. UNIX for Dummies Questions & Answers

Merge selective columns from files based on common key

Discussion started by: dovah

3. Shell Programming and Scripting

Merge multiple files with common header

Discussion started by: msarguru

4. Shell Programming and Scripting

Find matched patterns in a column of 2 files with different size and merge them

Discussion started by: redse171

5. Shell Programming and Scripting

Merge files based on both common and uncommon rows

Discussion started by: Diya123

6. Shell Programming and Scripting

script to merge two files on an index

Discussion started by: LMHmedchem

7. UNIX for Dummies Questions & Answers

Merge two files with common IDs but unequal number of rows

Discussion started by: katie8856

8. UNIX for Dummies Questions & Answers

Writing a loop to merge multiple files by common column

Discussion started by: evelibertine

9. Shell Programming and Scripting

How to remove common file names from text files

Discussion started by: siegfried

10. Shell Programming and Scripting

Merge files of differrent size with one field common in both files using awk

Discussion started by: shashi1982