script to merge two files on an index

05-13-2012

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

script to merge two files on an index

I have a need to merge two files on the value of an index column.

input file 1

Code:

id filePath MDL_NUMBER
1 MFCD00008104.mol MFCD00008104
2 MFCD00012849.mol MFCD00012849
3 MFCD00037597.mol MFCD00037597
4 MFCD00064558.mol MFCD00064558
5 MFCD00064559.mol MFCD00064559

input file 2

Code:

MDL_NUMBER RI3_1 fw
MFCD00008104 114.901 31.0572
MFCD00012849 114.901 31.0572
MFCD00037597 114.901 31.0572
MFCD00064558 114.901 31.0572
MFCD00064559 114.901 31.0572

output file

Code:

id filePath MDL_NUMBER RI3_1 fw
1 MFCD00008104.mol MFCD00008104 114.901 31.0572
2 MFCD00012849.mol MFCD00012849 114.901 31.0572
3 MFCD00037597.mol MFCD00037597 114.901 31.0572
4 MFCD00064558.mol MFCD00064558 114.901 31.0572
5 MFCD00064559.mol MFCD00064559 114.901 31.0572

I could probably do this in awk, or even with join, but I need to add logic to check each pair of index values to make sure that the data stays in registration. I think this means a higher level language like ruby, python, or perl, but I am not very good with any of those. In most cases, the files will match correctly, but I think I need to add exception handling to check that the files have the same number of rows and that the index values are in the right order.

Can someone point me to a tutorial for one of these languages that shows sample code for loading two files and merging output. I would guess you would hash the column header so you can specify which column is the index. I can do this in excel, but there are 150,000 lines in some of these files and I have allot of them to do.

Suggestions would be very helpful.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

05-13-2012

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

You could start with something like this:

Code:

awk 'END {  
  if (FNR != fnr)
    print "record count mismatch:", FNR, fnr
  }
NR == FNR { 
  FNR == 1 && split($0, h)
  idx[$1, $3] = $0
  fnr = FNR
  next  
  }
FNR == 1 {
  print h[1], h[2], h[3], $2, $3 
  next
  }
{ 
  print ((FNR - 1, $1) in idx ? idx[FNR - 1, $1] : "unknown"), $2, $3 
  }'  file1 file2

It should be quite easy to add some logic.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

05-13-2012

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Thanks for the post. One of the issues I am looking at here is that I use things like this fairly often, but the formats of the files can be very different. They could have hundreds of columns and the header name and location of the index column can vary quire a bit. In most cases, it is easier to use a header name to specify the index column and it's harder to use a simple tokenizer where the number of columns and location of the index is not constant. I am trying to get away from scripts with significant hard coding that needs to be changed.

In c++, I might create a hash table using the index value of the second file ask the key and the remainder of the row as the value. I would read in the second file and store it on the index value and then loop through the first file looking up the file 2 data that goes with each file 1 row. This would also mean that the files wouldn't have to be in the same order. That wouldn't be super fast, but processing files with 150,000+ rows will take some time no matter what the method. I don't really know how to do that sort of thing in a scripting language.

I don't think that something of that sophistication is really necessary, if the rows are not in the same order, another script could be used to sort them. I think that all that is really necessary is to check that the number of rows is the same and then test each row for matching key values, but I would like to have a script that will just take arguments for the filenames and index header name and have the rest work for pretty much any pair of files.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

05-13-2012

Registered User

240, 30

Join Date: Nov 2009

Last Activity: 22 April 2020, 9:46 PM EDT

Location: BeiJing China

Posts: 240

Thanks Given: 6

Thanked 30 Times in 9 Posts

Code:

paste file1 file2 | awk '{$3=""}1'

complex.invoke

View Public Profile for complex.invoke

Find all posts by complex.invoke

05-13-2012

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Quote:

Originally Posted by huaihaizi3

Code:

paste file1 file2 | awk '{$3=""}1'

The main point here is to do some checking to make sure that the index values match. There are allot of ways to stuff two files together, but some of these files are massive and have had quite a bit of processing done on them. There is absolutely no guarantee that they will match up, or that the index key will always be in the same column. The value of the index needs to be checked for each pair of rows to make sure that matching data is merged in the output.

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

05-13-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Here is a configurable solution using awk specify key field-name as IDX and O is list of field-names to output:

This should work if you have the data in any number of files as long and fieldname of key is the same in each file.

Code:

awk -vIDX="MDL_NUMBER" -vO="id filePath MDL_NUMBER RI3_1 fw" '
FNR==1 {
   headers=split(O, htxt)
   split("", o)
   for(hd in htxt) p[htxt[hd]]=hd
   for(i=1;i<=NF;i++) {
       if ($i==IDX) keypos=i
       if ($i in p) o[p[$i]]=i
   }
   next;
}
{ for(c in o) {K[$keypos]; OUT[$keypos,c]= $(o[c]) } }
END {
    $0=""
    for(i=1;i<=headers;i++)$i=htxt[i];
    print
    $0=""
    for(key in K) {
    for(i=1;i<=headers;i++)
        if(i in htxt) $i=OUT[key,i]l
    print
    }
}' file1 file2

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

05-13-2012

Registered User

362, 16

Join Date: Mar 2010

Last Activity: 3 March 2020, 10:38 PM EST

Location: Boston

Posts: 362

Thanks Given: 193

Thanked 16 Times in 15 Posts

Thanks for the post, that seems to get me allot of the way there.

I modified as I would use it in a bash script,

Code:

#!/usr/bin/bash

INDEX=$1
INDEX_FILE=$2
MERGE_FILE=$3

awk -vIDX=$INDEX -vO="id filePath MDL_NUMBER RI3_1 fw" '
FNR==1 {
   headers=split(O, htxt)
   split("", o)
   for(hd in htxt) p[htxt[hd]]=hd
   for(i=1;i<=NF;i++) {
       if ($i==IDX) keypos=i
       if ($i in p) o[p[$i]]=i
   }
   next;
}
{ for(c in o) {K[$keypos]; OUT[$keypos,c]= $(o[c]) } }
END {
    $0=""
    for(i=1;i<=headers;i++)$i=htxt[i];
    print
    $0=""
    for(key in K) {
    for(i=1;i<=headers;i++)
        if(i in htxt) $i=OUT[key,i]l
    print
    }
}' $INDEX_FILE $MERGE_FILE

run as,
./data_merge_awk.sh MDL_NUMBER index_file merge_file > output_file

The only issue is that I most often use this on tab delimited data. I tried changing the split argument from " " to "/t", but that doesn't do it.

The only downside I can see is having to hard code all of the columns I need to keep for each separate use. Some of the files I use this kind of thing for would have hundreds of columns. I'm not familiar with specifying the field names like you did. Is there a syntax for specifying "all columns but this one", etc?

LMHmedchem

LMHmedchem

View Public Profile for LMHmedchem

Find all posts by LMHmedchem

Shell Programming and Scripting

script to merge two files on an index

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Merge multiple tab delimited files with index checking

Discussion started by: LMHmedchem

2. Programming

Merge sort when array starts from zero(0) index???

Discussion started by: gabam

3. Shell Programming and Scripting

merge two text files of different size on common index

Discussion started by: LMHmedchem

4. Shell Programming and Scripting

Sort from start index and end index in line

Discussion started by: krish_indus

5. Shell Programming and Scripting

merge two files via looping script

Discussion started by: stinkefisch

6. Shell Programming and Scripting

script to merge xml files with options

Discussion started by: ptrbee

7. Shell Programming and Scripting

script needed to merge two files and report differences

Discussion started by: richsark

8. Filesystems, Disks and Memory

why the inode index of file system starts from 1 unlike array index(0)

Discussion started by: sairamdevotee

9. Shell Programming and Scripting

Merge two files in windows using perl script

Discussion started by: kunal_dixit

10. Shell Programming and Scripting

shell script to merge files

Discussion started by: arya