Match tab-delimited files based on key

05-31-2018

Registered User

9, 0

Join Date: Mar 2009

Last Activity: 5 June 2018, 9:44 AM EDT

Posts: 9

Thanks Given: 5

Thanked 0 Times in 0 Posts

Match tab-delimited files based on key

I thought I had this figured out but was wrong so am humbly asking for help.
The task is to add an additional column to FILE 1 based on records in FILE 2.
The key is in COLUMN 1 for FILE 1 and in COLUMN 1 OR COLUMN 2 for FILE 2.

I want to add the third column from FILE 2 to the beginning of FILE 1 so that the new FILE shows for example:

DESIRED FILE 3

Code:

1:13109    G_T    t    g    -0.4127    0.1042    7.52e-05    ?---??????-?  rs540538026

FILE 1

Code:

1:1057989    G_T    t    g    0.3000    0.0662    5.909e-06    ??++++++???+
1:11007      C_T    t    c    0.2874    0.0710    5.19e-05    ?????+++???+
1:2190612    A_G    a    g    1.1252    0.2605    1.561e-05    ???????????+
1:13109    G_T    t    g    -0.4127    0.1042    7.52e-05    ?---??????-?
1:3674534    G_T    t    g    -0.4187    0.1073    9.559e-05    ?---??????-?
1:6932407    A_G    a    g    1.4977    0.3322    6.535e-06    ???????????+
1:6938780    C_T    t    c    -1.3632    0.3274    3.135e-05    ???????????-
1:7171050    A_G    a    g    0.0537    0.0134    6.091e-05    ?+++?-++++++
1:8960594    C_T    t    c    -0.9273    0.2319    6.344e-05    ???????????-
1:12203508    C_T    t    c    -1.4228    0.3469    4.111e-05    ???????????-

FILE 2

Code:

1:40370176 1:40370176 rs564192510
1:61341695 1:61341699 rs146746778
1:180879355 1:180879367 rs142596889
1:10177 1:10177 rs367896724
1:10352 1:10352 rs555500075
1:11007 1:11008 rs575272151
1:11011 1:11012 rs544419019
1:13109 1:13110 rs540538026
1:13115 1:13116 rs62635286

I have tried the join command but that doesn't allow searching either of column 1 or 2 for matches.

If there is no record of the column 1 FILE 1 value in FILE 2 I would like to put an "NA".

I have tried awk solutions to the problem but am not skilled enough to accomplish the desired output. Any help would be much appreciated.

andmal

View Public Profile for andmal

Find all posts by andmal

05-31-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

to start with:

Code:

awk '
  FNR==NR {
    f21[$1]=$3
    f22[$2]=$3
    next
  }
  $1 in f21 { print $0, f21[$1];next }
  $1 in f22 { print $0, f21[$1] }
' file2 file1

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-31-2018

Registered User

9, 0

Join Date: Mar 2009

Last Activity: 5 June 2018, 9:44 AM EDT

Posts: 9

Thanks Given: 5

Thanked 0 Times in 0 Posts

Thank you -this is closer to a solution than I've been in several days.

I tried out the script and did some manual sanity checks.

The code correctly identifies both column 1 and column 2 values in FILE 2. However, it only seems to add the column 3 value in FILE 2 if the matched value in FILE 2 was found in column 1.

I am wondering if this part of the code needs to be modified? Does $1 in F22 refer to the first column in the created matrix?

Code:

  $1 in f21 { print $0, f21[$1];next }
  $1 in f22 { print $0, f21[$1] }

If there is no match in any of the columns the row is eliminated from the output, which actually isn't much of a problem though.

andmal

View Public Profile for andmal

Find all posts by andmal

05-31-2018

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

sorry - fat fingering on my part.
Try this:

Code:

awk '
  FNR==NR {
    f21[$1]=$3
    f22[$2]=$3
    next
  }
  $1 in f21 { print $0, f21[$1];next }
  $1 in f22 { print $0, f22[$1] }
' file2 file1

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

06-01-2018

Registered User

9, 0

Join Date: Mar 2009

Last Activity: 5 June 2018, 9:44 AM EDT

Posts: 9

Thanks Given: 5

Thanked 0 Times in 0 Posts

And now it works like a charm and produced exactly the output I was looking for. Thanks !!

---------- Post updated at 01:01 PM ---------- Previous update was at 12:50 PM ----------

The main question is solved thanks to vgersh99. I have a bonus question, if I would like to run this awk line on multiple FILES 1 using the same reference FILE2, would something along these lines do the trick?

Code:

for i in *.txt ; do
awk '
  FNR==NR {
    f21[$1]=$3
    f22[$2]=$3
    next
  }
  $1 in f21 { print $0, f21[$1];next }
  $1 in f22 { print $0, f22[$1] }
' FILE2.txt $i 
$i > $i.pruned
done

andmal

View Public Profile for andmal

Find all posts by andmal

06-01-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

With a VERY sloppy interpretation of "along these lines" you might come close to the desired result, once you corrected the syntax / redirection error in the before-last line, and accepted the higher resource cost as you run the script multiple times.
Why not sth. along THIS line :

Code:

awk '                      
FNR==NR         {f21[$1]=$3
                 f22[$2]=$3
                 next
                }
$1 in f21       {print $0, f21[$1] > (FILENAME ".pruned")
                 next
                }
$1 in f22       {print $0, f22[$1] > (FILENAME ".pruned")
                }
' file2.txt file[^2].txt

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-01-2018

Registered User

9, 0

Join Date: Mar 2009

Last Activity: 5 June 2018, 9:44 AM EDT

Posts: 9

Thanks Given: 5

Thanked 0 Times in 0 Posts

-I suspected the suggested code was sloppy -I'm a newbie.

As I understand, this part of your suggestions tells to take the column 1 of the f21 table. Then add the .pruned extension to the stdout file? Or does it process all files with the .pruned extension?

Code:

$1 in f21       {print $0, f21[$1] > (FILENAME ".pruned")                  next

To be more clear, I have 400 of FILE 1 that should be matched to the FILE 2 table, of which there is only 1. The filename looks as in the below example. I would like to match all of the below FILE1 without having run them each at a time. They all have the same file extension. The resulting files should get an additional extension .pruned.

Code:

FILE1_VEGF.tbl.filtered.tab
FILE1_TL1A.tbl.filtered.tab
FILE1_MMP13.tbl.filtered.tab
FILE1_KYNUR.tbl.filtered.tab
+398 more files

I also don't understand this part

Code:

FILE2 FILE1[^2].txt

Does the ^ mean that the files are combined?
Is it possible to use wildcard definition e.g. *.tab to process many different versions of FILE1?

andmal

View Public Profile for andmal

Find all posts by andmal

UNIX for Beginners Questions & Answers

Match tab-delimited files based on key

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a column in tab delimited file with column in other tab delimited file,based on match

Discussion started by: YogeshG

2. UNIX for Beginners Questions & Answers

UNIX - 2 tab delimited files, conditional column extraction

Discussion started by: GTed

3. Shell Programming and Scripting

Merge multiple tab delimited files with index checking

Discussion started by: LMHmedchem

4. UNIX for Dummies Questions & Answers

Need to convert a pipe delimited text file to tab delimited

Discussion started by: raja kakitapall

5. UNIX for Dummies Questions & Answers

How to sort the 6th field of tab delimited files?

Discussion started by: maihani

6. Shell Programming and Scripting

Insert a header record (tab delimited) in multiple files

Discussion started by: pchang

7. UNIX for Dummies Questions & Answers

How to use the join command to obtain tab delimited text files as an output?

Discussion started by: evelibertine

8. UNIX for Dummies Questions & Answers

tab delimited file that is not tab delimited.

Discussion started by: imlearning

9. Shell Programming and Scripting

Merging files into a single tab delimited file with a space separating

Discussion started by: Lucky Ali

10. Shell Programming and Scripting

Working with Tab-Delimited files

Discussion started by: shiroh_1982