UNIX - 2 tab delimited files, conditional column extraction

03-25-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

OK; and what be your results when applying either of the above proposals?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-25-2018

Registered User

6, 0

Join Date: Mar 2018

Last Activity: 26 March 2018, 6:06 AM EDT

Posts: 6

Thanks Given: 2

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

OK; and what be your results when applying either of the above proposals?

With the second set of test data:

Scrutinizers script incorrectly identifies the second line of file 2 as a match:

Code:

100
50 - This one should be NA
25
40
NA
NA
22
NA
NA
21
12
35
90
80
NA

Your script correctly identifies all records

Code:

100
NA
25
40
NA
NA
22
NA
NA
21
12
35
90
80
NA

If you have the time could you please help me understand the code you've very kindly provided? Hopefully then I can write my own for similar tasks in the future.
Cheers

---------- Post updated at 11:31 AM ---------- Previous update was at 10:59 AM ----------

Quote:

Originally Posted by Scrutinizer

Hi, see if this works:

Code:

awk -F'\t' '
  NR==FNR {
    if(!($1 in L))
      L[$1]=$2
    R[$1]=$3
    next
  }
  {
    print ($2>=L[$1] && $2<R[$1])?$4:"NA"
  }
' file1 file2

Hi - With an expanded dataset this script incorrectly matched the second line in file 2.

Last edited by Scrutinizer; 03-25-2018 at 11:18 PM.. Reason: Code Tags

GTed

View Public Profile for GTed

Find all posts by GTed

03-26-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by GTed

. . . help me understand the code you've very kindly provided? Hopefully then I can write my own for similar tasks in the future.
.
.
.

THAT's the right spirit that we after in these fora!

Here you go; further questions welcome (after having read the man page); have fun:

Code:

awk -F"\t" '                                                                    # start awk and define the field separator
NR == FNR       {INT[$1] = INT[$1] $2 "-" $3 FS                                 # for the first file, identified by total record No.
                                                                                # being equal to the file's NR, save intervals to an
                                                                                # array indexed by $1 as a list of L-R L-R L-R etc.
                 next                                                           # stop processing this line, start over with  next
                }
                                                                                # this is processed for second file only
                {split (INT[$1], T)                                             # split the interval list into individual L-R into
                                                                                # temp array T
                 OUT = "NA"                                                     # predefine OUT should no match be found
                 for (t in T)   {split (T[t], LM, "-")                          # loop across all individual L-R entries, split each 
                                                                                # one into limits array, with LM[1] holding L(eft)    
                                                                                # and LM[2] the R(ight) border
                                 if ($2 >= LM[1] && $2 < LM[2]) OUT = $4        # if $2 fits between limits, set OUT to $4
                                }
                 print OUT                                                      # and print it
                }
' file1 file2                                                                   # specify input files

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-26-2018

Registered User

6, 0

Join Date: Mar 2018

Last Activity: 26 March 2018, 6:06 AM EDT

Posts: 6

Thanks Given: 2

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

THAT's the right spirit that we after in these fora!

Here you go; further questions welcome (after having read the man page); have fun:

Code:

awk -F"\t" '                                                                    # start awk and define the field separator
NR == FNR       {INT[$1] = INT[$1] $2 "-" $3 FS                                 # for the first file, identified by total record No.
                                                                                # being equal to the file's NR, save intervals to an
                                                                                # array indexed by $1 as a list of L-R L-R L-R etc.
                 next                                                           # stop processing this line, start over with  next
                }
                                                                                # this is processed for second file only
                {split (INT[$1], T)                                             # split the interval list into individual L-R into
                                                                                # temp array T
                 OUT = "NA"                                                     # predefine OUT should no match be found
                 for (t in T)   {split (T[t], LM, "-")                          # loop across all individual L-R entries, split each 
                                                                                # one into limits array, with LM[1] holding L(eft)    
                                                                                # and LM[2] the R(ight) border
                                 if ($2 >= LM[1] && $2 < LM[2]) OUT = $4        # if $2 fits between limits, set OUT to $4
                                }
                 print OUT                                                      # and print it
                }
' file1 file2                                                                   # specify input files

Hugely appreciate the time you've taken to help me out. I'll now take sometime to break this down, read around, and hopefully digest

It runs in about 3 hours on the 'real' dataset.

You're a legend

GTed

View Public Profile for GTed

Find all posts by GTed

UNIX for Beginners Questions & Answers

UNIX - 2 tab delimited files, conditional column extraction

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a column in tab delimited file with column in other tab delimited file,based on match

Discussion started by: YogeshG

2. UNIX for Dummies Questions & Answers

awk - Extract 4 lines in Column to Rows Tab Delimited between tags

Discussion started by: mytouchsr

3. Shell Programming and Scripting

Delete an entire column from a tab delimited file

Discussion started by: sampoorna

4. Shell Programming and Scripting

Convert a 3 column tab delimited file to a matrix

Discussion started by: AshwaniSharma09

5. UNIX for Dummies Questions & Answers

add (append) a column in a tab delimited file

Discussion started by: mary271

6. Shell Programming and Scripting

Extract second column tab delimited file

Discussion started by: shoaibjameel123

7. UNIX for Dummies Questions & Answers

Using awk to log transform a column in a tab-delimited text file?

Discussion started by: evelibertine

8. Shell Programming and Scripting

Using sed on 1st column of tab delimited file

Discussion started by: Hkins552

9. UNIX for Dummies Questions & Answers

Add a new column to a tab delimited text file

Discussion started by: evelibertine

10. Shell Programming and Scripting

Delete first column in tab-delimited text-file

Discussion started by: andmal