UNIX - 2 tab delimited files, conditional column extraction


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers UNIX - 2 tab delimited files, conditional column extraction
# 8  
Old 03-25-2018
OK; and what be your results when applying either of the above proposals?
# 9  
Old 03-25-2018
Quote:
Originally Posted by RudiC
OK; and what be your results when applying either of the above proposals?
With the second set of test data:

Scrutinizers script incorrectly identifies the second line of file 2 as a match:
Code:
100
50 - This one should be NA
25
40
NA
NA
22
NA
NA
21
12
35
90
80
NA


Your script correctly identifies all records Smilie
Code:
100
NA
25
40
NA
NA
22
NA
NA
21
12
35
90
80
NA

If you have the time could you please help me understand the code you've very kindly provided? Hopefully then I can write my own for similar tasks in the future.
Cheers

---------- Post updated at 11:31 AM ---------- Previous update was at 10:59 AM ----------

Quote:
Originally Posted by Scrutinizer
Hi, see if this works:
Code:
awk -F'\t' '
  NR==FNR {
    if(!($1 in L))
      L[$1]=$2
    R[$1]=$3
    next
  }
  {
    print ($2>=L[$1] && $2<R[$1])?$4:"NA"
  }
' file1 file2

Hi - With an expanded dataset this script incorrectly matched the second line in file 2.

Last edited by Scrutinizer; 03-25-2018 at 11:18 PM.. Reason: Code Tags
# 10  
Old 03-26-2018
Quote:
Originally Posted by GTed
. . . help me understand the code you've very kindly provided? Hopefully then I can write my own for similar tasks in the future.
.
.
.
THAT's the right spirit that we after in these fora!

Here you go; further questions welcome (after having read the man page); have fun:

Code:
awk -F"\t" '                                                                    # start awk and define the field separator
NR == FNR       {INT[$1] = INT[$1] $2 "-" $3 FS                                 # for the first file, identified by total record No.
                                                                                # being equal to the file's NR, save intervals to an
                                                                                # array indexed by $1 as a list of L-R L-R L-R etc.
                 next                                                           # stop processing this line, start over with  next
                }
                                                                                # this is processed for second file only
                {split (INT[$1], T)                                             # split the interval list into individual L-R into
                                                                                # temp array T
                 OUT = "NA"                                                     # predefine OUT should no match be found
                 for (t in T)   {split (T[t], LM, "-")                          # loop across all individual L-R entries, split each 
                                                                                # one into limits array, with LM[1] holding L(eft)    
                                                                                # and LM[2] the R(ight) border
                                 if ($2 >= LM[1] && $2 < LM[2]) OUT = $4        # if $2 fits between limits, set OUT to $4
                                }
                 print OUT                                                      # and print it
                }
' file1 file2                                                                   # specify input files

# 11  
Old 03-26-2018
Quote:
Originally Posted by RudiC
THAT's the right spirit that we after in these fora!

Here you go; further questions welcome (after having read the man page); have fun:

Code:
awk -F"\t" '                                                                    # start awk and define the field separator
NR == FNR       {INT[$1] = INT[$1] $2 "-" $3 FS                                 # for the first file, identified by total record No.
                                                                                # being equal to the file's NR, save intervals to an
                                                                                # array indexed by $1 as a list of L-R L-R L-R etc.
                 next                                                           # stop processing this line, start over with  next
                }
                                                                                # this is processed for second file only
                {split (INT[$1], T)                                             # split the interval list into individual L-R into
                                                                                # temp array T
                 OUT = "NA"                                                     # predefine OUT should no match be found
                 for (t in T)   {split (T[t], LM, "-")                          # loop across all individual L-R entries, split each 
                                                                                # one into limits array, with LM[1] holding L(eft)    
                                                                                # and LM[2] the R(ight) border
                                 if ($2 >= LM[1] && $2 < LM[2]) OUT = $4        # if $2 fits between limits, set OUT to $4
                                }
                 print OUT                                                      # and print it
                }
' file1 file2                                                                   # specify input files

Hugely appreciate the time you've taken to help me out. I'll now take sometime to break this down, read around, and hopefully digest Smilie

It runs in about 3 hours on the 'real' dataset.

You're a legend Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a column in tab delimited file with column in other tab delimited file,based on match

Hello Everyone.. I want to replace the retail col from FileI with cstp1 col from FileP if the strpno matches in both files FileP.txt ... (2 Replies)
Discussion started by: YogeshG
2 Replies

2. UNIX for Dummies Questions & Answers

awk - Extract 4 lines in Column to Rows Tab Delimited between tags

I have tried the following to no avail. xargs -n8 < test.txt awk '{if(NR%6!=0){p=""}else{p="\n"};printf $0" "p}' Mod_Alm_log.txt > test.txt I have tried different variations of the above, the problem is mixes lines together. And it includes the tags "%a and %A" I need them to be all tab... (16 Replies)
Discussion started by: mytouchsr
16 Replies

3. Shell Programming and Scripting

Delete an entire column from a tab delimited file

Hi, Can anyone please tell me about how we can delete an entire column from a tab delimited file? Mu input_file.txt looks like this: And I want the output as: I used the below code nawk -v d="1" 'BEGIN{FS=OFS="\t"}{$d=""}{print}' input_file.txtBut in the output, the first column is... (5 Replies)
Discussion started by: sampoorna
5 Replies

4. Shell Programming and Scripting

Convert a 3 column tab delimited file to a matrix

Hi all, I have a 3 columns input file like this: CPLX9PC-4943 CPLX9PC-4943 1 CPLX9PC-4943 CpxID123 0 CPLX9PC-4943 CpxID126 0 CPLX9PC-4943 CPLX9PC-5763 0.5 CPLX9PC-4943 CpxID13 0 CPLX9PC-4943 CPLX9PC-6163 0 CPLX9PC-4943 CPLX9PC-6164 0.04... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

5. UNIX for Dummies Questions & Answers

add (append) a column in a tab delimited file

I have a file having the following entries: test1 test2 test3 11 22 33 22 44 66 99 99 44 --- I want to add a column so that the above file becomes: test1 test2 test3 notest 11 22 33 * 22 44 66 * 99 99 44 * --- Thanks (6 Replies)
Discussion started by: mary271
6 Replies

6. Shell Programming and Scripting

Extract second column tab delimited file

I have a file which looks like this: 73450 articles and news developmental psychology 2006-03-30 16:22:40 1 http://www.usnews.com 73450 articles and news developmental psychology 2006-03-30 16:22:40 2 http://www.apa.org 73450 articles and news developmental psychology 2006-03-30... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

7. UNIX for Dummies Questions & Answers

Using awk to log transform a column in a tab-delimited text file?

How do I use awk to log transform the fifth column of a tab-delimited text file? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

8. Shell Programming and Scripting

Using sed on 1st column of tab delimited file

Hi all, I'm new to Unix and work primarily in bioinformatics. I am in need of a script which will allow me to replace "1" with "chr1" in only the first column of a file which looks like such: 1 10327 rs112750067 T C . PASS ASP;RSPOS=10327;... (4 Replies)
Discussion started by: Hkins552
4 Replies

9. UNIX for Dummies Questions & Answers

Add a new column to a tab delimited text file

I want to add a new column to a tab delimited text file. It will be the first column and it will just be 1's. How do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

10. Shell Programming and Scripting

Delete first column in tab-delimited text-file

I have a large text-file with tab-delimited genetic data that looks like: KSC112 KSC234 0 0 1 1 A G C T I simply wan to delete the first column, but since the file has 600 000 columns, it is not possible with awk (seems to be limited at 32k columns). Does anyone have an idea how to do this? (2 Replies)
Discussion started by: andmal
2 Replies
Login or Register to Ask a Question