UNIX - 2 tab delimited files, conditional column extraction


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers UNIX - 2 tab delimited files, conditional column extraction
# 1  
Old 03-24-2018
UNIX - 2 tab delimited files, conditional column extraction

Please know that I am very new to unix and trying to learn 'on the job'. I'm only manipulating large tab-delimited files (millions of rows), but I'm stuck and don't know how to proceed with the following. Hoping for some friendly advice Smilie

I have 2 tab-delimited files - with differing column & row numbers in each.

Please note: some of the 'names' in column 1 (in each file) are repeated a varying number of times. Each of the real files are millions of rows each.

File 1:
Code:
Sc1 10  20
Sc1 20  30
Sc1 30  40
Sc2 40  50
Sc2 50  60
Sc2 60  70
Sc3 70  80
Sc3 80  90
Sc3 90  100

File 2:
Code:
Sc1 22  23  100
Sc1 44  45  50
Sc2 60  61  25
Sc3 79  80  40
Sc3 100 101 40
Sc4 60 61 25

I would like to test the following:

IF the value in the first column & first row of File2 matches the value anywhere in the first column of File1, AND IF the second column & first row value of File2 is >= to the second column File1 but < the third column File1 (these would be on the corresponding row of File1 after the match established in the first 'IF'), THEN write the value from column 4 row 1 File2 to File3, ELSE write NA to File3.

After testing for every File2 column 1 values, File3 should look like the following:

Code:
100
NA
25
40
NA
NA

To help understand, I can achieve this within Excel by doing the following

Paste File 1 from cells A1:C9

Paste File 2 from cells D1:G5

In H1 write the following formula:

Code:
=IF(COUNTIFS($A$1:$A$9,D1, $B$1:$B$9,"<="&E1,$C$1:$C$9,">"&E1)>0,G1,"NA")

Copy formula down to H5.

But – the real world files exceed the size of Excel, so transferring this to Unix is my problem.

I have tried combining the above File1 and File2 data into one file and then using the following command:
Code:
awk '{if ($4=$1 && $5>=$2 && $5<$3) {print $7}}' file3.txt > file4.txt

But that only tests on a row level.

I feel like I need to use awk with NR==FNR but ensure where to start?

I am hoping I have made that clear, and that someone can kindly provide some assistance Smilie

Many thanks

Ted (Completely new to unix and learning everyday)

Last edited by Scrutinizer; 03-25-2018 at 03:23 AM.. Reason: additional code tags
# 2  
Old 03-25-2018
Hi, see if this works:
Code:
awk -F'\t' '
  NR==FNR {
    if(!($1 in L))
      L[$1]=$2
    R[$1]=$3
    next
  }
  {
    print ($2>=L[$1] && $2<R[$1])?$4:"NA"
  }
' file1 file2

This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 03-25-2018
Hey thanks for the help, I'm away from a unix box to test at the moment....... I'll try tomorrow and get back to you.

You're a legend! If this works hopefully you can help me understand the code lines Smilie
# 4  
Old 03-25-2018
May I ask why file1 defines multiple but contiguous intervals which could be done in a single line defining min and max? E.g. 10 - 20, 20 - 30, and 30 - 40 could be condensed to 10 - 40 for Sc1 (which is what Scrutinizer's code does internally, assuming as well that intervals come in ascending order).
Would it be correct to assume that multiple interval definitions are to allow gaps between intervals? If yes, try (with e.g. the Sc1 20 30 line missing in the sample file)
Code:
awk -F"\t" '
NR == FNR       {INT[$1] = INT[$1] $2 "-" $3 FS
                 next
                }
                {split (INT[$1], T)
                 OUT = "NA"
                 for (t in T)   {split (T[t], LM, "-")
                                 if ($2 >= LM[1] && $2 < LM[2]) OUT = $4
                                }
                 print OUT
                } 

' file1 file2
NA
NA
25
40
NA
NA

With file1 from post#1, it's
Code:
100
NA
25
40
NA
NA

This User Gave Thanks to RudiC For This Post:
# 5  
Old 03-25-2018
In reality the files I have, that are millions of rows in size, are not contiguous or even nearly as uniform as File1 (many gaps exist). The data sets (and results) would not make sense if File1 was collapsed to min-max for each Scxxx name. I just produced this one so that the boundaries were easy to see for everyone.

The output (file3) definitely needs to be the below from the test files supplied:

100
NA
25
40
NA
NA

Last edited by GTed; 03-25-2018 at 05:36 AM..
# 6  
Old 03-25-2018
Yes. How about posting a decent, representative sample of your data including gaps? So proposals given could be tested against samples reflecting reality?
This User Gave Thanks to RudiC For This Post:
# 7  
Old 03-25-2018
Quote:
Originally Posted by RudiC
Yes. How about posting a decent, representative sample of your data including gaps? So proposals given could be tested against samples reflecting reality?
Apologies, if there was any confusion. Please find below some data which matches the reality of the files.

Note:
Column 1 in File 1 & 2 is not a complete list of Scxxxx names, some are missing.

The ranges within File 1 columns 2&3 are no longer contiguous although they do not overlap for each Scxxxxx name.

Code:
File 1		
Sc1	1	11
Sc1	18	36
Sc1	45	48
Sc2	23	50
Sc2	67	112
Sc2	235	245
Sc3	2	23
Sc3	29	40
Sc3	90	100
Sc5	44	99
Sc8	12	16
Sc8	18	22
Sc8	66	88
Sc8	111	119
Sc8	555	578
Sc10	20	27
Sc10	40	54
Sc10	80	88
Sc10	120	156
Sc10	170	222

Code:
File 2			
Sc1	22	23	100
Sc1	44	45	50
Sc2	238	239	25
Sc3	38	39	40
Sc3	100	101	40
Sc4	90	91	10
Sc5	46	47	22
Sc6	44	45	99
Sc7	1	2	100
Sc8	12	13	21
Sc8	84	85	12
Sc8	561	562	35
Sc10	42	43	90
Sc10	124	125	80
Sc11	999	1000	100

Desired output from the above files should be:

Code:
100
NA
25
40
NA
NA
22
NA
NA
21
12
35
90
80
NA

File 2 has 15 lines to test in this example, so I would expect 15 lines in the output.

Your help is very much appreciated.
Cheers

Last edited by GTed; 03-25-2018 at 09:42 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a column in tab delimited file with column in other tab delimited file,based on match

Hello Everyone.. I want to replace the retail col from FileI with cstp1 col from FileP if the strpno matches in both files FileP.txt ... (2 Replies)
Discussion started by: YogeshG
2 Replies

2. UNIX for Dummies Questions & Answers

awk - Extract 4 lines in Column to Rows Tab Delimited between tags

I have tried the following to no avail. xargs -n8 < test.txt awk '{if(NR%6!=0){p=""}else{p="\n"};printf $0" "p}' Mod_Alm_log.txt > test.txt I have tried different variations of the above, the problem is mixes lines together. And it includes the tags "%a and %A" I need them to be all tab... (16 Replies)
Discussion started by: mytouchsr
16 Replies

3. Shell Programming and Scripting

Delete an entire column from a tab delimited file

Hi, Can anyone please tell me about how we can delete an entire column from a tab delimited file? Mu input_file.txt looks like this: And I want the output as: I used the below code nawk -v d="1" 'BEGIN{FS=OFS="\t"}{$d=""}{print}' input_file.txtBut in the output, the first column is... (5 Replies)
Discussion started by: sampoorna
5 Replies

4. Shell Programming and Scripting

Convert a 3 column tab delimited file to a matrix

Hi all, I have a 3 columns input file like this: CPLX9PC-4943 CPLX9PC-4943 1 CPLX9PC-4943 CpxID123 0 CPLX9PC-4943 CpxID126 0 CPLX9PC-4943 CPLX9PC-5763 0.5 CPLX9PC-4943 CpxID13 0 CPLX9PC-4943 CPLX9PC-6163 0 CPLX9PC-4943 CPLX9PC-6164 0.04... (7 Replies)
Discussion started by: AshwaniSharma09
7 Replies

5. UNIX for Dummies Questions & Answers

add (append) a column in a tab delimited file

I have a file having the following entries: test1 test2 test3 11 22 33 22 44 66 99 99 44 --- I want to add a column so that the above file becomes: test1 test2 test3 notest 11 22 33 * 22 44 66 * 99 99 44 * --- Thanks (6 Replies)
Discussion started by: mary271
6 Replies

6. Shell Programming and Scripting

Extract second column tab delimited file

I have a file which looks like this: 73450 articles and news developmental psychology 2006-03-30 16:22:40 1 http://www.usnews.com 73450 articles and news developmental psychology 2006-03-30 16:22:40 2 http://www.apa.org 73450 articles and news developmental psychology 2006-03-30... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

7. UNIX for Dummies Questions & Answers

Using awk to log transform a column in a tab-delimited text file?

How do I use awk to log transform the fifth column of a tab-delimited text file? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

8. Shell Programming and Scripting

Using sed on 1st column of tab delimited file

Hi all, I'm new to Unix and work primarily in bioinformatics. I am in need of a script which will allow me to replace "1" with "chr1" in only the first column of a file which looks like such: 1 10327 rs112750067 T C . PASS ASP;RSPOS=10327;... (4 Replies)
Discussion started by: Hkins552
4 Replies

9. UNIX for Dummies Questions & Answers

Add a new column to a tab delimited text file

I want to add a new column to a tab delimited text file. It will be the first column and it will just be 1's. How do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

10. Shell Programming and Scripting

Delete first column in tab-delimited text-file

I have a large text-file with tab-delimited genetic data that looks like: KSC112 KSC234 0 0 1 1 A G C T I simply wan to delete the first column, but since the file has 600 000 columns, it is not possible with awk (seems to be limited at 32k columns). Does anyone have an idea how to do this? (2 Replies)
Discussion started by: andmal
2 Replies
Login or Register to Ask a Question