Please know that I am very new to unix and trying to learn 'on the job'. I'm only manipulating large tab-delimited files (millions of rows), but I'm stuck and don't know how to proceed with the following. Hoping for some friendly advice
I have 2 tab-delimited files - with differing column & row numbers in each.
Please note: some of the 'names' in column 1 (in each file) are repeated a varying number of times. Each of the real files are millions of rows each.
File 1:
File 2:
I would like to test the following:
IF the value in the first column & first row of File2 matches the value anywhere in the first column of File1, AND IF the second column & first row value of File2 is >= to the second column File1 but < the third column File1 (these would be on the corresponding row of File1 after the match established in the first 'IF'), THEN write the value from column 4 row 1 File2 to File3, ELSE write NA to File3.
After testing for every File2 column 1 values, File3 should look like the following:
To help understand, I can achieve this within Excel by doing the following
Paste File 1 from cells A1:C9
Paste File 2 from cells D1:G5
In H1 write the following formula:
Copy formula down to H5.
But – the real world files exceed the size of Excel, so transferring this to Unix is my problem.
I have tried combining the above File1 and File2 data into one file and then using the following command:
But that only tests on a row level.
I feel like I need to use awk with NR==FNR but ensure where to start?
I am hoping I have made that clear, and that someone can kindly provide some assistance
Many thanks
Ted (Completely new to unix and learning everyday)
Last edited by Scrutinizer; 03-25-2018 at 03:23 AM..
Reason: additional code tags
May I ask why file1 defines multiple but contiguous intervals which could be done in a single line defining min and max? E.g. 10 - 20, 20 - 30, and 30 - 40 could be condensed to 10 - 40 for Sc1 (which is what Scrutinizer's code does internally, assuming as well that intervals come in ascending order).
Would it be correct to assume that multiple interval definitions are to allow gaps between intervals? If yes, try (with e.g. the Sc1 20 30 line missing in the sample file)
With file1 from post#1, it's
In reality the files I have, that are millions of rows in size, are not contiguous or even nearly as uniform as File1 (many gaps exist). The data sets (and results) would not make sense if File1 was collapsed to min-max for each Scxxx name. I just produced this one so that the boundaries were easy to see for everyone.
The output (file3) definitely needs to be the below from the test files supplied:
Yes. How about posting a decent, representative sample of your data including gaps? So proposals given could be tested against samples reflecting reality?
Yes. How about posting a decent, representative sample of your data including gaps? So proposals given could be tested against samples reflecting reality?
Apologies, if there was any confusion. Please find below some data which matches the reality of the files.
Note:
Column 1 in File 1 & 2 is not a complete list of Scxxxx names, some are missing.
The ranges within File 1 columns 2&3 are no longer contiguous although they do not overlap for each Scxxxxx name.
Desired output from the above files should be:
File 2 has 15 lines to test in this example, so I would expect 15 lines in the output.
I have tried the following to no avail.
xargs -n8 < test.txt
awk '{if(NR%6!=0){p=""}else{p="\n"};printf $0" "p}' Mod_Alm_log.txt > test.txt
I have tried different variations of the above, the problem is mixes lines together.
And it includes the tags "%a and %A" I need them to be all tab... (16 Replies)
Hi,
Can anyone please tell me about how we can delete an entire column from a tab delimited file?
Mu input_file.txt looks like this:
And I want the output as:
I used the below code
nawk -v d="1" 'BEGIN{FS=OFS="\t"}{$d=""}{print}' input_file.txtBut in the output, the first column is... (5 Replies)
I have a file having the following entries:
test1 test2 test3
11 22 33
22 44 66
99 99 44
---
I want to add a column so that the above file becomes:
test1 test2 test3 notest
11 22 33 *
22 44 66 *
99 99 44 *
---
Thanks (6 Replies)
I have a file which looks like this:
73450 articles and news developmental psychology 2006-03-30 16:22:40 1 http://www.usnews.com
73450 articles and news developmental psychology 2006-03-30 16:22:40 2 http://www.apa.org
73450 articles and news developmental psychology 2006-03-30... (1 Reply)
Hi all,
I'm new to Unix and work primarily in bioinformatics. I am in need of a script which will allow me to replace "1" with "chr1" in only the first column of a file which looks like such:
1 10327 rs112750067 T C . PASS ASP;RSPOS=10327;... (4 Replies)
I want to add a new column to a tab delimited text file. It will be the first column and it will just be 1's. How do I go about doing that? Thanks! (1 Reply)
I have a large text-file with tab-delimited genetic data that looks like:
KSC112 KSC234 0 0 1 1 A G C T
I simply wan to delete the first column, but since the file has 600 000 columns, it is not possible with awk (seems to be limited at 32k columns).
Does anyone have an idea how to do this? (2 Replies)