Merge multiple tab delimited files with index checking
Hello,
I have 40 data files where the first three columns are the same (in theory) and the 4th column is different. Here is an example of three files,
file 2: A_f0_r179_pred.txt
file 2: A_f1_r173_pred.txt
file 3: A_f3_r243_pred.txt
In reality these files could have any number of rows.
What I need to do is to aggregate the E0 columns into a single file along with the Id and Name columns
The trick is that I want to check the "Name" column value of each row every time a new column is added. It is very important that this data stay in registration. It would also help to take something from each file name to use for a header in place E0 because I think having all of the columns be named the same is asking for trouble. It would be very easy to have the script change this in the file beforehand if that would make more sense.
My current thought was to use cut or paste to merge all of the columns I want, including the name columns, into one file like,
Then I could use IFS='\t' read -a to grab each line into an array and test the name fields to make sure they are all the same for each row. If they are, I could output the data columns to a new file. I think that would work but would be pretty awkward.
At some point, I also need to create a new column with the average of all of the data columns for each row.
The final script didn't actually work but I thought I would post it anyway in case it would be helpful. This script was supposed to allow the header value of the index key to be passed in the call to the script along with the header names of the columns to be output.
There are a great many ways to do this, so suggestions are greatly appreciated.
LMHmedchem
The script below was kindly suggested by Chubler_XL. I believe it would work for what I need but the output has the id column out of order and includes many blank rows interspersed with data.
.
.
.
The script below was kindly suggested by Chubler_XL. I believe it would work for what I need but the output has the id column out of order and includes many blank rows interspersed with data.
.
.
.
What if you pipe the output through a sort operation?
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
It looks like you have a number of requests for help / requirements:
1) aggregate the E0 fields into a single file along with the Id and Name columns -- for 40 files -- a join operation
2) create a new column with the average of all of the data columns for each row
3) take something from each file name to use for a header in place E0
You seem to like to use awk, but I think that given your heavy use of (essentially csv files (with TABs being used in place of commas), that acquiring and learning a csv-specific tool would be useful. That's up to you, of course.
I found that I could use csvtool to at least start on this. Its join is far better than the system join (the latter of which deals only with 2 files). So here is, without supporting scaffolding listed, what csvtool could easily do with your 3 sample files.
producing:
However, csvtool does not do arithmetic directly. Incorporating the filename or some other distinguishing feature to replace the E0 also does not seem to be doable. I may look at csvfix, ffe, CRUSH, etc. to see how they might apply.
Best wishes ... cheers, drl
Last edited by drl; 11-30-2016 at 01:15 PM..
Reason: Correct minor typo.
The runtime for this was ~40 seconds for 40 input files, each with 2500 rows. That's not too awful but I think this code is a bit ghastly. It would be faster if I collected all of the data in memory instead of writing it to a file and then reading it back in.
This solution also used sed in the pipe to replaces the E0 values with a value read from the file name as the data is passed to the new file. That is almost the only think about this script that I like. The code is not generalized but could be a bit more so in a few places.
RudiC, I will check out your latest post in a few minutes.
LMHmedchem
---------- Post updated at 12:46 PM ---------- Previous update was at 12:13 PM ----------
I made a few modifications to the script posted by RudiC.
This just changes the code that creates the substitute header from, HD = HD OFS $3 OFS $4 "_" T[2]
to HD = HD OFS $3 OFS T[1] "_" T[2] "_" T[3]
For the filename "A_f0_r179_pred.txt", this results in the header, "A_f0_r179" instead of the header E0_f0.
It also changes the input regular expression from, A_*_pred.txt
to *_*_pred.txt
because there are file names that start with letters other than A.
This runs in 0.2 seconds (compared to 40 seconds for my script). The only issue is that the Name columns are still appearing in the final output and I only need the Name once.
I could add more code to process the output and remove all of the "Name" columns except the first one.
HD = HD OFS $4 "_" T[1] "_" T[2] "_" T[3]
to HD = HD OFS T[1] "_" T[2] "_" T[3]
to skip the original "E0" in the new header name.
Run time was 0.2 seconds to process 40 files with 2500 rows and 43 columns.
I can more or less follow what this script is doing. I guess you could make it more general by using variables for the columns you are checking and outputting?
I thought I had this figured out but was wrong so am humbly asking for help.
The task is to add an additional column to FILE 1 based on records in FILE 2.
The key is in COLUMN 1 for FILE 1 and in COLUMN 1 OR COLUMN 2 for FILE 2.
I want to add the third column from FILE 2 to the beginning of... (8 Replies)
Please know that I am very new to unix and trying to learn 'on the job'. I'm only manipulating large tab-delimited files (millions of rows), but I'm stuck and don't know how to proceed with the following. Hoping for some friendly advice :)
I have 2 tab-delimited files - with differing column &... (10 Replies)
Here's a sample of the data:
NAME BIRTHDAY SEX LOCATION AGE ID
Jim 05/11/1986 M Japan 27 86
Rei 08/25/1990 F Korea 24 33
Jane 02/24/1985 F India 29 78
I've been trying to sort files using the... (8 Replies)
Hi Forum.
I'm struggling to find a solution for the following issue.
I have multiple files a1.txt, a2.txt, a3.txt, etc. and I would like to insert a tab-delimited header record at the beginning of each of the files.
This is my code so far but it's not working as expected.
for i in... (2 Replies)
Hi,
My requirement is,there is a directory location like:
:camp/current/
In this location there can be different flat files that are generated in a single day with same header and the data will be different, differentiated by timestamp, so i need to verify how many files are generated... (10 Replies)
I have a need to merge two files on the value of an index column.
input file 1
id filePath MDL_NUMBER
1 MFCD00008104.mol MFCD00008104
2 MFCD00012849.mol MFCD00012849
3 MFCD00037597.mol MFCD00037597
4 MFCD00064558.mol MFCD00064558
5 MFCD00064559.mol MFCD00064559
input file 2
... (9 Replies)
Hi
I have two tab delimited file with different number of columns but same number of rows. I need to combine these two files in such a way that row 1 in file 2 comes adjacent to row 1 in file 1.
For example:
The content of file1:
field1 field2 field3
a1 a2 a3
b1 b2 b3... (2 Replies)
I have a tab-Delimited file:
Eg:
'test' file contains:
a<tab>b<tab>c<tab>....
Based on certain condition, I wanna increase the number of lines of this file.How do I do that
Eg:
If some value in the database is 1 then one line in 'test' file is fine..
If some value in the database is 2... (1 Reply)
Hey guys...
Running Solaris 5.6, trying to write an easy /sbin/sh script. I want to run several commands, then have the results appear on one line. Additionally, I want the results to be separated by <TAB>.
Let's say that my script calls three commands (date, pwd, and hostname), I would want... (2 Replies)