I have a need to merge two files on the value of an index column.
input file 1
input file 2
output file
I could probably do this in awk, or even with join, but I need to add logic to check each pair of index values to make sure that the data stays in registration. I think this means a higher level language like ruby, python, or perl, but I am not very good with any of those. In most cases, the files will match correctly, but I think I need to add exception handling to check that the files have the same number of rows and that the index values are in the right order.
Can someone point me to a tutorial for one of these languages that shows sample code for loading two files and merging output. I would guess you would hash the column header so you can specify which column is the index. I can do this in excel, but there are 150,000 lines in some of these files and I have allot of them to do.
Thanks for the post. One of the issues I am looking at here is that I use things like this fairly often, but the formats of the files can be very different. They could have hundreds of columns and the header name and location of the index column can vary quire a bit. In most cases, it is easier to use a header name to specify the index column and it's harder to use a simple tokenizer where the number of columns and location of the index is not constant. I am trying to get away from scripts with significant hard coding that needs to be changed.
In c++, I might create a hash table using the index value of the second file ask the key and the remainder of the row as the value. I would read in the second file and store it on the index value and then loop through the first file looking up the file 2 data that goes with each file 1 row. This would also mean that the files wouldn't have to be in the same order. That wouldn't be super fast, but processing files with 150,000+ rows will take some time no matter what the method. I don't really know how to do that sort of thing in a scripting language.
I don't think that something of that sophistication is really necessary, if the rows are not in the same order, another script could be used to sort them. I think that all that is really necessary is to check that the number of rows is the same and then test each row for matching key values, but I would like to have a script that will just take arguments for the filenames and index header name and have the rest work for pretty much any pair of files.
The main point here is to do some checking to make sure that the index values match. There are allot of ways to stuff two files together, but some of these files are massive and have had quite a bit of processing done on them. There is absolutely no guarantee that they will match up, or that the index key will always be in the same column. The value of the index needs to be checked for each pair of rows to make sure that matching data is merged in the output.
Thanks for the post, that seems to get me allot of the way there.
I modified as I would use it in a bash script,
run as,
./data_merge_awk.sh MDL_NUMBER index_file merge_file > output_file
The only issue is that I most often use this on tab delimited data. I tried changing the split argument from " " to "/t", but that doesn't do it.
The only downside I can see is having to hard code all of the columns I need to keep for each separate use. Some of the files I use this kind of thing for would have hundreds of columns. I'm not familiar with specifying the field names like you did. Is there a syntax for specifying "all columns but this one", etc?
Hello,
I have 40 data files where the first three columns are the same (in theory) and the 4th column is different. Here is an example of three files,
file 2: A_f0_r179_pred.txt
Id Group Name E0
1 V N(,)'1 0.2904
2 V N(,)'2 0.3180
3 V N(,)'3 0.3277
4 V N(,)'4 0.3675
5 V N(,)'5 0.3456
... (8 Replies)
Hi friends,
I have implemented the merge sort algorith in c, before I put forward my question, you please have a look at my code.
// The array is sorted, as 1234
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int A = {4, 3, 2, 1};
void Merge_Sort(int , int, int);
void... (0 Replies)
Hi All,
I have a file (FileNames.txt) which contains the following data in it.
$ cat FileNames.txt
MYFILE17XXX208Sep191307.csv
MYFILE19XXX208Sep192124.csv
MYFILE20XXX208Sep192418.csv
MYFILE22XXX208Sep193234.csv
MYFILE21XXX208Sep193018.csv
MYFILE24XXX208Sep194053.csv... (5 Replies)
Hi,
I have a very basic knowledge of shell scripting & would like some help with a little problem I have. I sometimes use a program calle phronix & sometimes like to compare its results which are *.xml files. Which is easy enough but a friend wants to avoid typing the path to the files.... (2 Replies)
Hello,
I have two txt files that look like this:
db.0.0.0.0:
Total number of NS records = 1
db.127.0.0.0:
Total number of NS records = 1
Total number of PTR records = 1
db.172.19.0.0:
Total number of NS records = 1
Total number of PTR records = 3
db.172.19.59.0:
Total... (8 Replies)
why do inode indices starts from 1 unlike array indexes which starts from 0
its a question from "the design of unix operating system" of maurice j bach
id be glad if i get to know the answer quickly
:) (0 Replies)
Hi
I want to merge two or more files using perl in windows only(Just like Paste command in Unix script) . How can i do this.Is ther any single command to do this?
Thanks
Kunal (1 Reply)
Can anybody help me out with this problem
" a shell program that takes one or any number of file names as input; sorts the lines of each file in ascending order and displays the non blank lines of each sorted file and merge them as one combined sorted file. The program generates an error... (1 Reply)