Hello;
I am posting to get any help on my code that I have been struggling for some time. The project is to join two files each with 80k~180k rows. I want to merge them together by the shared common column. The problem of the shared column is partially matching, not exactly the same.
File1:
File2:
I want to merge the two file to get like:
Here is my code:
Note the first column of the File2 contains only part of the first column of File1 before the first vertical bar "|". And not all of the rows of File1 has a match in File2, may be 80k out of 180k. They are big files.
It was running, but did not append the matched part of File2 to File1. Could anyone give me some clue?
I found this join/merge problem is quite common in my work, and I do not have database like MySQL. It would be great for me to catch the spirit of the coding for this.
Thanks a lot!
Yifang
The two files do not seem to match even partially!
The string "mira_assembly_rep_c" is present in line 1 of File 1, and it is present in at the beginning of all lines of File 2.
Other than there, there is nothing in common.
What's the logic for the merged file then?
Why is the 2nd line of merged file as follows?
And why is the 3rd line of the merged file like so ?
Maybe you could explain how you derived the merged file.
I could not show either file as they are big (180000 rows and 90000rows). That's why I need some help here. What I posted were just part of the files and they do not have common columns at first sight. But, if I search through the rest of the file, for sure there are matches between them. I should have be more careful with the example part. Here I rearranged the first several rows of each file.
File1:
Note the "mira_assembly_rep_c3140" are the shared part in columns1 of File1.
In File2:
In File2, the ID_Part column is partially matching the Column1 (ID) of File1. in my code; I used condition to test the match:
The merged file contains the full ID, sequence and gene symbol and function description of the gene.
The merged file will be appended to another database with the ID as identifier. That's why I need the two files merged.
Quote:
The two files do not seem to match even partially!
The string "mira_assembly_rep_c" is present in line 1 of File 1, and it is present in at the beginning of all lines of File 2.
Other than there, there is nothing in common.
Actually "mira_assembly_rep_c3" instead of "mira_assembly_rep_c" was used for matching test, that's why I append a vertical bar "|" at the end of the string
Quote:
What's the logic for the merged file then?
The logic is tricky as the match is the middle part of the string.
Quote:
Why is the 2nd line of merged file as follows?
Maybe it is a typo because the row became too long to see, and was wrapped. Anyway, there are five fields:
ID Sequence Part_ID gene_symbol Description.
A common bioinformatics problem, joining two tables :P I wonder why you posted your question to the "Web Development" section but it happens to be the only forum I subscribe to
Some comments about your code:
-w has been superseded by use warnings.
The chomp() function removes return carriages and newlines.
The strategy I would use is to find some way to just capture the assembly information and using the assembly information to store the information on the line:
Then read file2 like you did before and get the required information from your %line2 hash.
Hi Dave:
Thanks for your comments! Actually I prefer your style too, e.g. use warnings etc. I scripted another codes as:
I used the first column $AAA[0] of file1 as key of the hash, and then compare with the first column $BBB[0] of file2. If $AAA[0] contains the string $BBB[0], it means a match, as "mira_" is not the only assembly marker.
It seems running except a small bug for
which caused the warning and stopped the program. So that I have to comment the use warnings.
The code takes ~6 hours for my 2.3Ghz dual CPU + 3GB RAM (compaq machine) to run. Not sure if this could be improved for file1 has 147478 rows (15.2MB) and file2 86837 rows(7.2MB).
Actually I have another idea in my mind to reduce the work load because the iteration is 147478x86837 times. If a match is found in file1, then the matched row in file1 can be deleted so that for the next $BBB[0] in file2 does not need to search this row again. ... so that the last search is 86838 instead of 147478 loops ( when the match is in the last row, worst scenario!). The reason is each row is unique in both file. Could not figure out this by myself. Any clue is highly appreciated!
Yifang
No problems. chomp by the way operates on the default input ($_), so you can just specify chomp instead of chomp $_
The computation will take a while as you pointed out. I think it might be worth fixing up the first file so that everything is systematic e.g. having a standardised assembly notation, so you don't need to use a regular expression. Once that is fixed up, you can just use a hash to see if the key exists.
As for your second approach of deleting elements in the hash, look up the delete() function.
Hello,
This post is already here but want to do this with another way
Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files
file1.csv:
1|abc
1|def
2|ghi
2|jkl
3|mno
3|pqr
file2.csv:
1|123|jojo
1|NULL|bibi... (2 Replies)
Dear Ladies & Gents,
I have a requirement to delete all the log files in /var/log/test directory that are older than 10 days and their first line begin with "MSH" or "<?xml" or "FHS". I've put together the following BASH script, but it's erroring out:
for filename in $(find /var/log/test... (2 Replies)
Hi all,
I'm trying to join two .txt file tab delimitated based on a common column.
File 1
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
comp1000201_c0_seq1 comp1000201_c0 337 183.51 0.00 0.00 0.00 0.00
comp1000297_c0_seq1 ... (1 Reply)
Hi,
I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column).
I want to write a script to join the files by the first common column so that in the... (5 Replies)
Hi,
I'm dealing with an issue and losing a lot of hours figuring out how i would solve this.
I have an input file which looks like this:
('BLABLA +200-GRS','Serviço ','TarifaçãoServiço','wap.bla.us.0000000121',2985,0,55,' de conversão em escada','Dia','Domingos')
('BLABLA +200-GRR','Serviço... (6 Replies)
Hi,
I have line in input file as below:
3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL
My expected output for line in the file must be :
"1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL"
Can someone... (7 Replies)
I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ?
file1
ax100 20 30 40
ax200 22 33 44
file2
ax100 10 20 40
ax200 12 13 44
file2
ax100 0 0 4
ax200 2 3 4 (9 Replies)
Hi All,
I have working (Perl) code to combine 2 input files into a single output file using the join function that works to a point, but has the following limitations:
1. I am restrained to 2 input files only.
2. Only the "matched" fields are written out to the "matched" output file and... (1 Reply)