Hi, this is about sorting a very large file (like 10 gb) to keep lines with unique entries across SOME of the columns.
The line originally looked like this:
please note the -u flag.
The problem is that this single command is taking more than 12 hours with 10 g of memory and I am looking for a way to speed things up.
I have heard that splitting a large file into subfiles, then sorting each subfile and then merging back together with the sort command can work, but I am imaging this will not work because I'm using the -u flag to keep only unique rows.
maybe split and then sort first WITHOUT the -u flag and then sort -m -u -k2,2 -k3,3n -k4,4n -k5,5n -k6,6n subfiles ?
I'm thinking of ways to test that, but please let me know if you have any ideas for me!
Thank you kindly.
Jonathan
Last edited by radoulov; 08-31-2011 at 06:01 PM..
Reason: Please use code tags for code and data samples, thank you
Hi,
May I know, if a pipe separated File is large, what is the best method to calculate the unique row count of 3rd column and get a list of unique value of the 3rdcolum?
Thanks in advance! (20 Replies)
This may sound like a trivial problem, but I still need some help:
I have a file with ids and I want to split it 'n' ways (could be any number) into files:
1
1
1
2
2
3
3
4
5
5
Let's assume 'n' is 3, and we cannot have the same id in two different partitions. So the partitions may... (8 Replies)
Input file
---------
12:name1:|host1|host1|host2|host1
13:name2:|host1|host1|host2|host3
14:name3:
......
Required output
---------------
12:name1:host1(2)|host1(1)
13:name2:host1(2)|host2(1)|host3(1)
14:name3:
where (x) - Count how many times field appears in last column
... (3 Replies)
Hi. I am not sure the title gives an optimal description of what I want to do.
I have several text files that contain data in many columns. All the files are organized the same way, but the data in the columns might differ. I want to count the number of times data occur in specific columns,... (0 Replies)
I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Hi,
I have an input file that I have sorted in a previous stage by $1 and $4. I now need something that will take the first record from each group of data based on the key being $1
Input file
1000AAA|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
1000AAA|"ZZZ"|"Date"|"2"|"Y"|"ABC"|""|AA... (2 Replies)
Dear community, I am facing a problem and I kindly ask your help:
I have 4 different data sets consisted from 3 different types of array.
On each file, column 1 is chromosome position, column 2 is SNP id etc... Lets say I have the following (bim) datasets:
x2014:
1 rs3094315... (4 Replies)