50GB: Without ordering, you cannot reasonably expect to be able to solve the problem.
You also cannot create arrays that are billions of items long because you would spend years searching for matches. Hashes that large will not fit in the memory of most machines.
Here is why pre-ordering (like a radix sort) is required :
Lets assume the data for person92 appears at offset position 2312000405 and at 44422000444. A 40GB gap.
The span is beyond the memory of most available desktop systems. You may counter with - "that's not true", but you cannot prove it. Simply because you cannot reasonably test it in the computer. It is faster to assume you need plan B: a radix sort to create lots of smaller files that have all person data items in them
You need to do what amounts to a radix sort: I am assuming that the first column is a unique identifier, probably a number. This example uses a number. Let's assume the numbers range from 1 billion to 15 billion: simply write any number between 12 billion one and 13 billion in a file named 12 billion.
This will take a long time:
I'm using filenames of one, two, three, four, five, six...fifteen. AND a lot of disk space - make sure the destination file system has 50+GB free before starting.
Note: awk does double precision arithmetic so this will work.
You now have 15 files that are much more amenable to sorting, etc. Next time consider using a database not a file for this kind of thing.
If you have lots of time, disk space, and free tmp disk space you could actually sort the file. It will take the better part of a day on a really fast desktop with SATA drives. Do not run lots of other stuff when this process is going or it will take even longer, if you can possibly avoid doing that.
Now simple awk code will remove the problem data. Because you can compare all of the person1 data with a few reads, park the data in a tmp variable, and write to a new file if it is good.
Last edited by jim mcnamara; 12-29-2014 at 11:42 PM..
This User Gave Thanks to jim mcnamara For This Post:
Hi,
I tried this but could not get it...
here is what I need I have an xml where I get all the data in blocks but some times I get empty blocks with no data...shown below..I need to delete only those blocks with no data, I tried couple of ways but could not do it..any help is appreciated...... (1 Reply)
Hi,
I'm new baby in linux and i need your help to make a script.
i have a file like this:
nameuser Password="password-user"
Data-profile=tipe-profile, IP-Address=ip-data, country=country_user, old-user=old_user
Scripts I need that allow search, edit, and delete data from you file.
... (1 Reply)
Hi,
I try to write script to compare 2 data file (list of numbers) which after that I want to delete unmatched numbers and create new file for matched numbers.
Can anybody to help me? (5 Replies)
Hi Guyz,
Can anyone help me in the following:-
I want to delete all that is there in my file that occures before the first space comes.
Eg. My file InputFile conatins the following:
123 12345678 87654 Hello
09867 09876654 34567 Happy
I want the data occuring before the occurence... (3 Replies)
Dear All,
I have a master file - Master.txt
100|ABC
200|CED
500|XYZ
800|POL
I have a reference file - Ref.txt
200
800
What is desired..
Check for all those records in reference file matching with those within master file and then delete those records from Master file
So, at end,... (1 Reply)
Hi I have a file with following records
It contains three months of data, some data is duplicated,i need to access the latest data from the duplicate ones.
for e.g; i have foll data
"200","0","","11722","-63","","","","11722","JUL","09"
"200","0","","11722","-63","","","","11722","JUL","09"... (10 Replies)
Hi Guys,
Please help me with my problem here:
I have a source file:
1212 23232 343434 ASAS1 4
3212 23232 343434 ASAS2 4
3234 23232 343434 QWQW1 4
1134 23232 343434 QWQW2 4
3212 23232 343434 QWQW3 4
and a mapping... (4 Replies)