I want to keep all people in the data, who at least have Firstname or Lastname or both and at least 1 row of Details. So, Person4 lacking both Firstname and Lastname will be deleted, Person5 with no Details will be deleted.
Firstname,Lastname and Details are fixed keywords in col 2.
Unfortunately the data is not sorted nicely as in the example and also it is about 50 gigs, so I would prefer to read it only once in memory. Also sorting first maybe expensive?
50GB: Without ordering, you cannot reasonably expect to be able to solve the problem.
You also cannot create arrays that are billions of items long because you would spend years searching for matches. Hashes that large will not fit in the memory of most machines.
Here is why pre-ordering (like a radix sort) is required :
Lets assume the data for person92 appears at offset position 2312000405 and at 44422000444. A 40GB gap.
The span is beyond the memory of most available desktop systems. You may counter with - "that's not true", but you cannot prove it. Simply because you cannot reasonably test it in the computer. It is faster to assume you need plan B: a radix sort to create lots of smaller files that have all person data items in them
You need to do what amounts to a radix sort: I am assuming that the first column is a unique identifier, probably a number. This example uses a number. Let's assume the numbers range from 1 billion to 15 billion: simply write any number between 12 billion one and 13 billion in a file named 12 billion.
This will take a long time:
I'm using filenames of one, two, three, four, five, six...fifteen. AND a lot of disk space - make sure the destination file system has 50+GB free before starting.
Note: awk does double precision arithmetic so this will work.
You now have 15 files that are much more amenable to sorting, etc. Next time consider using a database not a file for this kind of thing.
If you have lots of time, disk space, and free tmp disk space you could actually sort the file. It will take the better part of a day on a really fast desktop with SATA drives. Do not run lots of other stuff when this process is going or it will take even longer, if you can possibly avoid doing that.
Now simple awk code will remove the problem data. Because you can compare all of the person1 data with a few reads, park the data in a tmp variable, and write to a new file if it is good.
Last edited by jim mcnamara; 12-29-2014 at 11:42 PM..
This User Gave Thanks to jim mcnamara For This Post:
thank you for a detailed explanation, ( fortunately ) i had compiled the dataset from 42 files...so I can run the awk filtering in a loop? I am sorting the datasets as you mentioned..the first col has alphanumneric values, so i am using -k1,1 option to prepare the data for filtering... once that is done, i will need your guidance on the simple awk script that you mentioned..
Once the input file is sorted, you could run sth like
This will work happily through your file without requiring too much memory; still it may take its time.
Hi Guys,
Please help me with my problem here:
I have a source file:
1212 23232 343434 ASAS1 4
3212 23232 343434 ASAS2 4
3234 23232 343434 QWQW1 4
1134 23232 343434 QWQW2 4
3212 23232 343434 QWQW3 4
and a mapping... (4 Replies)
Hi I have a file with following records
It contains three months of data, some data is duplicated,i need to access the latest data from the duplicate ones.
for e.g; i have foll data
"200","0","","11722","-63","","","","11722","JUL","09"
"200","0","","11722","-63","","","","11722","JUL","09"... (10 Replies)
Dear All,
I have a master file - Master.txt
100|ABC
200|CED
500|XYZ
800|POL
I have a reference file - Ref.txt
200
800
What is desired..
Check for all those records in reference file matching with those within master file and then delete those records from Master file
So, at end,... (1 Reply)
Hi Guyz,
Can anyone help me in the following:-
I want to delete all that is there in my file that occures before the first space comes.
Eg. My file InputFile conatins the following:
123 12345678 87654 Hello
09867 09876654 34567 Happy
I want the data occuring before the occurence... (3 Replies)
Hi,
I try to write script to compare 2 data file (list of numbers) which after that I want to delete unmatched numbers and create new file for matched numbers.
Can anybody to help me? (5 Replies)
Hi,
I'm new baby in linux and i need your help to make a script.
i have a file like this:
nameuser Password="password-user"
Data-profile=tipe-profile, IP-Address=ip-data, country=country_user, old-user=old_user
Scripts I need that allow search, edit, and delete data from you file.
... (1 Reply)
Hi,
I tried this but could not get it...
here is what I need I have an xml where I get all the data in blocks but some times I get empty blocks with no data...shown below..I need to delete only those blocks with no data, I tried couple of ways but could not do it..any help is appreciated...... (1 Reply)