Sponsored Content
Full Discussion: Delete incomplete data
Top Forums UNIX for Dummies Questions & Answers Delete incomplete data Post 302930046 by jim mcnamara on Monday 29th of December 2014 10:36:35 PM
Old 12-29-2014
50GB: Without ordering, you cannot reasonably expect to be able to solve the problem.
You also cannot create arrays that are billions of items long because you would spend years searching for matches. Hashes that large will not fit in the memory of most machines.

Here is why pre-ordering (like a radix sort) is required :
Lets assume the data for person92 appears at offset position 2312000405 and at 44422000444. A 40GB gap.

The span is beyond the memory of most available desktop systems. You may counter with - "that's not true", but you cannot prove it. Simply because you cannot reasonably test it in the computer. It is faster to assume you need plan B: a radix sort to create lots of smaller files that have all person data items in them

You need to do what amounts to a radix sort: I am assuming that the first column is a unique identifier, probably a number. This example uses a number. Let's assume the numbers range from 1 billion to 15 billion: simply write any number between 12 billion one and 13 billion in a file named 12 billion.

This will take a long time:
I'm using filenames of one, two, three, four, five, six...fifteen. AND a lot of disk space - make sure the destination file system has 50+GB free before starting.

Code:
awk 'BEGIN { arr[1]= "one"
             arr[2]= "two"
             arr[3]= "three"
             arr[4]= "four"
             arr[5]= "five"
             arr[6]= "six"
             arr[7]= "seven"
             arr[8]= "eight"
             arr[9]= "nine"
             arr[10]="ten"
             arr[11]="eleven"
             arr[12]="twelve"
             arr[13]="thirteen"
             arr[14]="fourteen"
             arr[15]="fifteen"
      { print $0 > arr[ int($1/100000000)] }' infile

Note: awk does double precision arithmetic so this will work.

You now have 15 files that are much more amenable to sorting, etc. Next time consider using a database not a file for this kind of thing.

If you have lots of time, disk space, and free tmp disk space you could actually sort the file. It will take the better part of a day on a really fast desktop with SATA drives. Do not run lots of other stuff when this process is going or it will take even longer, if you can possibly avoid doing that.

Code:
export TMPDIR=/path/to/huge/filesystem/with/lots/of/free/space
export DEST=/path/bigfree/disk/newfile
sort -k1n -t,   infile >  $DEST

Now simple awk code will remove the problem data. Because you can compare all of the person1 data with a few reads, park the data in a tmp variable, and write to a new file if it is good.

Last edited by jim mcnamara; 12-29-2014 at 11:42 PM..
This User Gave Thanks to jim mcnamara For This Post:
 

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Delete blocks with no data..

Hi, I tried this but could not get it... here is what I need I have an xml where I get all the data in blocks but some times I get empty blocks with no data...shown below..I need to delete only those blocks with no data, I tried couple of ways but could not do it..any help is appreciated...... (1 Reply)
Discussion started by: mgirinath
1 Replies

2. Programming

edit, and delete data from you file

Hi, I'm new baby in linux and i need your help to make a script. i have a file like this: nameuser Password="password-user" Data-profile=tipe-profile, IP-Address=ip-data, country=country_user, old-user=old_user Scripts I need that allow search, edit, and delete data from you file. ... (1 Reply)
Discussion started by: dorek
1 Replies

3. Shell Programming and Scripting

Delete unmatched data

Hi, I try to write script to compare 2 data file (list of numbers) which after that I want to delete unmatched numbers and create new file for matched numbers. Can anybody to help me? (5 Replies)
Discussion started by: nazri76
5 Replies

4. Shell Programming and Scripting

Delete all data before first space occurence

Hi Guyz, Can anyone help me in the following:- I want to delete all that is there in my file that occures before the first space comes. Eg. My file InputFile conatins the following: 123 12345678 87654 Hello 09867 09876654 34567 Happy I want the data occuring before the occurence... (3 Replies)
Discussion started by: DTechBuddy
3 Replies

5. Shell Programming and Scripting

Reference data check for delete

Dear All, I have a master file - Master.txt 100|ABC 200|CED 500|XYZ 800|POL I have a reference file - Ref.txt 200 800 What is desired.. Check for all those records in reference file matching with those within master file and then delete those records from Master file So, at end,... (1 Reply)
Discussion started by: sureshg_sampat
1 Replies

6. Shell Programming and Scripting

Delete duplicate data and pertain the latest month data.

Hi I have a file with following records It contains three months of data, some data is duplicated,i need to access the latest data from the duplicate ones. for e.g; i have foll data "200","0","","11722","-63","","","","11722","JUL","09" "200","0","","11722","-63","","","","11722","JUL","09"... (10 Replies)
Discussion started by: vee_789
10 Replies

7. Shell Programming and Scripting

Delete data between [] with sed

i'm cat /var/log/message Jan 10 14:48:45 LOG SKYPE-OUT IN=eth1 OUT=eth2 SRC=192.168.1.65 DST=203.157.168.5 PROTO=TCP SPT=1284 DPT=3306 Jan 10 14:48:45 LOG HTTPS IN=eth0 OUT=eth1 SRC=207.46.15.251 DST=192.168.1.47 PROTO=TCP SPT=443 DPT=2069 Jan 10 14:48:45 LOG HTTPS IN=eth0 OUT=eth1... (2 Replies)
Discussion started by: slackman
2 Replies

8. UNIX for Dummies Questions & Answers

Mapping a data in a file and delete line in source file if data does not exist.

Hi Guys, Please help me with my problem here: I have a source file: 1212 23232 343434 ASAS1 4 3212 23232 343434 ASAS2 4 3234 23232 343434 QWQW1 4 1134 23232 343434 QWQW2 4 3212 23232 343434 QWQW3 4 and a mapping... (4 Replies)
Discussion started by: kokoro
4 Replies
All times are GMT -4. The time now is 06:14 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy