Delete incomplete data


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Delete incomplete data
# 1  
Old 12-29-2014
Delete incomplete data

Hi all,

Please help with the following example.

I want to keep all people in the data, who at least have Firstname or Lastname or both and at least 1 row of Details. So, Person4 lacking both Firstname and Lastname will be deleted, Person5 with no Details will be deleted.
Firstname,Lastname and Details are fixed keywords in col 2.

Unfortunately the data is not sorted nicely as in the example and also it is about 50 gigs, so I would prefer to read it only once in memory. Also sorting first maybe expensive?

Input
Code:
Person1,Firstname,x1
Person1,Lastname,x2
Person1,Details,x3
Person1,Details,x4
Person2,Firstname,x5
Person2,Details,x6
Person2,Details,x7
Person2,Details,x8
Person4,Details,x9
Person4,Details,x11
Person4,Details,x12
Person3,Details,x9
Person3,Lastname,x10
Person3,Details,x11
Person3,Details,x12
Person5,Firstname,x15
Person5,Lastname,x26
Person6,Firstname,x5
Person6,Details,x6


Output

Code:
Person1,Firstname,x1
Person1,Lastname,x2
Person1,Details,x3
Person1,Details,x4
Person2,Firstname,x5
Person2,Details,x6
Person2,Details,x7
Person2,Details,x8
Person3,Details,x9
Person3,Lastname,x10
Person3,Details,x11
Person3,Details,x12
Person6,Firstname,x5
Person6,Details,x6

# 2  
Old 12-29-2014
50GB: Without ordering, you cannot reasonably expect to be able to solve the problem.
You also cannot create arrays that are billions of items long because you would spend years searching for matches. Hashes that large will not fit in the memory of most machines.

Here is why pre-ordering (like a radix sort) is required :
Lets assume the data for person92 appears at offset position 2312000405 and at 44422000444. A 40GB gap.

The span is beyond the memory of most available desktop systems. You may counter with - "that's not true", but you cannot prove it. Simply because you cannot reasonably test it in the computer. It is faster to assume you need plan B: a radix sort to create lots of smaller files that have all person data items in them

You need to do what amounts to a radix sort: I am assuming that the first column is a unique identifier, probably a number. This example uses a number. Let's assume the numbers range from 1 billion to 15 billion: simply write any number between 12 billion one and 13 billion in a file named 12 billion.

This will take a long time:
I'm using filenames of one, two, three, four, five, six...fifteen. AND a lot of disk space - make sure the destination file system has 50+GB free before starting.

Code:
awk 'BEGIN { arr[1]= "one"
             arr[2]= "two"
             arr[3]= "three"
             arr[4]= "four"
             arr[5]= "five"
             arr[6]= "six"
             arr[7]= "seven"
             arr[8]= "eight"
             arr[9]= "nine"
             arr[10]="ten"
             arr[11]="eleven"
             arr[12]="twelve"
             arr[13]="thirteen"
             arr[14]="fourteen"
             arr[15]="fifteen"
      { print $0 > arr[ int($1/100000000)] }' infile

Note: awk does double precision arithmetic so this will work.

You now have 15 files that are much more amenable to sorting, etc. Next time consider using a database not a file for this kind of thing.

If you have lots of time, disk space, and free tmp disk space you could actually sort the file. It will take the better part of a day on a really fast desktop with SATA drives. Do not run lots of other stuff when this process is going or it will take even longer, if you can possibly avoid doing that.

Code:
export TMPDIR=/path/to/huge/filesystem/with/lots/of/free/space
export DEST=/path/bigfree/disk/newfile
sort -k1n -t,   infile >  $DEST

Now simple awk code will remove the problem data. Because you can compare all of the person1 data with a few reads, park the data in a tmp variable, and write to a new file if it is good.

Last edited by jim mcnamara; 12-29-2014 at 11:42 PM..
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 12-30-2014
thank you for a detailed explanation, ( fortunately ) i had compiled the dataset from 42 files...so I can run the awk filtering in a loop? I am sorting the datasets as you mentioned..the first col has alphanumneric values, so i am using -k1,1 option to prepare the data for filtering... once that is done, i will need your guidance on the simple awk script that you mentioned..
# 4  
Old 12-30-2014
Once the input file is sorted, you could run sth like
Code:
awk     'NR==1          {P=$1}
         P != $1        {if (N && D) print O; O=DL=""; N=0; D=0; P=$1}
                        {O=O DL $0; DL="\n"}
         $2 ~ /name$/   {N=1}
         $2 ~ /^Deta/   {D=1}
         END            {if (N && D) print O}
        ' FS="," file
Person1,Firstname,x1
Person1,Lastname,x2
Person1,Details,x3
Person1,Details,x4
Person2,Firstname,x5
Person2,Details,x6
Person2,Details,x7
Person2,Details,x8
Person3,Details,x9
Person3,Lastname,x10
Person3,Details,x11
Person3,Details,x12
Person6,Firstname,x5
Person6,Details,x6

This will work happily through your file without requiring too much memory; still it may take its time.
This User Gave Thanks to RudiC For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Mapping a data in a file and delete line in source file if data does not exist.

Hi Guys, Please help me with my problem here: I have a source file: 1212 23232 343434 ASAS1 4 3212 23232 343434 ASAS2 4 3234 23232 343434 QWQW1 4 1134 23232 343434 QWQW2 4 3212 23232 343434 QWQW3 4 and a mapping... (4 Replies)
Discussion started by: kokoro
4 Replies

2. Shell Programming and Scripting

Delete data between [] with sed

i'm cat /var/log/message Jan 10 14:48:45 LOG SKYPE-OUT IN=eth1 OUT=eth2 SRC=192.168.1.65 DST=203.157.168.5 PROTO=TCP SPT=1284 DPT=3306 Jan 10 14:48:45 LOG HTTPS IN=eth0 OUT=eth1 SRC=207.46.15.251 DST=192.168.1.47 PROTO=TCP SPT=443 DPT=2069 Jan 10 14:48:45 LOG HTTPS IN=eth0 OUT=eth1... (2 Replies)
Discussion started by: slackman
2 Replies

3. Shell Programming and Scripting

Delete duplicate data and pertain the latest month data.

Hi I have a file with following records It contains three months of data, some data is duplicated,i need to access the latest data from the duplicate ones. for e.g; i have foll data "200","0","","11722","-63","","","","11722","JUL","09" "200","0","","11722","-63","","","","11722","JUL","09"... (10 Replies)
Discussion started by: vee_789
10 Replies

4. Shell Programming and Scripting

Reference data check for delete

Dear All, I have a master file - Master.txt 100|ABC 200|CED 500|XYZ 800|POL I have a reference file - Ref.txt 200 800 What is desired.. Check for all those records in reference file matching with those within master file and then delete those records from Master file So, at end,... (1 Reply)
Discussion started by: sureshg_sampat
1 Replies

5. Shell Programming and Scripting

Delete all data before first space occurence

Hi Guyz, Can anyone help me in the following:- I want to delete all that is there in my file that occures before the first space comes. Eg. My file InputFile conatins the following: 123 12345678 87654 Hello 09867 09876654 34567 Happy I want the data occuring before the occurence... (3 Replies)
Discussion started by: DTechBuddy
3 Replies

6. Shell Programming and Scripting

Delete unmatched data

Hi, I try to write script to compare 2 data file (list of numbers) which after that I want to delete unmatched numbers and create new file for matched numbers. Can anybody to help me? (5 Replies)
Discussion started by: nazri76
5 Replies

7. Programming

edit, and delete data from you file

Hi, I'm new baby in linux and i need your help to make a script. i have a file like this: nameuser Password="password-user" Data-profile=tipe-profile, IP-Address=ip-data, country=country_user, old-user=old_user Scripts I need that allow search, edit, and delete data from you file. ... (1 Reply)
Discussion started by: dorek
1 Replies

8. Shell Programming and Scripting

Delete blocks with no data..

Hi, I tried this but could not get it... here is what I need I have an xml where I get all the data in blocks but some times I get empty blocks with no data...shown below..I need to delete only those blocks with no data, I tried couple of ways but could not do it..any help is appreciated...... (1 Reply)
Discussion started by: mgirinath
1 Replies
Login or Register to Ask a Question