Delete incomplete data

12-29-2014

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Delete incomplete data

Hi all,

Please help with the following example.

I want to keep all people in the data, who at least have Firstname or Lastname or both and at least 1 row of Details. So, Person4 lacking both Firstname and Lastname will be deleted, Person5 with no Details will be deleted.
Firstname,Lastname and Details are fixed keywords in col 2.

Unfortunately the data is not sorted nicely as in the example and also it is about 50 gigs, so I would prefer to read it only once in memory. Also sorting first maybe expensive?

Input

Code:

Person1,Firstname,x1
Person1,Lastname,x2
Person1,Details,x3
Person1,Details,x4
Person2,Firstname,x5
Person2,Details,x6
Person2,Details,x7
Person2,Details,x8
Person4,Details,x9
Person4,Details,x11
Person4,Details,x12
Person3,Details,x9
Person3,Lastname,x10
Person3,Details,x11
Person3,Details,x12
Person5,Firstname,x15
Person5,Lastname,x26
Person6,Firstname,x5
Person6,Details,x6

Output

Code:

Person1,Firstname,x1
Person1,Lastname,x2
Person1,Details,x3
Person1,Details,x4
Person2,Firstname,x5
Person2,Details,x6
Person2,Details,x7
Person2,Details,x8
Person3,Details,x9
Person3,Lastname,x10
Person3,Details,x11
Person3,Details,x12
Person6,Firstname,x5
Person6,Details,x6

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

12-29-2014

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

50GB: Without ordering, you cannot reasonably expect to be able to solve the problem.
You also cannot create arrays that are billions of items long because you would spend years searching for matches. Hashes that large will not fit in the memory of most machines.

Here is why pre-ordering (like a radix sort) is required :
Lets assume the data for person92 appears at offset position 2312000405 and at 44422000444. A 40GB gap.

The span is beyond the memory of most available desktop systems. You may counter with - "that's not true", but you cannot prove it. Simply because you cannot reasonably test it in the computer. It is faster to assume you need plan B: a radix sort to create lots of smaller files that have all person data items in them

You need to do what amounts to a radix sort: I am assuming that the first column is a unique identifier, probably a number. This example uses a number. Let's assume the numbers range from 1 billion to 15 billion: simply write any number between 12 billion one and 13 billion in a file named 12 billion.

This will take a long time:
I'm using filenames of one, two, three, four, five, six...fifteen. AND a lot of disk space - make sure the destination file system has 50+GB free before starting.

Code:

awk 'BEGIN { arr[1]= "one"
             arr[2]= "two"
             arr[3]= "three"
             arr[4]= "four"
             arr[5]= "five"
             arr[6]= "six"
             arr[7]= "seven"
             arr[8]= "eight"
             arr[9]= "nine"
             arr[10]="ten"
             arr[11]="eleven"
             arr[12]="twelve"
             arr[13]="thirteen"
             arr[14]="fourteen"
             arr[15]="fifteen"
      { print $0 > arr[ int($1/100000000)] }' infile

Note: awk does double precision arithmetic so this will work.

You now have 15 files that are much more amenable to sorting, etc. Next time consider using a database not a file for this kind of thing.

If you have lots of time, disk space, and free tmp disk space you could actually sort the file. It will take the better part of a day on a really fast desktop with SATA drives. Do not run lots of other stuff when this process is going or it will take even longer, if you can possibly avoid doing that.

Code:

export TMPDIR=/path/to/huge/filesystem/with/lots/of/free/space
export DEST=/path/bigfree/disk/newfile
sort -k1n -t,   infile >  $DEST

Now simple awk code will remove the problem data. Because you can compare all of the person1 data with a few reads, park the data in a tmp variable, and write to a new file if it is good.

Last edited by jim mcnamara; 12-29-2014 at 11:42 PM..

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

12-30-2014

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

thank you for a detailed explanation, ( fortunately ) i had compiled the dataset from 42 files...so I can run the awk filtering in a loop? I am sorting the datasets as you mentioned..the first col has alphanumneric values, so i am using -k1,1 option to prepare the data for filtering... once that is done, i will need your guidance on the simple awk script that you mentioned..

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

12-30-2014

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Once the input file is sorted, you could run sth like

Code:

awk     'NR==1          {P=$1}
         P != $1        {if (N && D) print O; O=DL=""; N=0; D=0; P=$1}
                        {O=O DL $0; DL="\n"}
         $2 ~ /name$/   {N=1}
         $2 ~ /^Deta/   {D=1}
         END            {if (N && D) print O}
        ' FS="," file
Person1,Firstname,x1
Person1,Lastname,x2
Person1,Details,x3
Person1,Details,x4
Person2,Firstname,x5
Person2,Details,x6
Person2,Details,x7
Person2,Details,x8
Person3,Details,x9
Person3,Lastname,x10
Person3,Details,x11
Person3,Details,x12
Person6,Firstname,x5
Person6,Details,x6

This will work happily through your file without requiring too much memory; still it may take its time.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

UNIX for Dummies Questions & Answers

Delete incomplete data

8 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Mapping a data in a file and delete line in source file if data does not exist.

Discussion started by: kokoro

2. Shell Programming and Scripting

Delete data between [] with sed

Discussion started by: slackman

3. Shell Programming and Scripting

Delete duplicate data and pertain the latest month data.

Discussion started by: vee_789

4. Shell Programming and Scripting

Reference data check for delete

Discussion started by: sureshg_sampat

5. Shell Programming and Scripting

Delete all data before first space occurence

Discussion started by: DTechBuddy

6. Shell Programming and Scripting

Delete unmatched data

Discussion started by: nazri76

7. Programming

edit, and delete data from you file

Discussion started by: dorek

8. Shell Programming and Scripting

Delete blocks with no data..

Discussion started by: mgirinath