Matching 10 Million file records with 10 Million in other file

06-13-2012

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Here is a little bash script to generate 10million test records (for those wanting to test performance of their solutions):

make_test.bash

Code:

for((i=1;i<30000000;i++)) {
   printf -v id "%s%04d%04d%03d" $i $((RANDOM%10000)) $((RANDOM%10000)) $((RANDOM%1000))
   [ $RANDOM -lt 14000 ] && echo "20.04.2012 11.08.44;RECV;APPNAME@HOSTNAME06:$id;processed;Location;contact;status;email_id;2" >&2
   [ $RANDOM -lt 16384 ] && echo "APPNAME@HOSTNAME06:$id;SUCCESS" || echo "APPNAME@HOSTNAME06:$id;FAILURE"
}

Call it like this ./make_test.bash > Status.txt 2> Input.txt

Here is my solution:

Code:

awk -F'[:;]' '
NR==FNR{S[$2]=($3=="SUCCESS"); next}
{ if ($4 in S)
    print $1";"$2";"$3":"$4";"$5";"$6";"$7";"(S[$4]?"SUCCESS":"FAILURE")";"$8";"$9
  else
    print $0
}' Status.txt Input.txt > Result.txt

Will update this post when I know the runtime :-
Stack dump at record 20,127,745 (of 27,713,184) while reading Status.txt after 1m36s runtime

Seems like it's just too much data for my PC. Good news is it didn't seem to take very long to read in the data and there is hope for larger (64bit) servers.

Last edited by Chubler_XL; 06-13-2012 at 09:30 PM..

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

06-14-2012

Registered User

182, 38

Join Date: Jun 2012

Last Activity: 23 September 2019, 9:01 AM EDT

Location: Lombardia, Italy

Posts: 182

Thanks Given: 5

Thanked 38 Times in 38 Posts

Quote:

Originally Posted by Chubler_XL

Here is a little bash script to generate 10million test records (for those wanting to test performance of their solutions):

make_test.bash

Code:

for((i=1;i<30000000;i++)) {
   printf -v id "%s%04d%04d%03d" $i $((RANDOM%10000)) $((RANDOM%10000)) $((RANDOM%1000))
   [ $RANDOM -lt 14000 ] && echo "20.04.2012 11.08.44;RECV;APPNAME@HOSTNAME06:$id;processed;Location;contact;status;email_id;2" >&2
   [ $RANDOM -lt 16384 ] && echo "APPNAME@HOSTNAME06:$id;SUCCESS" || echo "APPNAME@HOSTNAME06:$id;FAILURE"
}

Well, I've tried my solution using just one big status file (splitting it in many files is actually a very bad idea that would lead to extremely long run times - days at least). So I used:

Code:

#!/bin/bash
IFS=";:"
for file in status.txt; do
   declare -a status
   while read -r x y z; do
     status[$y]=$z
     done < $file
  while read a b c d e f g h i l; do
     [[ ${status[$d]} = "" ]] || h=${status[$d]}
     printf '%s;%s;%s:%s;%s;%s;%s;%s;%s;%s\n' "$a" "$b" "$c" "$d" "$e" "$f" "$g" "$h" "$i" "$l" >> output.txt
     done  < input.txt
  unset status
  done
exit 0

It took about 2.7GB of ram to load the status array (so about two times the file size), the files were on a slow green disk and the CPU was clocked at 1.6GHz.

The script apparently works, but it's slow: 45 minutes.

Lem

View Public Profile for Lem

Find all posts by Lem

Shell Programming and Scripting

Matching 10 Million file records with 10 Million in other file

7 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Add 1 million columns

Discussion started by: zajtat

2. UNIX for Dummies Questions & Answers

Deleting a million of files ..

Discussion started by: cain82

3. UNIX for Dummies Questions & Answers

Pls. help with script to remove million files

Discussion started by: samnyc

4. Shell Programming and Scripting

Tail 86000 lines from 1.2 million line file?

Discussion started by: robsonde

5. What is on Your Mind?

Pick a Number Between 0 and 20 for 1 Million Bits

Discussion started by: Neo

6. Shell Programming and Scripting

sort a file which has 3.7 million records

Discussion started by: greenworld

7. Shell Programming and Scripting

Extract data from large file 80+ million records

Discussion started by: learner16s