comparing Huge Files - Performance is very bad

10-10-2006

Registered User

6, 0

Join Date: May 2006

Last Activity: 2 December 2014, 10:02 AM EST

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

comparing Huge Files - Performance is very bad

Hi All,

Can you please help me in resolving the following problem?

My requirement is like this:

1) I have two files YESTERDAY_FILE and TODAY_FILE. Each one is having nearly two million data.
2) I need to check each record of TODAY_FILE in YESTERDAY_FILE. If exists we can skip that by treating as duplicate.
3) If does not exists, I need to check the primary key fields, say for example first 3 fields, if match is found i need check the data part, say for example 5th and 6th fields, if datapart is not matching, then i need to write the record in a new file, say OUTPUT_FILE, with a prefix of 'C' (Change). If datapart is matching skip the record.
4) If Primary Key fields match not found then i need to write the record with a prefix of A (Append).
4) After above process, i need to check each record of YESTERDAY_FILE in TODAY_FILE, if does not exists, i need to write the record with a prefix of D (Delete).

I developed the following logic which is taking too much time to execute...in one minute it is creating 100 records. Performance is too bad. Can any one of you please help me out.

My code is:

Code:

while read record
do
        primary_key_fields=`echo $record | cut -d "|" -f 1-${fields_to_compare}`
        data_part=`echo $record | cut -d "|" -f ${fields_to_skip}-`

        flag=`grep "${record}" ${yesterday_file_name}`
        if [ -z "${flag}" ]; then
                flag_tmp=`grep "${primary_key_fields}" $yesterday_file_name`
                yesterday_data_part=`echo ${flag_temp} | cut -d "|" -f ${fields_to_skip}-`
                if [ -z "${flag_tmp}" ] ; then
                        current_record="A|"${record}
                elif [ "${yesterday_data_part}" != "${data_part}" ] ; then
                        current_record="C|"${record}|sed "s/|I|/|U|/g"
                fi
                echo "${current_record}" >> $delta_file_name
        fi
done < $file_name

while read record
do
        primary_key_fields=`echo $record | cut -d "|" -f 1-${fields_to_compare}`
        flag=`fgrep "${primary_key_fields}" ${file_name}`
        if [ -z "${flag}" ]; then
                current_record="D|"`echo ${record| sed "s/|I|/|D|/g"`
                echo "${current_record}" >> $delta_file_name
        fi
done < ${yesterday_file_name}

Last edited by reborg; 10-10-2006 at 05:06 PM..

madhukalyan

View Public Profile for madhukalyan

Find all posts by madhukalyan

10-10-2006

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

code tags for code please. that script is unreadable.

Like {code} stuff {/code} except with [ ] instead of { }

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-10-2006

Registered User

411, 5

Join Date: Feb 2005

Last Activity: 7 May 2012, 4:35 PM EDT

Location: Longmont, CO

Posts: 411

Thanks Given: 1

Thanked 5 Times in 5 Posts

I've noticed that for simple things, I can shell script something and it works fine. When things start getting complicated or there's a performance issue, I'll break out perl (or python if you like).

I'd take what you have and see if it could be done better in perl. I'm sure it'd be a lot faster and probably easier to write.

Carl

BOFH

View Public Profile for BOFH

Find all posts by BOFH

10-10-2006

Registered User

255, 2

Join Date: Feb 2006

Last Activity: 2 February 2012, 9:30 AM EST

Location: Indianapolis, IN

Posts: 255

Thanks Given: 0

Thanked 2 Times in 2 Posts

I won't even try to interpret your code, because it is too difficult to read without code tags, but with that in mind, have you considered using comm, diff, cmp, etc.? Unix apps such as these were designed for problems like this. Why reinvent the wheel?

I'm also in agreement with BOFH that if raw performance is your concern, perl (or even C) might be better.

If I could read your code, I'd probably have some more insight.

Glenn Arndt

View Public Profile for Glenn Arndt

Find all posts by Glenn Arndt

10-10-2006

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Look - tmarikle posted a nice little 5 line awk program that does most of what you want - print all the lines in file2 that do not exist in file1. A sort of "minus" in a result set sense.

This is a modifed version of it, change it as you want:

Code:

awk '
    FILENAME=="file1" {
        Keys[$1 $2 $3]++
    }
    FILENAME=="file2" {
        if (Keys[$1 $2 $3] == 0) {
            print $0
        }
    }
' file1 file2 > newfile

The key is the first three fields. Add more fields or just use $0 for the whole record.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

10-10-2006

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

Python alternative:

Code:

#!/usr/bin/python
deltafile = open("delta.txt","a")
yfile = open("yester_file.txt") #open yesterday file
tfile = open("today_file.txt") #open today file

for i in xrange(0,2000000): #loop 2million records
        yesterline = yfile.readline().strip() #strip newline
        todayline = tfile.readline().strip()
        y_primary , y_2nd, y_3rd , y_4th = yesterline.split("|")
        t_primary, t_2nd, t_3rd, t_4th = todayline.split("|")
        if y_primary == t_primary:
                if y_4th != t_4th:         
                        print >> deltafile , "C|%s|%s|%s|%s" %( t_primary , t_2nd ,"U" , t_4th)
        else:
                print >> deltafile, "A|%s|%s|%s|%s" %( t_primary , t_2nd, t_3rd, t_4th )
                print >> deltafile, "D|%s|%s|%s|%s" %( y_primary , y_2nd, "D", y_4th)

deltafile.close() #close output file

Output:
/home > python test.py
C|aaa|xxxxxxxxxxxxxxxxxxxxxxxxx|U|vvvvvvvvvvvvvvvvvvv
C|bbb|xxxxxxxxxxxxxxxxxxxxxxxxx|U|kkkkkkkkkkkkkkkk
A|ddd|xxxxxxxxxxxxxxxxxxxxxxxxx|I|zzzzzzzzzzzzzzzzz
D|ccc|xxxxxxxxxxxxxxxxxxxxxxxxx|D|bbbbbbbbbbbbbbb

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

UNIX for Dummies Questions & Answers

comparing Huge Files - Performance is very bad

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

Discussion started by: Kannan K

2. Shell Programming and Scripting

Perl: Need help comparing huge files

Discussion started by: mrn6430

3. HP-UX

Performance issue with 'grep' command for huge file size

Discussion started by: arb_1984

4. Solaris

Performance (iops) becomes bad, what is the reason?

Discussion started by: ForgetChen

5. Shell Programming and Scripting

Comparing 2 huge text files

Discussion started by: linuxgeek

6. Shell Programming and Scripting

Comparing two huge files on field basis.

Discussion started by: Suman Singh

7. HP-UX

Bad performance but Low CPU loading?

Discussion started by: GreenShery

8. Shell Programming and Scripting

Comparing two huge files

Discussion started by: kmkbuddy_1983

9. AIX

Bad performance when log in with putty

Discussion started by: combat2k