comparing Huge Files - Performance is very bad


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers comparing Huge Files - Performance is very bad
# 1  
Old 10-10-2006
comparing Huge Files - Performance is very bad

Hi All,

Can you please help me in resolving the following problem?

My requirement is like this:

1) I have two files YESTERDAY_FILE and TODAY_FILE. Each one is having nearly two million data.
2) I need to check each record of TODAY_FILE in YESTERDAY_FILE. If exists we can skip that by treating as duplicate.
3) If does not exists, I need to check the primary key fields, say for example first 3 fields, if match is found i need check the data part, say for example 5th and 6th fields, if datapart is not matching, then i need to write the record in a new file, say OUTPUT_FILE, with a prefix of 'C' (Change). If datapart is matching skip the record.
4) If Primary Key fields match not found then i need to write the record with a prefix of A (Append).
4) After above process, i need to check each record of YESTERDAY_FILE in TODAY_FILE, if does not exists, i need to write the record with a prefix of D (Delete).

I developed the following logic which is taking too much time to execute...in one minute it is creating 100 records. Performance is too bad. Can any one of you please help me out.

My code is:
Code:
while read record
do
        primary_key_fields=`echo $record | cut -d "|" -f 1-${fields_to_compare}`
        data_part=`echo $record | cut -d "|" -f ${fields_to_skip}-`

        flag=`grep "${record}" ${yesterday_file_name}`
        if [ -z "${flag}" ]; then
                flag_tmp=`grep "${primary_key_fields}" $yesterday_file_name`
                yesterday_data_part=`echo ${flag_temp} | cut -d "|" -f ${fields_to_skip}-`
                if [ -z "${flag_tmp}" ] ; then
                        current_record="A|"${record}
                elif [ "${yesterday_data_part}" != "${data_part}" ] ; then
                        current_record="C|"${record}|sed "s/|I|/|U|/g"
                fi
                echo "${current_record}" >> $delta_file_name
        fi
done < $file_name

while read record
do
        primary_key_fields=`echo $record | cut -d "|" -f 1-${fields_to_compare}`
        flag=`fgrep "${primary_key_fields}" ${file_name}`
        if [ -z "${flag}" ]; then
                current_record="D|"`echo ${record| sed "s/|I|/|D|/g"`
                echo "${current_record}" >> $delta_file_name
        fi
done < ${yesterday_file_name}

fields_to_compare, fields_to_skip and file_name are the parameters passed to the script. In the following case:

fields_to_compare=1 (primary Key fields: aaa, bbb etc)
fields_to_skip =3 (From 4th field i need to consider as data part)
file_name=today_file

My Input is :

Yesterdays File (yester_file)

aaa|xxxxxxxxxxxxxxxxxxxxxxxxx|I|mmmmmmmmm
bbb|xxxxxxxxxxxxxxxxxxxxxxxxx|I|nnnnnnnnnnnnnn
ccc|xxxxxxxxxxxxxxxxxxxxxxxxx|I|bbbbbbbbbbbbbbb

Todays File (today_file)

aaa|xxxxxxxxxxxxxxxxxxxxxxxxx|I|vvvvvvvvvvvvvvvvvvv
bbb|xxxxxxxxxxxxxxxxxxxxxxxxx|I|kkkkkkkkkkkkkkkk
ddd|xxxxxxxxxxxxxxxxxxxxxxxxx|I|zzzzzzzzzzzzzzzzz

Output File (deltafile)

C|aaa|xxxxxxxxxxxxxxxxxxxxxxxxx|U|vvvvvvvvvvvvvvvvvvv
C|bbb|xxxxxxxxxxxxxxxxxxxxxxxxx|U|kkkkkkkkkkkkkkkk
A|ddd|xxxxxxxxxxxxxxxxxxxxxxxxx|I|zzzzzzzzzzzzzzzzz
D|ccc|xxxxxxxxxxxxxxxxxxxxxxxxx|D|bbbbbbbbbbbbbbb

Last edited by reborg; 10-10-2006 at 05:06 PM..
# 2  
Old 10-10-2006
code tags for code please. that script is unreadable.

Like {code} stuff {/code} except with [ ] instead of { }
# 3  
Old 10-10-2006
I've noticed that for simple things, I can shell script something and it works fine. When things start getting complicated or there's a performance issue, I'll break out perl (or python if you like).

I'd take what you have and see if it could be done better in perl. I'm sure it'd be a lot faster and probably easier to write.

Carl
# 4  
Old 10-10-2006
I won't even try to interpret your code, because it is too difficult to read without code tags, but with that in mind, have you considered using comm, diff, cmp, etc.? Unix apps such as these were designed for problems like this. Why reinvent the wheel?

I'm also in agreement with BOFH that if raw performance is your concern, perl (or even C) might be better.

If I could read your code, I'd probably have some more insight.
# 5  
Old 10-10-2006
Look - tmarikle posted a nice little 5 line awk program that does most of what you want - print all the lines in file2 that do not exist in file1. A sort of "minus" in a result set sense.

This is a modifed version of it, change it as you want:
Code:
awk '
    FILENAME=="file1" {
        Keys[$1 $2 $3]++
    }
    FILENAME=="file2" {
        if (Keys[$1 $2 $3] == 0) {
            print $0
        }
    }
' file1 file2 > newfile

The key is the first three fields. Add more fields or just use $0 for the whole record.
# 6  
Old 10-10-2006
Python alternative:
Code:
#!/usr/bin/python
deltafile = open("delta.txt","a")
yfile = open("yester_file.txt") #open yesterday file
tfile = open("today_file.txt") #open today file

for i in xrange(0,2000000): #loop 2million records
        yesterline = yfile.readline().strip() #strip newline
        todayline = tfile.readline().strip()
        y_primary , y_2nd, y_3rd , y_4th = yesterline.split("|")
        t_primary, t_2nd, t_3rd, t_4th = todayline.split("|")
        if y_primary == t_primary:
                if y_4th != t_4th:         
                        print >> deltafile , "C|%s|%s|%s|%s" %( t_primary , t_2nd ,"U" , t_4th)
        else:
                print >> deltafile, "A|%s|%s|%s|%s" %( t_primary , t_2nd, t_3rd, t_4th )
                print >> deltafile, "D|%s|%s|%s|%s" %( y_primary , y_2nd, "D", y_4th)

deltafile.close() #close output file

Output:
/home > python test.py
C|aaa|xxxxxxxxxxxxxxxxxxxxxxxxx|U|vvvvvvvvvvvvvvvvvvv
C|bbb|xxxxxxxxxxxxxxxxxxxxxxxxx|U|kkkkkkkkkkkkkkkk
A|ddd|xxxxxxxxxxxxxxxxxxxxxxxxx|I|zzzzzzzzzzzzzzzzz
D|ccc|xxxxxxxxxxxxxxxxxxxxxxxxx|D|bbbbbbbbbbbbbbb
 
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Performance problem with removing duplicates in a huge file (50+ GB)

I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file. I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far.. Any suggestions please ? Thanks !! (9 Replies)
Discussion started by: Kannan K
9 Replies

2. Shell Programming and Scripting

Perl: Need help comparing huge files

What do i need to do have the below perl program load 205 million record files into the hash. It currently works on smaller files, but not working on huge files. Any idea what i need to do to modify to make it work with huge files: #!/usr/bin/perl $ot1=$ARGV; $ot2=$ARGV; open(mfileot1,... (12 Replies)
Discussion started by: mrn6430
12 Replies

3. HP-UX

Performance issue with 'grep' command for huge file size

I have 2 files; one file (say, details.txt) contains the details of employees and another file (say, emp.txt) has some selected employee names. I am extracting employee details from details.txt by using emp.txt and the corresponding code is: while read line do emp_name=`echo $line` grep -e... (7 Replies)
Discussion started by: arb_1984
7 Replies

4. Solaris

Performance (iops) becomes bad, what is the reason?

I have written a virtual HBA driver named "xmp_vhba". A scsi disk is attached on it. as shown below: xmp_vhba, instance #0 disk, instance #11 But the performance became very bad when we read/write the scsi disk using the vdbench(a read/write io tool). What is the reason? ... (7 Replies)
Discussion started by: ForgetChen
7 Replies

5. Shell Programming and Scripting

Comparing 2 huge text files

I have this 2 files: k5login sanwar@systems.nyfix.com jjamnik@systems.nyfix.com nisha@SYSTEMS.NYFIX.COM rdpena@SYSTEMS.NYFIX.COM service/backups-ora@SYSTEMS.NYFIX.COM ivanr@SYSTEMS.NYFIX.COM nasapova@SYSTEMS.NYFIX.COM tpulay@SYSTEMS.NYFIX.COM rsueno@SYSTEMS.NYFIX.COM... (11 Replies)
Discussion started by: linuxgeek
11 Replies

6. Shell Programming and Scripting

Comparing two huge files on field basis.

Hi all, I have two large files and i want a field by field comparison for each record in it. All fields are tab seperated. file1: Email SELVAKUMAR RAMACHANDRAN Email SHILPA SAHU Web NIYATI SONI Web NIYATI SONI Email VIINII DOSHI Web RAJNISH KUMAR Web ... (4 Replies)
Discussion started by: Suman Singh
4 Replies

7. HP-UX

Bad performance but Low CPU loading?

There might be some problem with my server, because every morning at 7, it's performance become bad with no DB extra deadlock. But I just couldn't figure it out. Please give me some advise, thanks a lot... According to the CPU performace chart, Daily CPU loading Maximum: 42 %, Average:36%. ... (8 Replies)
Discussion started by: GreenShery
8 Replies

8. Shell Programming and Scripting

Comparing two huge files

Hi, I have two files file A and File B. File A is a error file and File B is source file. In the error file. First line is the actual error and second line gives the information about the record (client ID) that throws error. I need to compare the first field (which doesnt start with '//') of... (11 Replies)
Discussion started by: kmkbuddy_1983
11 Replies

9. AIX

Bad performance when log in with putty

Hello guys! I'm n00b in AIX and I'm sticked in a problem. (my English is poor enough, but I hope you can understand me :P). So.. I'm trying to connect to an AIX machine with putty, and .. 'using username xxx' appears after 2 sec (OK), but 'xxx@ip's password' appears after 1:15 min. After... (6 Replies)
Discussion started by: combat2k
6 Replies
Login or Register to Ask a Question