Difference between two huge files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Difference between two huge files
# 8  
Old 09-14-2008
Hi,

Yes, Bdiff is not behaving as i wanted. I am using Sun Solaris.

As you suggested, the best approach will be writing a script. As i am a novice to shell scripting, could you please help me out how to approach in solving this problem? Are there any sample scripts avaialble?

Thanks again

Sue
# 9  
Old 09-14-2008
Assume we have a file called "big". Then we copy "big" to a file called "little" except that we delete some lines. In that case we can display the missing lines with:
Code:
#! /usr/bin/ksh
exec < little
exec 4< big
IFS=""
while read line1 ; do
        match=0
        while ((!match)) ; do
                read -u4 line2
                if [[ "$line1" = "$line2" ]] ; then
                        match=1
                else
                        echo "$line2"
                fi
        done
done
while read -u4 line2 ; do
        echo "$line2"
done
exit 0

# 10  
Old 09-15-2008
Hi,

Your script seems to be working correctly. The problem now i am facing is that the output file kept on increasing its size as there is not end of file defined in the script.

How can we exit immediately after comparing those two files?

Thanks a ton

Sue
# 11  
Old 09-15-2008
I don't understand. It does exit upon reaching end-of-file. Smilie
# 12  
Old 09-16-2008
hi,

try the below code...hope it will work.

awk 'file_name=="file1" {arr[$0]++}
file_name=="file2"{ if($0 in arr) {continue}
else {print $0}}' file1 file2
# 13  
Old 09-16-2008
Quote:
Originally Posted by pyaranoid
Hi,

Your script seems to be working correctly. The problem now i am facing is that the output file kept on increasing its size as there is not end of file defined in the script.

How can we exit immediately after comparing those two files?

Thanks a ton

Sue
Looks like your "little" file is not a proper subset of your "big" file. If you have lines in your "little" file which are not in your "big" file, Perderabo's script will never exit but will continue to increase the size of the diff file. The simple fix is to check the exit status after each read i.e.
Code:
      read -u4 line2
      if [[ $? > 0 ]]
      then
          # print $line1
          break
      fi

# 14  
Old 09-16-2008
If the little file has any line at all which is not present in the big file, that condition will send the original script into an infinite loop. fpmurphy's fix stops the infinite loop, but the output is probably still wrong. After encountering such a line, my script will output the remainder of the big file. If the little file has a copy of the big file's final line, and all extra lines in the little file follow this final line, I guess it's ok.

BTW, if the little file has any extra lines, it is not a subset at all. A proper subset would have fewer lines. An "improper" or "non-proper" subset would be an exact copy of all lines.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of Huge files

Hi Friends !! I am facing a hash total issue while performing over a set of files of huge volume: Command used: tail -n +2 <File_Name> |nawk -F"|" -v '%.2f' qq='"' '{gsub(qq,"");sa+=($156<0)?-$156:$156}END{print sa}' OFMT='%.5f' Pipe delimited file and 156 column is for hash totalling.... (14 Replies)
Discussion started by: Ravichander
14 Replies

2. Shell Programming and Scripting

Difference between two huge .csv files

Hi all, I need help on getting difference between 2 .csv files. I have 2 large . csv files which has equal number of columns. I nned to compare them and get output in new file which will have difference olny. E.g. File1.csv Name, Date, age,number Sakshi, 16-12-2011, 22, 56 Akash,... (10 Replies)
Discussion started by: Dimple
10 Replies

3. Shell Programming and Scripting

Three Difference File Huge Data Comparison Problem.

I got three different file: Part of File 1 ARTPHDFGAA . . Part of File 2 ARTGHHYESA . . Part of File 3 ARTPOLYWEA . . (4 Replies)
Discussion started by: patrick87
4 Replies

4. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Hi, all: I've got two folders, say, "folder1" and "folder2". Under each, there are thousands of files. It's quite obvious that there are some files missing in each. I just would like to find them. I believe this can be done by "diff" command. However, if I change the above question a... (1 Reply)
Discussion started by: jiapei100
1 Replies

5. Programming

Huge difference between _POSIX_OPEN_MAX and sysconf(_SC_OPEN_MAX).

On my Linux system there seems to be a massive difference between the value of _POSIX_OPEN_MAX and what sysconf(_SC_OPEN_MAX) returns and also what I'd expect from the table of examples of configuration limits from Advanced Programming In The UNIX Environment, 2nd Ed. _POSIX_OPEN_MAX: 16... (5 Replies)
Discussion started by: gencon
5 Replies

6. Shell Programming and Scripting

Replacing second line from huge files

I'm trying simple functionality of replacing the second line of files with some other string. Problem is these files are huge and there are too many files to process. Could anyone please suggest me a way to replace the second line of all files with another text in a fastest possible manner. ... (2 Replies)
Discussion started by: satish.pyboyina
2 Replies

7. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

we have one file (11 Million) line that is being matched with (10 Billion) line. the proof of concept we are trying , is to join them on Unix : All files are delimited and they have composite keys.. could unix be faster than Oracle in This regards.. Please advice (1 Reply)
Discussion started by: magedfawzy
1 Replies

8. UNIX for Advanced & Expert Users

Huge files manipulation

Hi , i need a fast way to delete duplicates entrys from very huge files ( >2 Gbs ) , these files are in plain text. I tried all the usual methods ( awk / sort /uniq / sed /grep .. ) but it always ended with the same result (memory core dump) In using HP-UX large servers. Any advice will... (8 Replies)
Discussion started by: Klashxx
8 Replies

9. AIX

Huge difference in reported Disk usage between ls,df and du

IBM RS6000 F50 AIX 4.3.2 i am having trouble in calculating the actual size of a set of directories and reconciling the results with the actual Hard Disk space used I have 33GB disk which is showing 7.8GB used, a byte count of the files in the directory/sub-dirs i`m interested in is 48GB,... (4 Replies)
Discussion started by: cooperuf
4 Replies

10. Shell Programming and Scripting

Comparing two huge files

Hi, I have two files file A and File B. File A is a error file and File B is source file. In the error file. First line is the actual error and second line gives the information about the record (client ID) that throws error. I need to compare the first field (which doesnt start with '//') of... (11 Replies)
Discussion started by: kmkbuddy_1983
11 Replies
Login or Register to Ask a Question