The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
file comparison...help needed. er_ashu UNIX for Dummies Questions & Answers 4 05-15-2008 09:37 PM
Comparison Unix and Windows file sysytem localp UNIX for Dummies Questions & Answers 1 04-11-2008 04:02 AM
Output format - comparison with I/p file velappangs Shell Programming and Scripting 1 04-03-2008 06:31 AM
file comparison script tiger99 Shell Programming and Scripting 1 01-30-2008 10:47 AM
File Time Comparison Question pc9456 UNIX for Advanced & Expert Users 2 07-23-2003 03:05 PM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 01-02-2008
rikxik's Avatar
rikxik rikxik is offline
Registered User
  
 

Join Date: Dec 2007
Posts: 250
Do we have to loop?

Code:
$ cat f1
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
91111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
81111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
$
$ cat f2
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccddeddd
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
11111122222222333333aaaaaaaaaabbbbbbbbbccccccccddeddd
91111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
$
$ diff f1 f2 |grep "<" |cut -d"<" -f2 |cut -c2-
81111122222222333333aaaaaaaaaabbbbbbbbbccccccccdddddd
HTH
  #2 (permalink)  
Old 01-02-2008
stateful stateful is offline
Registered User
  
 

Join Date: Jan 2008
Posts: 7
I'd probably use diff too...

If the lines in the files are similar to the lines you put in your first post, meaning there are no spaces on the lines, you could:

Code:
#!/bin/sh
for k in `cat file1`
do
  grep -m 1 $k file2 > /dev/null
  if [ $? -eq 1 ]; then echo $k; fi 
done
the -m1 will cause grep to exit after the first match is found. If no match is found grep will exit with status 1, you can use that to determine if the line exists in file 2 or not. Keep in mind, that "for k in `cat`" stuff will break if you have spaces in the lines in the file.
  #3 (permalink)  
Old 01-03-2008
ranjithpr ranjithpr is offline
Registered User
  
 

Join Date: Nov 2007
Posts: 157
You can use grep -v -f

grep -v -f file1 file2

This will give you all the lines in file2 which are not in file1
  #4 (permalink)  
Old 01-05-2008
drl's Avatar
drl drl is online now Forum Advisor  
Registered User
  
 

Join Date: Apr 2007
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 707
Hi.
Quote:
Originally Posted by net_shree View Post
I have to compare two text files, very few of the lines in these files will have some difference in some column.
The files size is in GB.
By chance I am working with a text file of this size ( 1 GB ). It contains just over 1 GB, and has 15 M (15,000,000) lines. The real time to count the lines with wc is 15-20 seconds ( AMD-64/3000, SATA disk).

If this is correct, and you have 2 such files, then I think any method that reads a line from file1 and uses it with a program to look through file 2 at each step will not end quickly, because there will be 15 M loads of that program involved, not to mention actually reading the file. For example, doing a grep reading /dev/null for 15,000 times takes about 10 seconds (10.2 actually) real time. For 1,000 times that, I'd be looking at 2.75 hours just to load grep from the disk and read an immediate EOF. A grep of a non-existent string takes about 18 seconds for a single search.

I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.

If my facts are wrong, then tell me where I missed something of importance or made a mistake. Otherwise, perhaps we should take a step back and you tell us what the higher purpose of the problem is -- what problem you are really trying to solve -- perhaps we can suggest some other approach ... cheers, drl
  #5 (permalink)  
Old 01-06-2008
rikxik's Avatar
rikxik rikxik is offline
Registered User
  
 

Join Date: Dec 2007
Posts: 250
Quote:
I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.
Hi drl - I was wondering whether there is any reason/performance gain (for diff) if we sort the files? Is it essential/necessary? Just thinking aloud.
  #6 (permalink)  
Old 01-06-2008
drl's Avatar
drl drl is online now Forum Advisor  
Registered User
  
 

Join Date: Apr 2007
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 707
Hi, rikxik.

I was thinking that the diff window to look for sequences would not be so large. However, if the files were very similar, then the sort could perhaps be skipped -- I hope for the best, but expect the worst

It would be interesting to try it both ways, of course ... cheers, drl
  #7 (permalink)  
Old 01-10-2008
net_shree net_shree is offline
Registered User
  
 

Join Date: Dec 2007
Posts: 8
I did sort both the files and then tried diff as well as grep -v -f file1 file2, same problem.
It is running for too long.
Closed Thread

Bookmarks

Tags
linux

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 11:26 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0