awk to compare flat files and print output to another file Post: 302432559

Sponsored Content

Top Forums Shell Programming and Scripting awk to compare flat files and print output to another file Post 302432559 by suhaeb on Friday 25th of June 2010 10:18:34 AM

06-25-2010

Registered User

Thanks for your time on this, its much appreciated

1) Do both files have exactly the same number of records and are you just looking for records which have changed? Does the order of the output into file3 matter?

File1 has 1803077 records
file2 has 1795370 records

2) If there can be more or less records in file2 than file1, does the order of the output into file3 matter?

I would prefer 1st row in file3 from file1 and 2nd row from file2 and so on

Are you also interested in records which exist in file1 but do not exist in file2?

Yes, and viceversa also, it would be good if we can copy the records to diffrent files say recordsonlyonfile1.txt and recordsonlyonfile2.txt

3) What percentage of differences do you expect? (This is really a performance question because some approaches would involve multiple lookups).

there are huge changes in the file it could be over 50%

4) If this proves too difficult for shell programming, do you have a mainstream database engine?

I have informix database I am not sure if this would not help me as there is no uniq key in the records

---------- Post updated at 15:05 ---------- Previous update was at 14:20 ----------

One shell approach if the order of the output does not matter.
Tried with two approx 5 million record files of 500 Mb each. Took about 5 mins to run and the output only shows the mismatched records from file2. Actual performance will depend on how fast you computer is and how much memory you can give to sort.

Code:

#!/bin/ksh
cat file1 | sort > sortfile1
cat file2 | sort > sortfile2
comm -13 sortfile1 sortfile2

When sorting large files be sure to set $TMPDIR to somewhere with enough space for at least twice the size of the file being sorted.[/QUOTE]

suhaeb

View Public Profile for suhaeb

Find all posts by suhaeb

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to compare lines of two files and print output on screen

hey guys, I have two files both with two columns, I have already created an awk code to ignore certain lines (e.g lines that start with 963) as they wou ld begin with a certain string, however, the rest I have added together and calculated the average. At the moment the code also displays...

2. Shell Programming and Scripting

compare columns from seven files and print the output

Hi guys, I need some help to come out with a solution . I have seven such files but I am showing only three for convenience. filea a5 20 a8 16 fileb a3 42 a7 14 filec a5 23 a3 07 The output file shoud contain the data in table form showing first field of...

3. Shell Programming and Scripting

compare two files and search keyword and print output

You have two files to compare by searching keyword from one file into another file File A 23 >pp_ANSWER 24 >aa hello 25 >jau head wear 66 >jss oops 872 >aqq olps ploww oww sss 722 >GG_KILLER ..... large files File B Beta done KILLER John Mayor calix meyers ...

4. UNIX for Advanced & Expert Users

Shell Script to compare xml files and print output to a file

All, PLease can you help me with a shell script which can compare two xml files and print the difference to a output file. I have attached one such file for you reference. <Group> <Member ID=":Year_Quad:41501" childCount="4" fullPath="PEPSICO Year-Quad-Wk : FOLDER.52 Weeks Ending Dec...

5. Shell Programming and Scripting

awk compare specific columns from 2 files, print new file

Hello. I have two files. FILE1 was extracted from FILE2 and modified thanks to help from this post. Now I need to replace the extracted, modified lines into the original file (FILE2) to produce the FILE3. FILE1 1466 55.27433 14.72050 -2.52E+03 3.00E-01 1.05E+04 2.57E+04 1467 55.27433...

6. Shell Programming and Scripting

Compare two files and print using awk

I have 2 files: email_1.out 1 abc@yahoo.com 2 abc_1@yahoo.com 3 abc_2@yahoo.com data_1.out <tr> 1 MAIL # 1 TO src_1 </tr> <tr><td class="hcol">col_id</td> <td class="hcol">test_dt</td> <td class="hcol">user_type</td> <td class="hcol">ct</td></tr> <tr><td...

7. Shell Programming and Scripting

Compare to flat files using awk

compare to flat files using awk .but in 4th field contains non ordered substring. how to do that. file1.txt john|0.0|4|**:25;JP:50;UY:25 file2.txt andy|0.0|4|JP:50;**:25;UY:25

8. Shell Programming and Scripting

Compare columns of multiple files and print those unique string from File1 in an output file.

Hi, I have multiple files that each contain one column of strings: File1: 123abc 456def 789ghi File2: 123abc 456def 891jkl File3: 234mno 123abc 456def In total I have 25 of these type of file.

9. Shell Programming and Scripting

[Solved] awk compare two different columns of two files and print all from both file

Hi, I want to compare two columns from file1 with another two column of file2 and print matched and unmatched column like this File1 1 rs1 abc 3 rs4 xyz 1 rs3 stu File2 1 kkk rs1 AA 10 1 aaa rs2 DD 20 1 ccc ...

10. UNIX for Beginners Questions & Answers

Compare two files and print output

Hi All, i am trying to compare two files in Centos 6. F1: /tmp/d21 NAME="xvda" TYPE="disk" SIZE="40G" OWNER="root" GROUP="disk" MODE="brw-rw----" MOUNTPOINT="" NAME="xvda1" TYPE="part" SIZE="500M" OWNER="root" GROUP="disk" MODE="brw-rw----" MOUNTPOINT="/boot" NAME="xvda2" TYPE="part"...

LEARN ABOUT REDHAT

amplot

AMPLOT(8)						      System Manager's Manual							 AMPLOT(8)

NAME

       amplot - visualize the behavior of Amanda

SYNOPSIS

       amplot [ -c ] [ -e ] [ -g ] [ -l ] [ -p ] [ -t T ] amdump_files

DESCRIPTION

       Amplot  reads  an  amdump  output file that Amanda generates each run (e.g.  amdump.1) and translates the information into a picture format
       that may be used to determine how your installation is doing and if any parameters need to be changed.  Amplot also prints out amdump lines
       that  it  either  does  not understand or knows to be warning or error lines and a summary of the start, end and total time for each backup
       image.

       Amplot is a shell script that executes an awk program (amplot.awk) to scan the amdump output file.  It  then  executes  a  gnuplot  program
       (amplot.g)  to  generate the graph.  The awk program is written in an enhanced version of awk, such as GNU awk (gawk version 2.15 or later)
       or nawk.

       During execution, amplot generates a few temporary files that gnuplot uses.  These files are deleted at the end of execution.

       See the amanda(8) man page for more details about Amanda.

OPTIONS

       -c     Compress amdump_files after plotting.

       -e     Extend the X (time) axis if needed.

       -g     Direct gnuplot output directly to the X11 display (default).

       -p     Direct postscript output to file YYYYMMDD.ps (opposite of -g).

       -l     Generate landscape oriented output.

       -t T   Set the right edge of the plot to be T hours.

       The amdump_files may be in various compressed formats (compress, gzip, pact, compact).

INTERPRETATION

       The figure is divided into a number of regions.	There are titles on the top that show important statistical information about the configu-
       ration  and  from  this execution of amdump.  In the figure, the X axis is time, with 0 being the moment amdump was started.  The Y axis is
       divided into 5 regions:

	      QUEUES: How many backups have not been started, how many are waiting on space in the holding disk and how many have been transferred
	      successfully to tape.

	      %BANDWIDTH: Percentage of allowed network bandwidth in use.

	      HOLDING DISK: The higher line depicts space allocated on the holding disk to backups in progress and completed backups waiting to be
	      written to tape.	The lower line depicts the fraction of the holding disk containing completed backups waiting to be written to tape
	      including the file currently being written to tape.  The scale is percentage of the holding disk.

	      TAPE: Tape drive usage.

	      %DUMPERS: Percentage of active dumpers.

       The idle period at the left of the graph is time amdump is asking the machines how much data they are going to dump.  This process can take
       a while if hosts are down or it takes them a long time to generate estimates.

AUTHOR

       Olafur Gudmundsson ogud@tis.com
       Trusted Information Systems
       formerly at University of Maryland, College Park

BUGS

       Reports lines it does not recognize, mainly error cases but some are legitimate lines the program needs to be taught about.

SEE ALSO

       amanda(8), amdump(8), gawk(1), nawk(1), awk(1), gnuplot(1), sh(1), compress(1), gzip(1)

4th Berkeley Distribution														 AMPLOT(8)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to compare lines of two files and print output on screen

Discussion started by: chlfc

2. Shell Programming and Scripting

compare columns from seven files and print the output

Discussion started by: smriti_shridhar

3. Shell Programming and Scripting

compare two files and search keyword and print output

Discussion started by: cdfd123

4. UNIX for Advanced & Expert Users

Shell Script to compare xml files and print output to a file

Discussion started by: kanthrajgowda