How to extract duplicate records with associated header record


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers How to extract duplicate records with associated header record
# 1  
Old 01-12-2007
How to extract duplicate records with associated header record

All,

I have a task to search through several hundred files and extract duplicate detail records and keep them grouped with their header record. If no duplicate detail record exists, don't pull the header. For example, an input file could look like this:

input.txt
HA
D1
D2
D2
D3
D4
D4
HB
D1
D2
HC
D1
D1
D2
D3
D3

The output would be:

output.txt
HA
D2
D4
HC
D1
D3

Would it be possible to do this with AWK? I do not know python.

Thank you for your time.
# 2  
Old 01-15-2007
What distinguishes Header and data? Is there a fixed list of Headers or was the input file generated after pasting your several hundred files? Can you explain the exact requirements?
# 3  
Old 01-15-2007
I'd use perl for this...
Code:
$ ./input.pl 
HA
D2
D4
HC
D1
D3
$ cat ./input.pl 
#!/usr/bin/perl
# Script to print headers and duplicate items from input.txt

use warnings;
use strict;
my @records;
undef $/;

open ( INPUT, "< input.txt" ) || die "Couldn't open input file: $!\n";
# use a look-ahead assertion here
@records = split( /^(?=(?:H))/m, <INPUT> );
foreach my $record ( @records ) {
   my @lines = split( /\n/, $record );
   my $header = $lines[0];
   my %linehash;
   my $headerdone = 0;
   foreach my $line ( @lines ) {
      $linehash{$line}++;  
   }
   foreach my $key ( sort ( keys ( %linehash ) ) ) {
      my $value = $linehash{$key};
      if ( $value > 1 ) { 
         if ( $headerdone == 0 ) {
            printf( "%s\n", $header );
            $headerdone++;
         }
         printf( "%s\n", $key );
      }
   }
}
close ( INPUT );

exit ( 0 );

Cheers
ZB
# 4  
Old 01-15-2007
Quote:
Originally Posted by zazzybob
I'd use perl for this...
Code:
$ ./input.pl 
HA
D2
D4
HC
D1
D3
$ cat ./input.pl 
#!/usr/bin/perl
# Script to print headers and duplicate items from input.txt

use warnings;
use strict;
my @records;
undef $/;

open ( INPUT, "< input.txt" ) || die "Couldn't open input file: $!\n";
# use a look-ahead assertion here
@records = split( /^(?=(?:H))/m, <INPUT> );
foreach my $record ( @records ) {
   my @lines = split( /\n/, $record );
   my $header = $lines[0];
   my %linehash;
   my $headerdone = 0;
   foreach my $line ( @lines ) {
      $linehash{$line}++;  
   }
   foreach my $key ( sort ( keys ( %linehash ) ) ) {
      my $value = $linehash{$key};
      if ( $value > 1 ) { 
         if ( $headerdone == 0 ) {
            printf( "%s\n", $header );
            $headerdone++;
         }
         printf( "%s\n", $key );
      }
   }
}
close ( INPUT );

exit ( 0 );

Cheers
ZB
Thanks for the replies.
These is actually multiple files of daily extracts of expense report data from a transactional system. each file is made up of individual expense reports (header records) and the expense line items for each report (detail records). We had a situation where some detail records, but not all, were duplicated. This occurred in some output files, but not all. My requirements are to identify, by export file, the duplicate records, attached to their respective header records. We need this information to send to the system of record to correct these errors. It (hopefully) will be a one time fix. Also, I do not know perl, but am willing to learn enough to use it as a solution.

thanks again for posting a reply.
# 5  
Old 01-16-2007
Code:
#! /opt/third-party/bin/perl

my ($content, $i, $header, $headerprint, %fileHash);
open(FILE, "< a") || die "Unable to open file : $!\n";

while( chomp($content = <FILE>) ) {
  if( $content =~ m/^H/ ) {
    $headerprint = 0;
    $header = $content;
    %fileHash = ();
  }
  else {
    if( $headerprint == 0 ) {
      print "$header\n"; $headerprint = 1;
    }
    print "$content\n" if exists $fileHash{$content};
    $fileHash{$content} = $i++;
  }
}

exit 0

# 6  
Old 01-16-2007
This shell script should do for you.

#! /usr/bin/ksh
r=`sort $1|uniq -d`
if [ -z "$r" ]; then
echo " No duplicate record found"
else
k=`sort -u $1`
echo "output.txt:" >>outputfile
echo "$k" >> outputfile
exit 0
fi

Last edited by Krrishv; 01-16-2007 at 05:31 AM..
# 7  
Old 01-16-2007
But this wont,

categorize the ouput based on the header as the OP requested for Smilie
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Hi Experts, I have csv file with 30, 40 columns Pasting just 2 column for problem description. Need to print error if below combination is not present in file check for column-1 (DocumentNumber) and filter columns where value in DocumentNumber field is same. For all such rows, the field... (7 Replies)
Discussion started by: as7951
7 Replies

2. UNIX for Beginners Questions & Answers

Help in printing records where there is a 'header' in the first record ???

Hi, I have a backup report that unfortunately has some kind of hanging indent thing where the first line contains one column more than the others I managed to get the output that I wanted using awk, but just wanting to know if there is short way of doing it using the same awk Below is what... (2 Replies)
Discussion started by: newbie_01
2 Replies

3. Shell Programming and Scripting

Extract timestamp from first record in xml file and it checks if not it will replace first record

I have test.xml <emp><id>101</id><name>AAA</name><date>06/06/14 1811</date></emp> <Join><id>101</id><city>london</city><date>06/06/14 2011</date></join> <Join><id>101</id><city>new york</city><date>06/06/14 1811</date></join> <Join><id>101</id><city>sydney</city><date>06/06/14... (2 Replies)
Discussion started by: vsraju
2 Replies

4. Shell Programming and Scripting

Copy header values into records

I'm using a shell script to manipulate a data file. I have a large file with two sets of data samples (tracking memory consumption) taken over a long period of time, so I have many samples. The problem is that all the data is in the same file so that each sample contains two sets of data.... (2 Replies)
Discussion started by: abercrom
2 Replies

5. Shell Programming and Scripting

Approach on Header record

All, I currently have a requirement to fetch a Date value from a table. And then insert a Header record into a file along with that date value. ex: echo "HDR"" "`date +%Y%j` `date +%Y%m%d` In the above example I used julian date and standard date using Current Date. But the requirement... (0 Replies)
Discussion started by: cmaroju
0 Replies

6. Shell Programming and Scripting

Deleting duplicate records from file 1 if records from file 2 match

I have 2 files "File 1" is delimited by ";" and "File 2" is delimited by "|". File 1 below (3 record shown): Doc1;03/01/2012;New York;6 Main Street;Mr. Smith 1;Mr. Jones Doc2;03/01/2012;Syracuse;876 Broadway;John Davis;Barbara Lull Doc3;03/01/2012;Buffalo;779 Old Windy Road;Charles... (2 Replies)
Discussion started by: vestport
2 Replies

7. Shell Programming and Scripting

Specific Header after every 30 records

Hi All, I have got a requirement. I have a source file, EMPFULL.txt and I need to split the data for every 30 records and place a Typical Header as below with system and page number too. 2012.01.03 Employee Dept Report 1... (6 Replies)
Discussion started by: srk409
6 Replies

8. Shell Programming and Scripting

Insertion of Header record

A header record is to be inserted in the begining of a flat file without using extra file or new file. It should be inserted into same file. Advace thanks for all help... (7 Replies)
Discussion started by: shreekrishnagd
7 Replies

9. UNIX for Dummies Questions & Answers

change order of fields in header record

Hello, after 9 months of archiving 1000 files, now, i need to change the order of fields in the header record. some very large, space padded files. HEADERCAS05212008D0210DOMEST01(spacepadded to record length 210) must now be 05212008HEADERCASD0210DOMEST01(spacepadded to record length 210) ... (1 Reply)
Discussion started by: JohnMario
1 Replies

10. Shell Programming and Scripting

Count No of Records in File without counting Header and Trailer Records

I have a flat file and need to count no of records in the file less the header and the trailer record. I would appreciate any and all asistance Thanks Hadi Lalani (2 Replies)
Discussion started by: guiguy
2 Replies
Login or Register to Ask a Question