The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Google UNIX.COM


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Help with Fixed width File Parsing sate911 UNIX Desktop for Dummies Questions & Answers 4 05-19-2008 08:18 AM
Changing particular field in fixed width file dsravan Shell Programming and Scripting 4 02-11-2008 02:08 PM
Converting a Delimited File to Fixed width file raghavan.aero Shell Programming and Scripting 2 06-06-2007 11:44 AM
adding delimiter to a fixed width file sumeet Shell Programming and Scripting 2 03-21-2007 06:19 AM
Fixed Width file using AWK alok.benjwal UNIX for Dummies Questions & Answers 2 12-05-2005 07:39 AM

Reply
 
Submit Tools LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 02-08-2008
Registered User
 

Join Date: Feb 2008
Posts: 6
Extracting records with unique fields from a fixed width txt file

Greetings,

I would like to extract records from a fixed width text file that have unique field elements.

Data is structured like this:

John A Smith NY
Mary C Jones WA
Adam J Clark PA
Mary Jones WA

Fieldname / start-end position
Firstname 1-10
MI 11-12
Lastname 13-23
State 24-25

I want to compare firstname and lastname fields exclusively and output the unique records to a new file:
John A Smith NY
Adam J Clark PA

Any assistance would be greatly appreciated.

Last edited by sitney; 02-08-2008 at 10:39 PM.
Reply With Quote
Forum Sponsor
  #2 (permalink)  
Old 02-09-2008
Registered User
 

Join Date: Jan 2008
Posts: 309
Your requirements are a bit vague, but here is a possible perl solution:

Code:
#!/usr/bin/perl
use warnings;
use strict;
#use Data::Dumper; #uncomment for debugging
unless (scalar @ARGV == 2){
   die "Usage: perl scriptname.pl inputfile outputfile\n";
}

my $outfile = pop @ARGV;
my %names = ();
my %count = ();

while (<>){
   chomp;
   my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);
   (s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
   $names{"$first,$last"}={count => ++$count{"$first,$last"},
                           name => "$first $mi $last $state",
                          };
}

#print Dumper \%names; #uncomment for debugging  

open my $out , '>' , $outfile or die "$!"; 

foreach my $person (keys %names) {
    next if $names{$person}{count}>1;
    print $out $names{$person}{name},"\n";
}

close $out;

print STDOUT "finished";
exit(0);
Usage:

perl scriptname.pl path/to/inputfile path/to/outputfile

Last edited by KevinADC; 02-09-2008 at 01:08 AM.
Reply With Quote
  #3 (permalink)  
Old 02-09-2008
Registered User
 

Join Date: Feb 2008
Posts: 6
KevinADC - I really appreciate your response here.

It works! When I run your perl script, I get these results:
$ cat newnames.txt
John A Smith NY
Adam J Clark PA

Despite my vague requirements, you understood them perfectly.

I am trying to decipher the workhorse part of the script you wrote:

while (<>){
chomp;
#Assign variables to fixed width sections using unpack.
my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);

#Remove whitespace from variables.
(s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);

#Please describe what is going on here.
$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};
}

Thanks again.
Reply With Quote
  #4 (permalink)  
Old 02-09-2008
Registered User
 

Join Date: Jan 2008
Posts: 309
Quote:
#Please describe what is going on here.
$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};
I'll try....

$names{"$first,$last"} creates a hash key from the first and last name.

its' value is in turn a hash:

Code:
$names{"$first,$last"} = {count=>'' , name => '' };
the "count" keys value is the value of another hash: %count, which is keeping count of how many times the first,last names are found:

Code:
++$count{"$first,$last"}
so we can determine later if it is a unique combination or not. If it has a count of 1 (one) then it is unique.

the "name" keys is just the original line from the file which we use to print to the output file if the value of the "count" key is 1 (one).

You can uncomment the lines that say to "uncomment for debugging" and you will see the data structure of %names printed when the script finishes running.
Reply With Quote
  #5 (permalink)  
Old 02-09-2008
Registered User
 

Join Date: Jan 2008
Posts: 309
You have here:

Quote:
#Remove whitespace from variables. <br>
(s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
That part actually removes leading and trailing spaces from the list of variables. If there are internal spaces they are kept because names can have spaces in them, and if you removed the internal spaces you could potentially create false matches, example:

John W "Van Johnson" (last name in quotes to show it is one field)

John W VanJohnson

This is probaly a rare circumstance (and not a very good example) but it is possible, especially if the names are not in English.
Reply With Quote
  #6 (permalink)  
Old 02-09-2008
Registered User
 

Join Date: Feb 2008
Posts: 6
You said,
Quote:
(s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
That part actually removes leading and trailing spaces from the list of variables.
I am crystal clear with this clarification. Thanks.

However, the hash structure you used
Code:
$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};
is so compact and does so much, that even with your description, it remains beyond my full grasp at this stage of my perl newbishness.

Even though I don't fully grasp this data structure, I can use it, modify it, and apply it. So thanks again KevinADC!
Reply With Quote
  #7 (permalink)  
Old 02-09-2008
Registered User
 

Join Date: Jan 2008
Posts: 309
You're welcome. Actually that data structure could have been a bit simpler:

Code:
while (<>){
   chomp;
   my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);
   (s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
   $names{"$first,$last"}{count}++;
   $names{"$first,$last"}{name} = "$first $mi $last $state",
}
This eliminates the need for the seperate hash to keep track of the counts. I like to use the seperate hash for counts because in general data is much more complex than this and incrementing a count can be much easier done if it is kept seperate.
Reply With Quote
Google The UNIX and Linux Forums
Reply

Thread Tools
Display Modes




All times are GMT -7. The time now is 06:12 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008. All Rights Reserved.Ad Management by RedTyger Visit The Global Fact Book

Content Relevant URLs by vBSEO 3.2.0