Extracting records with unique fields from a fixed width txt file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extracting records with unique fields from a fixed width txt file
# 1  
Old 02-09-2008
Extracting records with unique fields from a fixed width txt file

Greetings,

I would like to extract records from a fixed width text file that have unique field elements.

Data is structured like this:

John A Smith NY
Mary C Jones WA
Adam J Clark PA
Mary Jones WA

Fieldname / start-end position
Firstname 1-10
MI 11-12
Lastname 13-23
State 24-25

I want to compare firstname and lastname fields exclusively and output the unique records to a new file:
John A Smith NY
Adam J Clark PA

Any assistance would be greatly appreciated.

Last edited by sitney; 02-09-2008 at 02:39 AM..
# 2  
Old 02-09-2008
Your requirements are a bit vague, but here is a possible perl solution:

Code:
#!/usr/bin/perl
use warnings;
use strict;
#use Data::Dumper; #uncomment for debugging
unless (scalar @ARGV == 2){
   die "Usage: perl scriptname.pl inputfile outputfile\n";
}

my $outfile = pop @ARGV;
my %names = ();
my %count = ();

while (<>){
   chomp;
   my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);
   (s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
   $names{"$first,$last"}={count => ++$count{"$first,$last"},
                           name => "$first $mi $last $state",
                          };
}

#print Dumper \%names; #uncomment for debugging  

open my $out , '>' , $outfile or die "$!"; 

foreach my $person (keys %names) {
    next if $names{$person}{count}>1;
    print $out $names{$person}{name},"\n";
}

close $out;

print STDOUT "finished";
exit(0);

Usage:

perl scriptname.pl path/to/inputfile path/to/outputfile

Last edited by KevinADC; 02-09-2008 at 05:08 AM..
# 3  
Old 02-09-2008
KevinADC - I really appreciate your response here.

It works! When I run your perl script, I get these results:
$ cat newnames.txt
John A Smith NY
Adam J Clark PA

Despite my vague requirements, you understood them perfectly.

I am trying to decipher the workhorse part of the script you wrote:

while (<>){
chomp;
#Assign variables to fixed width sections using unpack.
my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);

#Remove whitespace from variables.
(s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);

#Please describe what is going on here.
$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};
}

Thanks again.
# 4  
Old 02-09-2008
Quote:
#Please describe what is going on here.
$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};
I'll try....

$names{"$first,$last"} creates a hash key from the first and last name.

its' value is in turn a hash:

Code:
$names{"$first,$last"} = {count=>'' , name => '' };

the "count" keys value is the value of another hash: %count, which is keeping count of how many times the first,last names are found:

Code:
++$count{"$first,$last"}

so we can determine later if it is a unique combination or not. If it has a count of 1 (one) then it is unique.

the "name" keys is just the original line from the file which we use to print to the output file if the value of the "count" key is 1 (one).

You can uncomment the lines that say to "uncomment for debugging" and you will see the data structure of %names printed when the script finishes running.
# 5  
Old 02-09-2008
You have here:

Quote:
#Remove whitespace from variables. <br>
(s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
That part actually removes leading and trailing spaces from the list of variables. If there are internal spaces they are kept because names can have spaces in them, and if you removed the internal spaces you could potentially create false matches, example:

John W "Van Johnson" (last name in quotes to show it is one field)

John W VanJohnson

This is probaly a rare circumstance (and not a very good example) but it is possible, especially if the names are not in English.
# 6  
Old 02-10-2008
You said,
Quote:
(s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
That part actually removes leading and trailing spaces from the list of variables.
I am crystal clear with this clarification. Thanks.

However, the hash structure you used
Code:
$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};

is so compact and does so much, that even with your description, it remains beyond my full grasp at this stage of my perl newbishness.

Even though I don't fully grasp this data structure, I can use it, modify it, and apply it. So thanks again KevinADC!
# 7  
Old 02-10-2008
You're welcome. Actually that data structure could have been a bit simpler:

Code:
while (<>){
   chomp;
   my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);
   (s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
   $names{"$first,$last"}{count}++;
   $names{"$first,$last"}{name} = "$first $mi $last $state",
}

This eliminates the need for the seperate hash to keep track of the counts. I like to use the seperate hash for counts because in general data is much more complex than this and incrementing a count can be much easier done if it is kept seperate.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Alter Fixed Width File

Thank u so much .Its working fine as expected. ---------- Post updated at 03:41 PM ---------- Previous update was at 01:46 PM ---------- I need one more help. I have another file(fixed length) that will get negative value (ex:-00000000003000) in postion (98 - 112) then i have to... (6 Replies)
Discussion started by: vinus
6 Replies

2. UNIX for Dummies Questions & Answers

Length of a fixed width file

I have a fixed width file of length 53. when is try to get the lengh of the record of that file i get 2 different answers. awk '{print length;exit}' <File_name> The above code gives me length 50. wc -L <File_name> The above code gives me length 53. Please clarify on... (2 Replies)
Discussion started by: Amrutha24
2 Replies

3. Shell Programming and Scripting

How to parse fixed-width columns which may include empty fields?

I am trying to selectively display several columns from a db2 query, which gives me a fixed-width output (partial output listed here): --------- -------------------------- ------------ ------ 000 0000000000198012 702 29 000 0000000000198013 ... (9 Replies)
Discussion started by: ahsh79
9 Replies

4. Shell Programming and Scripting

variable fixed-width fields

Hi there, CTL Port IO Rate(IOPS) Read Rate(IOPS) Write Rate(IOPS) Read Hit(%) Write Hit(%) Trans. Rate(MB/S) Read Trans. Rate(MB/S) Write Trans. Rate(MB/S) 09:36:48 0 A 136 0 135 97 100 ... (6 Replies)
Discussion started by: gray380
6 Replies

5. Shell Programming and Scripting

Apply condition on fixed width file and filter records

Dear members.. I have a fixed width file. Requirement is as below:- 1. Scan each record from this fixed width file 2. Check for value under field no "6" equals to "ABC". If yes, then filter this record into the output file Please suggest a unix command to achieve this, my guess awk might... (6 Replies)
Discussion started by: sureshg_sampat
6 Replies

6. Shell Programming and Scripting

Comparing two fixed width file

Hi Guys I am checking the treads to get the answer but i am not able to get the answer for my question. I have two files. First file is a pattern file and the second file is the file i want to search in it. Output will be the lines from file2. File1: P2797f12af 44751228... (10 Replies)
Discussion started by: anshul_er
10 Replies

7. Shell Programming and Scripting

Manupulating Records in a fixed width file

I am trying to determine what would be a fast and simple way to manipulate data that comes in a fixed width format. This data has 6 segments within a record. Each record needs to written out with a header and the 6 segments. Based on the value in column #6 the fields will be defined accordingly.... (4 Replies)
Discussion started by: Muga801
4 Replies

8. Shell Programming and Scripting

Fixed-Width file from Oracle

Hi All, I have created a script which generates FIXED-WIDTH file by executing Oracle query. SELECT RPAD(NVL(col1,CHR(9)),20)||NVL(col2,CHR(9))||NVL(col3,CHR(9) FROM XYZ It generates the data file with proper alignment. But if same file i transfer to windows server or Mainframe... (5 Replies)
Discussion started by: Amit.Sagpariya
5 Replies

9. Shell Programming and Scripting

summing up the fields in fixed width file

Hi, I have a fixed width file with some records as given below: " 1000Nalsdjflj243324jljlj" "-0300Njfowjljl309933fsf" " 0010Njsfsjklj342344fsl" I want to sum-up first field values(i.e from 2nd character to 6th character)of each record. so for the above file i want to add (1000 - 300+... (2 Replies)
Discussion started by: srilaxmi
2 Replies

10. UNIX Desktop Questions & Answers

Help with Fixed width File Parsing

I am trying to parse a Fixed width file with data as below. I am trying to assign column values from each record to variables. When I parse the data, the spaces in all coumns are dropped. I would like to retain the spaces as part of the dat stored in the variables. Any help is appreciated. I... (4 Replies)
Discussion started by: sate911
4 Replies
Login or Register to Ask a Question