|
|||||||
| Forums | Search Forums | Register | Forum Rules | Man Pages | Albums | FAQ | Members | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
|
|
|
Thread Tools | Search this Thread | Display Modes |
|
#1
|
|||
|
|||
|
Removing dupes within 2 delimited areas in a large dictionary file
Hello, I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header : Code:
#DATA #VALID 1[this could be also be 0 instead of 1] and ends with a footer as shown below Code:
#END The data between the Header and the Footer consists of words, each word on a separate line. However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes. What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates. A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8 Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes. Many thanks in advance for help and also the learning experience _____Sample Input Code:
#DATA #VALID 1 a a a a all an and and and are are are as awk below case case could data data data data does dupes duplicates ends english examples file find footer from given given happens have header however i identify in input is is is issue it language large need not not of or output perl perso-arabic real removing repeated result sample script script section since since so sort that the the the the the the the the the them time up what which which with within within words #END ___________Expected output Code:
#DATA #VALID 1 a all an and are as awk below case could data does dupes duplicates ends english examples file find footer from given happens have header however i identify in input is issue it language large need not of or output perl perso-arabic real removing repeated result sample script section since so sort that the them time up what which with within words #END _____Sample ends Last edited by Scrutinizer; 12-07-2012 at 03:03 AM.. Reason: code tags |
| Sponsored Links | ||
|
|
#2
|
||||
|
||||
|
Perl script: Code:
my $in_file = '/temp/tmp/t'; # file contains your example data
my $out_file = '/temp/tmp/new_file.txt'; # New file with dups removed
my $line;
my @inla;
my @outla;
open ( my $in_file_fh, '<', $in_file ) or die "Can't open $in_file $!\n";
open ( my $out_file_fh, '>', $out_file ) or die "Can't open $out_file $!\n";
DATA: while ( $line = <$in_file_fh> ) { # Read input file
# If line starts with '#DATA' write to out file
# also write to out file the next line which is a '#VALID x' line
if ( $line =~ /^\#DATA/ ) {
print $out_file_fh $line;
foreach (1..1) {
$line = <$in_file_fh>;
print $out_file_fh $line;
}
# Read lines until '#END' line is read
while ( $line = <$in_file_fh> ) {
if ( $line =~ /^\#END/ ){
# Create an anoymous hash from input lines in arrary(@inla) which removes duplicates
# and place results in array: @outla
@outla = keys %{{ map{$_=>1}@inla}};
# write sorted out array to out file
print $out_file_fh ( sort @outla );
# write '#END' line to out file
print $out_file_fh $line;
# exit inner loop back to main loop and start over
last DATA;
}
# Load lines between the '#VALID x' line and the '#END' line into array
push @inla, $line;
}
}
}Code:
$ cat new_file.txt #DATA #VALID 1 a all an and are as awk below case could data does dupes duplicates ends english examples file find footer from given happens have header however i identify in input is issue it language large need not of or output perl perso-arabic real removing repeated result sample script section since so sort that the them time up what which with within words #END |
| The Following User Says Thank You to spacebar For This Useful Post: | ||
gimley (12-05-2012) | ||
| Sponsored Links | ||
|
|
#3
|
|||
|
|||
|
Many thanks. Am out at present. Will run the perl script and get back to you.
|
|
#4
|
||||
|
||||
|
You may use some like this: Code:
awk '/DATA/,/END/' to get data between DATA and END , then sort it. |
| Sponsored Links | |
|
|
#5
|
|||
|
|||
|
Hello, Sorry my Broadband was down and I could not check out the perl script. It works beautifully on ASCII data (8-bit). As soon as UTF8 or UTF16 data is addressed, no output is visible. Does PERL give problems with Unicode? Since my data is in Perso-Arabic, the script does not work. Any round-about way to solve the problem. I am using the latest version of ActiveState Perl and in despair even downloaded strawberry perl but the data does not work. I am attaching the zip file containing data in UTF8 format with Hindi as an example. There are two files testdic and testdic.out Many thanks for the beautifully commented script. I modified it slightly as under to take input and output from command line: Code:
#!/usr/bin/perl my $line; my @inla; my @outla; The rest of the code remains the same. I do not think this would affect accessing a UTF8 file. Many thanks once again |
| Sponsored Links | |
|
|
#6
|
||||
|
||||
|
Hi.
You might want to start with: perldoc perlunitu , then man perlunicode You seem to be using Windows. I have used the utf8 facilities on GNU/Linux systems, but I have no idea whether that might be available in/with ActiveState Perl. Doing an advanced search here for perl utf8 yields about 50 hits, some of which may be useful. Best wishes ... cheers, drl ( Edit 1: add note about advanced search ) Last edited by drl; 12-07-2012 at 06:43 AM.. |
| Sponsored Links | |
|
|
#7
|
|||
|
|||
|
Hello,
I found the problem. BOM: Byte Order mark Normally under windows a UTF-8 file starts with a BOM (byte order mark, U+FEFF), as is standard for UTF-8 files on Windows systems. I concede that it is legal for them to do so, but it is utterly pointless since the byte order is determined by the formal specification of the UTF-8 representation itself. And it just happens that, unlike the rest of UTF-8, an initial BOM will screw up a Unix system. And Perl is supposed to be Quote:
On Linux you should have no problem, since this aberration does not exist ina Unix system Many thanks for trying to solve the mystery. As an aid to all of us who suffer the tyranny of the WinOS system, here is a useful link: HTML Code:
http://www.perlmonks.org/?node_id=599720. Quote:
|
| Sponsored Links | ||
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Merging dupes on different lines in a dictionary | gimley | Shell Programming and Scripting | 2 | 09-09-2012 11:13 PM |
| Removing Dupes from huge file- awk/perl/uniq | makn | Shell Programming and Scripting | 17 | 04-14-2012 04:34 PM |
| Script Optimization - large delimited file, for loop with many greps | verge | Shell Programming and Scripting | 17 | 04-27-2011 09:15 PM |
| Extracting a portion of data from a very large tab delimited text file | Lucky Ali | Shell Programming and Scripting | 2 | 04-11-2010 11:55 AM |
| Large pipe delimited file that I need to add CR/LF every n fields | clintrpeterson | Shell Programming and Scripting | 2 | 10-15-2009 02:15 PM |
|
|