Removing dupes within 2 delimited areas in a large dictionary file | Unix Linux Forums | Shell Programming and Scripting

  Go Back    


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Removing dupes within 2 delimited areas in a large dictionary file

Shell Programming and Scripting


Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 12-05-2012
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 6 July 2014, 6:09 AM EDT
Posts: 181
Thanks: 79
Thanked 1 Time in 1 Post
Removing dupes within 2 delimited areas in a large dictionary file

Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :

Code:
#DATA
#VALID 1[this could be also be 0 instead of 1]

and ends with a footer as shown below

Code:
#END

The data between the Header and the Footer consists of words, each word on a separate line.
However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes.
What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates.
A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8
Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes.
Many thanks in advance for help and also the learning experience

_____Sample Input

Code:
#DATA
#VALID 1
a
a
a
a
all
an
and
and
and
are
are
are
as
awk
below
case
case
could
data
data
data
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
given
happens
have
header
however
i
identify
in
input
is
is
is
issue
it
language
large
need
not
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
script
section
since
since
so
sort
that
the
the
the
the
the
the
the
the
the
them
time
up
what
which
which
with
within
within
words
#END

___________Expected output

Code:
#DATA
#VALID 1
a
all
an
and
are
as
awk
below
case
could
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
happens
have
header
however
i
identify
in
input
is
issue
it
language
large
need
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
section
since
so
sort
that
the
them
time
up
what
which
with
within
words
#END

_____Sample ends

Last edited by Scrutinizer; 12-07-2012 at 03:03 AM.. Reason: code tags
Sponsored Links
    #2  
Old 12-05-2012
spacebar's Avatar
spacebar spacebar is offline
Registered User
 
Join Date: Oct 2009
Last Activity: 2 June 2014, 3:23 PM EDT
Location: spaceBAR Central
Posts: 440
Thanks: 0
Thanked 70 Times in 70 Posts
Perl script:

Code:
my $in_file     =  '/temp/tmp/t'; # file contains your example data
my $out_file    =  '/temp/tmp/new_file.txt'; # New file with dups removed
my $line;
my @inla;
my @outla;

open ( my $in_file_fh,  '<', $in_file  ) or die "Can't open $in_file $!\n";
open ( my $out_file_fh, '>', $out_file ) or die "Can't open $out_file $!\n";

DATA: while ( $line = <$in_file_fh> ) {  # Read input file
        # If line starts with '#DATA' write to out file
        # also write to out file the next line which is a '#VALID x' line
        if ( $line =~ /^\#DATA/ ) {
          print $out_file_fh $line;
          foreach (1..1) {
            $line = <$in_file_fh>;
            print $out_file_fh $line;
          }
          # Read lines until '#END' line is read
          while ( $line = <$in_file_fh> ) {
            if ( $line =~ /^\#END/ ){
              # Create an anoymous hash from input lines in arrary(@inla) which removes duplicates
              # and place results in array: @outla
              @outla = keys %{{ map{$_=>1}@inla}};
              # write sorted out array to out file
              print $out_file_fh ( sort @outla );
              # write '#END' line to out file
              print $out_file_fh $line;
              # exit inner loop back to main loop and start over
              last DATA;
            }
            # Load lines between the '#VALID x' line and the '#END' line into array
            push @inla, $line;
          }
        }
      }



Code:
$ cat new_file.txt
#DATA
#VALID 1
a
all
an
and
are
as
awk
below
case
could
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
happens
have
header
however
i
identify
in
input
is
issue
it
language
large
need
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
section
since
so
sort
that
the
them
time
up
what
which
with
within
words
#END

The Following User Says Thank You to spacebar For This Useful Post:
gimley (12-05-2012)
Sponsored Links
    #3  
Old 12-05-2012
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 6 July 2014, 6:09 AM EDT
Posts: 181
Thanks: 79
Thanked 1 Time in 1 Post
Many thanks. Am out at present. Will run the perl script and get back to you.
    #4  
Old 12-06-2012
Jotne's Avatar
Jotne Jotne is offline
Registered User
 
Join Date: Dec 2010
Last Activity: 25 July 2014, 9:30 AM EDT
Posts: 1,038
Thanks: 62
Thanked 216 Times in 204 Posts
You may use some like this:

Code:
awk '/DATA/,/END/'

to get data between DATA and END , then sort it.
Sponsored Links
    #5  
Old 12-07-2012
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 6 July 2014, 6:09 AM EDT
Posts: 181
Thanks: 79
Thanked 1 Time in 1 Post
Hello,
Sorry my Broadband was down and I could not check out the perl script. It works beautifully on ASCII data (8-bit). As soon as UTF8 or UTF16 data is addressed, no output is visible.
Does PERL give problems with Unicode?
Since my data is in Perso-Arabic, the script does not work.
Any round-about way to solve the problem. I am using the latest version of ActiveState Perl and in despair even downloaded strawberry perl but the data does not work.
I am attaching the zip file containing data in UTF8 format with Hindi as an example. There are two files testdic and testdic.out
Many thanks for the beautifully commented script. I modified it slightly as under to take input and output from command line:

Code:
#!/usr/bin/perl
my $line;
my @inla;
my @outla;

The rest of the code remains the same.
I do not think this would affect accessing a UTF8 file.
Many thanks once again
Attached Files
File Type: zip testdata.zip (2.5 KB, 11 views)
Sponsored Links
    #6  
Old 12-07-2012
drl's Avatar
drl drl is offline Forum Advisor  
Registered Voter
 
Join Date: Apr 2007
Last Activity: 26 July 2014, 8:28 AM EDT
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 1,663
Thanks: 34
Thanked 186 Times in 170 Posts
Hi.
Quote:
Originally Posted by gimley View Post
... Does PERL give problems with Unicode? ...
You might want to start with: perldoc perlunitu , then man perlunicode

You seem to be using Windows. I have used the utf8 facilities on GNU/Linux systems, but I have no idea whether that might be available in/with ActiveState Perl.

Doing an advanced search here for perl utf8 yields about 50 hits, some of which may be useful.

Best wishes ... cheers, drl

( Edit 1: add note about advanced search )

Last edited by drl; 12-07-2012 at 06:43 AM..
Sponsored Links
    #7  
Old 12-07-2012
gimley gimley is offline
Registered User
 
Join Date: Feb 2011
Last Activity: 6 July 2014, 6:09 AM EDT
Posts: 181
Thanks: 79
Thanked 1 Time in 1 Post
Hello,
I found the problem. BOM: Byte Order mark
Normally under windows a UTF-8 file starts with a BOM (byte order mark, U+FEFF), as is standard for UTF-8 files on Windows systems. I concede that it is legal for them to do so, but it is utterly pointless since the byte order is determined by the formal specification of the UTF-8 representation itself. And it just happens that, unlike the rest of UTF-8, an initial BOM will screw up a Unix system. And Perl is supposed to be
Quote:
"an oasis of Unix culture in the desert of can't-get-there-from here" (Larry Wall, probably slightly misquoted).
Using a hex editor I removed the FEFF and it worked like a charm.

On Linux you should have no problem, since this aberration does not exist ina Unix system

Many thanks for trying to solve the mystery.

As an aid to all of us who suffer the tyranny of the WinOS system, here is a useful link:
HTML Code:
http://www.perlmonks.org/?node_id=599720.
This offers two solutions for the problem. Googling
Quote:
"perl bom" or "perl File::BOM"
comes up with more if needed.
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Merging dupes on different lines in a dictionary gimley Shell Programming and Scripting 2 09-09-2012 11:13 PM
Removing Dupes from huge file- awk/perl/uniq makn Shell Programming and Scripting 17 04-14-2012 04:34 PM
Script Optimization - large delimited file, for loop with many greps verge Shell Programming and Scripting 17 04-27-2011 09:15 PM
Extracting a portion of data from a very large tab delimited text file Lucky Ali Shell Programming and Scripting 2 04-11-2010 11:55 AM
Large pipe delimited file that I need to add CR/LF every n fields clintrpeterson Shell Programming and Scripting 2 10-15-2009 02:15 PM



All times are GMT -4. The time now is 03:14 PM.