Removing dupes within 2 delimited areas in a large dictionary file

12-05-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Removing dupes within 2 delimited areas in a large dictionary file

Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :

Code:

#DATA
#VALID 1[this could be also be 0 instead of 1]

and ends with a footer as shown below

Code:

#END

The data between the Header and the Footer consists of words, each word on a separate line.
However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes.
What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates.
A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8
Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes.
Many thanks in advance for help and also the learning experience

_____Sample Input

Code:

#DATA
#VALID 1
a
a
a
a
all
an
and
and
and
are
are
are
as
awk
below
case
case
could
data
data
data
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
given
happens
have
header
however
i
identify
in
input
is
is
is
issue
it
language
large
need
not
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
script
section
since
since
so
sort
that
the
the
the
the
the
the
the
the
the
them
time
up
what
which
which
with
within
within
words
#END

___________Expected output

Code:

#DATA
#VALID 1
a
all
an
and
are
as
awk
below
case
could
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
happens
have
header
however
i
identify
in
input
is
issue
it
language
large
need
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
section
since
so
sort
that
the
them
time
up
what
which
with
within
words
#END

_____Sample ends

Last edited by Scrutinizer; 12-07-2012 at 04:03 AM.. Reason: code tags

gimley

View Public Profile for gimley

Find all posts by gimley

12-05-2012

Registered User

440, 71

Join Date: Oct 2009

Last Activity: 26 June 2018, 6:52 PM EDT

Location: spaceBAR Central

Posts: 440

Thanks Given: 0

Thanked 71 Times in 70 Posts

Perl script:

Code:

my $in_file     =  '/temp/tmp/t'; # file contains your example data
my $out_file    =  '/temp/tmp/new_file.txt'; # New file with dups removed
my $line;
my @inla;
my @outla;

open ( my $in_file_fh,  '<', $in_file  ) or die "Can't open $in_file $!\n";
open ( my $out_file_fh, '>', $out_file ) or die "Can't open $out_file $!\n";

DATA: while ( $line = <$in_file_fh> ) {  # Read input file
        # If line starts with '#DATA' write to out file
        # also write to out file the next line which is a '#VALID x' line
        if ( $line =~ /^\#DATA/ ) {
          print $out_file_fh $line;
          foreach (1..1) {
            $line = <$in_file_fh>;
            print $out_file_fh $line;
          }
          # Read lines until '#END' line is read
          while ( $line = <$in_file_fh> ) {
            if ( $line =~ /^\#END/ ){
              # Create an anoymous hash from input lines in arrary(@inla) which removes duplicates
              # and place results in array: @outla
              @outla = keys %{{ map{$_=>1}@inla}};
              # write sorted out array to out file
              print $out_file_fh ( sort @outla );
              # write '#END' line to out file
              print $out_file_fh $line;
              # exit inner loop back to main loop and start over
              last DATA;
            }
            # Load lines between the '#VALID x' line and the '#END' line into array
            push @inla, $line;
          }
        }
      }

Code:

$ cat new_file.txt
#DATA
#VALID 1
a
all
an
and
are
as
awk
below
case
could
data
does
dupes
duplicates
ends
english
examples
file
find
footer
from
given
happens
have
header
however
i
identify
in
input
is
issue
it
language
large
need
not
of
or
output
perl
perso-arabic
real
removing
repeated
result
sample
script
section
since
so
sort
that
the
them
time
up
what
which
with
within
words
#END

This User Gave Thanks to spacebar For This Post:

spacebar

View Public Profile for spacebar

Find all posts by spacebar

12-06-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Many thanks. Am out at present. Will run the perl script and get back to you.

gimley

View Public Profile for gimley

Find all posts by gimley

12-06-2012

Registered User

1,040, 213

Join Date: Dec 2010

Last Activity: 20 September 2014, 2:08 AM EDT

Posts: 1,040

Thanks Given: 62

Thanked 213 Times in 203 Posts

You may use some like this:

Code:

awk '/DATA/,/END/'

to get data between DATA and END, then sort it.

Jotne

View Public Profile for Jotne

Find all posts by Jotne

12-07-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Hello,
Sorry my Broadband was down and I could not check out the perl script. It works beautifully on ASCII data (8-bit). As soon as UTF8 or UTF16 data is addressed, no output is visible.
Does PERL give problems with Unicode?
Since my data is in Perso-Arabic, the script does not work.
Any round-about way to solve the problem. I am using the latest version of ActiveState Perl and in despair even downloaded strawberry perl but the data does not work.
I am attaching the zip file containing data in UTF8 format with Hindi as an example. There are two files testdic and testdic.out
Many thanks for the beautifully commented script. I modified it slightly as under to take input and output from command line:

Code:

#!/usr/bin/perl
my $line;
my @inla;
my @outla;

The rest of the code remains the same.
I do not think this would affect accessing a UTF8 file.
Many thanks once again

testdata.zip (2.5 KB)

gimley

View Public Profile for gimley

Find all posts by gimley

12-07-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Quote:

Originally Posted by gimley

... Does PERL give problems with Unicode? ...

You might want to start with: perldoc perlunitu, then man perlunicode

You seem to be using Windows. I have used the utf8 facilities on GNU/Linux systems, but I have no idea whether that might be available in/with ActiveState Perl.

Doing an advanced search here for perl utf8 yields about 50 hits, some of which may be useful.

Best wishes ... cheers, drl

( Edit 1: add note about advanced search )

Last edited by drl; 12-07-2012 at 07:43 AM..

drl

View Public Profile for drl

Find all posts by drl

12-07-2012

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Hello,
I found the problem. BOM: Byte Order mark
Normally under windows a UTF-8 file starts with a BOM (byte order mark, U+FEFF), as is standard for UTF-8 files on Windows systems. I concede that it is legal for them to do so, but it is utterly pointless since the byte order is determined by the formal specification of the UTF-8 representation itself. And it just happens that, unlike the rest of UTF-8, an initial BOM will screw up a Unix system. And Perl is supposed to be

Quote:

"an oasis of Unix culture in the desert of can't-get-there-from here" (Larry Wall, probably slightly misquoted).

Using a hex editor I removed the FEFF and it worked like a charm.

On Linux you should have no problem, since this aberration does not exist ina Unix system

Many thanks for trying to solve the mystery.

As an aid to all of us who suffer the tyranny of the WinOS system, here is a useful link:

HTML Code:

http://www.perlmonks.org/?node_id=599720.

This offers two solutions for the problem. Googling

Quote:

"perl bom" or "perl File::BOM"

comes up with more if needed.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Removing dupes within 2 delimited areas in a large dictionary file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove dupes in a large file

Discussion started by: gimley

2. UNIX for Advanced & Expert Users

Need optimized awk/perl/shell to give the statistics for the Large delimited file

Discussion started by: kartikirans

3. Shell Programming and Scripting

Merging dupes on different lines in a dictionary

Discussion started by: gimley

4. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Discussion started by: makn

5. Shell Programming and Scripting

Script Optimization - large delimited file, for loop with many greps

Discussion started by: verge

6. Shell Programming and Scripting

Extracting a portion of data from a very large tab delimited text file

Discussion started by: Lucky Ali

7. Shell Programming and Scripting

Large pipe delimited file that I need to add CR/LF every n fields

Discussion started by: clintrpeterson

8. Shell Programming and Scripting

Removing Embedded Newline from Delimited File

Discussion started by: bbetteridge

9. Shell Programming and Scripting

Removing blanks in a text tab delimited file

Discussion started by: Faisal Riaz

10. UNIX for Advanced & Expert Users

Issue with Removing Carriage Return (^M) in delimited file

Discussion started by: sirahc