Removing dupes within 2 delimited areas in a large dictionary file
Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :
and ends with a footer as shown below
The data between the Header and the Footer consists of words, each word on a separate line.
However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes.
What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates.
A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8
Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes.
Many thanks in advance for help and also the learning experience
Hi - I tried to remove ^M in a delimited file using "tr -d "\r" and "sed 's/^M//g'", but it does not work quite well. While the ^M is removed, the format of the record is still cut in half, like
a,b, c
c,d,e
The delimited file is generated using sh script by outputing a SQL query result to... (7 Replies)
Hi Experts
I am very new to perl and need to make a script using perl.
I would like to remove blanks in a text tab delimited file in in a specfic column range ( colum 21 to column 43) sample input and output shown below :
Input:
117 102 650 652 654 656
117 93 95... (3 Replies)
Hey there - a bit of background on what I'm trying to accomplish, first off. I am trying to load the data from a pipe delimited file into a database. The loading tool that I use cannot handle embedded newline characters within a field, so I need to scrub them out.
Solutions that I have tried... (7 Replies)
I have a large flat file with variable length fields that are pipe delimited. The file has no new line or CR/LF characters to indicate a new record. I need to parse the file and after some number of fields, I need to insert a CR/LF to start the next record.
Input file ... (2 Replies)
Hi All
I wanted to know how to effectively delete some columns in a large tab delimited file.
I have a file that contains 5 columns and almost 100,000 rows
3456 f g t t
3456 g h
456 f h
4567 f g h z
345 f g
567 h j k lThis is a very large data file and tab delimited.
I need... (2 Replies)
Since there are approximately 75K gsfiles and hundreds of stfiles per gsfile, this script can take hours. How can I rewrite this script, so that it's much faster? I'm not as familiar with perl but I'm open to all suggestions.
ls file.list>$split
for gsfile in `cat $split`;
do
csplit... (17 Replies)
Hi,
I have the following command in place
nawk -F, '!a++' file > file.uniq
It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error:
bash-3.2$ nawk -F, '!a++'... (17 Replies)
I am working on a homonym dictionary of names i.e. names which are clustered together according to their “sound-alike” pronunciation:
An example will make this clear:
Since the dictionary is manually constructed it often happens that inadvertently two sets of “homonyms” which should be grouped... (2 Replies)
I have a file size is around 24 G with 14 columns, delimiter with "|"
My requirement- can anyone provide me the fastest and best to get the below results
Number of records of the file
First column and second Column- Unique counts
Thanks for your time
Karti
------ Post updated at... (3 Replies)
I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
!x++
The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would... (4 Replies)
Discussion started by: gimley
4 Replies
LEARN ABOUT NETBSD
nl
NL(1) BSD General Commands Manual NL(1)NAME
nl -- line numbering filter
SYNOPSIS
nl [-p] [-b type] [-d delim] [-f type] [-h type] [-i incr] [-l num] [-n format] [-s sep] [-v startnum] [-w width] [file]
DESCRIPTION
The nl utility reads lines from the named file or the standard input if the file argument is omitted, applies a configurable line numbering
filter operation and writes the result to the standard output.
The nl utility treats the text it reads in terms of logical pages. Unless specified otherwise, line numbering is reset at the start of each
logical page. A logical page consists of a header, a body and a footer section; empty sections are valid. Different line numbering options
are independently available for header, body and footer sections.
The starts of logical page sections are signaled by input lines containing nothing but one of the following sequences of delimiter charac-
ters:
Line "Start of"
::: header
:: body
: footer
If the input does not contain any logical page section signaling directives, the text being read is assumed to consist of a single logical
page body.
The following options are available:
-b type
Specify the logical page body lines to be numbered. Recognized type arguments are:
a Number all lines.
t Number only non-empty lines.
n No line numbering.
pexpr Number only those lines that contain the basic regular expression specified by expr.
The default type for logical page body lines is t.
-d delim
Specify the delimiter characters used to indicate the start of a logical page section in the input file. At most two characters may
be specified; if only one character is specified, the first character is replaced and the second character remains unchanged. The
default delim characters are ``:''.
-f type
Specify the same as -b type except for logical page footer lines. The default type for logical page footer lines is n.
-h type
Specify the same as -b type except for logical page header lines. The default type for logical page header lines is n.
-i incr
Specify the increment value used to number logical page lines. The default incr value is 1.
-l num If numbering of all lines is specified for the current logical section using the corresponding -b a, -f a or -h a option, specify the
number of adjacent blank lines to be considered as one. For example, -l 2 results in only the second adjacent blank line being num-
bered. The default num value is 1.
-n format
Specify the line numbering output format. Recognized format arguments are:
ln Left justified.
rn Right justified, leading zeros suppressed.
rz Right justified, leading zeros kept.
The default format is rn.
-p Specify that line numbering should not be restarted at logical page delimiters.
-s sep Specify the characters used in separating the line number and the corresponding text line. The default sep setting is a single tab
character.
-v startnum
Specify the initial value used to number logical page lines; see also the description of the -p option. The default startnum value
is 1.
-w width
Specify the number of characters to be occupied by the line number; in case the width is insufficient to hold the line number, it
will be truncated to its width least significant digits. The default width is 6.
EXIT STATUS
The nl utility exits 0 on success, and >0 if an error occurs.
SEE ALSO pr(1)STANDARDS
The nl utility conforms to X/Open Portability Guide Issue 4, Version 2 (``XPG4.2'') with the exception of not supporting the intermingling of
the file operand with the options, which the standard considers an obsolescent feature to be removed from a further issue.
HISTORY
The nl utility first appeared in AT&T System V Release 2 UNIX.
BSD February 15, 1999 BSD