Removing dupes within 2 delimited areas in a large dictionary file
Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :
and ends with a footer as shown below
The data between the Header and the Footer consists of words, each word on a separate line.
However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes.
What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates.
A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8
Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes.
Many thanks in advance for help and also the learning experience
Hi - I tried to remove ^M in a delimited file using "tr -d "\r" and "sed 's/^M//g'", but it does not work quite well. While the ^M is removed, the format of the record is still cut in half, like
a,b, c
c,d,e
The delimited file is generated using sh script by outputing a SQL query result to... (7 Replies)
Hi Experts
I am very new to perl and need to make a script using perl.
I would like to remove blanks in a text tab delimited file in in a specfic column range ( colum 21 to column 43) sample input and output shown below :
Input:
117 102 650 652 654 656
117 93 95... (3 Replies)
Hey there - a bit of background on what I'm trying to accomplish, first off. I am trying to load the data from a pipe delimited file into a database. The loading tool that I use cannot handle embedded newline characters within a field, so I need to scrub them out.
Solutions that I have tried... (7 Replies)
I have a large flat file with variable length fields that are pipe delimited. The file has no new line or CR/LF characters to indicate a new record. I need to parse the file and after some number of fields, I need to insert a CR/LF to start the next record.
Input file ... (2 Replies)
Hi All
I wanted to know how to effectively delete some columns in a large tab delimited file.
I have a file that contains 5 columns and almost 100,000 rows
3456 f g t t
3456 g h
456 f h
4567 f g h z
345 f g
567 h j k lThis is a very large data file and tab delimited.
I need... (2 Replies)
Since there are approximately 75K gsfiles and hundreds of stfiles per gsfile, this script can take hours. How can I rewrite this script, so that it's much faster? I'm not as familiar with perl but I'm open to all suggestions.
ls file.list>$split
for gsfile in `cat $split`;
do
csplit... (17 Replies)
Hi,
I have the following command in place
nawk -F, '!a++' file > file.uniq
It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error:
bash-3.2$ nawk -F, '!a++'... (17 Replies)
I am working on a homonym dictionary of names i.e. names which are clustered together according to their “sound-alike” pronunciation:
An example will make this clear:
Since the dictionary is manually constructed it often happens that inadvertently two sets of “homonyms” which should be grouped... (2 Replies)
I have a file size is around 24 G with 14 columns, delimiter with "|"
My requirement- can anyone provide me the fastest and best to get the below results
Number of records of the file
First column and second Column- Unique counts
Thanks for your time
Karti
------ Post updated at... (3 Replies)
I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
!x++
The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would... (4 Replies)
Discussion started by: gimley
4 Replies
LEARN ABOUT CENTOS
cracklib-format
cracklib-format(8) Debian GNU/Linux manual cracklib-format(8)NAME
cracklib-format, cracklib-packer, cracklib-unpacker - cracklib dictionary utilities
SYNOPSIS
cracklib-format file ...
cracklib-packer cracklib_dictpath
cracklib-unpacker cracklib_dictpath
DESCRIPTION
cracklib-format takes a list of text files each containing a list of words, one per line, It lowercases all words, removes control charac-
ters, and sorts the lists. It outputs the cleaned up list to standard output. The text files may be optionally compressed with gzip(1).
If you supply massive amounts of text to cracklib-format you must have enough free space available for use by the sort(1) command. If you
do not have 20Mb free in /var/tmp (or whatever temporary area your sort(1) command uses), have a look at the /usr/sbin/cracklib-format pro-
gram which is a sh(1) program. You can usually tweak the sort(1) command to use any large area of disk you desire, by use of the -T
option. cracklib-format has a hook for this.
cracklib-packer reads from standard input a list of sorted and cleaned words and creates a database in the directory and prefix given by
the command line argument cracklib_dictpath. Three files are created with the suffixes of .hwm, .pwd, and .pwi. These three files are in
the format that the FascistCheck(3) subroutine, cracklib-unpacker(8), and cracklib-check(8), utilities understand. The number of words
read and written are printed on stdout(3).
cracklib-unpacker reads from the database in the directory and prefix given by the command line argument cracklib_dictpath and outputs on
standard output the list of words that make up the database.
The database is in a binary format generated by the utilities cracklib-format(8) and cracklib-packer(8). On a Debian system the database
is located in the directory /var/cache/cracklib/cracklib_dict and is generated daily with the program /etc/cron.daily/cracklib. The loca-
tion is also defined in the header file crack.h using the constant CRACKLIB_DICTPATH though none of the subroutines in the cracklib
libraries have this location hardcoded into their implementations.
FILES
/var/cache/cracklib/cracklib_dict.[hwm|pwd|pwi]
cracklib dictionary database files used by utilities.
/etc/cron.daily/cracklib
cracklib daily cron program to rebuild the cracklib dictionary database.
/etc/cracklib/cracklib.conf
cracklib configuration file used by the cracklib daily cron program to rebuild the cracklib dictionary database.
/usr/include/crack.h
cracklib header file defining the subroutine FascistCheck(3) and the constant CRACKLIB_DICTPATH used to compile in the location of
the cracklib dictionary database for these utilities.
/usr/sbin/cracklib-format
cracklib shell script to create initial list of words for dictionary database.
SEE ALSO FascistCheck(3), cracklib-check(8), update-cracklib(8), create-cracklib-dict(8)
/usr/share/doc/libcrack2/libcrack2.html
/usr/share/doc/cracklib-runtime/cracklib-runtime.html
AUTHOR
cracklib2 is written by Alec Muffett <alecm@crypto.dircon.co.uk>. Manual added by Jean Pierre LeJacq <jplejacq@quoininc.com>.
2.7-8.5 Sat Jun 21 22:43:12 CEST 2008 cracklib-format(8)