Removing dupes within 2 delimited areas in a large dictionary file
Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :
and ends with a footer as shown below
The data between the Header and the Footer consists of words, each word on a separate line.
However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes.
What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates.
A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8
Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes.
Many thanks in advance for help and also the learning experience
Hi - I tried to remove ^M in a delimited file using "tr -d "\r" and "sed 's/^M//g'", but it does not work quite well. While the ^M is removed, the format of the record is still cut in half, like
a,b, c
c,d,e
The delimited file is generated using sh script by outputing a SQL query result to... (7 Replies)
Hi Experts
I am very new to perl and need to make a script using perl.
I would like to remove blanks in a text tab delimited file in in a specfic column range ( colum 21 to column 43) sample input and output shown below :
Input:
117 102 650 652 654 656
117 93 95... (3 Replies)
Hey there - a bit of background on what I'm trying to accomplish, first off. I am trying to load the data from a pipe delimited file into a database. The loading tool that I use cannot handle embedded newline characters within a field, so I need to scrub them out.
Solutions that I have tried... (7 Replies)
I have a large flat file with variable length fields that are pipe delimited. The file has no new line or CR/LF characters to indicate a new record. I need to parse the file and after some number of fields, I need to insert a CR/LF to start the next record.
Input file ... (2 Replies)
Hi All
I wanted to know how to effectively delete some columns in a large tab delimited file.
I have a file that contains 5 columns and almost 100,000 rows
3456 f g t t
3456 g h
456 f h
4567 f g h z
345 f g
567 h j k lThis is a very large data file and tab delimited.
I need... (2 Replies)
Since there are approximately 75K gsfiles and hundreds of stfiles per gsfile, this script can take hours. How can I rewrite this script, so that it's much faster? I'm not as familiar with perl but I'm open to all suggestions.
ls file.list>$split
for gsfile in `cat $split`;
do
csplit... (17 Replies)
Hi,
I have the following command in place
nawk -F, '!a++' file > file.uniq
It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error:
bash-3.2$ nawk -F, '!a++'... (17 Replies)
I am working on a homonym dictionary of names i.e. names which are clustered together according to their “sound-alike” pronunciation:
An example will make this clear:
Since the dictionary is manually constructed it often happens that inadvertently two sets of “homonyms” which should be grouped... (2 Replies)
I have a file size is around 24 G with 14 columns, delimiter with "|"
My requirement- can anyone provide me the fastest and best to get the below results
Number of records of the file
First column and second Column- Unique counts
Thanks for your time
Karti
------ Post updated at... (3 Replies)
I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
!x++
The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would... (4 Replies)
Discussion started by: gimley
4 Replies
LEARN ABOUT OSX
osacompile
OSACOMPILE(1) BSD General Commands Manual OSACOMPILE(1)NAME
osacompile -- compile AppleScripts and other OSA language scripts
SYNOPSIS
osacompile [-l language] [-e command] [-o name] [-d] [-r type:id] [-t type] [-c creator] [-x] [-s] [-u] [-a arch] [file ...]
DESCRIPTION
osacompile compiles the given files, or standard input if none are listed, into a single output script. Files may be plain text or other
compiled scripts. The options are as follows:
-l language
Override the language for any plain text files. Normally, plain text files are compiled as AppleScript.
-e command
Enter one line of a script. Script commands given via -e are prepended to the normal source, if any. Multiple -e options may be given
to build up a multi-line script. Because most scripts use characters that are special to many shell programs (e.g., AppleScript uses
single and double quote marks, ``('', ``)'', and ``*''), the command will have to be correctly quoted and escaped to get it past the
shell intact.
-o name
Place the output in the file name. If -o is not specified, the resulting script is placed in the file ``a.scpt''. The value of -o
partly determines the output file format; see below.
-x Save the resulting script as execute-only.
The following options are only relevant when creating a new bundled applet or droplet:
-s Stay-open applet.
-u Use startup screen.
-a arch
Create the applet or droplet for the specified target architecture arch. The allowable values are ``ppc'', ``i386'', and ``x86_64''.
The default is to create a universal binary.
The following options control the packaging of the output file. You should only need them for compatibility with classic Mac OS or for cus-
tom file formats.
-d Place the resulting script in the data fork of the output file. This is the default.
-r type:id
Place the resulting script in the resource fork of the output file, in the specified resource.
-t type
Set the output file type to type, where type is a four-character code. If this option is not specified, the creator code will not be
set.
-c creator
Set the output file creator to creator, where creator is a four-character code. If this option is not specified, the creator code will
not be set.
If no options are specified, osacompile produces a Mac OS X format script file: data fork only, with no type or creator code.
If the -o option is specified and the file does not already exist, osacompile uses the filename extension to determine what type of file to
create. If the filename ends with ``.app'', it creates a bundled applet or droplet. If the filename ends with ``.scptd'', it creates a bun-
dled compiled script. Otherwise, it creates a flat file with the script data placed according to the values of the -d and -r options.
EXAMPLES
To produce a script compatible with classic Mac OS:
osacompile -r scpt:128 -t osas -c ToyS example.applescript
SEE ALSO osascript(1), osalang(1)Mac OS X November 12, 2008 Mac OS X