Removing dupes within 2 delimited areas in a large dictionary file
Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :
and ends with a footer as shown below
The data between the Header and the Footer consists of words, each word on a separate line.
However given the large data, it so happens that within a section, words are repeated, as a result of which the file ends up with dupes.
What I need is a PERL or AWK script which could identify the header and the footer, find the data within them and sort the data removing all duplicates.
A sample input and output are given below. The examples are from English since the real time data is in Perso-Arabic script. Case is not an issue since the language does not have case. All data is in Unicode :UTF16 but I can convert it to Unicode 8
Could it be possible to please comment the script so that I can learn how to identify Headers and Footeers with a database and then sort them removing dupes.
Many thanks in advance for help and also the learning experience
Hi - I tried to remove ^M in a delimited file using "tr -d "\r" and "sed 's/^M//g'", but it does not work quite well. While the ^M is removed, the format of the record is still cut in half, like
a,b, c
c,d,e
The delimited file is generated using sh script by outputing a SQL query result to... (7 Replies)
Hi Experts
I am very new to perl and need to make a script using perl.
I would like to remove blanks in a text tab delimited file in in a specfic column range ( colum 21 to column 43) sample input and output shown below :
Input:
117 102 650 652 654 656
117 93 95... (3 Replies)
Hey there - a bit of background on what I'm trying to accomplish, first off. I am trying to load the data from a pipe delimited file into a database. The loading tool that I use cannot handle embedded newline characters within a field, so I need to scrub them out.
Solutions that I have tried... (7 Replies)
I have a large flat file with variable length fields that are pipe delimited. The file has no new line or CR/LF characters to indicate a new record. I need to parse the file and after some number of fields, I need to insert a CR/LF to start the next record.
Input file ... (2 Replies)
Hi All
I wanted to know how to effectively delete some columns in a large tab delimited file.
I have a file that contains 5 columns and almost 100,000 rows
3456 f g t t
3456 g h
456 f h
4567 f g h z
345 f g
567 h j k lThis is a very large data file and tab delimited.
I need... (2 Replies)
Since there are approximately 75K gsfiles and hundreds of stfiles per gsfile, this script can take hours. How can I rewrite this script, so that it's much faster? I'm not as familiar with perl but I'm open to all suggestions.
ls file.list>$split
for gsfile in `cat $split`;
do
csplit... (17 Replies)
Hi,
I have the following command in place
nawk -F, '!a++' file > file.uniq
It has been working perfectly as per requirements, by removing duplicates by taking into consideration only first 3 fields. Recently it has started giving below error:
bash-3.2$ nawk -F, '!a++'... (17 Replies)
I am working on a homonym dictionary of names i.e. names which are clustered together according to their “sound-alike” pronunciation:
An example will make this clear:
Since the dictionary is manually constructed it often happens that inadvertently two sets of “homonyms” which should be grouped... (2 Replies)
I have a file size is around 24 G with 14 columns, delimiter with "|"
My requirement- can anyone provide me the fastest and best to get the below results
Number of records of the file
First column and second Column- Unique counts
Thanks for your time
Karti
------ Post updated at... (3 Replies)
I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
!x++
The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would... (4 Replies)
Discussion started by: gimley
4 Replies
LEARN ABOUT SUSE
glib-mkenums
GLIB-MKENUMS(1) [FIXME: manual] GLIB-MKENUMS(1)NAME
glib-mkenums - C language enum description generation utility
SYNOPSIS
glib-mkenums [options...] [files...]
DESCRIPTION
glib-mkenums is a small perl-script utility that parses C code to extract enum definitions and produces enum descriptions based on text
templates specified by the user. Most frequently this script is used to produce C code that contains enum values as strings so programs can
provide value name strings for introspection.
INVOCATION
glib-mkenums takes a list of valid C code files as input. The options specified control the text that is output, certain substitutions are
performed on the text templates for keywords enclosed in @ characters.
Options
--fhead text
Put out text prior to processing input files.
--fprod text
Put out text everytime a new input file is being processed.
--ftail text
Put out text after all input files have been processed.
--eprod text
Put out text everytime an enum is encountered in the input files.
--vhead text
Put out text before iterating over the set of values of an enum.
--vprod text
Put out text for every value of an enum.
--vtail text
Put out text after iterating over all values of an enum.
--comments text
Template for auto-generated comments, the default (for C code generations) is "/* @comment@ */".
--template file
Read templates from the given file. The templates are enclosed in specially-formatted C comments
/*** BEGIN section ***/
/*** END section ***/
where section may be file-header, file-production, file-tail, enumeration-production, value-header, value-production, value-tail or
comment.
--help
Print brief help and exit.
--version
Print version and exit.
Production text substitutions
Certain keywords enclosed in @ characters will be substituted in the emitted text. For the substitution examples of the keywords below, the
following example enum definition is assumed:
typedef enum
{
PREFIX_THE_XVALUE = 1 << 3,
PREFIX_ANOTHER_VALUE = 1 << 4
} PrefixTheXEnum;
@EnumName@
The name of the enum currently being processed, enum names are assumed to be properly namespaced and to use mixed capitalization to
separate words (e.g. PrefixTheXEnum).
@enum_name@
The enum name with words lowercase and word-separated by underscores (e.g. prefix_the_xenum).
@ENUMNAME@
The enum name with words uppercase and word-separated by underscores (e.g. PREFIX_THE_XENUM).
@ENUMSHORT@
The enum name with words uppercase and word-separated by underscores, prefix stripped (e.g. THE_XENUM).
@VALUENAME@
The enum value name currently being processed with words uppercase and word-separated by underscores, this is the assumed literal
notation of enum values in the C sources (e.g. PREFIX_THE_XVALUE).
@valuenick@
A nick name for the enum value currently being processed, this is usually generated by stripping common prefix words of all the enum
values of the current enum, the words are lowercase and underscores are substituted by a minus (e.g. the-xvalue).
@type@
This is substituted either by "enum" or "flags", depending on whether the enum value definitions contained bit-shift operators or not
(e.g. flags).
@Type@
The same as @type@ with the first letter capitalized (e.g. Flags).
@TYPE@
The same as @type@ with all letters uppercased (e.g. FLAGS).
@filename@
The name of the input file currently being processed (e.g. foo.h).
@basename@
The base name of the input file currently being processed (e.g. foo.h). (Since: 2.22)
Trigraph extensions
Some C comments are treated specially in the parsed enum definitions, such comments start out with the trigraph sequence /*< and end with
the trigraph sequence >*/. Per enum definition, the options "skip" and "flags" can be specified, to indicate this enum definition to be
skipped, or for it to be treated as a flags definition, or to specify the common prefix to be stripped from all values to generate value
nicknames, respectively. The "lowercase_name" option can be used to specify the word separation used in the *_get_type() function. For
instance, /*< lowercase_name=gnome_vfs_uri_hide_options >*/.
Per value definition, the options "skip" and "nick" are supported. The former causes the value to be skipped, and the latter can be used to
specify the otherwise auto-generated nickname. Examples:
typedef enum /*< skip >*/
{
PREFIX_FOO
} PrefixThisEnumWillBeSkipped;
typedef enum /*< flags,prefix=PREFIX >*/
{
PREFIX_THE_ZEROTH_VALUE, /*< skip >*/
PREFIX_THE_FIRST_VALUE,
PREFIX_THE_SECOND_VALUE,
PREFIX_THE_THIRD_VALUE, /*< nick=the-last-value >*/
} PrefixTheFlagsEnum;
SEE ALSO glib-genmarshal(1)
[FIXME: source] 05/02/2010 GLIB-MKENUMS(1)