04-13-2012
Quote:
Originally Posted by
methyl
The nawk in post #1 and the sort in post #2 give different results on my system. This is because the nawk always keeps the first of the duplicate key records but the sort selects a random one.
Shell tools were never designed to process multi-gigabyte files. Do you have a database engine and access to a programmer?
Yes I too see that behavior...so the OP would have to stick to [n]awk to make sure that the first one of the dupes is written out instead of a random one.
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hey Guys,
I have file which looks like this,
Contig201#numbPA
Contig1452#nmdynD6PA
dm022p15.r#CG6461PA
dm005e16.f#SpatPA
IGU001_0015_A06.f#CG17593PA
I need to remove duplicates based on the chracter matching upto '#'.
for example if we consider this..
Contig201#numbPA... (4 Replies)
Discussion started by: sharatz83
4 Replies
2. Shell Programming and Scripting
Hello,
I have two files. File1 or the master file contains two columns separated by a delimiter:
a=b
b=d
e=f
g=h
File 2 which is the file to be processed has only a single column
a
h
c
b
What I need is an awk script to identify unique names from file 2 which are not found in the... (6 Replies)
Discussion started by: gimley
6 Replies
3. Shell Programming and Scripting
Hello,
I have a large amount of data with the following structure:
Word=Transliterated word
I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.... (5 Replies)
Discussion started by: gimley
5 Replies
4. Shell Programming and Scripting
Hi,
I have a file that is 430K lines long. It has records like below
|site1|MAP
|site2|MAP
|site1|MODAL
|site2|MAP
|site2|MODAL
|site2|LINK
|site1|LINK
My task is to count the number of time MAP, MODAL, LINK occurs for a single site and write new records like below to a new file
... (5 Replies)
Discussion started by: reach.sree@gmai
5 Replies
5. Shell Programming and Scripting
Hi i want to fetch 100k record from a file which is looking like as below.
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
... (17 Replies)
Discussion started by: lathigara
17 Replies
6. Shell Programming and Scripting
Hi,
I have a file which looks like:ke this : chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11130990 11131025 chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11131583... (22 Replies)
Discussion started by: Amit Pande
22 Replies
7. Shell Programming and Scripting
Hello,
I have a very large dictionary file which is in text format and which contains a large number of sub-sections. Each sub-section starts with the following header :
#DATA
#VALID 1
and ends with a footer as shown below
#END
The data between the Header and the Footer consists of... (6 Replies)
Discussion started by: gimley
6 Replies
8. UNIX for Advanced & Expert Users
I'm trying to remove duplicate data from an input file with unsorted data which is of size >50GB and write the unique records to a new file.
I'm trying and already tried out a variety of options posted in similar threads/forums. But no luck so far..
Any suggestions please ?
Thanks !! (9 Replies)
Discussion started by: Kannan K
9 Replies
9. Shell Programming and Scripting
Dear all,
I have a large dictionary database which has the following structure
source word=target word
e.g.
book=livre
Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated
e.g.
book=livre
book=tome
Since I want to... (7 Replies)
Discussion started by: gimley
7 Replies
10. Shell Programming and Scripting
I am trying to remove whitespaces from a file containing sample data as:
457 <EOFD> Mar 1 2007 12:00:00:000AM <EOFD> Mar 31 2007 12:00:00:000AM <EOFD> system <EORD> 458 <EOFD> Mar 1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007 9:10:56:036PM <EOFD> prodiws<EORD> . Basically these... (11 Replies)
Discussion started by: amvip
11 Replies
UNIQ(1) BSD General Commands Manual UNIQ(1)
NAME
uniq -- report or filter out repeated lines in a file
SYNOPSIS
uniq [-cdu] [-f fields] [-s chars] [input_file [output_file]]
DESCRIPTION
The uniq utility reads the standard input comparing adjacent lines, and writes a copy of each unique input line to the standard output. The
second and succeeding copies of identical adjacent input lines are not written. Repeated lines in the input will not be detected if they are
not adjacent, so it may be necessary to sort the files first.
The following options are available:
-c Precede each output line with the count of the number of times the line occurred in the input, followed by a single space.
-d Don't output lines that are not repeated in the input.
-f fields
Ignore the first fields in each input line when doing comparisons. A field is a string of non-blank characters separated from adja-
cent fields by blanks. Field numbers are one based, i.e. the first field is field one.
-s chars
Ignore the first chars characters in each input line when doing comparisons. If specified in conjunction with the -f option, the
first chars characters after the first fields fields will be ignored. Character numbers are one based, i.e. the first character is
character one.
-u Don't output lines that are repeated in the input.
If additional arguments are specified on the command line, the first such argument is used as the name of an input file, the second is used
as the name of an output file.
The uniq utility exits 0 on success, and >0 if an error occurs.
COMPATIBILITY
The historic +number and -number options have been deprecated but are still supported in this implementation.
SEE ALSO
sort(1)
STANDARDS
The uniq utility is expected to be IEEE Std 1003.2 (``POSIX.2'') compatible.
BSD
January 6, 2007 BSD