Hello,
I have a script which removes duplicates in a database with a single delimiter
The script is given below:
How do I modify the script to remove duplicates in a database with two
A small pseudo-sample is given below.
I tried to modify the delimiter part in the script using
But it resulted in a totally garbled output
Since the file is very large, normal editors do not remove the dupes and hence the request
Your description isn't clear enough to understand what you're trying to do.
I don't see any lines in your input that have a duplicated field (with = as the field separator). So there doesn't seem to be anything that needs to be done to remove duplicated fields in a line.
There is nothing in your code that makes any attempt to compare lines. If you were trying to remove duplicated lines the = would have no relevance; just using:
would do that (assuming that your database is in a text file named file).
You haven't shown us what output you hope to produce from your sample "text" and you haven't told us what form it takes. (Is you sample stored in a text file, in an Oracle database file, in some other type of database, or something else???)
What output are you hoping to produce from your pseudo-sample?
I am sorry. I should have been more explicit.
My database has a word or a phrase followed by its part of speech and eventually the meaning of the same, each of which are delimited by
As can be seen in the example below:
It so happens that while compiling the dictionary, duplicates have crept into the database and what I need is a tool to remove these duplicates. As a pseudoexample, here is the sample of the input:
The expected output would clean out all duplicates and store only unique strings, as shown in the output below:
I hope this clarifies the query. The script I had provided handled only one delimiter
and I wanted to know if the awk script could be modified to suit this issue. Many thanks.
I work in a Windows environment.
You didn't answer the question about what type of file is being processed! And, that is even more important now that we know you're working on a Windows system (while posting your question in a forum devoted to UNIX and UNIX-like operating systems).
If you have awk, you must have installed some UNIX utilities on your Windows system. Did you try the sort command I suggested? If so, what did it do? If not, why not?
An common, easy way to remove duplicated lines using awk is:
but, of course, that depends on file being a text file (as defined by UNIX systems); a DOS file that doesn't have a line terminator may silently drop the last (incomplete) line in a DOS file.
This User Gave Thanks to Don Cragun For This Post:
I had tried this but had forgotten to save the file as a Unix file. The moment I saved it in Unix format, the duplicates were eliminated.
Many thanks for your patience and help
I have a large file 1.5 gb and want to sort the file.
I used the following AWK script to do the job
!x++
The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted.
Any solution to speed up the AWk script or a Perl script would... (4 Replies)
Dear all,
I have a large dictionary database which has the following structure
source word=target word
e.g.
book=livre
Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated
e.g.
book=livre
book=tome
Since I want to... (7 Replies)
hi i am having delimited .dat file having content like below.
test.dat(5 line of records)
======
PT2~Stag~Pt2 Stag Test.
Updated~PT2 S T~Area~~UNCEF R20~~2012-05-24 ~2014-05-24~~
PT2~Stag y~Pt2 Stag Test.
Updated~PT2 S T~Area~METR~~~2012-05-24~2014-05-24~~test
PT2~Pt2 Stag Test~~PT2 S... (4 Replies)
Hello,
I have a large amount of data with the following structure:
Word=Transliterated word
I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.... (5 Replies)
I am compiling a synonym dictionary which has the following structure
Headword=Synonym1,Synonym2 and so on, with each synonym separated by a comma.
As is usual in such cases manual preparation of synonyms results in repeating the synonym which results in dupes as in the example below:... (3 Replies)
I would like to know how to replace a space delimiter with a ^_ (\037) delimiter and a double quote delimiter while maintaining the spaces inside the double quotes. The double quote delimiter is only used on text fields.
I'd prefer a one-liner, but could handle a function or script that accepts... (4 Replies)
Hello,
I have two files. File1 or the master file contains two columns separated by a delimiter:
a=b
b=d
e=f
g=h
File 2 which is the file to be processed has only a single column
a
h
c
b
What I need is an awk script to identify unique names from file 2 which are not found in the... (6 Replies)
Hi All,
I am trying to cut to do a cut operation, but since there are seems to be more than one deltimiters in some occasion I am not able to get the exact field. Can you please provide an SED and AWK script for treating the source file in such a way that all consecutive delimiters are treated... (3 Replies)
Hi,
I need to find the number of tab delimiters in the first line of a file.So using
word=`head -1 files.txt`
I have extracted the first line of file into a variable word.It has 20 tab delimted columns.So can anyone help me in finding the number of delimiters?
I am using csh and I am a... (5 Replies)