Modify script to remove dupes with two delimiters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Modify script to remove dupes with two delimiters
# 1  
Old 01-24-2017
Modify script to remove dupes with two delimiters

Hello,
I have a script which removes duplicates in a database with a single delimiter
Code:
=

The script is given below:
Code:
# script to remove dupes from a row with structure word=word
BEGIN{FS="="}
{for(i=1;i<=NF;i++){a[$i]++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1

How do I modify the script to remove duplicates in a database with two
Code:
=

A small pseudo-sample is given below.
Code:
अ=m=Prefix signifying negation.
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.

I tried to modify the delimiter part in the script using
Code:
{FS="=""*'"="}

But it resulted in a totally garbled output
Since the file is very large, normal editors do not remove the dupes and hence the request
# 2  
Old 01-24-2017
Your description isn't clear enough to understand what you're trying to do.

I don't see any lines in your input that have a duplicated field (with = as the field separator). So there doesn't seem to be anything that needs to be done to remove duplicated fields in a line.

There is nothing in your code that makes any attempt to compare lines. If you were trying to remove duplicated lines the = would have no relevance; just using:
Code:
sort -u file

would do that (assuming that your database is in a text file named file).

You haven't shown us what output you hope to produce from your sample "text" and you haven't told us what form it takes. (Is you sample stored in a text file, in an Oracle database file, in some other type of database, or something else???)

What output are you hoping to produce from your pseudo-sample?
# 3  
Old 01-24-2017
I am sorry. I should have been more explicit.
My database has a word or a phrase followed by its part of speech and eventually the meaning of the same, each of which are delimited by
Code:
=

As can be seen in the example below:
Code:
अ=m=Prefix signifying negation.

It so happens that while compiling the dictionary, duplicates have crept into the database and what I need is a tool to remove these duplicates. As a pseudoexample, here is the sample of the input:
Code:
अ=m=Prefix signifying negation.
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.

The expected output would clean out all duplicates and store only unique strings, as shown in the output below:
Code:
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.

I hope this clarifies the query. The script I had provided handled only one delimiter
Code:
=

and I wanted to know if the awk script could be modified to suit this issue. Many thanks.
I work in a Windows environment.
# 4  
Old 01-24-2017
You didn't answer the question about what type of file is being processed! And, that is even more important now that we know you're working on a Windows system (while posting your question in a forum devoted to UNIX and UNIX-like operating systems).

If you have awk, you must have installed some UNIX utilities on your Windows system. Did you try the sort command I suggested? If so, what did it do? If not, why not?

An common, easy way to remove duplicated lines using awk is:
Code:
awk '!a[$0]++' file

but, of course, that depends on file being a text file (as defined by UNIX systems); a DOS file that doesn't have a line terminator may silently drop the last (incomplete) line in a DOS file.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 01-24-2017
I had tried this but had forgotten to save the file as a Unix file. The moment I saved it in Unix format, the duplicates were eliminated.
Many thanks for your patience and help
# 6  
Old 01-24-2017
awk can remove a WinDos \r
Code:
awk '
sub(/\r$/,"")
!($0 in a) {print; a[$0]}
' file

This 2nd awk line looks more complex but saves some memory.
This User Gave Thanks to MadeInGermany For This Post:
# 7  
Old 01-24-2017
Thanks a lot. I tried it on my Dos file and it worked perfectly.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove dupes in a large file

I have a large file 1.5 gb and want to sort the file. I used the following AWK script to do the job !x++ The script works but it is very slow and takes over an hour to do the job. I suspect this is because the file is not sorted. Any solution to speed up the AWk script or a Perl script would... (4 Replies)
Discussion started by: gimley
4 Replies

2. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Dear all, I have a large dictionary database which has the following structure source word=target word e.g. book=livre Since the database is very large in spite of all the care taken, it so happens that at times the source word is repeated e.g. book=livre book=tome Since I want to... (7 Replies)
Discussion started by: gimley
7 Replies

3. Shell Programming and Scripting

Remove newline character between two delimiters

hi i am having delimited .dat file having content like below. test.dat(5 line of records) ====== PT2~Stag~Pt2 Stag Test. Updated~PT2 S T~Area~~UNCEF R20~~2012-05-24 ~2014-05-24~~ PT2~Stag y~Pt2 Stag Test. Updated~PT2 S T~Area~METR~~~2012-05-24~2014-05-24~~test PT2~Pt2 Stag Test~~PT2 S... (4 Replies)
Discussion started by: sushine11
4 Replies

4. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Hello, I have a large amount of data with the following structure: Word=Transliterated word I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.... (5 Replies)
Discussion started by: gimley
5 Replies

5. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

I am compiling a synonym dictionary which has the following structure Headword=Synonym1,Synonym2 and so on, with each synonym separated by a comma. As is usual in such cases manual preparation of synonyms results in repeating the synonym which results in dupes as in the example below:... (3 Replies)
Discussion started by: gimley
3 Replies

6. UNIX for Dummies Questions & Answers

Remove two delimiters, space and double quotes

I would like to know how to replace a space delimiter with a ^_ (\037) delimiter and a double quote delimiter while maintaining the spaces inside the double quotes. The double quote delimiter is only used on text fields. I'd prefer a one-liner, but could handle a function or script that accepts... (4 Replies)
Discussion started by: SteveDWin
4 Replies

7. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Hello, I have two files. File1 or the master file contains two columns separated by a delimiter: a=b b=d e=f g=h File 2 which is the file to be processed has only a single column a h c b What I need is an awk script to identify unique names from file 2 which are not found in the... (6 Replies)
Discussion started by: gimley
6 Replies

8. Shell Programming and Scripting

Script in SED and AWK so that it treats consecutive delimiters as one

Hi All, I am trying to cut to do a cut operation, but since there are seems to be more than one deltimiters in some occasion I am not able to get the exact field. Can you please provide an SED and AWK script for treating the source file in such a way that all consecutive delimiters are treated... (3 Replies)
Discussion started by: rakesh.su30
3 Replies

9. UNIX for Dummies Questions & Answers

Script to find the number of tab delimiters in a line

Hi, I need to find the number of tab delimiters in the first line of a file.So using word=`head -1 files.txt` I have extracted the first line of file into a variable word.It has 20 tab delimted columns.So can anyone help me in finding the number of delimiters? I am using csh and I am a... (5 Replies)
Discussion started by: poornimajayan
5 Replies
Login or Register to Ask a Question