Modify script to remove dupes with two delimiters

01-24-2017

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Modify script to remove dupes with two delimiters

Hello,
I have a script which removes duplicates in a database with a single delimiter

Code:

The script is given below:

Code:

# script to remove dupes from a row with structure word=word
BEGIN{FS="="}
{for(i=1;i<=NF;i++){a[$i]++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1

How do I modify the script to remove duplicates in a database with two

Code:

A small pseudo-sample is given below.

Code:

अ=m=Prefix signifying negation.
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.

I tried to modify the delimiter part in the script using

Code:

{FS="=""*'"="}

But it resulted in a totally garbled output
Since the file is very large, normal editors do not remove the dupes and hence the request

gimley

View Public Profile for gimley

Find all posts by gimley

01-24-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Your description isn't clear enough to understand what you're trying to do.

I don't see any lines in your input that have a duplicated field (with = as the field separator). So there doesn't seem to be anything that needs to be done to remove duplicated fields in a line.

There is nothing in your code that makes any attempt to compare lines. If you were trying to remove duplicated lines the = would have no relevance; just using:

Code:

sort -u file

would do that (assuming that your database is in a text file named file).

You haven't shown us what output you hope to produce from your sample "text" and you haven't told us what form it takes. (Is you sample stored in a text file, in an Oracle database file, in some other type of database, or something else???)

What output are you hoping to produce from your pseudo-sample?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-24-2017

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

I am sorry. I should have been more explicit.
My database has a word or a phrase followed by its part of speech and eventually the meaning of the same, each of which are delimited by

Code:

As can be seen in the example below:

Code:

अ=m=Prefix signifying negation.

It so happens that while compiling the dictionary, duplicates have crept into the database and what I need is a tool to remove these duplicates. As a pseudoexample, here is the sample of the input:

Code:

अ=m=Prefix signifying negation.
अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.

The expected output would clean out all duplicates and store only unique strings, as shown in the output below:

Code:

अ=m=Prefix signifying negation.
अँहँ=ind=Interjection expressing disapprobation.
अं=int=An interjection expressing contempt,unconcern,disbelief.
अंक=m=A figure;a mark.The thigh.An act of a play.
अंकगणित=n=Arithmetic.

I hope this clarifies the query. The script I had provided handled only one delimiter

Code:

and I wanted to know if the awk script could be modified to suit this issue. Many thanks.
I work in a Windows environment.

gimley

View Public Profile for gimley

Find all posts by gimley

01-24-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You didn't answer the question about what type of file is being processed! And, that is even more important now that we know you're working on a Windows system (while posting your question in a forum devoted to UNIX and UNIX-like operating systems).

If you have awk, you must have installed some UNIX utilities on your Windows system. Did you try the sort command I suggested? If so, what did it do? If not, why not?

An common, easy way to remove duplicated lines using awk is:

Code:

awk '!a[$0]++' file

but, of course, that depends on file being a text file (as defined by UNIX systems); a DOS file that doesn't have a line terminator may silently drop the last (incomplete) line in a DOS file.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-24-2017

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

I had tried this but had forgotten to save the file as a Unix file. The moment I saved it in Unix format, the duplicates were eliminated.
Many thanks for your patience and help

gimley

View Public Profile for gimley

Find all posts by gimley

01-24-2017

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

awk can remove a WinDos \r

Code:

awk '
sub(/\r$/,"")
!($0 in a) {print; a[$0]}
' file

This 2nd awk line looks more complex but saves some memory.

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-24-2017

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Thanks a lot. I tried it on my Dos file and it worked perfectly.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Modify script to remove dupes with two delimiters

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove dupes in a large file

Discussion started by: gimley

2. Shell Programming and Scripting

Help with Perl script for identifying dupes in column1

Discussion started by: gimley

3. Shell Programming and Scripting

Remove newline character between two delimiters

Discussion started by: sushine11

4. Shell Programming and Scripting

Help in modifying existing Perl Script to produce report of dupes

Discussion started by: gimley

5. Shell Programming and Scripting

Script for identifying and deleting dupes in a line

Discussion started by: gimley

6. UNIX for Dummies Questions & Answers

Remove two delimiters, space and double quotes

Discussion started by: SteveDWin

7. Shell Programming and Scripting

Using an awk script to identify dupes in two files

Discussion started by: gimley

8. Shell Programming and Scripting

Script in SED and AWK so that it treats consecutive delimiters as one

Discussion started by: rakesh.su30

9. UNIX for Dummies Questions & Answers

Script to find the number of tab delimiters in a line

Discussion started by: poornimajayan