Why does this awk script not work correctly?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Why does this awk script not work correctly?
# 1  
Old 05-01-2017
Why does this awk script not work correctly?

I have a large database with English on the left hand side and Indic words on the left hand.
It so happens that since the Indic words have been entered by hand, there are duplicates in the entries.
The structure is as under:
Code:
English headword=Indic gloss,Indic gloss

A small sample will explain
Code:
10=दहा
10th=दहावा,दशम
11=अकरा,अकरा,एकादश
11th=अकरावा
12=बारा
12th=बारावा
13=तेरा,तेरा,त्रयोदश
13th=तेरावा
14=चौदा
14th=चौदावा
15=पंधरा
15th=पंधरावा,पंध्रावा
16=सोळा
16th=सोळावा
175=पावणेदोनशे,एकशे पंचाहत्तर
17=सतरा
17th=सतरावा
18=अठरा
18th=अठरावा
190=एकशेनव्वद
19=एकोणीस
19th=एकोणिसावा
1=एक
1st=प्रथम,पहिला
20=वीस
20th=विसावा
21=एकवीस
21st=एकविसावा
22=बावीस
22nd=बाविसावा
23=तेवीस
23rd=तेविसावा
24-hour interval=दिवस
24-karat gold=शुद्ध सोने,खरे सोने,अस्सल सोने,बावनकशी सोने

As can be seen some duplicates in the Indicword are present:
Code:
13=तेरा,तेरा,त्रयोदश
11=अकरा,अकरा,एकादश

I wrote an Awk script to remove such duplicates
Code:
# script to remove dupes from a row with structure word=word
BEGIN{FS="="}
{for(i=1;i<=NF;i++){a[$i]++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1

However when the script runs, it mangles the output file.
What has gone wrong?
Many thanks for your kind help.

---------- Post updated at 12:46 AM ---------- Previous update was at 12:45 AM ----------

Sorry the English is on Lefthand and Indic on right hand separated by
Code:
=

.
# 2  
Old 05-01-2017
Without showing us the output your hope to get from your sample input, without telling us whether or not the order of the indic glosses on the right side of the equal sign matters, without telling us what operating system you're using, and without telling us how the output you are currently getting is "mangled"; we can make lots of assumptions about what might be wrong that have absolutely nothing to do with what might or might not be your actual problem.

But, one thing that is obvious is that with with FS="=" the comma separated string on the right side of the equal sign in each input line is a single field. One might guess that you either want to split $2 on commas or you want to set FS using FS="[=,]" and loop through fields 2, through NF instead of 1 through NF.
# 3  
Old 05-01-2017
Assuming that the order of the order of the indic glosses has to be kept as they appear in the input (only removing duplicated indic glosses), assuming that you're using a version of awk that conforms to the requirements stated by the POSIX standards, you might try replacing your awk code with:
Code:
BEGIN { FS = "[=,]"
}
{       o = $1
        ofs = "="
        for(i = 2; i <= NF; i++) 
                if(!($i in s)) {
                        o = o ofs $i
                        s[$i]
                        ofs = ","
                }
        print o
        for(i in s)
                delete s[i] 
}

which, with your sample input, produces the output:
Code:
10=दहा
10th=दहावा,दशम
11=अकरा,एकादश
11th=अकरावा
12=बारा
12th=बारावा
13=तेरा,त्रयोदश
13th=तेरावा
14=चौदा
14th=चौदावा
15=पंधरा
15th=पंधरावा,पंध्रावा
16=सोळा
16th=सोळावा
175=पावणेदोनशे,एकशे पंचाहत्तर
17=सतरा
17th=सतरावा
18=अठरा
18th=अठरावा
190=एकशेनव्वद
19=एकोणीस
19th=एकोणिसावा
1=एक
1st=प्रथम,पहिला
20=वीस
20th=विसावा
21=एकवीस
21st=एकविसावा
22=बावीस
22nd=बाविसावा
23=तेवीस
23rd=तेविसावा
24-hour interval=दिवस
24-karat gold=शुद्ध सोने,खरे सोने,अस्सल सोने,बावनकशी सोने

If the output order of indic glosses on the right hand side doesn't matter, this code could be simplified.
This User Gave Thanks to Don Cragun For This Post:
# 4  
Old 05-01-2017
Sorry, I should have been more clear.
I work under Windows and hence DOS.
Basically as you can see the dictionary has a structure
Code:
English headword=Indic gloss,Indic gloss

as shown in the sample below:
Code:
10=दहा
10th=दहावा,दशम
11=अकरा,अकरा,एकादश
11th=अकरावा
12=बारा
12th=बारावा
13=तेरा,तेरा,त्रयोदश
13th=तेरावा
14=चौदा
14th=चौदावा
15=पंधरा
15th=पंधरावा,पंध्रावा
16=सोळा
16th=सोळावा
175=पावणेदोनशे,एकशे पंचाहत्तर
17=सतरा
17th=सतरावा
18=अठरा
18th=अठरावा
190=एकशेनव्वद
19=एकोणीस
19th=एकोणिसावा
1=एक
1st=प्रथम,पहिला
20=वीस
20th=विसावा
21=एकवीस
21st=एकविसावा
22=बावीस
22nd=बाविसावा
23=तेवीस
23rd=तेविसावा
24-hour interval=दिवस
24-karat gold=शुद्ध सोने,खरे सोने,अस्सल सोने,बावनकशी सोने

Since the database was made by hand at times, there are words repeated in the Indic glosses as shown in the sample below:
Code:
13=तेरा,तेरा,त्रयोदश
11=अकरा,अकरा,एकादश

What I needed was an awk script to identify such repeated entries and delete the duplicate entry.
Thus the sample above would be reduced as under
Code:
13=तेरा,त्रयोदश
11=अकरा,एकादश

I had written the following awk script to do the job:
Code:
# script to remove dupes from a row with structure word=word,word
BEGIN{FS="="}
{for(i=1;i<=NF;i++){a[$i]++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1

However when I ran the script on the sample, it produced a mangled output:
Code:
10=दहा
10th=दहावा,दशम
अकरा,अकरा,एकादश=11
11th=अकरावा
बारा=12
12th=बारावा
तेरा,तेरा,त्रयोदश=13
तेरावा=13th
चौदा=14
चौदावा=14th
पंधरा=15
15th=पंधरावा,पंध्रावा
सोळा=16
16th=सोळावा
पावणेदोनशे,एकशे पंचाहत्तर=175
17=सतरा
सतरावा=17th
18=अठरा
अठरावा=18th
एकशेनव्वद=190
19=एकोणीस
19th=एकोणिसावा
एक=1
प्रथम,पहिला=1st
वीस=20
विसावा=20th
एकवीस=21
21st=एकविसावा
बावीस=22
बाविसावा=22nd
तेवीस=23
23rd=तेविसावा
दिवस=24-hour interval
शुद्ध सोने,खरे सोने,अस्सल सोने,बावनकशी सोने=24-karat gold

I hope the above clarifies the situation. Identifying dupes visually is both time-consuming and prone to error.

---------- Post updated at 02:00 AM ---------- Previous update was at 01:57 AM ----------

By the time I had posted the clarifications, you had already replied. Many thanks, it worked and swept through a dictionary of 70,000 words and removed all the dupes.
I will now study the script to see where I went wrong
# 5  
Old 05-01-2017
The for loop can be shortened, and a classic split trick clears an array.
Code:
BEGIN { FS = "[=,]"
}
{       o = $1 "=" $2
        s[$2]
        for(i = 3; i <= NF; i++) 
                if(!($i in s)) {
                        o = o "," $i
                        s[$i]
                }
        print o
# clear s[]
        split("",s)
}

This User Gave Thanks to MadeInGermany For This Post:
# 6  
Old 05-01-2017
If the order of the indic glosses is unimporrtant, try also
Code:
awk -F= '
        {for (MX=n=split($2, T, ","); n>0; n--) C[T[n]]
         printf "%s=", $1
         DL = ""
         for (c in C)   {printf "%s%s", DL, c
                         DL = ","
                        }
         printf RS
         split ("",C)
        }
' file
10=दहा
10th=दहावा,दशम
11=एकादश,अकरा
11th=अकरावा
12=बारा
12th=बारावा
13=त्रयोदश,तेरा
13th=तेरावा
.
.
.
24-hour interval=दिवस
24-karat gold=खरे सोने,अस्सल सोने,शुद्ध सोने,बावनकशी सोने

This User Gave Thanks to RudiC For This Post:
# 7  
Old 05-01-2017
Many thanks. I tested the script and it worked beautifully.
The loop is an interesting feature
Thanks to all who so very kindly give their time to help out.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Work with setsid to open a session and close it correctly

I would like to create the following script: run a python script with setsid python may or may not fail with exception check if all of the group processes were terminated correctly if not, kill the remaining processes How can I do that? Thanks a lot (3 Replies)
Discussion started by: ASF Studio
3 Replies

2. Gentoo

LDAP-Auth does not work correctly with systemd

Hi, since the upgrade to Gnome 3.6 (now i have 3.8) the authentication over LDAP stops working. The whole machine does not start anymore. The machine boot, but no gdm and no X. I can login, with root, but then the tty hangs. When i look at ttyF12 i see a lot of systemd service the runs random,... (1 Reply)
Discussion started by: darktux
1 Replies

3. SCO

Set NIC correctly , but the network does not work

I'm trying to virtualize an instance of SCO Open Server 5.0.2c in VirtualBox (called VM- A) , I can not configure the network (NIC). The NIC I'm using is PCnet -FAST III (Am79C973 ) (this NIC works with VirtualBox + SCO 5.0.5M) When I add from ' Add new LAN adapter' I detects the NIC... (2 Replies)
Discussion started by: flako
2 Replies

4. UNIX for Advanced & Expert Users

Libvirt does not work correctly anymore on my gentoo

Hi, Since a year my libvirtd does not work anymore on my Gentoodesktop. In the meantime a used virtualbox. But I would like to have back libvirt. The problem was after libvirt should not only work with root privileges. I deinstalled all things with libvirt an kvm. I removed all things from /var... (4 Replies)
Discussion started by: darktux
4 Replies

5. Shell Programming and Scripting

awk not working correctly

Hi I am attempting to right a script which will read a table and extract specfic information. LASTFAILEDJOB=/usr/openv/netbackup/scripts/GB-LDN/Junaid/temp_files/lastfailedjob cat /usr/openv/netbackup/scripts/GB-LDN/Junaid/temp_files/lastfailedjob 237308646 If i run the following... (5 Replies)
Discussion started by: Junes
5 Replies

6. Programming

Cannot get dbx to work correctly with a running process

Hi everyone, I've been struggling with this for a few weeks now. I'm trying to debug a running process with dbx on an AIX box. The command I'm using is 'dbx -a <pid> core' There is a function I can perform in my application that crashes this process, but it does not show up as crashed in... (0 Replies)
Discussion started by: ctote
0 Replies

7. Shell Programming and Scripting

awk script to remove spaces - examples don't show up correctly

I have the following data from a manual database dump. I need to format the columns so that I can import them into an excel spread sheet. So far I have been able to get past the hurdles with vi and grep. Now I have one last issue that I can't get past. Here is an example of the data. Here is... (18 Replies)
Discussion started by: Chris_Rivera
18 Replies

8. Shell Programming and Scripting

Grep/awk not getting the message correctly

I have a script which will take two file as the inputs and take the Value in file1 and search in file2 and give the output in Outputfile. #!/bin/sh #. ${HOME}/crossworlds/bin/CWSharedEnv.sh FILE1=$1 FILE2=$2 for Var in $(cat $FILE1);do echo $Var grep -i "$Var" $FILE2 done > Outputfile I... (2 Replies)
Discussion started by: SwapnaNaidu
2 Replies

9. Shell Programming and Scripting

Awk: first line not processed correctly

Hey, I've made a little awk-script which reorders lines. Works okay, only problem is that is doesn't process the first line correctly. If I switch lines in the Input file it doesn't proces this first line either. Somebody please help! Here's is the code and the input file! thanx ... (1 Reply)
Discussion started by: BartleDuc
1 Replies

10. SuSE

vsft doesn't work correctly

I install vsftpd server on 2 SUSE 10.2 servers. The first works perfectly, but the second doesn't work how I expect. The second works only over local network and doesn't over internet. The vsftpd.conf and ../xinetd.d/vsftpd are the same in 2 servers. The only different was when I threw to log in... (1 Reply)
Discussion started by: zhivko.neychev
1 Replies
Login or Register to Ask a Question