Regexes for three column data to create a dictionary

04-23-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Regexes for three column data to create a dictionary

I am working on a multilingual dictionary and I have data in three columns. The data structure can be

Code:

word=word=gloss

Code:

word word=word word=gloss gloss

Code:

acts as a delimiter
The number of words separated by the delimiter can be up to 8 or 10. The structure is well defined in the sense that the number of words in the first column and the number of words in the second column are identical
An example will make this clear. For ease of comprehension I am using Latin script:

Code:

book=boook=bUk

Code:

hand book=hannd boook=hEnD bUk

and so on.
I need to map the gloss in column3 to the string in column1 and the string in column 2

Code:

book=bUK
boook=bUK

Code:

hand book=hEnD bUk
hannd boook=hEnD bUk

My query is how do I write a regex which will identify each of these types. Once I have the regex, I can write a script which will easily separate these out. I would appreciate a regex in Perl or Unix.
A script in either Perl or Awk would be the cherry on the cake. I work in a Windows environment
I hope to complete the mapper and put it up as a useful tool for multi-lingual transliteration across two languages. Many thanks.

gimley

View Public Profile for gimley

Find all posts by gimley

04-23-2016

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

perl does hashes really well. Is there a reason to use regexes? Once the hash is built in memory you could serialize it to disk, or more importantly you could use the hashtable instead of searching your file for each lookup. This will only benefit you if the file has a reasonably large number of lines. > few thousand

FWIW: how does your translator handle homographs? In English:lead a horse to water, the metal was as heavy as lead

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-23-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Thanks for your comment. Insofar as homographs are concerned the gloss handles these as separate entries.
I am not very good at Perl and prefer to use Awk. If I am not asking too much how can this be handled in Perl.

gimley

View Public Profile for gimley

Find all posts by gimley

04-23-2016

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

awk is fine - associative arrays are hashes.

However, most translators have lots of work to do, so however you do this you need to keep the hash in memory if performance is a criterion. So you may need to read through a bunch of "smallfile" with one read of bigfile -I don't know

You will need two field separators the "=" character and (for english tab and space).
The other possible field sep characters depend on your language.
files: bigfile (one with = separator) smallfile text to translate

Code:

awk 'FNR==NR {FS="="; one[$3]=$1 ; two[$3]=$2; next}  # lookup will be gloss
       FNR!=NR { FS="[ \t]"; $ do your lookup here} 
       ' bigfile smallfile

I do not understand enough to give you a good answer. Some points:
1. the 'do your lookup here' part is simply an associative array lookup,
read $0 from smallfile, which should have the gloss, right?

Code:

print one[$0]  #- prints the first field in the bigfile

2. you will probably need to create an awk function to lookup one[] and another for two[]

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

04-23-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Thanks a lot for your help with the script. You have grabbed what I have in mind. I will try and take on from here and suitably modify the script to accommodate other cases.

gimley

View Public Profile for gimley

Find all posts by gimley

04-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Looking at the samples in post#1, I'm not sure if the distinction between word1 and word2 (resulting in two arrays) is really needed. Wouldn't

Code:

awk '
FNR==NR {TR[$1]=$3              # gloss indexed by the two words
         TR[$2]=$3
         next
        }
        {                       # whatever you need to do to your text
        }
' FS="=" transfile FS="..." text

do what you want?

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-23-2016

Registered User

303, 4

Join Date: Feb 2011

Last Activity: 3 March 2020, 10:05 PM EST

Posts: 303

Thanks Given: 145

Thanked 4 Times in 4 Posts

Sorry for the late reply. I was out and could not access my mail. Many thanks for the answer.
I see your point that the distinction between word1 and word2 may not be needed.
I will test it out and get back to you.

gimley

View Public Profile for gimley

Find all posts by gimley

Shell Programming and Scripting

Regexes for three column data to create a dictionary

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Compare 2 files of csv file and match column data and create a new csv file of them

Discussion started by: refrain

2. Shell Programming and Scripting

Script to create unique look-up for headers for a Dictionary

Discussion started by: gimley

3. Shell Programming and Scripting

Compare 2 files and match column data and align data from 3 column

Discussion started by: asnandhakumar

4. Shell Programming and Scripting

AWK script to create max value of 3rd column, grouping by first column

Discussion started by: ckmehta

5. UNIX for Dummies Questions & Answers

What's the Diff Between These Two Regexes?

Discussion started by: sudon't

6. Homework & Coursework Questions

How to create a dictionary using cygwin

Discussion started by: kpopfreakghecky

7. Shell Programming and Scripting

create a new file from data file from a column

Discussion started by: mykey242

8. Programming

How to create java based dictionary for mobile using data in microsoft excel?

Discussion started by: Anna Hussie

9. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

Discussion started by: patrick87

10. Shell Programming and Scripting

Question about working with data to create new column

Discussion started by: scottzx7rr