awk xml dictionary script: could I get some input?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk xml dictionary script: could I get some input?
# 1  
Old 02-24-2018
awk xml dictionary script: could I get some input?

I completely understand if nobody wants to take a look at the ENTIRE code. What I am asking is that if anyone could browse quickly over the code and perhaps see if anything could be improved. You need not run the program, but you can if you want to.

I have been using awk for about a week or so, and you guys have such good advice every time I ask. My setup: Debian, mksh shell, sh script. Using mawk (for better performance)

The xml dictionary file has to be downloaded for the program to work.

The dictionary is in Latin. I have yet to see such a thing come to Linux (Android has such Latin apps but not regular Linux).

dictionary:
Code:
#!/bin/sh

# The input xml file MUST be changed into a linux-readable format for this program to run. Open in nano and save accordingly.
#
# Grab the xml file here: http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus:text:1999.04.0060
#
# Save to your computer.
#

# Name this file as 'dictionary' and run (where 'amo' is the word to search for):
#
# $ chmod +x dictionary
# $ ./dictionary amo
#
# Other latin words to look up: et, sum, iste, non, bonus
#

## Code which connects to perseus website to attain 1st per. sg. (needed as key for xml file)

URL="http://www.perseus.tufts.edu/hopper/morph?l=$1&la=la"
XMLfile="Perseus_text_1999.04.0060.xml"

wFIN='<h4 class="la">'
wFOUT='</h4>'

wDefIn='<span class="lemma_definition">'
wDefOut='</span>'

wFormIn='<td class="la">'$1'</td>'
wFormOut='<td style="font-size: x-small">'

wget -q -O- "$URL" | mawk -v vDefIn="$wDefIn" -v vDefOut="$wDefOut" -v vFormIn="$wFormIn" -v vFormOut="$wFormOut" -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
'$0 ~ vWFIN,$0 ~ vWFOUT {printf "\n[ " substr($0,18, length($0)-22)" ]"}   $0 ~ vDefIn,$0 ~ vDefOut {{ if (!/>/) {{$1=$1}1; print " "$0"";} }}   $0 ~ vFormIn,$0 ~ vFormOut {{ if (!/td /) {{$1=$1}1;   $0=substr($0,5, length($0)-9); print "-"$0;   } } }'

echo

keyIn='key="'$1'"'	# Which tag shall be searched?
keyOut='</entry>'	#
tagIn='<'		# How are tags to be distinguished?
tagOut='>'		#
keySepA=''		# Separates the main word from its roots
keySepB=','		#
etySepA='['		# Etymology left
etySepB=']\n\n • '	# Etymology right
defSep='\n\n\n'         # Separates individual definitions
emSep='\n\n • '		# Separates em-dashes

# First concatenate the result into a usable string
mawk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" ' $0 ~ vkeyIn, $0 ~ vkeyOut {printf $0}' $XMLfile |
mawk -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '

	# Separation after main key word
	{gsub ("<orth>", vkeySepA)}
	{gsub ("</orth>", vkeySepB)}

	# Add separation for several variations of definitions
	{gsub (/<etym lang="la" opt="n">/, vetySepA)}
	{gsub (/<\/etym>\. -<\/sense>/, "]")}
	{gsub (/<\/etym>, <trans opt="n">/, vetySepB)}
	{gsub (/<\/etym>\.-/, vetySepB)}
	{gsub (/<\/etym>\. /, "]")}
#	{gsub (/<\/etym>/, vetySepB)}

	# Get rid of potential extra definition markers
	{gsub (/\.-<\/sense>/, ".")}
	{gsub (/\.- <\/sense>/, ".")}
	{gsub (/\. - <\/sense>/, ".")}
	{gsub (/<\/usg>-<\/sense>/, ".")}

	# Find and prepare subsections
	{gsub (/<sense id=.*level="[0-9]" n="0" opt="n">/, "")}
	{gsub (/<sense id=.*"I" opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"II" opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"III" opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"IV." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"IV" opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"V." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"VI." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"VII." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"VIII." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"IV." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"X." opt="n">/, vdefSep)}

	# Add missing dot after gender
	{gsub (/<\/gen>/, ". ")}

	# Collapse all remaining tags
	{gsub (tagIn "[^" tagOut "]*" tagOut, "")}

	# Separate em-dash text
	{if ((!/-\\,/) && (!/[a-zA-Z]-/) && (!/ -/)) {gsub (/-/, vemSep)}}
        {if ((!/-\\,/) ) {gsub (/\.-/, "." vemSep)}}
        {gsub (/ - /, vemSep)}
	{if (!/-\\,/) {gsub (/\.-/, "." vemSep)}}
	#{gsub (/\.-/, "." vemSep)}

	# Remove double spaces and spaces between certain characters
	{gsub(/ +/, " ")}
	{gsub(/ ,/, ",")}
	{gsub(/ \./, ".")}
	{gsub(/ \:/, ":")}
	{gsub(/ \?/, "?")}
	{gsub(/\‘ /, "‘")}
	{gsub(/ \'/, "'")}
	{gsub(/^ /, "")}
        {gsub(/\( /, "(")}
        {gsub(/ \)/, ")")}
	{gsub(/\.\.\. /, "...")}

	{printf $0}'
echo

Upon running (where 'amo' is the word to look up):
Code:
$ ./dictionary amo

You should see:
Code:
[ sum ] to be, exist, live
-verb 1st sg pres ind act

sum, (2d pers. es, or old ēs; old subj praes. siem, siēs, siet, sient, for sim, etc., T.; fuat for sit, T., V., L.; imperf. often forem, forēs, foret, forent, for essem, etc.; fut. escunt for erunt, C.), fuī (fūvimus for fuimus, Enn. ap. C.), futūrus (inf fut. fore or futūrum esse, C.), esse [ES-; FEV-]	 


I. As a predicate, asserting existence, to be, exist, live: ut id aut esse dicamus aut non esse: flumen est Arar, quod, etc., Cs.: homo nequissimus omnium qui sunt, qui fuerunt: arbitrari, me nusquam aut nullum fore: fuimus Troes, fuit Ilium, V.

 • Of place, to be, be present, be found, stay, live: cum non liceret Romae quemquam esse: cum essemus in castris: deinceps in lege est, ut, etc.: erat nemo, quicum essem libentius quam tecum: sub uno tecto esse, L.

 • Of circumstances or condition, to be, be found, be situated, be placed: Sive erit in Tyriis, Tyrios laudabis amictūs, i. e. is attired, O.: in servitute: in magno nomine et gloriā: in vitio: Hic in noxiāst, T.: in pace, L.: (statua) est et fuit totā Graeciā summo honore: ego sum spe bonā: rem illam suo periculo esse, at his own risk: omnem reliquam spem in impetu esse equitum, L.

 • In 3 d pers., followed by a pron rel., there is (that) which, there are (persons) who, there are (things) which, some.

 • With indic. (the subject conceived as definite): est quod me transire oportet, there is a (certain) reason why I must, etc., T.: sunt item quae appellantur alces, there are creatures also, which, etc., Cs.: sunt qui putant posse te non decedere, some think: Sunt quibus in satirā videor nimis acer, H.

 • With subj. (so usu. in prose, and always with a subject conceived as indefinite): sunt, qui putent esse mortem...sunt qui censeant, etc.: est isdem de rebus quod dici possit subtilius: sunt qui Crustis et pomis viduas venentur avaras, H.

 • With dat, to belong, pertain, be possessed, be ascribed: fingeret fallacias, Unde esset adulescenti amicae quod daret, by which the youth might have something to give, T.: est igitur homini cum deo similitudo, man has some resemblance: Privatus illis census erat brevis, H.: Troia et huic loco nomen est, L.

 • Ellipt.: Nec rubor est emisse palam (sc. ei), nor is she ashamed, O.: Neque testimoni dictio est (sc. servo), has no right to be a witness, T.

 • With cum and abl of person, to have to do with, be connected with: tecum nihil rei nobis est, we have nothing to do with you, T.: si mihi tecum minus esset, quam est cum tuis omnibus.

 • With ab and abl of person, to be of, be the servant of, follow, adhere to, favor, side with: Ab Andriā est ancilla haec, T.: sed vide ne hoc, Scaevola, totum sit a me, makes for me.

 • With pro, to be in favor of, make for: (iudicia) partim nihil contra Habitum valere, partim etiam pro hoc esse.

 • With ex, to consist of, be made up of: (creticus) qui est ex longā et brevi et longā: duo extremi chorei sunt, id est, e singulis longis et brevibus.

 • To be real, be true, be a fact, be the case, be so: sunt ista, Laeli: est ut dicis, inquam: verum esto: esto, granted, V.

 • Esp. in phrases, est ut, it is the case that, is true that, is possible that, there is reason for: sin est, ut velis Manere illam apud te, T.: est, ut id maxime deceat: futurum esse ut omnes pellerentur, Cs.: magis est ut ipse moleste ferat errasse se, quam ut reformidet, etc., i. e. he has more reason for being troubled...than for dreading, etc.: ille erat ut odisset defensorem, etc., he certainly did hate.

 • In eo esse ut, etc., to be in a condition to, be possible that, be about to, be on the point of (impers. or with indef subj.): cum iam in eo esset, ut in muros evaderet miles, when the soldiers were on the point of scaling, L.: cum res non in eo essent ut, etc., L.

 • Est ubi, there is a time when, sometimes: est, ubi id isto modo valeat.

 • Est quod, there is reason to, is occasion to: etsi magis est, quod gratuler tibi, quam quod te rogem, I have more reason to: est quod referam ad consilium: sin, etc., L.: non est quod multa loquamur, H.

 • Est cur, there is reason why: quid erat cur Milo optaret, etc., what cause had Milo for wishing? etc.

 • With inf, it is possible, is allowed, is permitted, one may: Est quādam prodire tenus, si non datur ultra, H.: scire est liberum Ingenium atque animum, T.: neque est te fallere quicquam, V.: quae verbo obiecta, verbo negare sit, L.: est videre argentea vasa, Ta.: fuerit mihi eguisse aliquando tuae amicitiae, S.

 • Of events, to be, happen, occur, befall, take place: illa (solis defectio) quae fuit regnante Romulo: Amabo, quid tibi est? T.: quid, si...futurum nobis est? L.

 • To come, fall, reach, be brought, have arrived: ex eo tempore res esse in vadimonium coepit: quae ne in potestatem quidem populi R. esset, L.. 


II. As a copula, to be: et praeclara res est et sumus otiosi: non sum ita hebes, ut istud dicam: Nos numerus sumus, a mere number, H.: sic, inquit, est: est ut dicis: frustra id inceptum Volscis fuit, L.: cum in convivio comiter et iucunde fuisses: quod in maritimis facillime sum, am very glad to be.

 • With gen part., to be of, belong to: qui eiusdem civitatis fuit, N.: qui Romanae partis erant, L.: ut aut amicorum aut inimicorum Campani simus, L.

 • With gen possess., to belong to, pertain to, be of, be the part of, be peculiar to, be characteristic of, be the duty of: audiant eos, quorum summa est auctoritas apud, etc., who possess: ea ut civitatis Rhodiorum essent, L.: Aemilius, cuius tum fasces erant, L.: plebs novarum rerum atque Hannibalis tota esse, were devoted to, L.: negavit moris esse Graecorum, ut, etc.

 • With pron possess.: est tuum, Cato, videre quid agatur: fuit meum quidem iam pridem rem p. lugere.

 • With gerundive: quae res evertendae rei p. solerent esse, which were the usual causes of ruin to the state: qui utilia ferrent, quaeque aequandae libertatis essent, L.

 • With gen. or abl. of quality, to be of, be possessed of, be characterized by, belong to, have, exercise: nimium me timidum, nullius consili fuisse confiteor: Sulla gentis patriciae nobilis fuit, S.: civitas magnae auctoritatis, Cs.: refer, Cuius fortunae (sit), H.: nec magni certaminis ea dimicatio fuit, L.: bellum variā victoriā fuit, S.: tenuissimā valetudine esse, Cs.: qui capite et superciliis semper est rasis.

 • With gen. or abl. of price or value, to be of, be valued at, stand at, be appreciated, cost: videtur esse quantivis preti, T.: ager nunc multo pluris est, quam tunc fuit: magni erunt mihi tuae litterae: sextante sal et Romae et per totam Italiam erat, was worth, L.

 • With dat predic., to express definition or purpose, to serve for, be taken as, be regarded as, be felt to be: vitam hanc rusticam tu probro et crimini putas esse oportere, ought to be regarded as: eo natus sum ut Iugurthae scelerum ostentui essem, S.: ipsa res ad levandam annonam impedimento fuerat, L.

 • With second dat of pers.: quo magis quae agis curae sunt mihi, T.: illud Cassianum, ‘cui bono fuerit,' the inquiry of Cassius, ‘for whose benefit was it': haec tam parva civitas praedae tibi et quaestui fuit.

 • To be sufficient for, be equal to, be fit: sciant patribus aeque curae fuisse, ne, etc., L.: ut divites conferrent, qui oneri ferendo essent, such as were able to bear the burden, L.: cum solvendo aere (old dat. for aeri) alieno res p. non esset, L.

 • With ellips. of aeri: tu nec solvendo eras, wast unable to pay.

 • With ad, to be of use for, serve for: res quae sunt ad incendia, Cs.: valvae, quae olim ad ornandum templum erant maxime.

 • With de, to be of, treat concerning, relate to: liber, qui est de animo.

 • In the phrase, id est, or hoc est, in explanations, that is, that is to say, I mean: sed domum redeamus, id est ad nostros revertamur: vos autem, hoc est populus R., etc., S.

One thing I don't like about this program is that I end up running awk twice to do the job-once to concatenate all the text, and the other time to filter and display the results. I would like to run it only once if possible.

Also, there are several inefficiencies involving the filtering of text which will be plainly apparent.

Thank you. Smilie
# 2  
Old 02-25-2018
Just on a side note, this:

Quote:
Originally Posted by bedtime
Code:
	{gsub (/<sense id=.*"VIII." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"IV." opt="n">/, vdefSep)}
	{gsub (/<sense id=.*"X." opt="n">/, vdefSep)}

is probably a typo and should read "IX". But it seems that you could collapse the whole block using a regexp instead of fixed values, could't you?

Code:
{gsub (/<sense id=.*"[XVI]\{1,4\}[.]*" opt="n">/, vdefSep)}

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 3  
Old 02-25-2018
Quote:
Originally Posted by bakunin
Just on a side note, this:



is probably a typo and should read "IX". But it seems that you could collapse the whole block using a regexp instead of fixed values, could't you?

Code:
{gsub (/<sense id=.*"[XVI]\{1,4\}[.]*" opt="n">/, vdefSep)}

I hope this helps.

bakunin
Thank you. It didn't work in this code, but I can use this later. In another thread I happened to solve it.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Read xml file till script finds separation and run again for next input and so on

Hi All, I have one query, I managed to run script with user inputs through command line or with 1 file. But I need to read a txt file/xml file in which user can mention multiple sets of answers and script should run for each set till it reach the EOF. Thanks in advance for example, the file... (3 Replies)
Discussion started by: rv_champ
3 Replies

2. Shell Programming and Scripting

XML Fields comparison using awk script

Hello All, I have many zipped XMLs (example file name in tgz formate - file_rec.trx.2016-01-23.000123.exc.85sesdzd45wsds5299c8f2994f7.tgz) looks following and I need to verify two numbers, they are RecordNumber and EnrolData (only sequence number, NOT hole). for all the records, both should be... (5 Replies)
Discussion started by: VasuKukkapalli
5 Replies

3. Shell Programming and Scripting

Convert XML to CSV using awk or shell script

Hello, I am working on a part of code where I need a awk or shell script to convert the given XML file to CSV or TXT file. There are multiple xml files and of different structure, so a single script is required for converting data. I did find a lot of solutions in the forum but... (16 Replies)
Discussion started by: Rashmitha
16 Replies

4. Shell Programming and Scripting

XML variable for input in same input file

Dear All , i stuck in one problem executing xml .. i have input xml as <COMMAND name="ARRANGEMENT.WRITE" timestamp="0" so="initial"> <SVLOBJECT> <LONG name="CSP_PMNT_ID" val="-1"/> <MONEY name="CSP_CEILING" amount="0.0" currency="AUD"/> ... (6 Replies)
Discussion started by: arvindng
6 Replies

5. Shell Programming and Scripting

Cleaning through perl or awk a Stemmer dictionary

Hello, I work under Windows Vista and I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is... (4 Replies)
Discussion started by: gimley
4 Replies

6. Shell Programming and Scripting

awk Script to parse a XML tag

I have an XML tag like this: <property name="agent" value="/var/tmp/root/eclipse" /> Is there way using awk that i can get the value from the above tag. So the output should be: /var/tmp/root/eclipse Help will be appreciated. Regards, Adi (6 Replies)
Discussion started by: asirohi
6 Replies

7. Shell Programming and Scripting

XML- Sed || Awk Bash script... Help!

Hi ! I'm working into my first bash script to make some xml modification and it's going to make me crazy lol .. so I decide to try into this forum to take some ideas from people that really know about this! This is my situation I've and xml file with a lots of positional values with another tags... (9 Replies)
Discussion started by: juampal
9 Replies

8. Shell Programming and Scripting

Shell script (not Perl) to parse xml with awk

Hi, I have to make an script according to these: - I have couples of files like: xxxxxxxxxxxxx.csv xxxxxxxxxxxxx_desc.xml - every xml file has diferent fields, but keeps this format: ........ <defaultName>2011-02-25T16:43:43.582Z</defaultName> ........... (2 Replies)
Discussion started by: Pluff
2 Replies

9. Shell Programming and Scripting

Help needed in writing awk script for xml source

Hi, i am not able to get an approach for converting xml file to flat file using awk programming. Can anyone help me out. The input xml is like this: <outer> <field1>one</field1> <field2>two</field2> <field3>three<Error Code=777 Description=12345/></field3> <field4>four</field4> </outer>... (2 Replies)
Discussion started by: naren_0101bits
2 Replies

10. Shell Programming and Scripting

extract data from xml- shell script using awk

Hi, This is the xml file that i have. - <front-servlet platform="WAS4.0" request-retriever="SiteMinder-aware" configuration-rescan-interval="60000"> <concurrency-throttle maximum-concurrency="50" redirect-page="/jsp/defaulterror.jsp" /> - <loggers> <instrumentation... (5 Replies)
Discussion started by: nishana
5 Replies
Login or Register to Ask a Question