Make awk gsub take value of for loop

02-25-2018

Registered User

56, 9

Join Date: Feb 2018

Last Activity: 4 May 2020, 10:15 AM EDT

Posts: 56

Thanks Given: 42

Thanked 9 Times in 9 Posts

Make awk gsub take value of for loop

I am running Debian, mksh shell and #!/bin/mksh script.

Here is one instance I am trying to match. There are other level and n values, but they must be gathered in numerical order or the program will not work properly:

Code:

level="0" n="0"

Here is my code which does not work:

Code:

{ for (a = 0; a <= 10; ++a)
     { gsub(/level="[a]" n="[0-9]"/, "") }
}

The above code does not match and execute it, but the below code does. This is only proof of concept that it can be matched:

Code:

{ for (a = 0; a <= 10; ++a)
     { gsub(/level="0" n="[0-9]"/, "") }
}

So it seems that gsub is not taking the a parameter. How can I make gsub take the value of the for loop?

Thank you.

Btw, I have tried:

Code:

{ gsub(/level="[0-9]" n="[0-9]"/, "") }

This catches other instances before level="0" n="0".

bedtime

View Public Profile for bedtime

Find all posts by bedtime

02-25-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Regexes enclosed in slashes /.../ are regex constants, i.e. taken verbatim / literally. So a won't be replaced by its contents but matches "a".
Try to build the regex from partial string constants and variable contents, like e.g.

Code:

gsub("level=\"" a "\" n=\"[0-9]\"", "")

Not sure, though, what your intentions are with the square brackets around the a variable. And, why not a taylored regex (with "alternation") in lieu of the loop across a .
Some decent input (good and bad, i.e. to be matched or not) samples would help to understand what you're after.

Last edited by RudiC; 02-25-2018 at 09:47 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-25-2018

Registered User

56, 9

Join Date: Feb 2018

Last Activity: 4 May 2020, 10:15 AM EDT

Posts: 56

Thanks Given: 42

Thanked 9 Times in 9 Posts

Moderator's Comments:

Removed duplicate / edited paragraphs

Quote:

Originally Posted by RudiC

Regexes enclosed in slashes /.../ are regex constants, i.e. taken verbatim / literally. So a won't be replaced by its contents but matches "a".
Try to build the regex from partial string constants and variable contents, like e.g.

Code:

gsub("level=\"" a "\" n=\"[0-9]\"", "")

Not sure, though, what your intentions are with the square brackets around the a variable. And, why not a taylored regex (with "alternation") in lieu of the loop across a .
Some decent input (good and bad, i.e. to be matched or not) samples would help to understand what you're after.

This was the code needed to make it work! Thank you!

The big issue was that it must match perfectly or it would mess up the printout by going past its point and overwriting other data. With the forums help

, I got that fixed (after 6hrs of droning at the comp). Maybe there is a very simple way, but none of the other offered solutions have worked so far, though I did learn new things from them.

The square brackets were just a beginners mistake; I was trying to use them as some sort way of inputting a substitution.

If anyone is interested, a working solution is here. It's not the most efficient, but it works perfectly. If anyone wants all the code, just ask:

Code:

# Find 'sense id' number and store in a variable
{ /<sense id=\"n.*" level/; {vID = substr($2, 1, length($2)-1)}}

# Used for testing to see value:
# {print "vID: " vID}

# If matched then print section divider
{ for (vid = 0; vid <= 19; vid++){
    { for (vl = 0; vl <= 3; vl++){
        { for (vn = 0; vn <=25; vn++){

            {if (vn<=10) {vnx=vn    } }
            {if (vn==11) {vnx="I"    } }
            {if (vn==12) {vnx="II"    } }
            {if (vn==13) {vnx="III"    } }
            {if (vn==14) {vnx="IV"    } }
            {if (vn==15) {vnx="IV."    } }
            {if (vn==16) {vnx="V"    } }
            {if (vn==17) {vnx="V."  } }
            {if (vn==18) {vnx="A"  } }
            {if (vn==19) {vnx="B"  } }
            {if (vn==20) {vnx="C"  } }
            {if (vn==21) {vnx="D"  } }
            {if (vn==22) {vnx="E"  } }
            {if (vn==23) {vnx="F"  } }
            {if (vn==24) {vnx="G"  } }
            {if (vn==25) {vnx="H"  } }

                        # Used for testing to see values:
            # valuev="<sense " vID "." vid "\" level=\"" vl "\" n=\"" vnx "\" opt=\"n\">"; print valuev "\n"
            # {gsub("<sense " vID "." vid "\" level=\"" vl "\" n=\"" vnx "\" opt=\"n\">", vdefSep)}

                        # Everything is ready, so try to make a match!
            {gsub("<sense " vID "." vid "\" level=\"" vl "\" n=\"" vnx "\" opt=\"n\">", vdefSep)}

        }}
    }}
}}

# A sampe of what I'm trying to match:
#
# <sense id="n1.0" level="0" n="0" opt="n">
# <sense id="n1.1" level="1" n="I" opt="n">
# <sense id="n1.2" level="2" n="A" opt="n">
# <sense id="n1.3" level="2" n="B" opt="n">
# <sense id="n1.4" level="3" n="1" opt="n">
# <sense id="n1.5" level="3" n="2" opt="n">
# <sense id="n1.6" level="1" n="II" opt="n">
# <sense id="n1.7" level="2" n="A" opt="n">
# <sense id="n1.8" level="3" n="1" opt="n">
# <sense id="n1.9" level="3" n="2" opt="n">
# <sense id="n1.10" level="2" n="B" opt="n">
# <sense id="n1.11" level="3" n="1" opt="n">
# <sense id="n1.12" level="3" n="2" opt="n">
# <sense id="n1.12" level="3" n="2" opt="n">
# <sense id="n1.13" level="3" n="3" opt="n">

Basically, it goes through every single possibility that exists. There must be a better way.

Last edited by RudiC; 02-25-2018 at 03:37 PM.. Reason: Removed duplicated / edited paragraphs[

bedtime

View Public Profile for bedtime

Find all posts by bedtime

02-25-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

A few comments, although I'm afraid I didn't understand all the details of your code snippet nor interpret any of those correctly. Partly due to the unusual indenting that doesn't lend itself immediately:
- Although braces don't hurt and the parser will understand / eliminate them, too many of them makes the code difficult to read. {if (vn<=10) {vnx=vn } } can be written as if (vn<=10) vnx=vn without sacrifying logics but improving readability.
- for every single input line, you execute those nested loops 20 x 4 x 26, i.e. 2080 times - quite lengthy for more than a few input lines.
- instead of the 16 ifs for the vnx constants assignment, you could use an array.
- you seem to execute 2080 gsubs on $0 with different patterns, each and every one overwriting the former ones - not sure if each of those really makes sense and is necessary.

I could imagine that if you explain your problem verbosely in plain English supporting this with a few meaningful examples, people in here could come up with a taylored, crisp proposal on how to improve and accelerate the solution.

EDIT:
This

Code:

{ /<sense id=\"n.*" level/; {vID = substr($2, 1, length($2)-1)}}

is NOT a pattern {action} pair and will change vID with every new input line. Is that intended? Why then the /<sense id=\"n.*" level/?

EDIT 2: After replacing the gsub with a print - just as a proof of concept - , this yields the identical output as your code above:

Code:

awk '
BEGIN   {split ("1 2 3 4 5 6 7 8 9 10 I II III IV IV. V V. A B C D E F G H", VNARR)
         VNARR[0] = 0
        }

        {vID = substr($2, 1, length($2)-1)

#               Used for testing to see value:
#               {print "vID: " vID}

#               If matched then print section divider
         for (vid = 0; vid <= 19; vid++)
           for (vl = 0; vl <= 3; vl++)
             for (vn = 0; vn <=25; vn++)        print "<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">", vdefSep
        }

' file

Last edited by RudiC; 02-25-2018 at 07:19 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-25-2018

Registered User

56, 9

Join Date: Feb 2018

Last Activity: 4 May 2020, 10:15 AM EDT

Posts: 56

Thanks Given: 42

Thanked 9 Times in 9 Posts

Quote:

Originally Posted by RudiC

- Although braces don't hurt and the parser will understand / eliminate them, too many of them makes the code difficult to read. {if (vn<=10) {vnx=vn } } can be written as if (vn<=10) vnx=vn without sacrifying logics but improving readability.

Yes, I will try to ensure all new code is trimmed down. For now, I don't want to mess with the other braces.

Quote:

- for every single input line, you execute those nested loops 20 x 4 x 26, i.e. 2080 times - quite lengthy for more than a few input lines.

Yes, I know no other working alternative.

Quote:

- instead of the 16 ifs for the vnx constants assignment, you could use an array.

How eloquent that is! This is the type of thing I was looking for—so many lines saved!

Quote:

- you seem to execute 2080 gsubs on $0 with different patterns, each and every one overwriting the former ones - not sure if each of those really makes sense and is necessary.

I hadn't known that; my thought was that gsub only executed on a match? Maybe something like (and I've tried to make this work for a while):

Code:

# Make a variable for easy access:
IDvar="<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">"

# If that variable exists, then run gsub.
if (/IDvar/) gsub(IDvar, vdefSep)

Quote:

I could imagine that if you explain your problem verbosely in plain English supporting this with a few meaningful examples, people in here could come up with a taylored, crisp proposal on how to improve and accelerate the solution.

Thank you. I try to keep speech to a min, so people aren't overwhelmed...

Quote:

EDIT:
This

Code:

{ /<sense id=\"n.*" level/; {vID = substr($2, 1, length($2)-1)}}

is NOT a pattern {action} pair and will change vID with every new input line. Is that intended? Why then the /<sense id=\"n.*" level/?

It would actually be the same due to how the xml file stores things, but, that said, you are right; it is not necessary to run and rerun that variable, even if the info is constant. I've taken it out of the for loop and put it just before.

Here is the updated version:

Code:

#!/bin/mksh

# This program requires an xml dictionary file to run. If it is not on your machine,
# this program will automatically download it and store in ~/.config/latin/.

# Name this file as 'latin' and run:
#
# $ chmod +x latin
#
# To run:
# $ ./latin amo
#
# To enable internet auto-decline:
# $ ./latin -d amo
#
# To run with only auto-decline:
# $ ./latin -c amo
#
# Where 'amo' is the term searched.

searchTerm=$2

URL="http://www.perseus.tufts.edu/hopper/morph?l=$searchTerm&la=la"

wFIN='<h4 class="la">'
wFOUT='</h4>'
wDefIn='<span class="lemma_definition">'
wDefOut='</span>'
wFormIn='<td class="la">'$searchTerm'</td>'
wFormOut='<td style="font-size: x-small">'

## Code which connects to perseus to attain 1st per. sg. (needed as key for xml file)
if [[ ("$1" == "-d") ]]; then

	searchTerms=$(wget -q -O- "$URL" | mawk -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf substr($0,18, length($0)-22)"\n"; next;}')

elif [[ ("$1" == "-c") ]]; then

	wget -q -O- "$URL" | mawk -v vDefIn="$wDefIn" -v vDefOut="$wDefOut" -v vFormIn="$wFormIn" -v vFormOut="$wFormOut" -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf "\n[ " substr($0,18, length($0)-22)" ]"; next;}   $0 ~ vDefIn,$0 ~ vDefOut {{ if (!/>/) {{$1=$1}1; x+=1; print " "$0"";} }}   $0 ~ vFormIn,$0 ~ vFormOut {{ if (!/td /) {{$1=$1}1;   $0=substr($0,5, length($0)-9); print "-"$0; next;} } }'

else
	searchTerms=$1
fi

if [ "$1" == "-c" ]; then
	exit
fi

XMLfile=Perseus_text_1999.04.0060.xml
XMLdir=~/.config/latin/
XMLlink="http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus:text:1999.04.0060"

if [ ! -e $XMLdir$XMLfile ]; then
        echo "\nFile:" $XMLdir$XMLfile "not found.\n\nDownloading from" $XMLlink "...\n"
	mkdir -p ~/.config/latin
	wget -qO- $XMLlink | tr -d '\r' > $XMLdir$XMLfile
fi

for searchTerm in $searchTerms
do

#echo "Searching for:"$searchTerms

keyIn='key="'$searchTerm'"'	# Which tag shall be searched?
keyOut='</entry>'	#
tagIn='<'		# How are tags to be distinguished?
tagOut='>'		#
keySepA=''		# Separates the main word from its roots
keySepB=','		#
etySepA='['		# Etymology left
etySepB=']\n\n • '	# Etymology right
defSep='\n\n '          # Separates individual definitions
emSep='\n\n • '		# Separates em-dashes

#echo $keyIn

# First concatenate the result into a usable string
awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" ' $0 ~ vkeyIn, $0 ~ vkeyOut {printf $0; }' $XMLdir$XMLfile |
awk -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '

	# Separation after main key word
	{ gsub("<orth>", vkeySepA) }
	{ gsub("</orth>", vkeySepB) }

	# Add separation for several variations of definitions
	#{gsub(/<etym lang="la" opt="n">/, vetySepA)}
	# Testing
	{ gsub(/<sense id.*><etym lang="la" opt="n">/, vetySepA) }
	{ gsub(/<\/etym>\. —<\/sense>/, "]") }
	{ gsub(/<\/etym>\, <trans opt="n">/, vetySepB) }
	{ gsub(/<\/etym>\.—/, vetySepB) }
	{ gsub(/<\/etym>\. /, "]") }

	# Get rid of potential extra definition markers
	{ gsub(/\.—<\/sense>/, ".") }
	{ gsub(/\.— <\/sense>/, ".") }
	{ gsub(/\. — <\/sense>/, ".") }
	{ gsub(/<\/usg>—<\/sense>/, ".") }

	{ vID = substr($2, 1, length($2)-1) }

BEGIN   { split ("1 2 3 4 5 6 7 8 9 10 I II III IV IV. V V. A B C C. D E F G H", VNARR)
         VNARR[0] = 0
        }

        {

	#If matched then print section divider
	for (vid = 0; vid <= 19; vid++)
	  for (vl = 0; vl <= 3; vl++)
	    for (vn = 0; vn <=26; vn++) {

		#IDvar="<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">"
		#print IDvar

		gsub("<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">", vdefSep )

		}
	}

	# Add missing dot after gender
	{ gsub(/<\/gen>/, ". ") }

	# Collapse all remaining tags
	{ gsub(tagIn "[^" tagOut "]*" tagOut, "") }

	# Separate em-dash text
	{ if ((!/—\\,/) && (!/[a-zA-Z]—/) && (!/ —/)) {gsub (/—/, vemSep) }}
        { if ((!/—\\,/) ) {gsub (/\.—/, "." vemSep)}}
        { gsub (/ — /, vemSep)}
	{ if (!/—\\,/) {gsub (/\.—/, "." vemSep)}}

	# Remove double spaces and spaces between certain characters
	{ gsub(/ +/,  " ") }
	{ gsub(/ ,/,  ",") }
	{ gsub(/\( /, "(") }
	{ gsub(/ \)/, ")") }
	{ gsub(/ \./, ".") }
	{ gsub(/ \:/, ":") }
	{ gsub(/ \?/, "?") }
	{ gsub(/\‘ /, "‘") }
	{ gsub(/ \’/, "’") }
	{ gsub(/^ /,  "" ) }
	{ gsub(/\.\.\. /, "...") }
	{ NF }

{ print "\n" $0 "\n" }'

done

I had made a version with such great notes, but upon finishing it, there was an error which I could fix. Likely I lost a bracket somewhere.

Once again, thank you all. I am still (always) open to any other suggestions.

*EDIT*

Updated script: XML dictionary file is now automatically downloaded to ~/.config/latin/ if not present. There is no manual downloading required. Just run the script and all is done automatically.

Last edited by bedtime; 02-26-2018 at 05:21 AM.. Reason: Updated script

bedtime

View Public Profile for bedtime

Find all posts by bedtime

02-26-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by bedtime

. . .

Quote:
- for every single input line, you execute those nested loops 20 x 4 x 26, i.e. 2080 times - quite lengthy for more than a few input lines.
Yes, I know no other working alternative. Smilie

As already proposed, if you describe what is needed someone could come up with some nifty trick e.g. regex. Pls be aware that if the substitution has taken place in the first loop, another 2079 loops will be executed nevertheless.

Quote:

I hadn't known that; my thought was that gsub only executed on a match? Maybe something like (and I've tried to make this work for a while):

Code:

# Make a variable for easy access:
IDvar="<sense " vID "." vid "\" level=\"" vl "\" n=\"" VNARR[vn] "\" opt=\"n\">"

# If that variable exists, then run gsub.
if (/IDvar/) gsub(IDvar, vdefSep)

gsub analyses the input line / variable char by char for a match, as does the matching operators, e.g. /.../ - so it parses the input twice. Unnecessary, and costly, esp. for lengthy lines. If you're sure there is only one single match, use sub to stop after that match. BTW, /IDvar/ looks for exactly that literal string, "IDvar", verbatim.

Quote:

Thank you. I try to keep speech to a min, so people aren't overwhelmed...
. . .

I wasn't asking for a romantic novel, but for a meaningful explanation / formulation of the central problem(s).

Quote:

I had made a version with such great notes, but upon finishing it, there was an error which I could fix. Likely I lost a bracket somewhere. Smilie

There are editors in them 'thar hills that allow for checking for e.g. unpaired brackets, braces, parentheses. Clever indentation also helps.

Last edited by RudiC; 02-26-2018 at 05:24 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-26-2018

Registered User

56, 9

Join Date: Feb 2018

Last Activity: 4 May 2020, 10:15 AM EDT

Posts: 56

Thanks Given: 42

Thanked 9 Times in 9 Posts

Quote:

Originally Posted by RudiC

As already proposed, if you describe what is needed someone could come up with some nifty trick e.g. regex. Pls be aware that if the substitution has taken place in the first loop, another 2079 loops will be executed nevertheless.

Issue fixed. With one line of code:

Code:

gsub(vdefTagIn "[^" vdefTagOut "]*" vdefTagOut, vdefSep)

where the tags were defined as '<sense' and '>'. No more need for 2079 loops of madness.

Anyways, nothing was wasted; all the ideas posted will help in future scripting.

As of now, I will be working on merging some gsub commands with regex tricks.

Oh-and about the braces, when I removed certain ones the program would not operate correctly; it would scatter text and such. I just added a brace between the beginning of the program (after the variables) and before { print $0 }, and I was able to remove all the other braces!

If anyone is interested:

latin:

Code:

#!/bin/mksh

# This program requires an xml dictionary file to run. If it is not on your machine,
# this program will automatically download it and store in ~/.config/latin/.

# Name this file as 'latin' and run:
#
# $ chmod +x latin
#
# To run:
# $ ./latin amo
#
# To enable internet auto-decline:
# $ ./latin -d amo
#
# To run with only auto-decline:
# $ ./latin -c amo
#
# Where 'amo' is the term searched.

key=$2

URL="http://www.perseus.tufts.edu/hopper/morph?l=$key&la=la"

wFIN='<h4 class="la">'
wFOUT='</h4>'
wDefIn='<span class="lemma_definition">'
wDefOut='</span>'
wFormIn='<td class="la">'$key'</td>'
wFormOut='<td style="font-size: x-small">'

## Code which connects to perseus to attain 1st per. sg. (needed as key for xml file)
if [[ ("$1" == "-d") ]]; then

	searchTerms=$(wget -q -O- "$URL" | mawk -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf substr($0,18, length($0)-22)"\n"; next;}')

elif [[ ("$1" == "-c") ]]; then

	wget -q -O- "$URL" | mawk -v vDefIn="$wDefIn" -v vDefOut="$wDefOut" -v vFormIn="$wFormIn" -v vFormOut="$wFormOut" -v vWFIN="$wFIN" -v vWFOUT="$wFOUT" \
	' $0 ~ vWFIN,$0 ~ vWFOUT {printf "\n[ " substr($0,18, length($0)-22)" ]"; next;}   $0 ~ vDefIn,$0 ~ vDefOut {{ if (!/>/) {{$1=$1}1; x+=1; print " "$0"";} }}   $0 ~ vFormIn,$0 ~ vFormOut {{ if (!/td /) {{$1=$1}1;   $0=substr($0,5, length($0)-9); print "-"$0; next;} } }'

else
	searchTerms=$1
fi

if [ "$1" == "-c" ]; then
	exit
fi

XMLfile=Perseus_text_1999.04.0060.xml
XMLdir=~/.config/latin/
XMLlink="http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus:text:1999.04.0060"

if [ ! -e $XMLdir$XMLfile ]; then
        echo "\nFile:" $XMLdir$XMLfile "not found.\n\nDownloading from" $XMLlink "...\n"
	mkdir -p ~/.config/latin
	wget -qO- $XMLlink | tr -d '\r' > $XMLdir$XMLfile
fi

for key in $searchTerms; do

keyIn='key="'$key'"'	# Which tag shall be searched?
keyOut='</entry>'	#
tagIn='<'		# How are tags to be distinguished?
tagOut='>'		#
defTagIn='<sense'	# How are definitions defined?
defTagOut='>'
keySepA=''		# Separates the main word from its roots
keySepB=','		#
etySepA='['		# Etymology left
etySepB=']\n\n � '	# Etymology right
defSep='\n\n '          # Separates individual definitions
emSep='\n\n � '		# Separates em-dashes

# First concatenate the result into a usable string
awk -v vkeyIn="$keyIn" -v vkeyOut="$keyOut" ' $0 ~ vkeyIn, $0 ~ vkeyOut {printf $0; }' $XMLdir$XMLfile |
awk -v vdefTagIn="$defTagIn" -v vdefTagOut="$defTagOut" -v tagIn="$tagIn" -v tagOut="$tagOut" -v vkeySepA="$keySepA" -v vkeySepB="$keySepB" -v vdefSep="$defSep" -v vetySepA="$etySepA" -v vetySepB="$etySepB" -v vemSep="$emSep" '
{
	# Separation after main key word
	gsub("<orth>", vkeySepA)
	gsub("</orth>", vkeySepB)

	# Add separation for several variations of definitions
	#gsub(/<etym lang="la" opt="n">/, vetySepA)
	gsub(/<sense id.*><etym lang="la" opt="n">/, vetySepA)
	gsub(/<\/etym>\. -<\/sense>/, "]")
	gsub(/<\/etym>\, <trans opt="n">/, vetySepB)
	gsub(/<\/etym>\.-/, vetySepB)
	gsub(/<\/etym>\. /, "]")

	# Get rid of potential extra definition markers
	gsub(/\.-<\/sense>/, ".")
	gsub(/\.- <\/sense>/, ".")
	gsub(/\. - <\/sense>/, ".")
	gsub(/<\/usg>-<\/sense>/, ".")
	gsub(/<\/usg> -<\/sense>/, ".")

	# Add missing dot after gender
	gsub(/<\/gen>/, ". ")

	# Collapse all definition tags and add formatting in their place
	gsub(vdefTagIn "[^" vdefTagOut "]*" vdefTagOut, vdefSep)

	# Collapse all remaining tags
	gsub(tagIn "[^" tagOut "]*" tagOut, "")

	# Separate em-dash text
	if ((!/-\\,/) && (!/[a-zA-Z]-/) && (!/ -/)) gsub (/-/, vemSep)
        if ((!/-\\,/) ) gsub (/\.-/, "." vemSep)
        gsub (/ - /, vemSep)
	gsub (/ -/, vemSep)
	if (!/-\\,/) gsub (/\.-/, "." vemSep)

	# Remove double spaces and spaces between certain characters
	gsub(/ +/,  " ")
	gsub(/ ,/,  ",")
	gsub(/\( /, "(")
	gsub(/ \)/, ")")
	gsub(/ \./, ".")
	gsub(/ \:/, ":")
	gsub(/ \?/, "?")
	gsub(/\� /, "�")
	gsub(/ \'/, "'")
	gsub(/^ /,  "" )
	gsub(/\.\.\. /, "...")

}

{ print "\n" $0 "\n" } '

done

bedtime

View Public Profile for bedtime

Find all posts by bedtime

Shell Programming and Scripting

Make awk gsub take value of for loop

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Using multiple gsub() function under a loop in awk

Discussion started by: useless79

2. UNIX for Dummies Questions & Answers

awk gsub with variables

Discussion started by: somu_june

3. Shell Programming and Scripting

awk gsub

Discussion started by: ysrini

4. Shell Programming and Scripting

Awk; gsub in fields 3 and 4

Discussion started by: Bubnoff

5. Shell Programming and Scripting

awk gsub with variables?

Discussion started by: ergy1983

6. Shell Programming and Scripting

Awk gsub error.

Discussion started by: pinnacle

7. Shell Programming and Scripting

Awk Gsub Query

Discussion started by: crosairs

8. Shell Programming and Scripting

awk gsub

Discussion started by: pxy2d1

9. Shell Programming and Scripting

Help with AWK and gsub

Discussion started by: npolite

10. Shell Programming and Scripting

use var in gsub of awk

Discussion started by: summer_cherry