awk and regex of wikisource data

06-20-2015

Registered User

12, 0

Join Date: Mar 2013

Last Activity: 24 February 2017, 1:18 PM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

awk and regex of wikisource data

This is for GNU Awk.

A sample file.txt contains this data (actual text from Wikipedia):

In June 2000, ''Bookface, Inc.'' launched the website [URL="http://www.Bookface.com"]www.Bookface.com[/URL], a "Read on Demand" service precipitated both by the concurrent [[print on demand]] boom, and launching during the hype surrounding [[Stephen King]]'s online-only novella ''The Plant'', which had been <launched in July>, 1999.<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref> Bookface delivered "whole books and excerpts to readers directly", with publishers including [[HarperCollins]], Penguin Puttnam, [[Random House]] and Time Warner Trade Publishing lined up to provide Bookface with content.<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>

There are thousands of files this is example data.

I'd like to extract the text between the <ref></ref> pairs. Note that some of the ref pairs start with <ref name="findarticles.com"> where the name= portion could be just about anything and ends in ">". Or there may be no name= at all and start with <ref>. They always end in </ref>. Also the text between the ref pairs may contain other < and > characters (though no nested <ref></ref> pairs). Finally, file.txt will be accessed as a string via readfile(), not via getline.

This is what I have so far (this is a code-fragment from a longer awk script which does other unrelated stuff ie. the readfile method is needed for other reasons):

Code:

@include "readfile"
BEGIN {
     file = readfile("file.txt")
     c = patsplit(file, b, "<ref[^>]{0,1000}+>[^<]+(</ref>)") 
     while(i++ < c) print b[i]
}

This works, except when the text between the ref pairs contains "<" or ">", as in the first ref pair in the above data ("<Accessed January 27>")

--Mid Ocean

Mid Ocean

View Public Profile for Mid Ocean

Find all posts by Mid Ocean

06-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I understand not wanting to put such a long lined input file in CODE tags; but, in the future, please at least note that the entire file is a single line.

Assuming that each pair of <ref> and its matching </ref> tags are always on a single line in the files you want to process, this seems simpler:

Code:

awk -F'<ref>' '
{	for(i=2; i<=NF; i++)
		print substr($i, 1, match($i, "</ref>") - 1)
}' file.txt

As long as the text files really are text files (i.e., with no lines longer than LINE_MAX bytes including the terminating <newline> character), this should work with any awk utility. (But, as always, on Solaris/SunOS systems, change awk to /usr/xpg4/bin/awk.)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-20-2015

Registered User

12, 0

Join Date: Mar 2013

Last Activity: 24 February 2017, 1:18 PM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

I initially tried that solution (using split, same idea) and it gives incorrect results.

Can't post the full data on unix.com, but the full source for a Wikipedia article on Pastebin:

{{Infobox person | name = Lou Anders | image = Lou Anders.jpg<!-- - Pastebin.com

Your code shows 18 ref pairs. In fact there are 23 as seen here:

https://en.wikipedia.org/wiki/Lou_Anders#References

The patsplit() solution "works". It will pick up the 23 in this example. However if you introduce a ">" or "<" character inside the text of a ref pair, it will skip it.

---------- Post updated at 05:06 PM ---------- Previous update was at 04:57 PM ----------

The other thing is I am working entirely in an awk script not from the command line or a shell script. And I really want to know how to solve this regex problem as it has application to other areas of my program. So I'm hoping that rather than finding a different solution using a different method, I can get help with my original question: what is the right regex for the patsplit() solution?

---------- Post updated at 05:14 PM ---------- Previous update was at 05:06 PM ----------

Your solution works by changing -F'<ref[^>]{0,1000}+>' .. though it produces empty lines. Still working on it..

---------- Post updated at 05:27 PM ---------- Previous update was at 05:14 PM ----------

OK here's a solution based on your code

Code:

  c = split(file, b, "<ref[^>]{0,1000}+>")
  i = 1
  while(i++ < c) {
    print substr(b[i], 1, match(b[i], "</ref>") - 1)
  }

---------- Post updated at 05:36 PM ---------- Previous update was at 05:27 PM ----------

Yes this is working now. I'd like to learn how to do the regex but this solution with split/substr/match is working. Thanks for your help.

Mid Ocean

View Public Profile for Mid Ocean

Find all posts by Mid Ocean

06-20-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You may want to give this a shot:

Code:

grep -Eo "<ref[^>]*>([^<]*|[^<]*<[^/]*>[^<]*)</ref>" file2
<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref>
<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>

---------- Post updated at 23:53 ---------- Previous update was at 23:47 ----------

Or

Code:

grep -Eo "<ref[^>]*>[^<]*(<[^/]*>[^<]*)*</ref>" file2

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Standard POSIX BREs and EREs perform greedy matches. And, awk uses standard POSIX EREs. Greedy means that .* in the ERE <ref>.*</ref> matches the longest string of characters it can find that starts with <ref> and ends with </ref>. Creating an ERE that matches a string starting with a specific string and ending with another (longer than one character) string that doesn't contain the terminating (longer than one character) string is somewhere between hard and impossible depending on the terminating string. Shell parameter expansions provide ways to perform greedy expansions (${var##pattern} and ${var%%pattern}) and non-greedy expansions (${var#pattern} and ${var%pattern}). You may also be able to find something in gawk to tell it to use a non-greedy RE match.

You're right about my code missing references. I was just looking for <ref> when I should have also been looking for <ref name="string">. Changing my code to:

Code:

awk -F'<ref[^>]*>' '
{	for(i=2; i<=NF; i++)
		print substr($i, 1, match($i, "</ref>") - 1)
}' file.txt

will take care of that, but it does still depend on finding the opening and closing ref tags on the same line in your input files. Note that your <ref[^>]{0,1000}+> (which is one or more occurrences of zero to 1000 non-> characters between <ref and >) can be much more concisely written as <ref[^>]*> (which is zero or more occurrences of non-> characters between <ref and >).

The i=2 in the for loop should eliminate the blank lines problem.

If the above awk script doesn't work using gawk (which doesn't care much about line length limits), it must mean that some of your files do have the opening and closing ref tags on different lines. If that is your problem, we can try a shell script to do the parsing, but note that some of the references printed will contain <newline> characters in that case.

The grep commands that RudiC suggested also depend on the opening and closing ref tags being on the same line.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-20-2015

Registered User

12, 0

Join Date: Mar 2013

Last Activity: 24 February 2017, 1:18 PM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

Ok that's good to know that it is hard to impossible because I have struggled with how to do it. Your method is workable.

On non-greedy and gawk found this:

https://lists.gnu.org/archive/html/b.../msg00000.html

Oh well.

I'm using readfile() which reads the entire file into a variable - the line break characters are there but it's treated as a single long string, FN=1. Then use split to create fields.

In the past I've had problems with * matching to the end of the string (file). In this case it doesn't seem to matter. I added the {0} for good measure but you're right it's not needed.

Note that the code and method are a little different when running as a script with readfile vs. running awk from the command line. i=1 not 2, using while not for loop. The reason for blank lines is because there is another type of ref in the document that looks like this: <ref name="trashotron.com"/> .. note the slash at the end and no closing </ref>. This gets treated as a split point with empty results - no harm and easy to work around by checking for null result.

Mid Ocean

View Public Profile for Mid Ocean

Find all posts by Mid Ocean

06-20-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

What do you want tags like <ref name="trashotron.com"/> to do? Do want it to be ignored or do you want output with name="trashotron.com" as the tag text?

For a reference like:

Code:

<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>

do you want any reference to the name="findarticles.com" in the output?

If you had a file.txt (note that there are <newlines> in this text) containing:

Code:

<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref>
<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books. Accessed January 27, 2008</ref><ref>This
reference is split across
three lines.</ref><ref name="4 line split">Line 1;
Line 2.
Line 3,
Line 4.</ref><ref name="no tag text"/>

please show us exactly what output you would like to produce from this input (in CODE tags).

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

awk and regex of wikisource data

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex within IF statement in awk

Discussion started by: Ophiuchus

2. Shell Programming and Scripting

wildcard in regex for awk

Discussion started by: black_fender

3. Shell Programming and Scripting

Regex to Parse data

Discussion started by: ManoharMa

4. Shell Programming and Scripting

RegeX to parse data from a txt file

Discussion started by: killbanne

5. Shell Programming and Scripting

awk equivalent of regex

Discussion started by: r4v3n

6. UNIX for Dummies Questions & Answers

Using AWK and regex

Discussion started by: krashraj

7. Shell Programming and Scripting

awk regex problem

Discussion started by: aishsimplesweet

8. Shell Programming and Scripting

sed to awk (regex pattern) how?

Discussion started by: TehOne

9. Shell Programming and Scripting

Extracting a regex with awk

Discussion started by: Enobarbus37

10. Shell Programming and Scripting

awk or regex

Discussion started by: smihael