awk and regex of wikisource data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk and regex of wikisource data
# 1  
Old 06-20-2015
awk and regex of wikisource data

This is for GNU Awk.

A sample file.txt contains this data (actual text from Wikipedia):


In June 2000, ''Bookface, Inc.'' launched the website [URL="http://www.Bookface.com"]www.Bookface.com[/URL], a "Read on Demand" service precipitated both by the concurrent [[print on demand]] boom, and launching during the hype surrounding [[Stephen King]]'s online-only novella ''The Plant'', which had been <launched in July>, 1999.<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref> Bookface delivered "whole books and excerpts to readers directly", with publishers including [[HarperCollins]], Penguin Puttnam, [[Random House]] and Time Warner Trade Publishing lined up to provide Bookface with content.<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>


There are thousands of files this is example data.

I'd like to extract the text between the <ref></ref> pairs. Note that some of the ref pairs start with <ref name="findarticles.com"> where the name= portion could be just about anything and ends in ">". Or there may be no name= at all and start with <ref>. They always end in </ref>. Also the text between the ref pairs may contain other < and > characters (though no nested <ref></ref> pairs). Finally, file.txt will be accessed as a string via readfile(), not via getline.

This is what I have so far (this is a code-fragment from a longer awk script which does other unrelated stuff ie. the readfile method is needed for other reasons):

Code:
@include "readfile"
BEGIN {
     file = readfile("file.txt")
     c = patsplit(file, b, "<ref[^>]{0,1000}+>[^<]+(</ref>)") 
     while(i++ < c) print b[i]
}

This works, except when the text between the ref pairs contains "<" or ">", as in the first ref pair in the above data ("<Accessed January 27>")

--Mid Ocean
# 2  
Old 06-20-2015
I understand not wanting to put such a long lined input file in CODE tags; but, in the future, please at least note that the entire file is a single line.

Assuming that each pair of <ref> and its matching </ref> tags are always on a single line in the files you want to process, this seems simpler:
Code:
awk -F'<ref>' '
{	for(i=2; i<=NF; i++)
		print substr($i, 1, match($i, "</ref>") - 1)
}' file.txt


As long as the text files really are text files (i.e., with no lines longer than LINE_MAX bytes including the terminating <newline> character), this should work with any awk utility. (But, as always, on Solaris/SunOS systems, change awk to /usr/xpg4/bin/awk.)
# 3  
Old 06-20-2015
I initially tried that solution (using split, same idea) and it gives incorrect results.

Can't post the full data on unix.com, but the full source for a Wikipedia article on Pastebin:

{{Infobox person | name = Lou Anders | image = Lou Anders.jpg<!-- - Pastebin.com

Your code shows 18 ref pairs. In fact there are 23 as seen here:

https://en.wikipedia.org/wiki/Lou_Anders#References

The patsplit() solution "works". It will pick up the 23 in this example. However if you introduce a ">" or "<" character inside the text of a ref pair, it will skip it.

---------- Post updated at 05:06 PM ---------- Previous update was at 04:57 PM ----------

The other thing is I am working entirely in an awk script not from the command line or a shell script. And I really want to know how to solve this regex problem as it has application to other areas of my program. So I'm hoping that rather than finding a different solution using a different method, I can get help with my original question: what is the right regex for the patsplit() solution?

---------- Post updated at 05:14 PM ---------- Previous update was at 05:06 PM ----------

Your solution works by changing -F'<ref[^>]{0,1000}+>' .. though it produces empty lines. Still working on it..

---------- Post updated at 05:27 PM ---------- Previous update was at 05:14 PM ----------

OK here's a solution based on your code

Code:
  c = split(file, b, "<ref[^>]{0,1000}+>")
  i = 1
  while(i++ < c) {
    print substr(b[i], 1, match(b[i], "</ref>") - 1)
  }

---------- Post updated at 05:36 PM ---------- Previous update was at 05:27 PM ----------

Yes this is working now. I'd like to learn how to do the regex but this solution with split/substr/match is working. Thanks for your help.
# 4  
Old 06-20-2015
You may want to give this a shot:
Code:
grep -Eo "<ref[^>]*>([^<]*|[^<]*<[^/]*>[^<]*)</ref>" file2
<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref>
<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>

---------- Post updated at 23:53 ---------- Previous update was at 23:47 ----------

Or
Code:
grep -Eo "<ref[^>]*>[^<]*(<[^/]*>[^<]*)*</ref>" file2

# 5  
Old 06-20-2015
Standard POSIX BREs and EREs perform greedy matches. And, awk uses standard POSIX EREs. Greedy means that .* in the ERE <ref>.*</ref> matches the longest string of characters it can find that starts with <ref> and ends with </ref>. Creating an ERE that matches a string starting with a specific string and ending with another (longer than one character) string that doesn't contain the terminating (longer than one character) string is somewhere between hard and impossible depending on the terminating string. Shell parameter expansions provide ways to perform greedy expansions (${var##pattern} and ${var%%pattern}) and non-greedy expansions (${var#pattern} and ${var%pattern}). You may also be able to find something in gawk to tell it to use a non-greedy RE match.

You're right about my code missing references. I was just looking for <ref> when I should have also been looking for <ref name="string">. Changing my code to:
Code:
awk -F'<ref[^>]*>' '
{	for(i=2; i<=NF; i++)
		print substr($i, 1, match($i, "</ref>") - 1)
}' file.txt

will take care of that, but it does still depend on finding the opening and closing ref tags on the same line in your input files. Note that your <ref[^>]{0,1000}+> (which is one or more occurrences of zero to 1000 non-> characters between <ref and >) can be much more concisely written as <ref[^>]*> (which is zero or more occurrences of non-> characters between <ref and >).

The i=2 in the for loop should eliminate the blank lines problem.

If the above awk script doesn't work using gawk (which doesn't care much about line length limits), it must mean that some of your files do have the opening and closing ref tags on different lines. If that is your problem, we can try a shell script to do the parsing, but note that some of the references printed will contain <newline> characters in that case.

The grep commands that RudiC suggested also depend on the opening and closing ref tags being on the same line.
# 6  
Old 06-20-2015
Ok that's good to know that it is hard to impossible because I have struggled with how to do it. Your method is workable.

On non-greedy and gawk found this:

https://lists.gnu.org/archive/html/b.../msg00000.html

Oh well.

I'm using readfile() which reads the entire file into a variable - the line break characters are there but it's treated as a single long string, FN=1. Then use split to create fields.

In the past I've had problems with * matching to the end of the string (file). In this case it doesn't seem to matter. I added the {0} for good measure but you're right it's not needed.

Note that the code and method are a little different when running as a script with readfile vs. running awk from the command line. i=1 not 2, using while not for loop. The reason for blank lines is because there is another type of ref in the document that looks like this: <ref name="trashotron.com"/> .. note the slash at the end and no closing </ref>. This gets treated as a split point with empty results - no harm and easy to work around by checking for null result.
# 7  
Old 06-20-2015
What do you want tags like <ref name="trashotron.com"/> to do? Do want it to be ignored or do you want output with name="trashotron.com" as the tag text?

For a reference like:
Code:
<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>

do you want any reference to the name="findarticles.com" in the output?

If you had a file.txt (note that there are <newlines> in this text) containing:
Code:
<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref>
<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books. Accessed January 27, 2008</ref><ref>This
reference is split across
three lines.</ref><ref name="4 line split">Line 1;
Line 2.
Line 3,
Line 4.</ref><ref name="no tag text"/>

please show us exactly what output you would like to produce from this input (in CODE tags).
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex within IF statement in awk

Hello to all, I have: X="string 1-" Y="-string 2" Z="string 1-20-string 2"In the position of the number 20 could be different numbers, but I'm interest only when the number is 15, 20,45 or 70. I want to include an IF within an awk code with a regex in the following way. ... (12 Replies)
Discussion started by: Ophiuchus
12 Replies

2. Shell Programming and Scripting

wildcard in regex for awk

Hello I have a file like : 20120918000001413 | 1.17.163.89 | iSelfcare | MSISDN | N 20120918000001806 | 1.33.27.100 | iSelfcare | 5564 | N .... I want to extract all lines that have on 4th field (considering "|" the separator ) something other than just digits. I want to do this using a... (5 Replies)
Discussion started by: black_fender
5 Replies

3. Shell Programming and Scripting

Regex to Parse data

Experts and Informed folks, Need some help here in parsing the log file. 1389675 Opera_ShirtCatalog INSERT INTO Opera_ShirtCatalog(COL1, COL2) VALUES (1, 'TEST1'), (2,'TEST2'); 1389685 Opera_ShirtCatlog_Wom INSERT INTO Opera_ShirtCatlog_Wom(col1, col2, col3) VALUES (9,'Siz12, FormFit',... (12 Replies)
Discussion started by: ManoharMa
12 Replies

4. Shell Programming and Scripting

RegeX to parse data from a txt file

Hi all the experts out there, I am totally new to perl and I was given an assignment by using Perl to find the 2nd element of each line in each curly bracket which made up of 5 elements. Expected result should like this: Type: VCC Pin_name: AK32,AL32,AH21,..... Type: NC Pin_name:... (2 Replies)
Discussion started by: killbanne
2 Replies

5. Shell Programming and Scripting

awk equivalent of regex

Hi all, Can someone tell me what's the (g)awk equal of this simple regex to find ip addresses in urls: egrep "^http://{1,3}\.{1,3}\.{1,3}\.{1,3}(:{1,5})?/"Input: http://10.0.0.1/query.exe http://11y10x09w:80/howaboutme http://192.168.100.190:1234/takeme.gpg Output:... (8 Replies)
Discussion started by: r4v3n
8 Replies

6. UNIX for Dummies Questions & Answers

Using AWK and regex

Hi can you suggest in this regard The sample.txt conatins the data name lines type sam 12 txt sam 24 xls sam 36 pdf ram 32 txt ram 45 sxls ram 58 word sam 92 jpeg sam 21 gif sam 22 ltf from the data i need to sum all line... (5 Replies)
Discussion started by: krashraj
5 Replies

7. Shell Programming and Scripting

awk regex problem

hi everyone suppose my input file is ABC-12345 ABCD-12345 BCD-123456 i want to search the specific pattern which looks like - in a file so i used this command cat $file | awk ' { if ($0 ~ /-/) { print } }' so it gives me the result as ABCD-12345 BCD-12345 BCD-12345 ... (31 Replies)
Discussion started by: aishsimplesweet
31 Replies

8. Shell Programming and Scripting

sed to awk (regex pattern) how?

Hello, I am trying to covert a for statement into a single awk script and I've got everything but one part. I also need to execute an external script when "not found", how can I do that ? for TXT in `find debugme -name "*.txt"` ;do FPATH=`echo $TXT | sed 's/\(.*\)\/\(.*\)/\1/'` how... (7 Replies)
Discussion started by: TehOne
7 Replies

9. Shell Programming and Scripting

Extracting a regex with awk

I have a regexp that I wish to match against every line of a file using awk. But I do not want to substitute it or select the line. I want to pull the matched text out and put it in a different file, line by line. What is the correct awk usage to *extract* a regexp and put it in another... (11 Replies)
Discussion started by: Enobarbus37
11 Replies

10. Shell Programming and Scripting

awk or regex

Hi! I want to made a program that will generate code like this: {{Navedi XYZ |avtor=XYZ1 |naslov=XYZ2 |leto_izzida=XYZ3 |zalozba=XYZ4 |kraj=XYZ5 |isbn=XYZ6 |cobiss_id=XYZ7 }} from input like this: <b> ODGOVORNOST............. : <a... (5 Replies)
Discussion started by: smihael
5 Replies
Login or Register to Ask a Question