awk and regex of wikisource data Post: 302947689

Sponsored Content

Top Forums Shell Programming and Scripting awk and regex of wikisource data Post 302947689 by Mid Ocean on Saturday 20th of June 2015 02:34:34 PM

06-20-2015

Registered User

awk and regex of wikisource data

This is for GNU Awk.

A sample file.txt contains this data (actual text from Wikipedia):

In June 2000, ''Bookface, Inc.'' launched the website [URL="http://www.Bookface.com"]www.Bookface.com[/URL], a "Read on Demand" service precipitated both by the concurrent [[print on demand]] boom, and launching during the hype surrounding [[Stephen King]]'s online-only novella ''The Plant'', which had been <launched in July>, 1999.<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref> Bookface delivered "whole books and excerpts to readers directly", with publishers including [[HarperCollins]], Penguin Puttnam, [[Random House]] and Time Warner Trade Publishing lined up to provide Bookface with content.<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>

There are thousands of files this is example data.

I'd like to extract the text between the <ref></ref> pairs. Note that some of the ref pairs start with <ref name="findarticles.com"> where the name= portion could be just about anything and ends in ">". Or there may be no name= at all and start with <ref>. They always end in </ref>. Also the text between the ref pairs may contain other < and > characters (though no nested <ref></ref> pairs). Finally, file.txt will be accessed as a string via readfile(), not via getline.

This is what I have so far (this is a code-fragment from a longer awk script which does other unrelated stuff ie. the readfile method is needed for other reasons):

Code:

@include "readfile"
BEGIN {
     file = readfile("file.txt")
     c = patsplit(file, b, "<ref[^>]{0,1000}+>[^<]+(</ref>)") 
     while(i++ < c) print b[i]
}

This works, except when the text between the ref pairs contains "<" or ">", as in the first ref pair in the above data ("<Accessed January 27>")

--Mid Ocean

Mid Ocean

View Public Profile for Mid Ocean

Find all posts by Mid Ocean

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk or regex

Hi! I want to made a program that will generate code like this: {{Navedi XYZ |avtor=XYZ1 |naslov=XYZ2 |leto_izzida=XYZ3 |zalozba=XYZ4 |kraj=XYZ5 |isbn=XYZ6 |cobiss_id=XYZ7 }} from input like this: <b> ODGOVORNOST............. : <a...

2. Shell Programming and Scripting

Extracting a regex with awk

I have a regexp that I wish to match against every line of a file using awk. But I do not want to substitute it or select the line. I want to pull the matched text out and put it in a different file, line by line. What is the correct awk usage to *extract* a regexp and put it in another...

3. Shell Programming and Scripting

sed to awk (regex pattern) how?

Hello, I am trying to covert a for statement into a single awk script and I've got everything but one part. I also need to execute an external script when "not found", how can I do that ? for TXT in `find debugme -name "*.txt"` ;do FPATH=`echo $TXT | sed 's/$.*$\/$.*$/\1/'` how...

4. Shell Programming and Scripting

awk regex problem

hi everyone suppose my input file is ABC-12345 ABCD-12345 BCD-123456 i want to search the specific pattern which looks like - in a file so i used this command cat $file | awk ' { if ($0 ~ /-/) { print } }' so it gives me the result as ABCD-12345 BCD-12345 BCD-12345 ...

5. UNIX for Dummies Questions & Answers

Using AWK and regex

Hi can you suggest in this regard The sample.txt conatins the data name lines type sam 12 txt sam 24 xls sam 36 pdf ram 32 txt ram 45 sxls ram 58 word sam 92 jpeg sam 21 gif sam 22 ltf from the data i need to sum all line...

6. Shell Programming and Scripting

awk equivalent of regex

Hi all, Can someone tell me what's the (g)awk equal of this simple regex to find ip addresses in urls: egrep "^http://{1,3}\.{1,3}\.{1,3}\.{1,3}(:{1,5})?/"Input: http://10.0.0.1/query.exe http://11y10x09w:80/howaboutme http://192.168.100.190:1234/takeme.gpg Output:...

7. Shell Programming and Scripting

RegeX to parse data from a txt file

Hi all the experts out there, I am totally new to perl and I was given an assignment by using Perl to find the 2nd element of each line in each curly bracket which made up of 5 elements. Expected result should like this: Type: VCC Pin_name: AK32,AL32,AH21,..... Type: NC Pin_name:...

8. Shell Programming and Scripting

Regex to Parse data

Experts and Informed folks, Need some help here in parsing the log file. 1389675 Opera_ShirtCatalog INSERT INTO Opera_ShirtCatalog(COL1, COL2) VALUES (1, 'TEST1'), (2,'TEST2'); 1389685 Opera_ShirtCatlog_Wom INSERT INTO Opera_ShirtCatlog_Wom(col1, col2, col3) VALUES (9,'Siz12, FormFit',...

9. Shell Programming and Scripting

wildcard in regex for awk

10. Shell Programming and Scripting

Regex within IF statement in awk

Hello to all, I have: X="string 1-" Y="-string 2" Z="string 1-20-string 2"In the position of the number 20 could be different numbers, but I'm interest only when the number is 15, 20,45 or 70. I want to include an IF within an awk code with a regex in the following way. ...

LEARN ABOUT OPENSOLARIS

longline

readfile(1F)							   FMLI Commands						      readfile(1F)

NAME

       readfile, longline - reads file, gets longest line

SYNOPSIS

       readfile filename

       longline [filename]

DESCRIPTION

       The  readfile  function	reads  filename  and copies it to stdout. No translation of NEWLINE is done. It keeps track of the longest line it
       reads and if there is a subsequent call to longline, the length of that line, including the NEWLINE character,  is returned.

       The longline function returns the length, including the NEWLINE character, of the longest line in filename. If filename is  not	specified,
       it uses the file named in the last call to readfile.

EXAMPLES

       Example 1 Typical use of readfile and longline

       Here is a typical use of readfile and longline in a text frame definition file:

	    .
	    .
	    .
	 text="`readfile myfile`"
	 columns=`longline`
	    .
	    .
	    .

ATTRIBUTES

       See attributes(5) for descriptions of the following attributes:

       +-----------------------------+-----------------------------+
       |      ATTRIBUTE TYPE	     |	    ATTRIBUTE VALUE	   |
       +-----------------------------+-----------------------------+
       |Availability		     |SUNWcsu			   |
       +-----------------------------+-----------------------------+

SEE ALSO

       cat(1), attributes(5)

DIAGNOSTICS

       If filename does not exist, readfile will return  FALSE (that is, the expression will have an error return).

       longline returns  0 if a readfile has not previously been issued.

NOTES

       More  than  one descriptor can call readfile in the same frame definition file. In text frames, if one of those calls is made from the text
       descriptor, then a subsequent use of longline will always get the longest line of the file read by the readfile associated  with  the  text
       descriptor, even if it was not the most recent use of readfile.

SunOS 5.11							    5 Jul 1990							      readfile(1F)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk or regex

Discussion started by: smihael

2. Shell Programming and Scripting

Extracting a regex with awk

Discussion started by: Enobarbus37

3. Shell Programming and Scripting

sed to awk (regex pattern) how?

Discussion started by: TehOne

4. Shell Programming and Scripting

awk regex problem

Discussion started by: aishsimplesweet