A sample file.txt contains this data (actual text from Wikipedia):
In June 2000, ''Bookface, Inc.'' launched the website [URL="http://www.Bookface.com"]www.Bookface.com[/URL], a "Read on Demand" service precipitated both by the concurrent [[print on demand]] boom, and launching during the hype surrounding [[Stephen King]]'s online-only novella ''The Plant'', which had been <launched in July>, 1999.<ref>[http://www.kirjasto.sci.fi/sking.htm Stephen King Bio at ''Books & Writers'']. <Accessed January 27>, 2008</ref> Bookface delivered "whole books and excerpts to readers directly", with publishers including [[HarperCollins]], Penguin Puttnam, [[Random House]] and Time Warner Trade Publishing lined up to provide Bookface with content.<ref name="findarticles.com">[http://findarticles.com/p/articles/mi_m0EIN/is_2000_June_2/ai_62434142 Bookface.com Opens Books Online; Innovative Website Gives Readers Direct Access to Books; www.bookface.com to Launch With Involvement of Major Publishers], June 2, 2000. Accessed January 27, 2008</ref>
There are thousands of files this is example data.
I'd like to extract the text between the <ref></ref> pairs. Note that some of the ref pairs start with <ref name="findarticles.com"> where the name= portion could be just about anything and ends in ">". Or there may be no name= at all and start with <ref>. They always end in </ref>. Also the text between the ref pairs may contain other < and > characters (though no nested <ref></ref> pairs). Finally, file.txt will be accessed as a string via readfile(), not via getline.
This is what I have so far (this is a code-fragment from a longer awk script which does other unrelated stuff ie. the readfile method is needed for other reasons):
This works, except when the text between the ref pairs contains "<" or ">", as in the first ref pair in the above data ("<Accessed January 27>")
I understand not wanting to put such a long lined input file in CODE tags; but, in the future, please at least note that the entire file is a single line.
Assuming that each pair of <ref> and its matching </ref> tags are always on a single line in the files you want to process, this seems simpler:
As long as the text files really are text files (i.e., with no lines longer than LINE_MAX bytes including the terminating <newline> character), this should work with any awk utility. (But, as always, on Solaris/SunOS systems, change awk to /usr/xpg4/bin/awk.)
The patsplit() solution "works". It will pick up the 23 in this example. However if you introduce a ">" or "<" character inside the text of a ref pair, it will skip it.
---------- Post updated at 05:06 PM ---------- Previous update was at 04:57 PM ----------
The other thing is I am working entirely in an awk script not from the command line or a shell script. And I really want to know how to solve this regex problem as it has application to other areas of my program. So I'm hoping that rather than finding a different solution using a different method, I can get help with my original question: what is the right regex for the patsplit() solution?
---------- Post updated at 05:14 PM ---------- Previous update was at 05:06 PM ----------
Your solution works by changing -F'<ref[^>]{0,1000}+>' .. though it produces empty lines. Still working on it..
---------- Post updated at 05:27 PM ---------- Previous update was at 05:14 PM ----------
OK here's a solution based on your code
---------- Post updated at 05:36 PM ---------- Previous update was at 05:27 PM ----------
Yes this is working now. I'd like to learn how to do the regex but this solution with split/substr/match is working. Thanks for your help.
Standard POSIX BREs and EREs perform greedy matches. And, awk uses standard POSIX EREs. Greedy means that .* in the ERE <ref>.*</ref> matches the longest string of characters it can find that starts with <ref> and ends with </ref>. Creating an ERE that matches a string starting with a specific string and ending with another (longer than one character) string that doesn't contain the terminating (longer than one character) string is somewhere between hard and impossible depending on the terminating string. Shell parameter expansions provide ways to perform greedy expansions (${var##pattern} and ${var%%pattern}) and non-greedy expansions (${var#pattern} and ${var%pattern}). You may also be able to find something in gawk to tell it to use a non-greedy RE match.
You're right about my code missing references. I was just looking for <ref> when I should have also been looking for <ref name="string">. Changing my code to:
will take care of that, but it does still depend on finding the opening and closing ref tags on the same line in your input files. Note that your <ref[^>]{0,1000}+> (which is one or more occurrences of zero to 1000 non-> characters between <ref and >) can be much more concisely written as <ref[^>]*> (which is zero or more occurrences of non-> characters between <ref and >).
The i=2 in the for loop should eliminate the blank lines problem.
If the above awk script doesn't work using gawk (which doesn't care much about line length limits), it must mean that some of your files do have the opening and closing ref tags on different lines. If that is your problem, we can try a shell script to do the parsing, but note that some of the references printed will contain <newline> characters in that case.
The grep commands that RudiC suggested also depend on the opening and closing ref tags being on the same line.
I'm using readfile() which reads the entire file into a variable - the line break characters are there but it's treated as a single long string, FN=1. Then use split to create fields.
In the past I've had problems with * matching to the end of the string (file). In this case it doesn't seem to matter. I added the {0} for good measure but you're right it's not needed.
Note that the code and method are a little different when running as a script with readfile vs. running awk from the command line. i=1 not 2, using while not for loop. The reason for blank lines is because there is another type of ref in the document that looks like this: <ref name="trashotron.com"/> .. note the slash at the end and no closing </ref>. This gets treated as a split point with empty results - no harm and easy to work around by checking for null result.
What do you want tags like <ref name="trashotron.com"/> to do? Do want it to be ignored or do you want output with name="trashotron.com" as the tag text?
For a reference like:
do you want any reference to the name="findarticles.com" in the output?
If you had a file.txt (note that there are <newlines> in this text) containing:
please show us exactly what output you would like to produce from this input (in CODE tags).
Hello to all,
I have:
X="string 1-"
Y="-string 2"
Z="string 1-20-string 2"In the position of the number 20 could be different numbers, but I'm interest only when the number is 15, 20,45 or 70.
I want to include an IF within an awk code with a regex in the following way.
... (12 Replies)
Hello I have a file like :
20120918000001413 | 1.17.163.89 | iSelfcare | MSISDN | N
20120918000001806 | 1.33.27.100 | iSelfcare | 5564 | N
....
I want to extract all lines that have on 4th field (considering "|" the separator ) something other than just digits. I want to do this using a... (5 Replies)
Experts and Informed folks,
Need some help here in parsing the log file.
1389675 Opera_ShirtCatalog INSERT INTO Opera_ShirtCatalog(COL1, COL2) VALUES (1, 'TEST1'), (2,'TEST2');
1389685 Opera_ShirtCatlog_Wom INSERT INTO Opera_ShirtCatlog_Wom(col1, col2, col3) VALUES (9,'Siz12, FormFit',... (12 Replies)
Hi all the experts out there,
I am totally new to perl and I was given an assignment by using Perl to find the 2nd element of each line in each curly bracket which made up of 5 elements.
Expected result should like this:
Type: VCC Pin_name: AK32,AL32,AH21,.....
Type: NC Pin_name:... (2 Replies)
Hi all,
Can someone tell me what's the (g)awk equal of this simple regex to find ip addresses in urls:
egrep "^http://{1,3}\.{1,3}\.{1,3}\.{1,3}(:{1,5})?/"Input:
http://10.0.0.1/query.exe
http://11y10x09w:80/howaboutme
http://192.168.100.190:1234/takeme.gpg
Output:... (8 Replies)
Hi can you suggest in this regard
The sample.txt conatins the data
name lines type
sam 12 txt
sam 24 xls
sam 36 pdf
ram 32 txt
ram 45 sxls
ram 58 word
sam 92 jpeg
sam 21 gif
sam 22 ltf
from the data i need to sum all line... (5 Replies)
hi everyone
suppose my input file is
ABC-12345
ABCD-12345
BCD-123456
i want to search the specific pattern which looks like
-
in a file so i used this command
cat $file | awk ' { if ($0 ~ /-/) { print } }'
so it gives me the result as
ABCD-12345
BCD-12345
BCD-12345
... (31 Replies)
Hello,
I am trying to covert a for statement into a single awk script and I've got everything but one part.
I also need to execute an external script when "not found", how can I do that ?
for TXT in `find debugme -name "*.txt"` ;do
FPATH=`echo $TXT | sed 's/\(.*\)\/\(.*\)/\1/'`
how... (7 Replies)
I have a regexp that I wish to match against every line of a file using awk.
But I do not want to substitute it or select the line.
I want to pull the matched text out and put it in a different file, line by line.
What is the correct awk usage to *extract* a regexp and put it in another... (11 Replies)
Hi!
I want to made a program that will generate code like this:
{{Navedi XYZ
|avtor=XYZ1
|naslov=XYZ2
|leto_izzida=XYZ3
|zalozba=XYZ4
|kraj=XYZ5
|isbn=XYZ6
|cobiss_id=XYZ7
}}
from input like this:
<b> ODGOVORNOST............. : <a... (5 Replies)