awk and regex of wikisource data Post: 302947704

Sponsored Content

Top Forums Shell Programming and Scripting awk and regex of wikisource data Post 302947704 by Don Cragun on Saturday 20th of June 2015 06:05:42 PM

06-20-2015

Registered User

Standard POSIX BREs and EREs perform greedy matches. And, awk uses standard POSIX EREs. Greedy means that .* in the ERE <ref>.*</ref> matches the longest string of characters it can find that starts with <ref> and ends with </ref>. Creating an ERE that matches a string starting with a specific string and ending with another (longer than one character) string that doesn't contain the terminating (longer than one character) string is somewhere between hard and impossible depending on the terminating string. Shell parameter expansions provide ways to perform greedy expansions (${var##pattern} and ${var%%pattern}) and non-greedy expansions (${var#pattern} and ${var%pattern}). You may also be able to find something in gawk to tell it to use a non-greedy RE match.

You're right about my code missing references. I was just looking for <ref> when I should have also been looking for <ref name="string">. Changing my code to:

Code:

awk -F'<ref[^>]*>' '
{	for(i=2; i<=NF; i++)
		print substr($i, 1, match($i, "</ref>") - 1)
}' file.txt

will take care of that, but it does still depend on finding the opening and closing ref tags on the same line in your input files. Note that your <ref[^>]{0,1000}+> (which is one or more occurrences of zero to 1000 non-> characters between <ref and >) can be much more concisely written as <ref[^>]*> (which is zero or more occurrences of non-> characters between <ref and >).

The i=2 in the for loop should eliminate the blank lines problem.

If the above awk script doesn't work using gawk (which doesn't care much about line length limits), it must mean that some of your files do have the opening and closing ref tags on different lines. If that is your problem, we can try a shell script to do the parsing, but note that some of the references printed will contain <newline> characters in that case.

The grep commands that RudiC suggested also depend on the opening and closing ref tags being on the same line.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk or regex

Hi! I want to made a program that will generate code like this: {{Navedi XYZ |avtor=XYZ1 |naslov=XYZ2 |leto_izzida=XYZ3 |zalozba=XYZ4 |kraj=XYZ5 |isbn=XYZ6 |cobiss_id=XYZ7 }} from input like this: <b> ODGOVORNOST............. : <a...

2. Shell Programming and Scripting

Extracting a regex with awk

I have a regexp that I wish to match against every line of a file using awk. But I do not want to substitute it or select the line. I want to pull the matched text out and put it in a different file, line by line. What is the correct awk usage to *extract* a regexp and put it in another...

3. Shell Programming and Scripting

sed to awk (regex pattern) how?

Hello, I am trying to covert a for statement into a single awk script and I've got everything but one part. I also need to execute an external script when "not found", how can I do that ? for TXT in `find debugme -name "*.txt"` ;do FPATH=`echo $TXT | sed 's/$.*$\/$.*$/\1/'` how...

4. Shell Programming and Scripting

awk regex problem

hi everyone suppose my input file is ABC-12345 ABCD-12345 BCD-123456 i want to search the specific pattern which looks like - in a file so i used this command cat $file | awk ' { if ($0 ~ /-/) { print } }' so it gives me the result as ABCD-12345 BCD-12345 BCD-12345 ...

5. UNIX for Dummies Questions & Answers

Using AWK and regex

Hi can you suggest in this regard The sample.txt conatins the data name lines type sam 12 txt sam 24 xls sam 36 pdf ram 32 txt ram 45 sxls ram 58 word sam 92 jpeg sam 21 gif sam 22 ltf from the data i need to sum all line...

6. Shell Programming and Scripting

awk equivalent of regex

Hi all, Can someone tell me what's the (g)awk equal of this simple regex to find ip addresses in urls: egrep "^http://{1,3}\.{1,3}\.{1,3}\.{1,3}(:{1,5})?/"Input: http://10.0.0.1/query.exe http://11y10x09w:80/howaboutme http://192.168.100.190:1234/takeme.gpg Output:...

7. Shell Programming and Scripting

RegeX to parse data from a txt file

Hi all the experts out there, I am totally new to perl and I was given an assignment by using Perl to find the 2nd element of each line in each curly bracket which made up of 5 elements. Expected result should like this: Type: VCC Pin_name: AK32,AL32,AH21,..... Type: NC Pin_name:...

8. Shell Programming and Scripting

Regex to Parse data

Experts and Informed folks, Need some help here in parsing the log file. 1389675 Opera_ShirtCatalog INSERT INTO Opera_ShirtCatalog(COL1, COL2) VALUES (1, 'TEST1'), (2,'TEST2'); 1389685 Opera_ShirtCatlog_Wom INSERT INTO Opera_ShirtCatlog_Wom(col1, col2, col3) VALUES (9,'Siz12, FormFit',...

9. Shell Programming and Scripting

wildcard in regex for awk

10. Shell Programming and Scripting

Regex within IF statement in awk

Hello to all, I have: X="string 1-" Y="-string 2" Z="string 1-20-string 2"In the position of the number 20 could be different numbers, but I'm interest only when the number is 15, 20,45 or 70. I want to include an IF within an awk code with a regex in the following way. ...

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk or regex

Discussion started by: smihael

2. Shell Programming and Scripting

Extracting a regex with awk

Discussion started by: Enobarbus37

3. Shell Programming and Scripting

sed to awk (regex pattern) how?

Discussion started by: TehOne

4. Shell Programming and Scripting

awk regex problem

Discussion started by: aishsimplesweet

5. UNIX for Dummies Questions & Answers

Using AWK and regex

Discussion started by: krashraj

6. Shell Programming and Scripting

awk equivalent of regex

Discussion started by: r4v3n

7. Shell Programming and Scripting

RegeX to parse data from a txt file

Discussion started by: killbanne

8. Shell Programming and Scripting

Regex to Parse data

Discussion started by: ManoharMa

9. Shell Programming and Scripting

wildcard in regex for awk

Discussion started by: black_fender

10. Shell Programming and Scripting

Regex within IF statement in awk

Discussion started by: Ophiuchus