Sponsored Content
Top Forums Shell Programming and Scripting awk and regex of wikisource data Post 302947692 by Don Cragun on Saturday 20th of June 2015 04:11:32 PM
Old 06-20-2015
I understand not wanting to put such a long lined input file in CODE tags; but, in the future, please at least note that the entire file is a single line.

Assuming that each pair of <ref> and its matching </ref> tags are always on a single line in the files you want to process, this seems simpler:
Code:
awk -F'<ref>' '
{	for(i=2; i<=NF; i++)
		print substr($i, 1, match($i, "</ref>") - 1)
}' file.txt


As long as the text files really are text files (i.e., with no lines longer than LINE_MAX bytes including the terminating <newline> character), this should work with any awk utility. (But, as always, on Solaris/SunOS systems, change awk to /usr/xpg4/bin/awk.)
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk or regex

Hi! I want to made a program that will generate code like this: {{Navedi XYZ |avtor=XYZ1 |naslov=XYZ2 |leto_izzida=XYZ3 |zalozba=XYZ4 |kraj=XYZ5 |isbn=XYZ6 |cobiss_id=XYZ7 }} from input like this: <b> ODGOVORNOST............. : <a... (5 Replies)
Discussion started by: smihael
5 Replies

2. Shell Programming and Scripting

Extracting a regex with awk

I have a regexp that I wish to match against every line of a file using awk. But I do not want to substitute it or select the line. I want to pull the matched text out and put it in a different file, line by line. What is the correct awk usage to *extract* a regexp and put it in another... (11 Replies)
Discussion started by: Enobarbus37
11 Replies

3. Shell Programming and Scripting

sed to awk (regex pattern) how?

Hello, I am trying to covert a for statement into a single awk script and I've got everything but one part. I also need to execute an external script when "not found", how can I do that ? for TXT in `find debugme -name "*.txt"` ;do FPATH=`echo $TXT | sed 's/\(.*\)\/\(.*\)/\1/'` how... (7 Replies)
Discussion started by: TehOne
7 Replies

4. Shell Programming and Scripting

awk regex problem

hi everyone suppose my input file is ABC-12345 ABCD-12345 BCD-123456 i want to search the specific pattern which looks like - in a file so i used this command cat $file | awk ' { if ($0 ~ /-/) { print } }' so it gives me the result as ABCD-12345 BCD-12345 BCD-12345 ... (31 Replies)
Discussion started by: aishsimplesweet
31 Replies

5. UNIX for Dummies Questions & Answers

Using AWK and regex

Hi can you suggest in this regard The sample.txt conatins the data name lines type sam 12 txt sam 24 xls sam 36 pdf ram 32 txt ram 45 sxls ram 58 word sam 92 jpeg sam 21 gif sam 22 ltf from the data i need to sum all line... (5 Replies)
Discussion started by: krashraj
5 Replies

6. Shell Programming and Scripting

awk equivalent of regex

Hi all, Can someone tell me what's the (g)awk equal of this simple regex to find ip addresses in urls: egrep "^http://{1,3}\.{1,3}\.{1,3}\.{1,3}(:{1,5})?/"Input: http://10.0.0.1/query.exe http://11y10x09w:80/howaboutme http://192.168.100.190:1234/takeme.gpg Output:... (8 Replies)
Discussion started by: r4v3n
8 Replies

7. Shell Programming and Scripting

RegeX to parse data from a txt file

Hi all the experts out there, I am totally new to perl and I was given an assignment by using Perl to find the 2nd element of each line in each curly bracket which made up of 5 elements. Expected result should like this: Type: VCC Pin_name: AK32,AL32,AH21,..... Type: NC Pin_name:... (2 Replies)
Discussion started by: killbanne
2 Replies

8. Shell Programming and Scripting

Regex to Parse data

Experts and Informed folks, Need some help here in parsing the log file. 1389675 Opera_ShirtCatalog INSERT INTO Opera_ShirtCatalog(COL1, COL2) VALUES (1, 'TEST1'), (2,'TEST2'); 1389685 Opera_ShirtCatlog_Wom INSERT INTO Opera_ShirtCatlog_Wom(col1, col2, col3) VALUES (9,'Siz12, FormFit',... (12 Replies)
Discussion started by: ManoharMa
12 Replies

9. Shell Programming and Scripting

wildcard in regex for awk

Hello I have a file like : 20120918000001413 | 1.17.163.89 | iSelfcare | MSISDN | N 20120918000001806 | 1.33.27.100 | iSelfcare | 5564 | N .... I want to extract all lines that have on 4th field (considering "|" the separator ) something other than just digits. I want to do this using a... (5 Replies)
Discussion started by: black_fender
5 Replies

10. Shell Programming and Scripting

Regex within IF statement in awk

Hello to all, I have: X="string 1-" Y="-string 2" Z="string 1-20-string 2"In the position of the number 20 could be different numbers, but I'm interest only when the number is 15, 20,45 or 70. I want to include an IF within an awk code with a regex in the following way. ... (12 Replies)
Discussion started by: Ophiuchus
12 Replies
GREP(1) 						      General Commands Manual							   GREP(1)

NAME
grep, g - search a file for a pattern SYNOPSIS
grep [ option ... ] pattern [ file ... ] g [ option ... ] pattern [ file ... ] DESCRIPTION
Grep searches the input files (standard input default) for lines that match the pattern, a regular expression as defined in regexp(7) with the addition of a newline character as an alternative (substitute for |) with lowest precedence. Normally, each line matching the pattern is `selected', and each selected line is copied to the standard output. The options are -c Print only a count of matching lines. -h Do not print file name tags (headers) with output lines. -e The following argument is taken as a pattern. This option makes it easy to specify patterns that might confuse argument parsing, such as -n. -i Ignore alphabetic case distinctions. The implementation folds into lower case all letters in the pattern and input before interpre- tation. Matched lines are printed in their original form. -l (ell) Print the names of files with selected lines; don't print the lines. -L Print the names of files with no selected lines; the converse of -l. -n Mark each printed line with its line number counted in its file. -s Produce no output, but return status. -v Reverse: print lines that do not match the pattern. -f The pattern argument is the name of a file containing regular expressions one per line. -b Don't buffer the output: write each output line as soon as it is discovered. Output lines are tagged by file name when there is more than one input file. (To force this tagging, include /dev/null as a file name argument.) Care should be taken when using the shell metacharacters $*[^|()= and newline in pattern; it is safest to enclose the entire expression in single quotes '...'. An expression starting with '*' will treat the rest of the expression as literal characters. G invokes grep with -n and forces tagging of output lines by file name. If no files are listed, it searches all files matching *.C *.b *.c *.h *.m *.cc *.java *.cgi *.pl *.py *.tex *.ms SOURCE
/src/cmd/grep /bin/g SEE ALSO
ed(1), awk(1), sed(1), sam(1), regexp(7) DIAGNOSTICS
Exit status is null if any lines are selected, or non-null when no lines are selected or an error occurs. GREP(1)
All times are GMT -4. The time now is 08:24 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy