How to extract data from BNC xml with reference brackets?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to extract data from BNC xml with reference brackets?
# 1  
Old 12-11-2008
How to extract data from BNC xml with reference brackets?

I have data like the following pattern:
<change date="2000-01-09" who="#OUCS">Updated all catrefs</change>

<change date="2000-01-08" who="#OUCS">Manually updated tagcounts, titlestmt, and title in source</change>

<change date="1999-09-13" who="#UCREL">POS codes revised for BNC-2; header updated</change>

<change date="1994-11-24" who="#dominic">Initial accession to corpus</change>

</revisionDesc>
</teiHeader>
- <wtext type="NONAC">
- <div level="1" n="1" type="leaflet">
- <head type="MAIN">
- <s n="1">
<w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w>

<w c5="DTQ" hw="what" pos="PRON">WHAT</w>

<w c5="VBZ" hw="be" pos="VERB">IS</w>

<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>

<c c5="PUN">?</c>

</s>


</head>


- <p>
- <s n="2">
- <hi rend="bo">
<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>

<c c5="PUL">(</c>

<w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w>

<w c5="AJ0" hw="immune" pos="ADJ">Immune</w>

<w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w>

<w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w>

<c c5="PUR">)</c>

</hi>


<w c5="VBZ" hw="be" pos="VERB">is</w>

<w c5="AT0" hw="a" pos="ART">a</w>

<w c5="NN1" hw="condition" pos="SUBST">condition</w>

<w c5="VVN" hw="cause" pos="VERB">caused</w>

<w c5="PRP" hw="by" pos="PREP">by</w>

<w c5="AT0" hw="a" pos="ART">a</w>


Then in order extract those patterns like
<w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)</w>.
First, I wirte the following command sed 's/<w c5="\(.*?\)" hw="\(.*?\)" pos="\(.*?\)">\(.*?\)<\/w>/\1:\4/g' A00.xml.
However, the result is like this which is not what I want:
<s n="420"><w c5="NN1" hw="aids" pos="SUBST">AIDS </w><w c5="NN1-VVB" hw="care" pos="SUBST">Care </w><w c5="NN1" hw="education" pos="SUBST">Education </w><w c5="CJC" hw="and" pos="CONJ">and </w><w c5="NN1" hw="training" pos="SUBST">Training </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="NN1" hw="company" pos="SUBST">company </w><w c5="VVN" hw="limit" pos="VERB">limited </w><w c5="PRP" hw="by" pos="PREP">by </w><w c5="NN1" hw="guarantee" pos="SUBST">guarantee</w><c c5="PUN">.</c></s>

Seem the replacement doesn't work.

I want the result like these for all those patterns <w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)</w>

NN1:FACTSHEET
DTQ:WHAT
VBZ:IS
NN1:AIDS

Second, I try awk '/<w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)<\/w>/ {print $1,$2,$3,$4}' A00.xml. However, the result is not what I want. They didn't print out those parts within ().

How can we just extract and grep those parts within () which is used to defined the parts I need to extract?

Thanks all of your suggestion
John
# 2  
Old 12-14-2008
awk doesn't use that kind of syntax to assign matches to subexpressions... you must have seen that in perl somewhere?

Your code works with only minor modifications in perl:

Code:
perl -ne '
        if (/<w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)<\/w>/) {print "$1,$2,$3,$4\n"}
' inputfile > outputfile

# 3  
Old 12-14-2008
In this book Title:Unix Power Tools, Third Edition
URL:Amazon.com: Unix Power Tools, Third Edition: Shelley Powers, Jerry Peek, Tim O'Reilly, Mike Loukides: Books
ISBN:0596003307
Author:Shelley Powers / Jerry Peek / Tim O'Reilly / Mike Loukides
Publisher:O'Reilly & Associates
Page:1200 pages
Edition:3rd edition (October 1, 2002)

32.13 Regular Expressions: Remembering Patterns with \ (, \ ), and \1
Another pattern that requires a special mechanism is searching for repeated words. The expression [a-z][a-z] will match any two lowercase letters. If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldn't help. You need a way to remember what you found and see if the same pattern occurs again. In some programs, you can mark part of a pattern using \( and \). You can recall the remembered pattern with \ followed by a single digit.[4] Therefore, to search for two identical letters, use \([a-z]\)\1. You can have nine different remembered patterns. Each occurrence of \( starts a new pattern. The regular expression to match a five-letter palindrome (e.g., "radar") is: \([a-z]\)\([a-z]\)[a-z]\2\1. [Some versions of some programs can't handle \( \) in the same regular expression as \1, etc. In all versions of sed, you're safe if you use \( \) on the pattern side of an s command — and \1, etc., on the replacement side (Section 34.11). — JP]

— BB
34.11 Referencing Portions of a Search String
In sed, the substitution command provides metacharacters to select any individual portion of a string that is matched and recall it in the replacement string. A pair of escaped parentheses are used in sed to enclose any part of a regular expression and save it for recall. Up to nine "saves" are permitted for a single line. \n is used to recall the portion of the match that was saved, where n is a number from 1 to 9 referencing a particular "saved" string in order of use. (Section 32.13 has more information.)

For example, when converting a plain-text document into HTML, we could convert section numbers that appear in a cross-reference into an HTML hyperlink. The following expression is broken onto two lines for printing, but you should type all of it on one line:

s/\([sS]ee \)\(Section \)\([1-9][0-9]*\)\.\([1-9][0-9]*\)/
\1<a href="#SEC-\3_\4">\2\3.\4<\/a>/
Four pairs of escaped parentheses are specified. String 1 captures the word see with an upper- or lowercase s. String 2 captures the section number (because this is a fixed string, it could have been simply retyped in the replacement string). String 3 captures the part of the section number before the decimal point, and String 4 captures the part of the section number after the decimal point. The replacement string recalls the first saved substring as \1. Next starts a link where the two parts of the section number, \3 and \4, are separated by an underscore (_) and have the string SEC- before them. Finally, the link text replays the section number again — this time with a decimal point between its parts. Note that although a dot (.) is special in the search pattern and has to be quoted with a backslash there, it's not special on the replacement side and can be typed literally. Here's the script run on a short test document, using checksed (Section 34.4):

% checksed testdoc
********** < = testdoc > = sed output **********
8c8
< See Section 1.2 for details.
---
> See <a href="#SEC-1_2">Section 1.2</a> for details.
19c19
< Be sure to see Section 23.16!
---
> Be sure to see <a href="#SEC-23_16">Section 23.16</a>!
We can use a similar technique to match parts of a line and swap them. For instance, let's say there are two parts of a line separated by a colon. We can match each part, putting them within escaped parentheses and swapping them in the replacement:

% cat test1
first:second
one:two
% sed 's/\(.*\):\(.*\)/\2:\1/' test1
second:first
twoSmiliene
The larger point is that you can recall a saved substring in any order and multiple times. If you find that you need more than nine saved matches, or would like to be able to group them into matches and submatches, take a look at Perl.

Section 43.10, Section 31.10, Section 10.9, and Section 36.23 have examples.

—DD and JP


I test it it works for a list of lines in the same pattern. The problem in my situation is that I fail to in the first step put all the content of this regular expression <w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)<\/w>/in each individual line such as <w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w>

<w c5="DTQ" hw="what" pos="PRON">WHAT</w>

<w c5="VBZ" hw="be" pos="VERB">IS</w>

<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>

My result is not that clear which contains other contents out of the regular expression such as <s n="420">.

To my strange, it works in that book's example but not in my situation.

Best

John
# 4  
Old 12-15-2008
Sorry, I can't make sense of what you're saying.

However I notice you described in your original post that you wanted the output in this format:

Code:
NN1:FACTSHEET
DTQ:WHAT
VBZ:IS
NN1:AIDS

So try this instead:

Code:
perl -ne '
        if (/<w c5="(.*?)" hw=".*?" pos=".*?">(.*?)<\/w>/) {print "$1:$2\n"}
' inputfile > outputfile

# 5  
Old 12-15-2008
Thanks first.

The first six paragraphs are quoted from a book which introduce how to use sed with parentheses. I don't know why it won't works in my situation.

Best
John
# 6  
Old 12-15-2008
What operating system are you using? I think the .*? parts may be the problem, as that regular expression syntax is not supported by most implementations of sed. It may work with GNU sed, the version found on Linux.

Try this, which works for me on HP-UX:

Code:
sed -n 's/<w c5="\(.*\)" hw="\(.*\)" pos="\(.*\)">\(.*\)<\/w>/\1:\4/gp' inputfile > outputfile

# 7  
Old 12-15-2008
I am using SSH secure Shell Client on Windows xp. Then the Shell cilent is connected to our Unix server in our school
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract data using a reference

Gents, If there the possibility can to extract data using a reference from other file. input.txt ( big file which contends all data output.txt ( data extracted ) selection.txt ( information to extract the data Example In file input.txt there is big data each record have 56 lines like... (3 Replies)
Discussion started by: jiam912
3 Replies

2. Shell Programming and Scripting

Extract Data from XML file.

Hi Guys, I am in a need to extract data from a xml file. The XML file format is as below. <data jsxnamespace="propsbundle" locales=""> <locale> <!--Error messages starts--> <record jsxid="CHARPAIR001" jsxtext=" must be selected"></record> <record... (1 Reply)
Discussion started by: Showdown
1 Replies

3. Shell Programming and Scripting

awk -- Extract data from html within multiple tags as reference

Hi, I'm trying to get some data from an html file, but the problem is before it can extract the information I have multiple patterns that need to be passed through. https://www.unix.com/shell-programming-scripting/150711-extract-data-awk-html-files.html Is a similar problem. The only... (5 Replies)
Discussion started by: counfhou
5 Replies

4. Shell Programming and Scripting

Extract data from XML file

Hi , I have input file as XML. following are input data #complex.xml <?xml version="1.0" encoding="UTF-8"?> <TEST_doc xmlns="http://www.w3.org/2001/XMLSchema-instance"> <ENTRY uid="123456"> <protein> <name>PROT001</name> <organism>Human</organism> ... (1 Reply)
Discussion started by: mohan sharma
1 Replies

5. Shell Programming and Scripting

Extract and parse XML data (statistic value) to csv

Hi All, I need to parse some statistic data from the "measInfo" -eg. 25250000 (as highlighted) and return the result into line by line, and erasing all other unnecessary info/tag. Thought of starting with grep "measInfoID="25250000" but this only returns 1 line. How do I get all the output... (8 Replies)
Discussion started by: jackma
8 Replies

6. Shell Programming and Scripting

Data Extract from XML Log File

Please help me out to extract the Data from the XML Log files. So here is the data ERROR|2010-08-26 00:05:52,958|SERIAL_ID=128279996|ST=2010-08-2600:05:52|DEVICE=113.2.21.12:601|TYPE=TransactionLog... (9 Replies)
Discussion started by: raghunsi
9 Replies

7. Shell Programming and Scripting

XML data extract

Hi all, I have the following xml document : <HEADER><El1>asdf</El1> <El2>3</El2> <El3>asad</El3> <El4>asasdf</El4> <El5>asdf</El5> <El6>asdf</El6> <El7>asdf</El7> <El8>A</El8> <El9>0</El9> <El10>75291028141917</El10> <El11>asdf</El11> <El12>sdf</El12> <El13>er</El13> <El14><El15>asdf... (1 Reply)
Discussion started by: nthed
1 Replies

8. Shell Programming and Scripting

Extract xml data

Hi all, I have the following xml file : <xmlhead><xmlelement1>element1value</xmlelement1>\0a<xmlelement2>jjasd</xmlelement2>...</xmlhead> As you can see there are no lines or spaces seperating the elements, just the character \0a. How can i find and print the values of a specific element?... (1 Reply)
Discussion started by: nthed
1 Replies

9. Shell Programming and Scripting

sed or awk to extract data from Xml file

Hi, I want to get data from Xml file by using sed or awk command. I want to get the following result : mon titre 1;Createur1;Dossier1 mon titre 1;Createur1;Dossier1 and save it in cvs file (fichier.cvs). FROM this Xml file (test.xml): <playlist version="1"> <trackList> <track>... (1 Reply)
Discussion started by: yeclota
1 Replies

10. Shell Programming and Scripting

Help with shell script to extract data from XML file

Hello Scripting Gurus, I need help with extracting data from the XML file using shell script. The data is in a large XML and I need to extract the id values of all completedworkflows. Here is a sample of it. Input and output data is also in the attached text files. <wfregistry>... (5 Replies)
Discussion started by: yajaykumar
5 Replies
Login or Register to Ask a Question