How to extract data from BNC xml with reference brackets?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to extract data from BNC xml with reference brackets?
# 8  
Old 12-15-2008
I have tried
sed -n 's/<w c5="\(.*\)" hw="\(.*\)" pos="\(.*\)">\(.*\)<\/w>/\1:\4/gp' A00.xml < test.txt

as you suggested. yet the output file of test.txt is still a mess. Here I attach the original file and the output file.

Is it that we haven't grepped all the contents in the pattern of <w c5="\(.*\)" hw="\(.*\)" pos="\(.*\)">\(.*\)</w> in a list form.

I try egrep. yet not working

Thanks for your discussion and instruction
# 9  
Old 12-15-2008
This sounds like an assignment or homework, which is against the forum rules.

The problem is that ".*" does a greedy match, and you have multiple matches on each line of data, so you need to handle that. Try something like \"\([^"]*\)\" instead of "\(.*\)" to limit the match to the contents of the speech marks. [^"]* means any number of characters excluding ".

You will need some additional search and replaces to remove the <s ...> and <c ...> </c> tags, but I'll leave that as an exercise for you.
# 10  
Old 12-16-2008
Thanks for your reply.

First, this is not homework or assignment. I am researching on corpus linguistics and try to find an effective way for collecting data. To remember items with parentheses are seldom mention in many examples.

I am afraid your version sed -n 's/<w c5="\(.*\)" hw="\(.*\)" pos="\(.*\)">\(.*\)<\/w>/\1:\4/gp' inputfile > outputfile doesn't work.

First, with/gp , the items in my results are repeated or doubled.

Second, I try sed 's/<w c5="\(.*\)" hw="\(.*\)" pos="\(.*\)">\(.*\)<\/w>/\1:\4/' test2.txt. Then it works. The content of test2.txt is like:
<w c5="VBZ" hw="be" pos="VERB">is</w>
<w c5="AT0" hw="a" pos="ART">a</w>
<w c5="NN1" hw="condition" pos="SUBST">condition</w>
<w c5="VVN" hw="cause" pos="VERB">caused</w>
<w c5="PRP" hw="by" pos="PREP">by</w>
<w c5="AT0" hw="a" pos="ART">a</w>
<w c5="NN1" hw="virus" pos="SUBST">virus</w>
<w c5="VVN" hw="call" pos="VERB">called</w>
<w c5="NP0" hw="hiv" pos="SUBST">HIV</w>

Then the result is

VBZ:is
AT0:a
NN1:condition
VVN:caused
PRP:by
AT0:a
NN1:virus
VVN:called
NP0:HIV

That means the sed only works for worklis like the above words in red part. Moreover, if we use "need some additional search and replaces to remove the *<s ...>* and *<c ...> </c>* tags", this may not be the best way.

I don't why it won't work for my whole file A00.xml

Best
John
# 11  
Old 12-16-2008
How can we grep only those content according to regular expression

I try to collect first those content like <w c5=".*" hw=".*" pos=".*?">.*</w> in that A00.xml.

I use the following pattern :

egrep "<w c5=".*" hw=".*" pos=".*?">.*</w>" A00.xml

The result is:

<s n="396"><w c5="PNP" hw="we" pos="PRON">We </w><w c5="VVB" hw="make" pos="VERB">make </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="DT0" hw="most" pos="ADJ">most </w><w c5="PRF" hw="of" pos="PREP">of </w></s>

First, there is unexpected part <s n=...>

Second, they are not in list form like this:
<w c5="PNP" hw="we" pos="PRON">We </w>
<w c5="VVB" hw="make" pos="VERB">make </w>
<w c5="AT0" hw="the" pos="ART">the </w>
<w c5="DT0" hw="most" pos="ADJ">most </w>
<w c5="PRF" hw="of" pos="PREP">of </w>
# 12  
Old 12-16-2008
I'm glad it's not homework, I just though I should check because we get a lot of posts like that here.

You don't seem to have tried what I suggested in my previous post to prevent greedy matching?

Is there any particular reason why you want to use sed? This perl one-liner seems to do what you require, as I understand it anyway:

Code:
perl -ne 'while (/<w c5="(.*?)" hw=".*?" pos=".*?">(.*?)<\/w>/g) {print $1:$2\n"}' A00.xml > outputfile

# 13  
Old 12-16-2008
Quote:
Originally Posted by Annihilannic
I'm glad it's not homework, I just though I should check because we get a lot of posts like that here.

You don't seem to have tried what I suggested in my previous post to prevent greedy matching?

Is there any particular reason why you want to use sed? This perl one-liner seems to do what you require, as I understand it anyway:

Code:
perl -ne 'while (/<w c5="(.*?)" hw=".*?" pos=".*?">(.*?)<\/w>/g) {print $1:$2\n"}' A00.xml > outputfile

First, we have unix system installed in a server. we have many xml files as big as 4 G to process. Then I think the server can process them much faster than my desktop computer. Second, I 'v e not learned perl before and am afraid that it will assump too much of my time to learn a new script language. Third, I try other GNU softwares such as powergrep and textpipe. Yet they take money to buy after evaluation period. As far as my understanding, they offer similar functions for extract data according to regular expression. Then I want to make full use of the unix tool sed, awk , and grep to reach teh same functions like what these program do.
# 14  
Old 12-16-2008
I don't know why you're telling me about your Unix system... I assumed you were doing this on Unix anyway? perl is widely found on Unix systems, and is more efficient at processing large amounts of data, so I would say it is ideal for your purposes (and very useful to learn!).

sed, awk and grep can also be used equally well for your task; I've given you some tips which you don't appear to have tried yet... so I'll wait until you give them a go. Let me know if you get stuck and have any specific questions.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Extract data using a reference

Gents, If there the possibility can to extract data using a reference from other file. input.txt ( big file which contends all data output.txt ( data extracted ) selection.txt ( information to extract the data Example In file input.txt there is big data each record have 56 lines like... (3 Replies)
Discussion started by: jiam912
3 Replies

2. Shell Programming and Scripting

Extract Data from XML file.

Hi Guys, I am in a need to extract data from a xml file. The XML file format is as below. <data jsxnamespace="propsbundle" locales=""> <locale> <!--Error messages starts--> <record jsxid="CHARPAIR001" jsxtext=" must be selected"></record> <record... (1 Reply)
Discussion started by: Showdown
1 Replies

3. Shell Programming and Scripting

awk -- Extract data from html within multiple tags as reference

Hi, I'm trying to get some data from an html file, but the problem is before it can extract the information I have multiple patterns that need to be passed through. https://www.unix.com/shell-programming-scripting/150711-extract-data-awk-html-files.html Is a similar problem. The only... (5 Replies)
Discussion started by: counfhou
5 Replies

4. Shell Programming and Scripting

Extract data from XML file

Hi , I have input file as XML. following are input data #complex.xml <?xml version="1.0" encoding="UTF-8"?> <TEST_doc xmlns="http://www.w3.org/2001/XMLSchema-instance"> <ENTRY uid="123456"> <protein> <name>PROT001</name> <organism>Human</organism> ... (1 Reply)
Discussion started by: mohan sharma
1 Replies

5. Shell Programming and Scripting

Extract and parse XML data (statistic value) to csv

Hi All, I need to parse some statistic data from the "measInfo" -eg. 25250000 (as highlighted) and return the result into line by line, and erasing all other unnecessary info/tag. Thought of starting with grep "measInfoID="25250000" but this only returns 1 line. How do I get all the output... (8 Replies)
Discussion started by: jackma
8 Replies

6. Shell Programming and Scripting

Data Extract from XML Log File

Please help me out to extract the Data from the XML Log files. So here is the data ERROR|2010-08-26 00:05:52,958|SERIAL_ID=128279996|ST=2010-08-2600:05:52|DEVICE=113.2.21.12:601|TYPE=TransactionLog... (9 Replies)
Discussion started by: raghunsi
9 Replies

7. Shell Programming and Scripting

XML data extract

Hi all, I have the following xml document : <HEADER><El1>asdf</El1> <El2>3</El2> <El3>asad</El3> <El4>asasdf</El4> <El5>asdf</El5> <El6>asdf</El6> <El7>asdf</El7> <El8>A</El8> <El9>0</El9> <El10>75291028141917</El10> <El11>asdf</El11> <El12>sdf</El12> <El13>er</El13> <El14><El15>asdf... (1 Reply)
Discussion started by: nthed
1 Replies

8. Shell Programming and Scripting

Extract xml data

Hi all, I have the following xml file : <xmlhead><xmlelement1>element1value</xmlelement1>\0a<xmlelement2>jjasd</xmlelement2>...</xmlhead> As you can see there are no lines or spaces seperating the elements, just the character \0a. How can i find and print the values of a specific element?... (1 Reply)
Discussion started by: nthed
1 Replies

9. Shell Programming and Scripting

sed or awk to extract data from Xml file

Hi, I want to get data from Xml file by using sed or awk command. I want to get the following result : mon titre 1;Createur1;Dossier1 mon titre 1;Createur1;Dossier1 and save it in cvs file (fichier.cvs). FROM this Xml file (test.xml): <playlist version="1"> <trackList> <track>... (1 Reply)
Discussion started by: yeclota
1 Replies

10. Shell Programming and Scripting

Help with shell script to extract data from XML file

Hello Scripting Gurus, I need help with extracting data from the XML file using shell script. The data is in a large XML and I need to extract the id values of all completedworkflows. Here is a sample of it. Input and output data is also in the attached text files. <wfregistry>... (5 Replies)
Discussion started by: yajaykumar
5 Replies
Login or Register to Ask a Question