The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
how to extract a data from a column? Balaji Sukumara Shell Programming and Scripting 8 09-23-2008 05:19 AM
extract data from file ps_sach Shell Programming and Scripting 13 09-19-2008 03:30 AM
how to extract data from tape. djahmed UNIX for Dummies Questions & Answers 1 05-22-2008 08:21 AM
Help: Log data extract kenm0j0 Shell Programming and Scripting 1 10-01-2007 09:02 AM
Extract data segment using awk?? apalex Shell Programming and Scripting 1 07-27-2004 07:13 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 12-11-2008
Johnivy Johnivy is offline
Registered User
  
 

Join Date: Dec 2008
Posts: 11
How to extract data from BNC xml with reference brackets?

I have data like the following pattern:
<change date="2000-01-09" who="#OUCS">Updated all catrefs</change>

<change date="2000-01-08" who="#OUCS">Manually updated tagcounts, titlestmt, and title in source</change>

<change date="1999-09-13" who="#UCREL">POS codes revised for BNC-2; header updated</change>

<change date="1994-11-24" who="#dominic">Initial accession to corpus</change>

</revisionDesc>
</teiHeader>
- <wtext type="NONAC">
- <div level="1" n="1" type="leaflet">
- <head type="MAIN">
- <s n="1">
<w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w>

<w c5="DTQ" hw="what" pos="PRON">WHAT</w>

<w c5="VBZ" hw="be" pos="VERB">IS</w>

<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>

<c c5="PUN">?</c>

</s>


</head>


- <p>
- <s n="2">
- <hi rend="bo">
<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>

<c c5="PUL">(</c>

<w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w>

<w c5="AJ0" hw="immune" pos="ADJ">Immune</w>

<w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w>

<w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w>

<c c5="PUR">)</c>

</hi>


<w c5="VBZ" hw="be" pos="VERB">is</w>

<w c5="AT0" hw="a" pos="ART">a</w>

<w c5="NN1" hw="condition" pos="SUBST">condition</w>

<w c5="VVN" hw="cause" pos="VERB">caused</w>

<w c5="PRP" hw="by" pos="PREP">by</w>

<w c5="AT0" hw="a" pos="ART">a</w>


Then in order extract those patterns like
<w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)</w>.
First, I wirte the following command sed 's/<w c5="\(.*?\)" hw="\(.*?\)" pos="\(.*?\)">\(.*?\)<\/w>/\1:\4/g' A00.xml.
However, the result is like this which is not what I want:
<s n="420"><w c5="NN1" hw="aids" pos="SUBST">AIDS </w><w c5="NN1-VVB" hw="care" pos="SUBST">Care </w><w c5="NN1" hw="education" pos="SUBST">Education </w><w c5="CJC" hw="and" pos="CONJ">and </w><w c5="NN1" hw="training" pos="SUBST">Training </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="NN1" hw="company" pos="SUBST">company </w><w c5="VVN" hw="limit" pos="VERB">limited </w><w c5="PRP" hw="by" pos="PREP">by </w><w c5="NN1" hw="guarantee" pos="SUBST">guarantee</w><c c5="PUN">.</c></s>

Seem the replacement doesn't work.

I want the result like these for all those patterns <w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)</w>

NN1:FACTSHEET
DTQ:WHAT
VBZ:IS
NN1:AIDS

Second, I try awk '/<w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)<\/w>/ {print $1,$2,$3,$4}' A00.xml. However, the result is not what I want. They didn't print out those parts within ().

How can we just extract and grep those parts within () which is used to defined the parts I need to extract?

Thanks all of your suggestion
John
  #2 (permalink)  
Old 12-14-2008
Annihilannic Annihilannic is offline Forum Advisor  
  
 

Join Date: May 2008
Location: Sydney, Australia
Posts: 1,009
awk doesn't use that kind of syntax to assign matches to subexpressions... you must have seen that in perl somewhere?

Your code works with only minor modifications in perl:


Code:
perl -ne '
        if (/<w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)<\/w>/) {print "$1,$2,$3,$4\n"}
' inputfile > outputfile

  #3 (permalink)  
Old 12-14-2008
Johnivy Johnivy is offline
Registered User
  
 

Join Date: Dec 2008
Posts: 11
In this book Title:Unix Power Tools, Third Edition
URL:Amazon.com: Unix Power Tools, Third Edition: Shelley Powers, Jerry Peek, Tim O'Reilly, Mike Loukides: Books
ISBN:0596003307
Author:Shelley Powers / Jerry Peek / Tim O'Reilly / Mike Loukides
Publisher:O'Reilly & Associates
Page:1200 pages
Edition:3rd edition (October 1, 2002)

32.13 Regular Expressions: Remembering Patterns with \ (, \ ), and \1
Another pattern that requires a special mechanism is searching for repeated words. The expression [a-z][a-z] will match any two lowercase letters. If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldn't help. You need a way to remember what you found and see if the same pattern occurs again. In some programs, you can mark part of a pattern using \( and \). You can recall the remembered pattern with \ followed by a single digit.[4] Therefore, to search for two identical letters, use \([a-z]\)\1. You can have nine different remembered patterns. Each occurrence of \( starts a new pattern. The regular expression to match a five-letter palindrome (e.g., "radar") is: \([a-z]\)\([a-z]\)[a-z]\2\1. [Some versions of some programs can't handle \( \) in the same regular expression as \1, etc. In all versions of sed, you're safe if you use \( \) on the pattern side of an s command — and \1, etc., on the replacement side (Section 34.11). — JP]

— BB
34.11 Referencing Portions of a Search String
In sed, the substitution command provides metacharacters to select any individual portion of a string that is matched and recall it in the replacement string. A pair of escaped parentheses are used in sed to enclose any part of a regular expression and save it for recall. Up to nine "saves" are permitted for a single line. \n is used to recall the portion of the match that was saved, where n is a number from 1 to 9 referencing a particular "saved" string in order of use. (Section 32.13 has more information.)

For example, when converting a plain-text document into HTML, we could convert section numbers that appear in a cross-reference into an HTML hyperlink. The following expression is broken onto two lines for printing, but you should type all of it on one line:

s/\([sS]ee \)\(Section \)\([1-9][0-9]*\)\.\([1-9][0-9]*\)/
\1<a href="#SEC-\3_\4">\2\3.\4<\/a>/
Four pairs of escaped parentheses are specified. String 1 captures the word see with an upper- or lowercase s. String 2 captures the section number (because this is a fixed string, it could have been simply retyped in the replacement string). String 3 captures the part of the section number before the decimal point, and String 4 captures the part of the section number after the decimal point. The replacement string recalls the first saved substring as \1. Next starts a link where the two parts of the section number, \3 and \4, are separated by an underscore (_) and have the string SEC- before them. Finally, the link text replays the section number again — this time with a decimal point between its parts. Note that although a dot (.) is special in the search pattern and has to be quoted with a backslash there, it's not special on the replacement side and can be typed literally. Here's the script run on a short test document, using checksed (Section 34.4):

% checksed testdoc
********** < = testdoc > = sed output **********
8c8
< See Section 1.2 for details.
---
> See <a href="#SEC-1_2">Section 1.2</a> for details.
19c19
< Be sure to see Section 23.16!
---
> Be sure to see <a href="#SEC-23_16">Section 23.16</a>!
We can use a similar technique to match parts of a line and swap them. For instance, let's say there are two parts of a line separated by a colon. We can match each part, putting them within escaped parentheses and swapping them in the replacement:

% cat test1
first:second
one:two
% sed 's/\(.*\):\(.*\)/\2:\1/' test1
second:first
twone
The larger point is that you can recall a saved substring in any order and multiple times. If you find that you need more than nine saved matches, or would like to be able to group them into matches and submatches, take a look at Perl.

Section 43.10, Section 31.10, Section 10.9, and Section 36.23 have examples.

—DD and JP


I test it it works for a list of lines in the same pattern. The problem in my situation is that I fail to in the first step put all the content of this regular expression <w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)<\/w>/in each individual line such as <w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w>

<w c5="DTQ" hw="what" pos="PRON">WHAT</w>

<w c5="VBZ" hw="be" pos="VERB">IS</w>

<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>

My result is not that clear which contains other contents out of the regular expression such as <s n="420">.

To my strange, it works in that book's example but not in my situation.

Best

John
  #4 (permalink)  
Old 12-15-2008
Annihilannic Annihilannic is offline Forum Advisor  
  
 

Join Date: May 2008
Location: Sydney, Australia
Posts: 1,009
Sorry, I can't make sense of what you're saying.

However I notice you described in your original post that you wanted the output in this format:


Code:
NN1:FACTSHEET
DTQ:WHAT
VBZ:IS
NN1:AIDS

So try this instead:


Code:
perl -ne '
        if (/<w c5="(.*?)" hw=".*?" pos=".*?">(.*?)<\/w>/) {print "$1:$2\n"}
' inputfile > outputfile

  #5 (permalink)  
Old 12-15-2008
Johnivy Johnivy is offline
Registered User
  
 

Join Date: Dec 2008
Posts: 11
Thanks first.

The first six paragraphs are quoted from a book which introduce how to use sed with parentheses. I don't know why it won't works in my situation.

Best
John
  #6 (permalink)  
Old 12-15-2008
Annihilannic Annihilannic is offline Forum Advisor  
  
 

Join Date: May 2008
Location: Sydney, Australia
Posts: 1,009
What operating system are you using? I think the .*? parts may be the problem, as that regular expression syntax is not supported by most implementations of sed. It may work with GNU sed, the version found on Linux.

Try this, which works for me on HP-UX:


Code:
sed -n 's/<w c5="\(.*\)" hw="\(.*\)" pos="\(.*\)">\(.*\)<\/w>/\1:\4/gp' inputfile > outputfile

  #7 (permalink)  
Old 12-15-2008
Johnivy Johnivy is offline
Registered User
  
 

Join Date: Dec 2008
Posts: 11
I am using SSH secure Shell Client on Windows xp. Then the Shell cilent is connected to our Unix server in our school
Closed Thread

Bookmarks

Tags
shell script, shell scripting, unix scripting, unix scripting basics

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 01:21 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0