Read content between xml tags with awk, grep, awk or what ever...


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Read content between xml tags with awk, grep, awk or what ever...
# 1  
Old 03-10-2010
Read content between xml tags with awk, grep, awk or what ever...

Hello,

I trying to extract text that is surrounded by xml-tags. I tried this

Code:
cat tst.xml | egrep "<SERVER>.*</SERVER>" |sed -e "s/<SERVER>\(.*\)<\/SERVER>/\1/"|tr "|" " "

which works perfect, if the start-tag and the end-tag are in the same line, e.g.:

Code:
<tag1>Hello Linux-Users</tag1>

but if I have somethink like that:

Code:
<tag2>Hello
Linux-
User</tag2>

it doesn't do anythink. I think the problem is that the tools I used are working line by line and because of that there's no way to recognize
the end-tag... I'm no very experienced with awk, sed and grep so i need some help...

Hope someone can help...


regards
SebiSmilie
# 2  
Old 03-10-2010
Hi, Sebi0815:

Perhaps you can change each newline to a space, so that the data appears as one long line. This is a naive approach, but if it doesn't affect the semantics of your data it may be sufficient.
Code:
tr '\n' ' ' < tst.xml | egrep...

Or delete them altogether:
Code:
tr -d '\n' < tst.xml | egrep...

Regards,
Alister
# 3  
Old 03-10-2010
Thanks for the fast answer Alister...

... but this solution won't work for me. I need to the "newlines" in the text.
# 4  
Old 03-10-2010
Here's a Perl solution. Assume your file is as follows -

Code:
$
$
$ cat sample.xml
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>
   </book>
</catalog>
$
$

You want to pick up the stuff between the "<description>, </description>" tags.

The first occurrence is on a single line. The rest of them span multiple lines and you want the newlines to be preserved. I shall assume that you want the whitespaces to be preserved as well.

Here's the script -

Code:
$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){print $1}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.
After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.
The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.
$
$

In case you want the newlines preserved, but want to remove the whitespace at the beginning, then -

Code:
$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*/\n/g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.
The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.
$
$

And in case you want to neither the newline nor the whitespace i.e. each chunk between "<description>" tags on a single line, then -

Code:
$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*//g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.
In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.
The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.
$
$

HTH,
tyler_durden
# 5  
Old 03-10-2010
Sebi0815:

The following is about as smart as your original solution; it will not work correctly if this tag can be embedded within itself, nor if there are multiple instances of it on a single line. If you require more intelligence, perhaps it is time to step up to a tool that understands xml.

Code:
$ cat data
<tag2>Hello
Linux-
User</tag2>

<tag3>DO NOT PRINT
DO NOT PRINT
DO NOT PRINT</tag3>
<tag2>Good Bye</tag2>

$ sed -n '/<tag2>/,/<\/tag2>/H; /<tag2>/h; /\/tag2/{x;s/<tag2>\(.*[^\n]\)\n*<\/tag2>/\1/p;}' data
Hello
Linux-
User
Good Bye

Cheers,
Alister
# 6  
Old 03-10-2010
EDIT I'm sorry, this won't really work, it prints any other text it comes across too, but someone with more awk experience may be able to fix that too.

Here's an awk line I got from someone here for a similar problem. I changed it to suit your problem, but it puts out some blank lines at the end and I don't know enough awk to fix that. Maybe someone else can perfect it.

It extracts everything between the opening and closing tags that you specify, it doesn't matter if it's one line or multiple lines. You can also use "awk command file" to run it on a file.
Code:
# echo '<tag2>Hello
Linux-
User</tag2>
<tag2>Hello Linux-Users</tag2>' | awk 'BEGIN{ RS="</tag2>"}{gsub(/.*<tag2>/,"");print}'
Hello
Linux-
User
Hello Linux-Users


#

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Piping grep into awk, read the next line using grep

Hi, I have a number of files containing the information below. """"" Fundallinfo 6.3950 14.9715 14.0482 """"" I would like to grep for Fundallinfo and use it to read the next line? I ideally would like to read the three numbers that follow in the next line and... (2 Replies)
Discussion started by: Paul Moghadam
2 Replies

2. UNIX for Dummies Questions & Answers

Grep content in xml file

I have an xml file with header as below. <Provider xmlns="http://www.xyzx.gov/xyz" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.xyzx.gov/xyz xyz.xsd" SCHEMA_VERSION="2.5" PROVIDER="5"> I want to get the schema version here that is 2.5 and put in a... (7 Replies)
Discussion started by: Ariean
7 Replies

3. Shell Programming and Scripting

How to add Xml tags to an existing xml using shell or awk?

Hi , I have a below xml: <ns:Body> <ns:result> <Date Month="June" Day="Monday:/> </ns:result> </ns:Body> i have a lookup abc.txtt text file with below details Month June July August Day Monday Tuesday Wednesday I need a output xml with below tags <ns:Body> <ns:result>... (2 Replies)
Discussion started by: Nevergivup
2 Replies

4. Shell Programming and Scripting

awk and or sed command to sum the value in repeating tags in a XML

I have a XML in which <Amt Ccy="EUR">3.1</Amt> tag repeats. This is under another tag <Main>. I need to sum all the values of <Amt Ccy=""> (Ccy may vary) coming under <Main> using awk and or sed command. can some help? Sample looks like below <root> <Main> ... (6 Replies)
Discussion started by: bk_12345
6 Replies

5. Shell Programming and Scripting

awk to retrieve the particular value from a same list of xml tags

Hi All, I have the following code in one of my xml file: <com:parameter> <com:name>secretKey</com:name> <com:value>31XA874821172E89B00B1C</com:value> </com:parameter> <com:parameter> <com:name>tryDisinfect</com:name> <com:value>false</com:value> </com:parameter> <com:parameter>... (4 Replies)
Discussion started by: mjavalkar
4 Replies

6. Shell Programming and Scripting

how to get tags content by grep

1) Is it possible to get tags content by grep -E ? For example title. Source text "<title>My page<title>"; to print "My page". 2) which bash utility to use when I want to use regex in this format? (?<=title>).*(?=</title) (11 Replies)
Discussion started by: visitor123
11 Replies

7. Shell Programming and Scripting

Help on awk to read xml file

Hello, I have a xml file as shown below. I want to parse the file and store data in variables. xml file looks like: <TEST NAME="DataBaseurl">jdbc:oracle:thin:@localhost:1521:ora10</TEST> <TEST NAME="Databaseuser">Pradeep</TEST> ...... and many other such lines i want to read this file and... (2 Replies)
Discussion started by: pradeepmacha
2 Replies

8. Shell Programming and Scripting

Read a file content with awk and sed

Hello , I have huge file with below content. I need to read the numeric values with in the paranthesis after = sign. Please help me with awk and sed script for it. 11.10.2009 04:02:47 Customer login not found: identifier=(0748502889) prefix=(TEL) serviceCode=(). 11.10.2009 04:03:12... (13 Replies)
Discussion started by: rmv
13 Replies

9. UNIX for Dummies Questions & Answers

Using Awk within awk to read all files in directory

I am wondering if anyone has any idea how to use an awk within awk to read files and find a match which adds to count. Say I am searching how many times the word crap appears in each files within a directory. How would i do that from the command prompt ... thanks (6 Replies)
Discussion started by: flevongo
6 Replies

10. Shell Programming and Scripting

Need help with awk - how to read a content of a file from every file from file list

Hi Experts. I need to list the file and the filename comes from the file ListOfFile.txt. Basicly I have a filename "ListOfFile.txt" and it contain Example of ListOfFile.txt /home/Dave/Program/Tran1.P /home/Dave/Program/Tran2.P /home/Dave/Program/Tran3.P /home/Dave/Program/Tran4.P... (7 Replies)
Discussion started by: tanit
7 Replies
Login or Register to Ask a Question