Read content between xml tags with awk, grep, awk or what ever...

Tags
shell scripts

 
Thread Tools Search this Thread
# 1  
Old 03-10-2010
Read content between xml tags with awk, grep, awk or what ever...

Hello,

I trying to extract text that is surrounded by xml-tags. I tried this

Code:
cat tst.xml | egrep "<SERVER>.*</SERVER>" |sed -e "s/<SERVER>\(.*\)<\/SERVER>/\1/"|tr "|" " "

which works perfect, if the start-tag and the end-tag are in the same line, e.g.:

Code:
<tag1>Hello Linux-Users</tag1>

but if I have somethink like that:

Code:
<tag2>Hello
Linux-
User</tag2>

it doesn't do anythink. I think the problem is that the tools I used are working line by line and because of that there's no way to recognize
the end-tag... I'm no very experienced with awk, sed and grep so i need some help...

Hope someone can help...


regards
SebiSmilie
# 2  
Old 03-10-2010
Hi, Sebi0815:

Perhaps you can change each newline to a space, so that the data appears as one long line. This is a naive approach, but if it doesn't affect the semantics of your data it may be sufficient.
Code:
tr '\n' ' ' < tst.xml | egrep...

Or delete them altogether:
Code:
tr -d '\n' < tst.xml | egrep...

Regards,
Alister
# 3  
Old 03-10-2010
Thanks for the fast answer Alister...

... but this solution won't work for me. I need to the "newlines" in the text.
# 4  
Old 03-10-2010
Here's a Perl solution. Assume your file is as follows -

Code:
$
$
$ cat sample.xml
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>
   </book>
</catalog>
$
$

You want to pick up the stuff between the "<description>, </description>" tags.

The first occurrence is on a single line. The rest of them span multiple lines and you want the newlines to be preserved. I shall assume that you want the whitespaces to be preserved as well.

Here's the script -

Code:
$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){print $1}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.
After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.
The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.
$
$

In case you want the newlines preserved, but want to remove the whitespace at the beginning, then -

Code:
$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*/\n/g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.
The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.
$
$

And in case you want to neither the newline nor the whitespace i.e. each chunk between "<description>" tags on a single line, then -

Code:
$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*//g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.
In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.
The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.
$
$

HTH,
tyler_durden
# 5  
Old 03-10-2010
Sebi0815:

The following is about as smart as your original solution; it will not work correctly if this tag can be embedded within itself, nor if there are multiple instances of it on a single line. If you require more intelligence, perhaps it is time to step up to a tool that understands xml.

Code:
$ cat data
<tag2>Hello
Linux-
User</tag2>

<tag3>DO NOT PRINT
DO NOT PRINT
DO NOT PRINT</tag3>
<tag2>Good Bye</tag2>

$ sed -n '/<tag2>/,/<\/tag2>/H; /<tag2>/h; /\/tag2/{x;s/<tag2>\(.*[^\n]\)\n*<\/tag2>/\1/p;}' data
Hello
Linux-
User
Good Bye

Cheers,
Alister
# 6  
Old 03-10-2010
EDIT I'm sorry, this won't really work, it prints any other text it comes across too, but someone with more awk experience may be able to fix that too.

Here's an awk line I got from someone here for a similar problem. I changed it to suit your problem, but it puts out some blank lines at the end and I don't know enough awk to fix that. Maybe someone else can perfect it.

It extracts everything between the opening and closing tags that you specify, it doesn't matter if it's one line or multiple lines. You can also use "awk command file" to run it on a file.
Code:
# echo '<tag2>Hello
Linux-
User</tag2>
<tag2>Hello Linux-Users</tag2>' | awk 'BEGIN{ RS="</tag2>"}{gsub(/.*<tag2>/,"");print}'
Hello
Linux-
User
Hello Linux-Users


#


|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Read xml tags and then remove the tag using shell script RJG Shell Programming and Scripting 3 01-18-2017 07:30 AM
Piping grep into awk, read the next line using grep Paul Moghadam UNIX for Dummies Questions & Answers 2 12-10-2013 12:58 AM
Grep content in xml file Ariean UNIX for Dummies Questions & Answers 7 11-05-2013 05:32 PM
How to add Xml tags to an existing xml using shell or awk? Nevergivup Shell Programming and Scripting 2 04-10-2013 03:55 AM
Shell Script to read XML tags and the data within that tag SmilePlease UNIX for Advanced & Expert Users 2 04-03-2013 08:16 AM
awk and or sed command to sum the value in repeating tags in a XML bk_12345 Shell Programming and Scripting 6 12-27-2012 12:27 PM
awk to retrieve the particular value from a same list of xml tags mjavalkar Shell Programming and Scripting 4 05-01-2012 05:12 PM
how to get tags content by grep visitor123 Shell Programming and Scripting 11 02-17-2012 05:50 PM
Help on awk to read xml file pradeepmacha Shell Programming and Scripting 2 07-14-2011 08:32 AM
Read a file content with awk and sed rmv Shell Programming and Scripting 13 10-24-2009 11:04 AM
Using Awk within awk to read all files in directory flevongo UNIX for Dummies Questions & Answers 6 09-19-2009 05:47 PM
Need help with awk - how to read a content of a file from every file from file list tanit Shell Programming and Scripting 7 03-10-2009 06:19 AM
Is it better to grep and pipe to awk, or to seach with awk itself DeCoTwc Shell Programming and Scripting 4 10-07-2008 03:52 PM
Grep XML tags saravvij Shell Programming and Scripting 2 09-25-2006 10:25 AM
Grep xml tags handak9 Shell Programming and Scripting 9 07-22-2005 03:51 AM