Read content between xml tags with awk, grep, awk or what ever...

03-10-2010

Registered User

18, 0

Join Date: Jan 2010

Last Activity: 13 January 2011, 5:40 AM EST

Location: .de

Posts: 18

Thanks Given: 0

Thanked 0 Times in 0 Posts

Read content between xml tags with awk, grep, awk or what ever...

Hello,

I trying to extract text that is surrounded by xml-tags. I tried this

Code:

cat tst.xml | egrep "<SERVER>.*</SERVER>" |sed -e "s/<SERVER>\(.*\)<\/SERVER>/\1/"|tr "|" " "

which works perfect, if the start-tag and the end-tag are in the same line, e.g.:

Code:

<tag1>Hello Linux-Users</tag1>

but if I have somethink like that:

Code:

<tag2>Hello
Linux-
User</tag2>

it doesn't do anythink. I think the problem is that the tools I used are working line by line and because of that there's no way to recognize
the end-tag... I'm no very experienced with awk, sed and grep so i need some help...

Hope someone can help...

regards
Sebi

Sebi0815

View Public Profile for Sebi0815

Find all posts by Sebi0815

03-10-2010

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Hi, Sebi0815:

Perhaps you can change each newline to a space, so that the data appears as one long line. This is a naive approach, but if it doesn't affect the semantics of your data it may be sufficient.

Code:

tr '\n' ' ' < tst.xml | egrep...

Or delete them altogether:

Code:

tr -d '\n' < tst.xml | egrep...

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-10-2010

Registered User

18, 0

Join Date: Jan 2010

Last Activity: 13 January 2011, 5:40 AM EST

Location: .de

Posts: 18

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks for the fast answer Alister...

... but this solution won't work for me. I need to the "newlines" in the text.

Sebi0815

View Public Profile for Sebi0815

Find all posts by Sebi0815

03-10-2010

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Here's a Perl solution. Assume your file is as follows -

Code:

$
$
$ cat sample.xml
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>
   </book>
</catalog>
$
$

You want to pick up the stuff between the "<description>, </description>" tags.

The first occurrence is on a single line. The rest of them span multiple lines and you want the newlines to be preserved. I shall assume that you want the whitespaces to be preserved as well.

Here's the script -

Code:

$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){print $1}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.
After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.
The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.
$
$

In case you want the newlines preserved, but want to remove the whitespace at the beginning, then -

Code:

$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*/\n/g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.
The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.
$
$

And in case you want to neither the newline nor the whitespace i.e. each chunk between "<description>" tags on a single line, then -

Code:

$
$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*//g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.
In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.
The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.
$
$

HTH,
tyler_durden

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

03-10-2010

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Sebi0815:

The following is about as smart as your original solution; it will not work correctly if this tag can be embedded within itself, nor if there are multiple instances of it on a single line. If you require more intelligence, perhaps it is time to step up to a tool that understands xml.

Code:

$ cat data
<tag2>Hello
Linux-
User</tag2>

<tag3>DO NOT PRINT
DO NOT PRINT
DO NOT PRINT</tag3>
<tag2>Good Bye</tag2>

$ sed -n '/<tag2>/,/<\/tag2>/H; /<tag2>/h; /\/tag2/{x;s/<tag2>\(.*[^\n]\)\n*<\/tag2>/\1/p;}' data
Hello
Linux-
User
Good Bye

Cheers,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-10-2010

Registered User

74, 3

Join Date: Oct 2009

Last Activity: 3 September 2011, 11:25 PM EDT

Posts: 74

Thanks Given: 2

Thanked 3 Times in 3 Posts

EDIT I'm sorry, this won't really work, it prints any other text it comes across too, but someone with more awk experience may be able to fix that too.

Here's an awk line I got from someone here for a similar problem. I changed it to suit your problem, but it puts out some blank lines at the end and I don't know enough awk to fix that. Maybe someone else can perfect it.

It extracts everything between the opening and closing tags that you specify, it doesn't matter if it's one line or multiple lines. You can also use "awk command file" to run it on a file.

Code:

# echo '<tag2>Hello
Linux-
User</tag2>
<tag2>Hello Linux-Users</tag2>' | awk 'BEGIN{ RS="</tag2>"}{gsub(/.*<tag2>/,"");print}'
Hello
Linux-
User
Hello Linux-Users


#

fubaya

View Public Profile for fubaya

Find all posts by fubaya

Shell Programming and Scripting

Read content between xml tags with awk, grep, awk or what ever...

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Piping grep into awk, read the next line using grep

Discussion started by: Paul Moghadam

2. UNIX for Dummies Questions & Answers

Grep content in xml file

Discussion started by: Ariean

3. Shell Programming and Scripting

How to add Xml tags to an existing xml using shell or awk?

Discussion started by: Nevergivup

4. Shell Programming and Scripting

awk and or sed command to sum the value in repeating tags in a XML

Discussion started by: bk_12345

5. Shell Programming and Scripting

awk to retrieve the particular value from a same list of xml tags

Discussion started by: mjavalkar

6. Shell Programming and Scripting

how to get tags content by grep

Discussion started by: visitor123

7. Shell Programming and Scripting

Help on awk to read xml file

Discussion started by: pradeepmacha

8. Shell Programming and Scripting

Read a file content with awk and sed

Discussion started by: rmv

9. UNIX for Dummies Questions & Answers

Using Awk within awk to read all files in directory

Discussion started by: flevongo

10. Shell Programming and Scripting

Need help with awk - how to read a content of a file from every file from file list

Discussion started by: tanit