The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
perl -write values in a file to @array in perl meghana Shell Programming and Scripting 27 06-07-2009 06:05 PM
extracting selected few lines through perl paruthiveeran UNIX for Dummies Questions & Answers 2 07-16-2008 05:43 AM
extracting used perl modules DILEEP410 Shell Programming and Scripting 0 07-09-2008 01:47 AM
Extracting values from files Master Error Shell Programming and Scripting 4 08-15-2004 10:23 AM
Perl - extracting data from .csv files kregh99 Shell Programming and Scripting 3 10-09-2003 11:18 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 08-06-2008
Steve_altius Steve_altius is offline
Registered User
  
 

Join Date: Jun 2008
Location: England
Posts: 8
Extracting tag values from XML using perl

Hi All,

I'm trying to extract the values for the 'src' and 'alt' tags within an xml file. In the files that I'm searching, the tags are always enclosed within an 'img' tag. Typically:

<img src="diwiz01.gif" width="576" height="254" alt="Out-of-process and In-process COM Objects"><bookmark name="f003"/></img>

I grep for 'img' and pipe to the following perl code that successfully extracts the required data:

Code:
#!/usr/bin/perl
while (<>) {
   while (m/img src=\"(.*?)\"/ig) {
      print $1,"|";
      }
   while (m/alt=\"(.*?)\"/ig) {
      print $1,"\n";
      }
      }
However, the xml source occasionally contains the 'src' and 'alt' tags in a different order within the 'img' tag. For example:

<img width="470" height="321" alt="A Remote COM Object" src="dicwiz02.gif"><bookmark name="f004"/></img>

Consequently, the above code doesn't work.

The basis of the code was originally used for a different problem and I didn't write it. I've modified it in an attempt to satisfy this problem. Unfortunately, although I know the basics of sed and awk (but hardly any perl), I'm not a programmer and I'm struggling a bit.

Any help gratefully received.

Thanks.
  #2 (permalink)  
Old 08-06-2008
Yogesh Sawant's Avatar
Yogesh Sawant Yogesh Sawant is offline Forum Staff  
Part Time Moderator and Full Time Dad
  
 

Join Date: Sep 2006
Location: Rossem, Tazenda
Posts: 1,086
replace:
Code:
while (m/img src=\"(.*?)\"/ig) {
    print $1,"|";
with:
Code:
while (m/img(.*?)src=\"(.*?)\"/ig) {
    print $2,"|";
  #3 (permalink)  
Old 08-06-2008
redoubtable redoubtable is offline
Registered User
  
 

Join Date: Aug 2008
Location: Portugal
Posts: 242
There might be other more clever solutions, but this one works.

[CODE]
Tsunami xml # cat xml
<img width="470" height="321" alt="A Remote COM Object" src="dicwiz02.gif"><bookmark name="f004"/></img>
<img src="diwiz01.gif" width="576" height="254" alt="Out-of-process and In-process COM Objects"><bookmark name="f003"/></img>
Tsunami xml # perl -ne 'print "$1 $2\n" if /<img.*?(?:src|alt)=\"(.*?)\".*?(?:alt|src)=\"(.*?)\".*?<\/img>/;' xml
A Remote COM Object dicwiz02.gif
diwiz01.gif Out-of-process and In-process COM Objects
Tsunami xml #
[CODE]
  #4 (permalink)  
Old 08-06-2008
Steve_altius Steve_altius is offline
Registered User
  
 

Join Date: Jun 2008
Location: England
Posts: 8
Thanks guys. Both solutions work. I appreciate your efforts!
Closed Thread

Bookmarks

Tags
perl regex

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 10:40 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0