Extracting data between tags based on search string from unix file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extracting data between tags based on search string from unix file
# 1  
Old 11-18-2009
Extracting data between tags based on search string from unix file

Input file is on Linux box and the input file has data in just one line with 1699741696 characters.

Sample Input:
Code:
<xxx><document coll="uspatfull" version="0"><CMSdoc>xxxantivirus</CMSdoc><tag1>1</tag1></document><document coll="uspatfull" version="0"><CMSdoc>yyy</CMSdoc><tag1>a</tag1></document><document coll="uspatfull" version="0"><CMSdoc>likeavirusesxxx</CMSdoc><tag1>aaa</tag1></document>
</xxx>


Output should be:
If data like "virus" appears anywhere between the document tags we need that in the output.

Code:
<xxx><document coll="uspatfull" version="0"><CMSdoc>xxxantivirus</CMSdoc><tag1>1</tag1></document><document coll="uspatfull" version="0"><CMSdoc>likeavirusesxxx</CMSdoc><tag1>aaa</tag1></document></xxx>


Thanks!

Last edited by Franklin52; 11-19-2009 at 01:05 PM.. Reason: Please use code tags!!
# 2  
Old 11-18-2009
Your input is not valid XML
Code:
<xml>
<document coll="uspatfull" version="0">
   <CMSdoc>xxxantivirus<tag1>1</tag1></CMSdoc>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>yyy<tag1>a</tag1></CMSdoc>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>likeavirusesxxx<tag1>aaa</tag1></CMSdoc>
</document>
</xml>

First, <xml> is not a valid element name within the meaning of the XML specification. Both xml and XML are reserved names. Second, you cannot embed another element (tag1)within an element's text content as is occuring in the CMSdoc element.

If you can modify your file to be a valid XML document, what you want to do will be much easier to achieve.
# 3  
Old 11-18-2009
updated the input file - extracting data between tags based on search string from unix file

Please consider valid xml as input:
removing reserved word: xml as tag, and removing tags under CMSdoc

Modified input:
Code:
<xxx>
<document coll="uspatfull" version="0">
   <CMSdoc>xxxantivirus</CMSdoc>
<tag1>1</tag1>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>yyy</CMSdoc>
<tag1>a</tag1>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>likeavirusesxxx</CMSdoc>
<tag1>aaa</tag1>
</document>
</xxx>

Expected output:
Code:
<xxx>
<document coll="uspatfull" version="0">
   <CMSdoc>xxxantivirus</CMSdoc>
<tag1>1</tag1>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>likeavirusesxxx</CMSdoc>
<tag1>aaa</tag1>
</document>
</xxx>

---------- Post updated at 03:24 PM ---------- Previous update was at 11:10 AM ----------

can you please look into this: extracting extracting data between tags based on search string from unix file

Last edited by fpmurphy; 11-19-2009 at 11:23 AM.. Reason: Added code tags. Removed embolding
# 4  
Old 11-19-2009
Best way to handle something like this requirement is to use an XSL stylesheet processor.

Here is a stylesheet which will transfer the supplied document into the required output.
Code:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

   <!-- pass in searchterm as -param searchterm "'virus'"  -->
   <xsl:param name="searchterm" />

   <xsl:output method="xml" indent ="yes"/>

   <xsl:template match="//document">
      <xsl:if test=".//text()[contains(., $searchterm)]">
         <xsl:copy-of select="." />
      </xsl:if>
   </xsl:template>

   <xsl:template match="/">
      <xsl:element name="xxx">
         <xsl:apply-templates select="//document" />
      </xsl:element>
   </xsl:template>

</xsl:stylesheet>

Here is the output using the xsltproc (which comes with libxslt) processor:
Code:
$ xsltproc -param searchterm "'virus'" file.xsl file.xml
<?xml version="1.0"?>
<xxx>
  <document coll="uspatfull" version="0">
<CMSdoc>xxxantivirus</CMSdoc>
<tag1>1</tag1>
</document>
  <document coll="uspatfull" version="0">
<CMSdoc>likeavirusesxxx</CMSdoc>
<tag1>aaa</tag1>
</document>
</xxx>

# 5  
Old 11-19-2009
Code:
gawk '/virus/{print $0RT}' RS="</document>" file

# 6  
Old 11-22-2009
Quote:
Originally Posted by ghostdog74
Code:
gawk '/virus/{print $0RT}' RS="</document>" file

Good solution by Gawk.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extracting data from one file, based on another file (splitting)

Dear All, I have two files but want to extract data from one based on another... can you please help me file 1 David Tom Ellen and file 2 David|0010|testnamez|resultsz David|0004|testnamex|resultsx Tom|0010|testnamez|resultsz Tom|0004|testnamex|resultsx Ellen|0010|testnamez|resultsz... (12 Replies)
Discussion started by: A-V
12 Replies

2. UNIX for Dummies Questions & Answers

Obtaining File information based on String Search

Is there a single Command in Unix to get the following Information when searching for files containing one or more strings in a Unix Directory (including sub directories within it) : 1) Complete filename ( path and filename) 2) Owner of the file 3) Size of the file 4) Last Modified date... (3 Replies)
Discussion started by: pchegoor
3 Replies

3. Shell Programming and Scripting

Script for extracting data from csv file based on column values.

Hi all, I am new to shell script.I need your help to write a shell script. I need to write a shell script to extract data from a .csv file where columns are ',' separated. The file has 5 columns having values say column 1,column 2.....column 5 as below along with their valuesm.... (3 Replies)
Discussion started by: Vivekit82
3 Replies

4. Shell Programming and Scripting

Search for a specific data in a file based on a date range

Hi, Currently I am working on a script to automate the process of converting the log file from binary into text format. To achieve this, partly I am depending on my application’s utility for this conversion and the rest I am relying on shell commands to search for directory, locate the file and... (5 Replies)
Discussion started by: svajhala
5 Replies

5. Shell Programming and Scripting

Extracting specific lines of data from a file and related lines of data based on a grep value range?

Hi, I have one file, say file 1, that has data like below where 19900107 is the date, 19900107 12 144 129 0.7380047 19900108 12 168 129 0.3149017 19900109 12 192 129 3.2766666E-02 ... (3 Replies)
Discussion started by: Wynner
3 Replies

6. Shell Programming and Scripting

Extracting data into flat file thru unix

Hi, I need to extract a oracle staging table to a flat file thru unix batch process.We are expecting more than 4million records in the table.I know I can do it using "UTL_FILE" .But,since "UTL_FILE" takes a lot of time I am looking for better options.Can any body suggest some better options? ... (3 Replies)
Discussion started by: Beena
3 Replies

7. Shell Programming and Scripting

Extracting data based on the list file

Hi there, Can you help. I need to extract data based on the list file(list.txt) from item.txt as shown below. Please note the actual files are enormous in size. Thank you. item.txt nokia1100 123,000 nokia2100 66,000 samsung123 11,000 samsung456 23,000 iphone432 234,000... (12 Replies)
Discussion started by: shtobias
12 Replies

8. Shell Programming and Scripting

using sed to conditionally extract stanzas of a file based on a search string

Dear All, I have a file with the syntax below (composed of several <log ..... </log> stanzas) I need to search this file for a number e.g. 2348022225919, and if it is found in a stanza, copy the whole stanza/section (<log .... </log>) to another output file. The numbers to search for are... (0 Replies)
Discussion started by: aitayemi
0 Replies

9. Shell Programming and Scripting

Extracting data from text file based on configuration set in config file

Hi , a:) i have configuration file with pattren <Range start no>,<Range end no>,<type of records to be extracted from the data file>,<name of the file to store output> eg: myfile.confg 9899000000,9899999999,DATA,b.dat 9899000000,9899999999,SMS,a.dat b:) Stucture of my data file is... (3 Replies)
Discussion started by: suparnbector
3 Replies

10. Shell Programming and Scripting

appending string to text file based on search string

Hi, I need to append string "Hi" to the beginning of the lines containing some specific string. How can I achieve that? Please help. Malay (1 Reply)
Discussion started by: malaymaru
1 Replies
Login or Register to Ask a Question