Need an efficient way to search for a tag in an xml file having millions of rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Need an efficient way to search for a tag in an xml file having millions of rows
# 1  
Old 03-01-2012
Need an efficient way to search for a tag in an xml file having millions of rows

Hi,

I have an XML file with around 1 billion rows in it and i am trying to find the number of times a particular tag occurs in it. The solution i am using works but takes a lot of time (~1 hr) .Please help me with an efficient way to do this.

Lets say the input file is

Code:
<Root>
     <Person>
           <Name>John</Name>
     </Person>
</Root>

This <Name> block can be present in it multiple times and i need to find the count quickly(efficiently).

Thanks.
# 2  
Old 03-01-2012
Hi Sheel,

Curious about your solution, what is?

I would use xpath or something similar.

Regards,
Birei.
# 3  
Old 03-01-2012
I am using a simple awk statement
Code:
 awk '/\<Name\>/' inputfile | wc -l

# 4  
Old 03-01-2012
Code:
 
grep -c "<Name>" xmlfile

# 5  
Old 03-01-2012
File 'input' contains 1 million entries of this block:
Code:
<Root>
    <Person>
        <Name>John</Name>
    </Person>
</Root>

And here's an analysis:

Code:
[root@host dir]# time awk '/<Name>/' input | wc -l
1000000

real    0m7.802s
user    0m7.766s
sys     0m0.125s
[root@host dir]# time awk '/<Name>/ {i++} END {print i}' input
1000000

real    0m7.559s
user    0m7.485s
sys     0m0.074s
[root@host dir]# time grep -c "Name" input
1000000

real    0m0.158s
user    0m0.121s
sys     0m0.037s
[root@host dir]# time perl -ne '(/<Name>/) && $i++; END {print $i}' input
1000000
real    0m2.968s
user    0m2.928s
sys     0m0.040s
[root@host dir]# time sed -n '/<Name>/p' input | wc -l
1000000

real    0m3.716s
user    0m3.716s
sys     0m0.096s

Verdict: grep seems to be quickest to do this particular task amongst the utilities used above. Crudely extrapolating the results for a file with 1 billion blocks of entries, it should take about 158s or around 3mins.

Last edited by balajesuri; 03-01-2012 at 06:53 AM..
# 6  
Old 03-01-2012
Have tried all the options (grep . sed & awk) but none of these seem to perform well when the file has 1 billion rows in it. There is one catch though. The input xml file has all the tags in a single row. i.e. this single row gets divided into 1 billion rows after indentation.
This indentation is manual. Can you guys help me with a command that indents the file first and then may be the search command could return the results faster.

e.g. Right Now the InputFile is

Quote:
<Root><?xml version="1.0" encoding="UTF-8"?<Person><Name>John</Name></Person></Root>
I need a command to convert this file into the format below

Quote:
<?xml version="1.0" encoding="UTF-8"?>
<Root>
<Person>
<Name>John</Name>
</Person>
</Root>
# 7  
Old 03-01-2012
Your one-line input file is not well formed.

For a well-formed xml file, it doesn't mind if one-line or multi-line, try with xpath, here an example:
Code:
$ cat infile
<?xml version="1.0" encoding="UTF-8"?><Root><Person><Name>John</Name></Person><Person><Name>John</Name></Person></Root>
$ xpath infile 'count(//Name)'
Query didn't return a nodeset. Value: 2

Regards,
Birei
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Grepping multiple XML tag results from XML file.

I want to write a one line script that outputs the result of multiple xml tags from a XML file. For example I have a XML file which has below XML tags in the file: <EMAIL>***</EMAIL> <CUSTOMER_ID>****</CUSTOMER_ID> <BRANDID>***</BRANDID> Now I want to grep the values of all these specified... (1 Reply)
Discussion started by: shubh752
1 Replies

2. Shell Programming and Scripting

Moving XML tag/contents after specific XML tag within same file

Hi Forum. I have an XML file with the following requirement to move the <AdditionalAccountHolders> tag and its content right after the <accountHolderName> tag within the same file but I'm not sure how to accomplish this through a Unix script. Any feedback will be greatly appreciated. ... (19 Replies)
Discussion started by: pchang
19 Replies

3. Shell Programming and Scripting

sed search and replace after xml tag

Hi All, I'm new to sed. In following XML file <interface type='direct'> <mac address='52:54:00:86:ce:f6'/> <source dev='eno1' mode='bridge'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> ... (8 Replies)
Discussion started by: varunrapelly
8 Replies

4. Shell Programming and Scripting

Efficient way to search array in text file by awk

I have one array SPLNO with approx 10k numbers.Now i want to search the subscriber number from MDN.TXT file (containing approx 1.5 lac record)from the array.if subscriber number found in array it will perform below operation.my issue is that it's taking more time because for one number it's search... (6 Replies)
Discussion started by: siramitsharma
6 Replies

5. Shell Programming and Scripting

To search for a particular tag in xml and collate all similar tag values and display them count

I want to basically do the below thing. Suppose there is a tag called object1. I want to display an output for all similar tag values under heading of Object 1 and the count of the xmls. Please help File: <xml><object1>house</object1><object2>child</object2>... (9 Replies)
Discussion started by: srkmish
9 Replies

6. Emergency UNIX and Linux Support

Trying to parse a xml file for only one tag

I have a xml file in where I need to parse only a particular tag and print the output in the shell script. Here is the tag info in the xml file <dp:file> This is dp file output </dp:file> Output should be printed as This is dp file output. Please help.Thank you. (5 Replies)
Discussion started by: chandu123
5 Replies

7. Shell Programming and Scripting

How to add the multiple lines of xml tags before a particular xml tag in a file

Hi All, I'm stuck with adding multiple lines(irrespective of line number) to a file before a particular xml tag. Please help me. <A>testing_Location</A> <value>LA</value> <zone>US</zone> <B>Region</B> <value>Russia</value> <zone>Washington</zone> <C>Country</C>... (0 Replies)
Discussion started by: mjavalkar
0 Replies

8. Shell Programming and Scripting

How to retrieve the value from XML tag whose end tag is in next line

Hi All, Find the following code: <Universal>D38x82j1JJ </Universal> I want to retrieve the value of <Universal> tag as below: Please help me. (3 Replies)
Discussion started by: mjavalkar
3 Replies

9. Shell Programming and Scripting

Changing particular tag value of xml file

Hi All, I have number of xml file like : ______________________________________________________ <?xml version="1.0" standalone="no"?> <!-- Created by Symology Ltd on 13/02/2012 - USER_BATCH_ID 0011091684 --> <!-- RECIPIENT_URL: HTTP://194.168.0.81:3408 --> <EToNrequest ... (7 Replies)
Discussion started by: krsnadasa
7 Replies

10. Shell Programming and Scripting

XML tag replacement from different XML file

We have 2 XML file 1. ORIGINAL.xml file and 2. ATTRIBUTE.xml files, In the ORIGINAL.xml we need some modification as <resourceCode>431048</resourceCode>under <item type="Manufactured"> tag - we need to grab the 431048 value from tag and pass it to database table in unix shell script to find the... (0 Replies)
Discussion started by: balrajg
0 Replies
Login or Register to Ask a Question