Split a 30GB XML file into 16 pieces


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split a 30GB XML file into 16 pieces
# 1  
Old 08-29-2011
Split a 30GB XML file into 16 pieces

I have a 30 GB XMl file which looks like this:

Code:
<page>
<title>APRIL</title>
.........(text contents that I need to extract and store in 1.dat including the <title> tag)
</page>
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>

I want to split this XML file into 16 pieces.

I used "split" command on my Linux to break this file into 16 files but what I found was that the tags were not intact. For example, the in below code, I found one file had half the content and other file had other half.
Code:
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>

Something like this:

Code:
<page>
<title>August</title>

Code:
....(text contents that I need to store in 2.dat including the <title> tag)
</page>

Can anybody please help me out with this?
# 2  
Old 08-29-2011
are there only 16 <page>...</page> segments in your big file?
what is the rule of splitting :
for each
Code:
 <page>
<title>foo</title> 
[whatever]
</page>

you want to get a file containing info:
Code:
<title>foo</title> 
[whatever]

?
This User Gave Thanks to sk1418 For This Post:
# 3  
Old 08-29-2011
Thanks for replying. My only criteria is to split the BIG 30 GB file in to 16 pieces of around 1.8GB each. This means the files could be named as 1.part, 2.part until 16.part.
What you are referring to is to extract each of the XML segment tags and the text between them and store them in separate files. That is not my requirement. There are over 3.5 million such page tags in the BIG XML file.

If you use "split" utility in Linux, it splits the file based on certain options that you give like number of lines, size etc. I used split too but that broke out some of the tags as I have shown in my example above.
So, if I use split and break it into 1.8 Gb each, this is what I would have done:

split -b=18000000 BIG_XMl_FILE
# 4  
Old 08-29-2011
if the big file is a well-formed xml file, how to handle the root element?
# 5  
Old 08-29-2011
Well, that is not a strict criteria for me, I just want to split the file into 16 pieces by keeping the tags intact. Smilie
# 6  
Old 08-29-2011
ok, you only want the <page>...</page>.

last question: is the order important?
say, in the big file you have
Code:
<page> (1) .(2)..... <page>(n)

you want
Code:
page(1)..(2).(3)..page(k) in file.part1
page(k+1)..(K+2).(k+3)..page(j) in file.part2
...

but is it ok if
Code:
page(1), (17), (23),.... in  file.part1
page(2), (18)...      in file.part2
..

?
# 7  
Old 08-29-2011
yes order is also not important. but only tags should be there properly.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split Big XML file Base on tag

HI I want to split file base on tag name. I have few header and footer on file <?xml version="1.33" encing="UTF-8"?> <bulkCmConfigDataFile" <xn:SubNetwork id="ONRM_ROOT"> <xn:MeContext id="PPP04156"> ... (4 Replies)
Discussion started by: pareshkp
4 Replies

2. Shell Programming and Scripting

Split xml file into multiple xml based on letterID

Hi All, We need to split a large xml into multiple valid xml with same header(2lines) and footer(last line) for N number of letterId. In the example below we have first 2 lines as header and last line as footer.(They need to be in each split xml file) Header: <?xml version="1.0"... (5 Replies)
Discussion started by: vx04
5 Replies

3. Shell Programming and Scripting

Split XML file based on tags

Hello All , Please help me with below requirement I want to split a xml file based on tag.here is the file format <data-set> some-information </data-set> <data-set1> some-information </data-set1> <data-set2> some-information </data-set2> I want to split the above file into 3... (5 Replies)
Discussion started by: Pratik4891
5 Replies

4. Shell Programming and Scripting

Perl : to split the tags from xml file

I do have an xml sheet as below where I need the perl script to filter only the hyperlink tags. <cols><col min="1" max="1" width="30.5703125" customWidth="1"/><col min="2" max="2" width="7.140625" bestFit="1" customWidth="1"/> <col min="3" max="3" width="32.28515625" bestFit="1"... (3 Replies)
Discussion started by: scriptscript
3 Replies

5. Shell Programming and Scripting

Split XML file

Hi Experts, Can you please help me to split following XML file based on new Order ? Actual file is very big. I have taken few lines of it. <?xml version="1.0" encoding="utf-8" standalone="yes"?> <Orders xmlns='http://www.URL.com/Orders'> <Order> <ORDNo>450321</ORDNo> ... (3 Replies)
Discussion started by: meetmedude
3 Replies

6. Shell Programming and Scripting

Split xml file into many

Hi, I had a scenario need a help as I am new to this. I have a xml file employee.xml with the below content. <Organisation><employee>xxx</employee><employee>yyy</employee><employee>zzz</employee></Organisation> I want to split the file into multiple file as below. Is there a specifice way... (5 Replies)
Discussion started by: mankuar
5 Replies

7. UNIX for Dummies Questions & Answers

How to split a huge file into small pieces (per 2000 columns)?

Dear all, I have a big file:2879(rows)x400,170 (columns) like below. I 'd like to split the file into small pieces:2879(rows)x2000(columns) per file (the last small piece will be 2879x170. So far, I only know how to create one samll piece at one time. But actually I need to repeat this work... (6 Replies)
Discussion started by: forevertl
6 Replies

8. Shell Programming and Scripting

Need to split a xml file in proper format

Hi, I have a file which has xml data but all in single line Ex - <?xml version="1.0"?><User><Name>Robert</Name><Location>California</Location><Occupation>Programmer</Occupation></User> I want to split the data in proper xml format Ex- <?xml version="1.0"?> <User> <Name>Robert</Name>... (6 Replies)
Discussion started by: avishek007
6 Replies

9. Shell Programming and Scripting

How do I split file into pieces with PERL?

How do I split file into pieces with PERL? IE file.txt head 1 2 3 4 end head 5 6 7 8 9 end n so on (7 Replies)
Discussion started by: 3junior
7 Replies

10. Shell Programming and Scripting

Shell script to split XML file

Hi, I'm experiencing difficulty in loading an XML file to an Oracle destination table.I keep running into a memory problem due to the large size of the file. I want to split the XML file into several smaller files based on the keyword(s)/tags : '' and '' and would like to use a Unix shell... (2 Replies)
Discussion started by: bayflash27
2 Replies
Login or Register to Ask a Question