Split a 30GB XML file into 16 pieces


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split a 30GB XML file into 16 pieces
# 8  
Old 08-29-2011
It's easy enough with low-level programming with C (or Perl). The principle is the same as in "tail" command. You rewind the file pointer to FILESIZE/16 and go back to find the first <page> and remember the byte position, then rewind to 2*FILESIZE/16 and so on. When you get your positions you split the file with dd (or in this program).
If no one comes with a solution, I'll write a program but a little later.
This User Gave Thanks to yazu For This Post:
# 9  
Old 08-29-2011
So, it means that this task cannot be handled very well with simple shell scripting. It will take some time but will be worth trying in C. I'll write a program for it and put it here. Smilie
# 10  
Old 08-29-2011
I wrote an awk script to do the job, check if it is what you need:

under the directory of your bigFile:
Code:
touch {1..16}.txt

this will create 1-16.txt 16 empty files. then run this:
Code:
awk 'BEGIN{flag=0; file=1 }{
if($0~/<page>/) flag=1;
if($0~/<\/page>/) {
        buf=buf$0"\n";
        flag=0;
        printf buf>>file".txt"
        buf="";
        
        file++;
        file=(file<=16)?file:1;
};
if(flag==1){
        buf=buf$0;
}
}' your30G_BIG.xml

well the code can be optimized but try if it is working 4 u first.
(you can change the output file name in the code).

Last edited by sk1418; 08-29-2011 at 07:00 AM.. Reason: the print line was removed
This User Gave Thanks to sk1418 For This Post:
# 11  
Old 08-29-2011
Quick & dirty and not tested thoroughly. But it prints something. Smilie If there will be problems with '\0' or with 64-bit sizes or offsets it's better to translate this to perl.
Code:
#define _FILE_OFFSET_BITS 64

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define NCHUNKS 16
#define NBUF  512
#define WORD "<page>"


void usage(char *progname) {
    printf("Usage: %s FILENAME\n", progname);
}

long long find_pos(FILE *fd, long long pos) {
    
    char BUF[NBUF+1];
    BUF[NBUF+1] = '\0';  // BUG !!!
    
    fseeko(fd, pos, SEEK_SET);
    int count = -1;
    char *found = NULL;

    while (!found) {
        fread(BUF, NBUF, 1, fd);
        found = strstr(BUF, WORD);
        count++;
    }
    int offset = found - BUF;

    return pos + NBUF*count + offset;
}

int main(int argc, char** argv) {
    FILE *fd = fopen(argv[1], "r"); // fopen should be fopen64 really
    if (! fd) {
        fprintf(stderr, "Couldn't open %s\n", argv[1]);
        usage(argv[0]);
        exit(-1);
    }

    long long fs;
    if (fseeko(fd, 0, SEEK_END)) {
        fprintf(stderr, "Couldn't go to the end of the file\n");
        exit(-1);
    }

    fs = ftello(fd);

    int count = 0;
    while (count < NCHUNKS-1) {
        printf("%lld ", find_pos(fd, ++count * fs/NCHUNKS));
    }
    printf("\n");
    
    fclose(fd);
    return 0;
}

You can test this with:
Code:
xxd -sNUM testfile | head -1

Yes, there is a bug. Too quick... ))) Thanks, Corona688!

Last edited by yazu; 08-29-2011 at 12:59 PM.. Reason: Bug
This User Gave Thanks to yazu For This Post:
# 12  
Old 08-29-2011
Nice-looking program, though I would note one problem:
Code:
char BUF[NBUF+1];
BUF[NBUF+1] = '\0';

Replace NBUF with 4 and follow along:

Code:
char BUF[4+1];
buf[0]=0; // first element
buf[1]=1; // second element
buf[2]=2; // third element
buf[3]=3; // fourth element
buf[4]=4; // fifth element
buf[5]=5; // SIXTH element!  buf[4+1] is beyond the end!

If you're lucky, this will do nothing.

If you're unlucky, it will crash your program.

If you're very unlucky, it will corrupt stack values in strange ways that alter other local variables and cause unpredictable misbehavior.

This often results in programs that work fine when compiled for debugging, but do strange things when optimized -- suddenly memory values which didn't matter get stripped out and you're only stomping on ones that do.

Code:
char BUF[NBUF+1];
    BUF[NBUF] = '\0';

That should be all it needs I think.
These 2 Users Gave Thanks to Corona688 For This Post:
# 13  
Old 08-29-2011
Yes, a baby mistake. It has one more problem - if there are not enough WORDs in a file, then it will crash (or infinitely loop) so you cannot test it on an arbitrary file. But I think in your situation it's impossible.
# 14  
Old 08-29-2011
Code:
    long long fs;
    if (fseeko(fd, 0, SEEK_END)) {
        fprintf(stderr, "Couldn't go to the end of the file\n");
        exit(-1);
    }

    fs = ftello(fd);

to make your code more portable you should use off_t instead of long long.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split Big XML file Base on tag

HI I want to split file base on tag name. I have few header and footer on file <?xml version="1.33" encing="UTF-8"?> <bulkCmConfigDataFile" <xn:SubNetwork id="ONRM_ROOT"> <xn:MeContext id="PPP04156"> ... (4 Replies)
Discussion started by: pareshkp
4 Replies

2. Shell Programming and Scripting

Split xml file into multiple xml based on letterID

Hi All, We need to split a large xml into multiple valid xml with same header(2lines) and footer(last line) for N number of letterId. In the example below we have first 2 lines as header and last line as footer.(They need to be in each split xml file) Header: <?xml version="1.0"... (5 Replies)
Discussion started by: vx04
5 Replies

3. Shell Programming and Scripting

Split XML file based on tags

Hello All , Please help me with below requirement I want to split a xml file based on tag.here is the file format <data-set> some-information </data-set> <data-set1> some-information </data-set1> <data-set2> some-information </data-set2> I want to split the above file into 3... (5 Replies)
Discussion started by: Pratik4891
5 Replies

4. Shell Programming and Scripting

Perl : to split the tags from xml file

I do have an xml sheet as below where I need the perl script to filter only the hyperlink tags. <cols><col min="1" max="1" width="30.5703125" customWidth="1"/><col min="2" max="2" width="7.140625" bestFit="1" customWidth="1"/> <col min="3" max="3" width="32.28515625" bestFit="1"... (3 Replies)
Discussion started by: scriptscript
3 Replies

5. Shell Programming and Scripting

Split XML file

Hi Experts, Can you please help me to split following XML file based on new Order ? Actual file is very big. I have taken few lines of it. <?xml version="1.0" encoding="utf-8" standalone="yes"?> <Orders xmlns='http://www.URL.com/Orders'> <Order> <ORDNo>450321</ORDNo> ... (3 Replies)
Discussion started by: meetmedude
3 Replies

6. Shell Programming and Scripting

Split xml file into many

Hi, I had a scenario need a help as I am new to this. I have a xml file employee.xml with the below content. <Organisation><employee>xxx</employee><employee>yyy</employee><employee>zzz</employee></Organisation> I want to split the file into multiple file as below. Is there a specifice way... (5 Replies)
Discussion started by: mankuar
5 Replies

7. UNIX for Dummies Questions & Answers

How to split a huge file into small pieces (per 2000 columns)?

Dear all, I have a big file:2879(rows)x400,170 (columns) like below. I 'd like to split the file into small pieces:2879(rows)x2000(columns) per file (the last small piece will be 2879x170. So far, I only know how to create one samll piece at one time. But actually I need to repeat this work... (6 Replies)
Discussion started by: forevertl
6 Replies

8. Shell Programming and Scripting

Need to split a xml file in proper format

Hi, I have a file which has xml data but all in single line Ex - <?xml version="1.0"?><User><Name>Robert</Name><Location>California</Location><Occupation>Programmer</Occupation></User> I want to split the data in proper xml format Ex- <?xml version="1.0"?> <User> <Name>Robert</Name>... (6 Replies)
Discussion started by: avishek007
6 Replies

9. Shell Programming and Scripting

How do I split file into pieces with PERL?

How do I split file into pieces with PERL? IE file.txt head 1 2 3 4 end head 5 6 7 8 9 end n so on (7 Replies)
Discussion started by: 3junior
7 Replies

10. Shell Programming and Scripting

Shell script to split XML file

Hi, I'm experiencing difficulty in loading an XML file to an Oracle destination table.I keep running into a memory problem due to the large size of the file. I want to split the XML file into several smaller files based on the keyword(s)/tags : '' and '' and would like to use a Unix shell... (2 Replies)
Discussion started by: bayflash27
2 Replies
Login or Register to Ask a Question