Sponsored Content
Top Forums Shell Programming and Scripting Split a 30GB XML file into 16 pieces Post 302551094 by yazu on Monday 29th of August 2011 11:31:22 AM
Old 08-29-2011
Quick & dirty and not tested thoroughly. But it prints something. Smilie If there will be problems with '\0' or with 64-bit sizes or offsets it's better to translate this to perl.
Code:
#define _FILE_OFFSET_BITS 64

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define NCHUNKS 16
#define NBUF  512
#define WORD "<page>"


void usage(char *progname) {
    printf("Usage: %s FILENAME\n", progname);
}

long long find_pos(FILE *fd, long long pos) {
    
    char BUF[NBUF+1];
    BUF[NBUF+1] = '\0';  // BUG !!!
    
    fseeko(fd, pos, SEEK_SET);
    int count = -1;
    char *found = NULL;

    while (!found) {
        fread(BUF, NBUF, 1, fd);
        found = strstr(BUF, WORD);
        count++;
    }
    int offset = found - BUF;

    return pos + NBUF*count + offset;
}

int main(int argc, char** argv) {
    FILE *fd = fopen(argv[1], "r"); // fopen should be fopen64 really
    if (! fd) {
        fprintf(stderr, "Couldn't open %s\n", argv[1]);
        usage(argv[0]);
        exit(-1);
    }

    long long fs;
    if (fseeko(fd, 0, SEEK_END)) {
        fprintf(stderr, "Couldn't go to the end of the file\n");
        exit(-1);
    }

    fs = ftello(fd);

    int count = 0;
    while (count < NCHUNKS-1) {
        printf("%lld ", find_pos(fd, ++count * fs/NCHUNKS));
    }
    printf("\n");
    
    fclose(fd);
    return 0;
}

You can test this with:
Code:
xxd -sNUM testfile | head -1

Yes, there is a bug. Too quick... ))) Thanks, Corona688!

Last edited by yazu; 08-29-2011 at 12:59 PM.. Reason: Bug
This User Gave Thanks to yazu For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shell script to split XML file

Hi, I'm experiencing difficulty in loading an XML file to an Oracle destination table.I keep running into a memory problem due to the large size of the file. I want to split the XML file into several smaller files based on the keyword(s)/tags : '' and '' and would like to use a Unix shell... (2 Replies)
Discussion started by: bayflash27
2 Replies

2. Shell Programming and Scripting

How do I split file into pieces with PERL?

How do I split file into pieces with PERL? IE file.txt head 1 2 3 4 end head 5 6 7 8 9 end n so on (7 Replies)
Discussion started by: 3junior
7 Replies

3. Shell Programming and Scripting

Need to split a xml file in proper format

Hi, I have a file which has xml data but all in single line Ex - <?xml version="1.0"?><User><Name>Robert</Name><Location>California</Location><Occupation>Programmer</Occupation></User> I want to split the data in proper xml format Ex- <?xml version="1.0"?> <User> <Name>Robert</Name>... (6 Replies)
Discussion started by: avishek007
6 Replies

4. UNIX for Dummies Questions & Answers

How to split a huge file into small pieces (per 2000 columns)?

Dear all, I have a big file:2879(rows)x400,170 (columns) like below. I 'd like to split the file into small pieces:2879(rows)x2000(columns) per file (the last small piece will be 2879x170. So far, I only know how to create one samll piece at one time. But actually I need to repeat this work... (6 Replies)
Discussion started by: forevertl
6 Replies

5. Shell Programming and Scripting

Split xml file into many

Hi, I had a scenario need a help as I am new to this. I have a xml file employee.xml with the below content. <Organisation><employee>xxx</employee><employee>yyy</employee><employee>zzz</employee></Organisation> I want to split the file into multiple file as below. Is there a specifice way... (5 Replies)
Discussion started by: mankuar
5 Replies

6. Shell Programming and Scripting

Split XML file

Hi Experts, Can you please help me to split following XML file based on new Order ? Actual file is very big. I have taken few lines of it. <?xml version="1.0" encoding="utf-8" standalone="yes"?> <Orders xmlns='http://www.URL.com/Orders'> <Order> <ORDNo>450321</ORDNo> ... (3 Replies)
Discussion started by: meetmedude
3 Replies

7. Shell Programming and Scripting

Perl : to split the tags from xml file

I do have an xml sheet as below where I need the perl script to filter only the hyperlink tags. <cols><col min="1" max="1" width="30.5703125" customWidth="1"/><col min="2" max="2" width="7.140625" bestFit="1" customWidth="1"/> <col min="3" max="3" width="32.28515625" bestFit="1"... (3 Replies)
Discussion started by: scriptscript
3 Replies

8. Shell Programming and Scripting

Split XML file based on tags

Hello All , Please help me with below requirement I want to split a xml file based on tag.here is the file format <data-set> some-information </data-set> <data-set1> some-information </data-set1> <data-set2> some-information </data-set2> I want to split the above file into 3... (5 Replies)
Discussion started by: Pratik4891
5 Replies

9. Shell Programming and Scripting

Split xml file into multiple xml based on letterID

Hi All, We need to split a large xml into multiple valid xml with same header(2lines) and footer(last line) for N number of letterId. In the example below we have first 2 lines as header and last line as footer.(They need to be in each split xml file) Header: <?xml version="1.0"... (5 Replies)
Discussion started by: vx04
5 Replies

10. Shell Programming and Scripting

Split Big XML file Base on tag

HI I want to split file base on tag name. I have few header and footer on file <?xml version="1.33" encing="UTF-8"?> <bulkCmConfigDataFile" <xn:SubNetwork id="ONRM_ROOT"> <xn:MeContext id="PPP04156"> ... (4 Replies)
Discussion started by: pareshkp
4 Replies
FSEEKO(3)						     Linux Programmer's Manual							 FSEEKO(3)

NAME
fseeko, ftello - seek to or report file position SYNOPSIS
#include <stdio.h> int fseeko(FILE *stream, off_t offset, int whence); off_t ftello(FILE *stream); Feature Test Macro Requirements for glibc (see feature_test_macros(7)): fseeko(), ftello(): _FILE_OFFSET_BITS == 64 || _POSIX_C_SOURCE >= 200112L (defining the obsolete _LARGEFILE_SOURCE macro also works) DESCRIPTION
The fseeko() and ftello() functions are identical to fseek(3) and ftell(3) (see fseek(3)), respectively, except that the offset argument of fseeko() and the return value of ftello() is of type off_t instead of long. On some architectures, both off_t and long are 32-bit types, but defining _FILE_OFFSET_BITS with the value 64 (before including any header files) will turn off_t into a 64-bit type. RETURN VALUE
On successful completion, fseeko() returns 0, while ftello() returns the current offset. Otherwise, -1 is returned and errno is set to indicate the error. ERRORS
See the ERRORS in fseek(3). VERSIONS
These functions are available under glibc since version 2.1. ATTRIBUTES
For an explanation of the terms used in this section, see attributes(7). +-------------------+---------------+---------+ |Interface | Attribute | Value | +-------------------+---------------+---------+ |fseeko(), ftello() | Thread safety | MT-Safe | +-------------------+---------------+---------+ CONFORMING TO
POSIX.1-2001, POSIX.1-2008, SUSv2. SEE ALSO
fseek(3) COLOPHON
This page is part of release 4.15 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest version of this page, can be found at https://www.kernel.org/doc/man-pages/. 2017-09-15 FSEEKO(3)
All times are GMT -4. The time now is 08:45 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy