Splitting a file into several smaller files using perl

03-29-2012

Registered User

107, 2

Join Date: Sep 2006

Last Activity: 25 March 2015, 1:18 PM EDT

Posts: 107

Thanks Given: 2

Thanked 2 Times in 2 Posts

Splitting a file into several smaller files using perl

Hi,
I'm trying to split a large file into several smaller files
the script will have two input arguments argument1=filename and argument2=no of files to be split.

In my large input file I have a header followed by 100009 records
The first line is a header; I want this header in all my splitted files

Here is what I have done so far

Code:

#!/usr/bin/perl
use File::Basename;

$inputfile=@ARGV[0];
$nof=@ARGV[1];                              # nof - no of files to split
($filename,$dir,$ext) = fileparse($inputfile,'\..*');
$header=`cat $inputfile | head -1`;
$NOLIF=`cat $inputfile | wc -l`;         # NOLIF - no of lines in file
$NOARIF=$NOLIF-1;                         # NOARIF - no of actual records in file
$NORPF=$NOARIF/$not;                    # NORPF - no of records per file
$NNORPF=`printf "%1.f\n" $NORPF`;   # NNORPF - new no of records per file

$count=0;
$filenum=0;

while (<$inputfile>) {
if ( $count == 0 ) {
     $nfilename = $filename._.$filenum.$ext;
     open( FILE, ">> $nfilename" );
     print( FILE "$header\n" );
     print( FILE "$_" );
     $count++;
  #} elsif ( $count == $NUM_LINES ) {
  } elsif ( $count == $NNORPF ) {
     close( FILE );
     $count = 0;
     $file_num++;
  } else {
     # just write the line!
     print( FILE "$_" );
     $count++;
  }
}

Here is my challenge:
Say I'm splitting my large input file into 10 files
so the first 9 files should have 10001 records and last should have 10000 records.

how do i get this working.

ramky79

View Public Profile for ramky79

Find all posts by ramky79

03-29-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Instead of showing a program that doesn't do what you want which doesn't seem to have anything to do with your question, and hoping we can guess what you think it's supposed to do, why not just show the input data you have, and the output data you want? Less guessing that way.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-29-2012

Registered User

107, 2

Join Date: Sep 2006

Last Activity: 25 March 2015, 1:18 PM EDT

Posts: 107

Thanks Given: 2

Thanked 2 Times in 2 Posts

Here is my input file "file.dat"; the file has 19 records and i want to split them into 10 files

BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
1 jhj jgu gv 36
2 dut jhg hg 54
3 gkl jkl hv 67
4 fjh gfh hg 45
5 hgl hgk hg 73
6 hkj hg yg 79
1 jhj jgu gv 36
2 dut jhg hg 54
3 gkl jkl hv 67
4 fjh gfh hg 45
5 hgl hgk hg 73
6 hkj hg yg 79
1 jhj jgu gv 36
2 dut jhg hg 54
3 gkl jkl hv 67
4 fjh gfh hg 45
5 hgl hgk hg 73
6 hkj hg yg 79
5 hgl hgk hg 73

Using this command

Code:

splitfile.pl file.dat 10

I want my output to look like this

Code:

file_1.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
1       jhj     jgu     gv      36
2       dut     jhg     hg      54

file_2.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
3       gkl     jkl     hv      67
4       fjh     gfh     hg      45

file_3.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
5       hgl     hgk     hg      73
6       hkj     hg      yg      79

file_4.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
1       jhj     jgu     gv      36
2       dut     jhg     hg      54

file_5.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
3       gkl     jkl     hv      67
4       fjh     gfh     hg      45

file_6.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
5       hgl     hgk     hg      73
6       hkj     hg      yg      79

file_7.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
1       jhj     jgu     gv      36
2       dut     jhg     hg      54

file_8.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
3       gkl     jkl     hv      67
4       fjh     gfh     hg      45

file_9.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
5       hgl     hgk     hg      73
6       hkj     hg      yg      79

file_10.dat:
BASENAME        STREETTYPE      PREFIX  SUFFIX  HOUSENUMBER
5       hgl     hgk     hg      73

Last edited by Corona688; 03-30-2012 at 12:23 PM.. Reason: Code tags for code and data, please.

ramky79

View Public Profile for ramky79

Find all posts by ramky79

03-30-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code:

$ cat fsplit.sh

#!/bin/sh

if [ "$#" -lt 2 ] || [ ! -f "$1" ]
then
        echo "usage:  $0 inputfile numfiles" >&2
        exit 1
fi

awk -v NFILES="$2" -v FNAME="file_%d.dat" '
        # Do not print the first file -- just count lines
        NR==FNR { next }

        # First line of the second read through the file.
        FNR==1 {        HEADER=$0
                        MAXLINES=sprintf("%d", (NR-1)/NFILES);
                        LINES=MAXLINES
                        next    }

        # skip to the next file and print header if exceeded maxlines
        (LINES >= MAXLINES) {
                        LINES=0;        FILE++;
                        print HEADER > sprintf(FNAME,FILE);     }

        # Print all lines into the current file
        { print > sprintf(FNAME, FILE); LINES++ }

# Yes, we give awk the same file twice.  On the first read, it just counts
# lines.  On the second, it decides which lines go into what file.
' "$1" "$1"

$ cat data
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
1 jhj jgu gv 36
2 dut jhg hg 54
3 gkl jkl hv 67
4 fjh gfh hg 45
5 hgl hgk hg 73
6 hkj hg yg 79
1 jhj jgu gv 36
2 dut jhg hg 54
3 gkl jkl hv 67
4 fjh gfh hg 45
5 hgl hgk hg 73
6 hkj hg yg 79
1 jhj jgu gv 36
2 dut jhg hg 54
3 gkl jkl hv 67
4 fjh gfh hg 45
5 hgl hgk hg 73
6 hkj hg yg 79
5 hgl hgk hg 73

$ tail *.dat
==> file_1.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
1 jhj jgu gv 36
2 dut jhg hg 54

==> file_10.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
5 hgl hgk hg 73

==> file_2.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
3 gkl jkl hv 67
4 fjh gfh hg 45

==> file_3.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
5 hgl hgk hg 73
6 hkj hg yg 79

==> file_4.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
1 jhj jgu gv 36
2 dut jhg hg 54

==> file_5.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
3 gkl jkl hv 67
4 fjh gfh hg 45

==> file_6.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
5 hgl hgk hg 73
6 hkj hg yg 79

==> file_7.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
1 jhj jgu gv 36
2 dut jhg hg 54

==> file_8.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
3 gkl jkl hv 67
4 fjh gfh hg 45

==> file_9.dat <==
BASENAME STREETTYPE PREFIX SUFFIX HOUSENUMBER
5 hgl hgk hg 73
6 hkj hg yg 79

$

Note the files match out of order because 10 doesn't sort alphabetically later than 9. Try %02d instead of %d to get numbers with leading zeroes that are always 2 digits.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-30-2012

Registered User

107, 2

Join Date: Sep 2006

Last Activity: 25 March 2015, 1:18 PM EDT

Posts: 107

Thanks Given: 2

Thanked 2 Times in 2 Posts

Hi,
This script works, but when i'm trying to split for a large file(~150000 records)
It is creating more than 10000 files...

I'm trying to split a large file into 10 files... here is what I'm giving
I have tweaked the code a little bit to customize my needs

The parameters I'm passing in
inputfile = NA.dat
numfiles =10
opdir = destination directory where i need my files
fileext = the file extension (dat)
region = the first part of my file name (NA)

Code:

#!/bin/sh

if [ "$#" -lt 5 ] || [ ! -f "$1" ]
then
        echo "usage:  $0 inputfile numfiles opdir fileext region" >&2
        exit 1
fi

awk -v NFILES="$2" -v FNAME="$3/$5_%d.$4" '
        # Do not print the first file -- just count lines
        NR==FNR { next }

        # First line of the second read through the file.
        FNR==1 {        HEADER=$0
                        MAXLINES=sprintf("%d", (NR-1)/NFILES);
                        LINES=MAXLINES
                        next    }

        # skip to the next file and print header if exceeded maxlines
        (LINES >= MAXLINES) {
                        LINES=0;        FILE++;
                        print HEADER > sprintf(FNAME,FILE);     }

        # Print all lines into the current file
        { print > sprintf(FNAME, FILE); LINES++ }

# Yes, we give awk the same file twice.  On the first read, it just counts
# lines.  On the second, it decides which lines go into what file.
' "$1" "$1"

I want 10 files to be created in my opdir (NA_1.dat .. NA_10.dat) with 15000 records in each of them

ramky79

View Public Profile for ramky79

Find all posts by ramky79

03-30-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

There was a minor error in my code which caused the MAXLINES variable to be a string and not a number.

Unfortunately, your input data was so perfectly matched to my bug that it worked anyway, so I didn't notice!

All your code needs is two keystrokes +0. I'd also add a bit of error checking.

Code:

#!/bin/sh

if [ "$#" -lt 5 ] || [ ! -f "$1" ] || [ ! -d "$3" ]
then
        echo "usage:  $0 inputfile numfiles opdir fileext region" >&2
        exit 1
fi

awk -v NFILES="$2" -v FNAME="$3/$5_%d.$4" '
        # Do not print the first file -- just count lines
        NR==FNR { next }

        # First line of the second read through the file.
        FNR==1 {        HEADER=$0
                        MAXLINES=sprintf("%d", (NR-1)/NFILES)+0;
                        LINES=MAXLINES
                        next    }

        # skip to the next file and print header if exceeded maxlines
        (LINES >= MAXLINES) {
                        LINES=0;        FILE++;
                        print HEADER > sprintf(FNAME,FILE);     }

        # Print all lines into the current file
        { print > sprintf(FNAME, FILE); LINES++ }

# Yes, we give awk the same file twice.  On the first read, it just counts
# lines.  On the second, it decides which lines go into what file.
' "$1" "$1"

This time I tested it all-out, making files with 150,000 lines, 100,009 lines, etc, etc. It splits as you wanted.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-30-2012

Registered User

107, 2

Join Date: Sep 2006

Last Activity: 25 March 2015, 1:18 PM EDT

Posts: 107

Thanks Given: 2

Thanked 2 Times in 2 Posts

Thanks corona688... this works as expected... but i made one little change instead of +0 i did +1, that way I'm splitting into 10 files instead of 11.

Thank you so much.

ramky79

View Public Profile for ramky79

Find all posts by ramky79

Shell Programming and Scripting

Splitting a file into several smaller files using perl

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Discussion started by: LMHmedchem

2. Shell Programming and Scripting

Splitting xml file into several xml files using perl

Discussion started by: mcosta

3. Shell Programming and Scripting

Splitting a file and creating new files using Perl script

Discussion started by: Deepak9870

4. Shell Programming and Scripting

Sed: Splitting A large File into smaller files based on recursive Regular Expression match

Discussion started by: sumguy

5. Shell Programming and Scripting

How to split a file into smaller files

Discussion started by: wintersnow2011

6. Shell Programming and Scripting

Help with splitting a large text file into smaller ones

Discussion started by: lord_butler

7. Shell Programming and Scripting

perl help to split big verilog file into smaller ones for each module

Discussion started by: return_user

8. Shell Programming and Scripting

splitting text file into smaller ones

Discussion started by: prvnrk

9. UNIX for Dummies Questions & Answers

splitting the large file into smaller files

Discussion started by: vsnreddy

10. Shell Programming and Scripting

Splitting a Larger File Into Mutiple Smaller ones.

Discussion started by: madhubt_1982