Script to split text files

03-20-2010

Registered User

4, 0

Join Date: Mar 2010

Last Activity: 30 December 2010, 5:59 PM EST

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

Script to split text files

Hi All,
I'm fairly new to scripting, so need a little help to get started with this problem.
I don't mind whether I go for an awk/bash/other approach, I don't really know which would be best suited to the problem...

Lets say I have a 10000 line text file, I would like to split this up into a few smaller files. Something like:
10 line, say the last 10 lines
100 line, say the first 100 lines
1000 line, say the last 1000 lines
5000 line, say the middle 5000 lines

This I could probably manage with head & tail etc.
However, if my text file was only 1000 lines long it would now work so well. I'g get 10 and 100 lines ok, but the 3rd would give me what I already have, and I guess the 4th would fail. What I would actually want is more like:
1 line
10 lines
100 lines
500 lines

Similarly, a text file much larger than 10000 lines, I'd want to behave the same the other way, like a 100k file = 100, 1000, 10000, 50000.

The numbers of lines does not need to be exact either. I would not mind doing the splits based on a percentage of the lines in the original file. Nor would I mind if lines in the original file were selected at random.
Basically, I just want a set of small medium large larger files of whatever size, but proportional to the original. Files would not need to be unique either, line 1 in the small file, and then line 1-10 in the medium file is fine, though if it's easier I would not mind lines 2-11 in the second file.

I hope I've not over-complicated this explanation...
Would somebody please give me a steer on where to start. What should I use for this - awk?, should I try and use percentages, or try and work out absolutes that work in every situation?

Many thanks!

Phil.

phil8258

View Public Profile for phil8258

Find all posts by phil8258

03-20-2010

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Your requirments are not really clear
As a general answer:

Code:

#!/bin/ksh
# splitem.sh
infile=$1
start=$2
end=$3
outfile=$4
awk -v start=$start -v end=$end ' NR>=start && NR <= end'  $infile > $outfile

usage:

Code:

./splitem myfile 1 5000 myfilefirst5000

you can get the last 5000 and first 5000 with tail -5000 and head -5000

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

03-20-2010

Registered User

4, 0

Join Date: Mar 2010

Last Activity: 30 December 2010, 5:59 PM EST

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

I thanks for your response. I understand how you use head and tail, and your script will help pull out a middle section, however let me try and clear up what i'm trying to do.

The input text file could be anything from 1 line long, to 100,000 lines long.
From which I want to produce a subset or small, medium, large files.
The problem is that I want it to be dynamic, and for the subset of files to be representative of the input file.
Something like:
Input is 1 line long: small output is 1 line, medium is 1 line, large is 1 line.
Input is 10 lines long: small output is ~2 lines, medium is ~4 lines, large is ~8 lines.
Input is 180 lines long: small output is ~20 lines, medium is ~90, large is ~120.

I don't mind overlap between what is in the files, but want to avoid over-coverage on one part, like the start.

The problem with head and tail, would be hard coding the "head -n ??"

It would need to be something more like:
#Note, i'm really guessing here, I hope this helps in some way illustrate...
originalfile_lines=$(cat original.txt | wc -l)
smallfile_lines=$(originalfile_lines % 10)
mediumfile_lines=$(originalfile_lines % 30)
largefile_lines=$(originalfile_lines % 75)

for ( i in 1 .. "$smallfile_lines" ) {
cat any random line from original.txt > smallfile
}

Or something to that effect. Please excuse my poor pseudo-code.

Thanks Again,

Phil.

phil8258

View Public Profile for phil8258

Find all posts by phil8258

03-20-2010

Registered User

100, 3

Join Date: Mar 2010

Last Activity: 9 January 2013, 12:08 PM EST

Location: la jolla, ca

Posts: 100

Thanks Given: 4

Thanked 3 Times in 3 Posts

This works.

First, create some files:

Code:

#!/usr/bin/perl
# file name is cfil.pl
foreach $j (@ARGV) {
	die "bad j: $j\n" if ($j<1 or $j>100000) ;
	open($fh, '>', "$j.txt") or die $!;

	for($i=1; $i<=$j; $i++) {
		print $fh "Line $i of $j: Blah Blah Ha de blab blah\n";
	}
	close $fh or die $!;
}

$ cf.pl 10 20 50 100 1000 50000
$ wc -l *
      10 10.txt
     100 100.txt
    1000 1000.txt
      20 20.txt
      50 50.txt
   50000 50000.txt
   51180 total

Now the perl script using your suggested ratio.

Code:

#!/usr/bin/perl
use POSIX;
# file name is 3files.pl
$smp=.10;
$mp=.25;
$lp=.75;

foreach $name (@ARGV) {
	open($fh, '<',$name) or die $!;
	@contents=<$fh>;
	$lines= scalar @contents;
	
	$smlines=$mlines=$llines=0;
	
	print "$name has $lines lines\n";
	
	open($smfh, '>',"sm-$name") or die $!;
	open($mfh, '>', "mid-$name") or die $!;
	open($lfh, '>', "lg-$name") or die $!;
	for($l=0; $l<$lines; $l++) {
		if (fmod($l,1/$smp)<1) {
			$smlines++; 
			print $smfh $contents[$l];
		}
		
		if (fmod($l,1/$mp)<1) {
			$mlines++;
			print $mfh $contents[$l];
		}
		
		
		if (fmod($l,1/$lp)<1) {
			$llines++;
			print $lfh $contents[$l] ;
		}	
	}
	print "\tsm-$name has $smlines lines\n";
	print "\tmid-$name has $mlines lines\n";
	print "\tlg-$name has $llines lines\n\n"; 
	close $fh or die $!;
	close $smfh or die $!;
	close $mfh or die $!;
	close $lfh or die $!;
}	

$ 3files.pl *.txt

10.txt has 10 lines
	sm-10.txt has 1 lines
	mid-10.txt has 3 lines
	lg-10.txt has 7 lines

100.txt has 100 lines
	sm-100.txt has 10 lines
	mid-100.txt has 25 lines
	lg-100.txt has 75 lines

1000.txt has 1000 lines
	sm-1000.txt has 100 lines
	mid-1000.txt has 250 lines
	lg-1000.txt has 750 lines

20.txt has 20 lines
	sm-20.txt has 2 lines
	mid-20.txt has 5 lines
	lg-20.txt has 15 lines

50.txt has 50 lines
	sm-50.txt has 5 lines
	mid-50.txt has 13 lines
	lg-50.txt has 37 lines

50000.txt has 50000 lines
	sm-50000.txt has 5000 lines
	mid-50000.txt has 12500 lines
	lg-50000.txt has 37500 lines

This prints the first line then an even distribution of lines to achieve the target.

ie, "sm-10.txt" has the first line, then every 10 thereafter, "mid-100.txt" has the first line then every 4 thereafter, "lg-100.txt" has the first line then 7.5 of every 10 thereafter, etc.

If you spent some time on this, you could make it a lot better. Suggestions:

1) Use integer arithmetic vs floating point if you have really big files.

2) Use a regex that you build on the fly that will reduce based on a pattern.

Cheers.

drewk

View Public Profile for drewk

Find all posts by drewk

03-20-2010

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi,

If you you get a yen to do your own coding, then a perl module, Algorithm::Numerical::Sample - search.cpan.org , implements the single-pass algorithm described in Knuth. It actually has 2 parts, one to sample from an array, and the other to sample as you read a file.

The module may be already installed, or in an available repository. If not, it's always available at CPAN .

There may be other facilities in other languages to do the same thing -- python likely has one, for example ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

03-20-2010

Registered User

4, 0

Join Date: Mar 2010

Last Activity: 30 December 2010, 5:59 PM EST

Posts: 4

Thanks Given: 0

Thanked 0 Times in 0 Posts

drewk, This is perfect. Exactly what i'm after! Many Thanks :O).

I don't know any perl whatsoever, so improvements to the script wont be any time soon. I've been "writing" bash scripts a few weeks now, so next I may have to pick up something like perl or python so that I can do the smart stuff.
Again, much appreciated, cheers. Phil.

phil8258

View Public Profile for phil8258

Find all posts by phil8258

03-21-2010

Registered User

100, 3

Join Date: Mar 2010

Last Activity: 9 January 2013, 12:08 PM EST

Location: la jolla, ca

Posts: 100

Thanks Given: 4

Thanked 3 Times in 3 Posts

You are welcome.

Of course, as drl states, the Knuth algorithm is the classic way to sample. I case you want to look at it, I have added it in the the routine here:

Code:

#!/usr/bin/perl
use POSIX;
use warnings;
use strict;

# From Knuth Art of Programming
# Algortihm S(3.4.2)
# Select n records at random from a set of N records where
# 0<n<=N
sub selection_sample {
    my ($array,$num)=@_;
    die "Too few elements (".scalar(@{$array}).") to select $num from\n"
        unless $num<@{$array};
    my @result;
    my $pos=0;
    while (@result<$num) {
        $pos++ while (rand(@{$array}-$pos)>($num-@result));
        push @result,$array->[$pos++];
    }
    return \@result
}

my $smp=.10;
my $mp=.25;
my $lp=.75;
my $smlines=0;
my $mlines=0;
my $llines=0;
my $klines=0;
my @contents;
my $knuth;
my $klines=0;
my $line;
my $lines;

foreach my $name (@ARGV) {
	open(my $fh, '<',$name) or die $!;
	@contents=<$fh>;
	$lines= scalar @contents;
	
	print "$name has $lines lines\n";
	
	open(my $smfh, '>',"sm-$name") or die $!;
	open(my $mfh, '>', "mid-$name") or die $!;
	open(my $lfh, '>', "lg-$name") or die $!;
	
	for($line=0; $line<$lines; $line++) {
		if (fmod($line,1/$smp)<1) {
			$smlines++; 
			print $smfh $contents[$line];
		}
		
		if (fmod($line,1/$mp)<1) {
			$mlines++;
			print $mfh $contents[$line];
		}
		
		
		if (fmod($line,1/$lp)<1) {
			$llines++;
			print $lfh $contents[$line] ;
		}	
	}
	
	open(my $kfh, '>', "k-$name") or die $!;	
	$knuth=selection_sample(\@contents,60) ;
	foreach $line (@{$knuth}) {
		$klines++;
		print $kfh $line;
	}

	print "\tsm-$name has $smlines lines\n";
	print "\tmid-$name has $mlines lines\n";
	print "\tlg-$name has $llines lines\n";
	print "\tk-$name has $klines lines \n\n";
	close $fh or die $!;
	close $smfh or die $!;
	close $mfh or die $!;
	close $lfh or die $!;	
	close $kfh or die $!;
}

drewk

View Public Profile for drewk

Find all posts by drewk

Shell Programming and Scripting

Script to split text files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Shell script to Split matrix file with delimiter into multiple files

Discussion started by: trymega

2. Shell Programming and Scripting

How to split files using shell script?

Discussion started by: linuxUser_

3. Shell Programming and Scripting

Split a text file into multiple text files?

Discussion started by: sammy777

4. Shell Programming and Scripting

awk script to split file into multiple files based on many columns

Discussion started by: viored

5. Shell Programming and Scripting

perl script to split the text file after every 4th field

Discussion started by: giridhar276

6. Shell Programming and Scripting

Backup script to split and tar files

Discussion started by: ganitolngyundre

7. Shell Programming and Scripting

Script to create a text file whose content is the text of another files

Discussion started by: tenteyu

8. UNIX for Dummies Questions & Answers

Writing awk script to read csv files and split them

Discussion started by: ladyAnne

9. Shell Programming and Scripting

Split line to multiple files Awk/Sed/Shell Script help

Discussion started by: saint2006

10. Shell Programming and Scripting

Script to split files based on number of lines

Discussion started by: gubbu