Script to split text files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Script to split text files
# 1  
Old 03-20-2010
Script to split text files

Hi All,
I'm fairly new to scripting, so need a little help to get started with this problem.
I don't mind whether I go for an awk/bash/other approach, I don't really know which would be best suited to the problem...

Lets say I have a 10000 line text file, I would like to split this up into a few smaller files. Something like:
10 line, say the last 10 lines
100 line, say the first 100 lines
1000 line, say the last 1000 lines
5000 line, say the middle 5000 lines

This I could probably manage with head & tail etc.
However, if my text file was only 1000 lines long it would now work so well. I'g get 10 and 100 lines ok, but the 3rd would give me what I already have, and I guess the 4th would fail. What I would actually want is more like:
1 line
10 lines
100 lines
500 lines

Similarly, a text file much larger than 10000 lines, I'd want to behave the same the other way, like a 100k file = 100, 1000, 10000, 50000.

The numbers of lines does not need to be exact either. I would not mind doing the splits based on a percentage of the lines in the original file. Nor would I mind if lines in the original file were selected at random.
Basically, I just want a set of small medium large larger files of whatever size, but proportional to the original. Files would not need to be unique either, line 1 in the small file, and then line 1-10 in the medium file is fine, though if it's easier I would not mind lines 2-11 in the second file.

I hope I've not over-complicated this explanation...
Would somebody please give me a steer on where to start. What should I use for this - awk?, should I try and use percentages, or try and work out absolutes that work in every situation?

Many thanks!

Phil.
# 2  
Old 03-20-2010
Your requirments are not really clear
As a general answer:
Code:
#!/bin/ksh
# splitem.sh
infile=$1
start=$2
end=$3
outfile=$4
awk -v start=$start -v end=$end ' NR>=start && NR <= end'  $infile > $outfile

usage:
Code:
./splitem myfile 1 5000 myfilefirst5000

you can get the last 5000 and first 5000 with tail -5000 and head -5000
# 3  
Old 03-20-2010
I thanks for your response. I understand how you use head and tail, and your script will help pull out a middle section, however let me try and clear up what i'm trying to do.

The input text file could be anything from 1 line long, to 100,000 lines long.
From which I want to produce a subset or small, medium, large files.
The problem is that I want it to be dynamic, and for the subset of files to be representative of the input file.
Something like:
Input is 1 line long: small output is 1 line, medium is 1 line, large is 1 line.
Input is 10 lines long: small output is ~2 lines, medium is ~4 lines, large is ~8 lines.
Input is 180 lines long: small output is ~20 lines, medium is ~90, large is ~120.

I don't mind overlap between what is in the files, but want to avoid over-coverage on one part, like the start.

The problem with head and tail, would be hard coding the "head -n ??"

It would need to be something more like:
#Note, i'm really guessing here, I hope this helps in some way illustrate...
originalfile_lines=$(cat original.txt | wc -l)
smallfile_lines=$(originalfile_lines % 10)
mediumfile_lines=$(originalfile_lines % 30)
largefile_lines=$(originalfile_lines % 75)

for ( i in 1 .. "$smallfile_lines" ) {
cat any random line from original.txt > smallfile
}

Or something to that effect. Please excuse my poor pseudo-code.

Thanks Again,

Phil.
# 4  
Old 03-20-2010
This works.

First, create some files:

Code:
#!/usr/bin/perl
# file name is cfil.pl
foreach $j (@ARGV) {
	die "bad j: $j\n" if ($j<1 or $j>100000) ;
	open($fh, '>', "$j.txt") or die $!;

	for($i=1; $i<=$j; $i++) {
		print $fh "Line $i of $j: Blah Blah Ha de blab blah\n";
	}
	close $fh or die $!;
}

$ cf.pl 10 20 50 100 1000 50000
$ wc -l *
      10 10.txt
     100 100.txt
    1000 1000.txt
      20 20.txt
      50 50.txt
   50000 50000.txt
   51180 total

Now the perl script using your suggested ratio.

Code:
#!/usr/bin/perl
use POSIX;
# file name is 3files.pl
$smp=.10;
$mp=.25;
$lp=.75;

foreach $name (@ARGV) {
	open($fh, '<',$name) or die $!;
	@contents=<$fh>;
	$lines= scalar @contents;
	
	$smlines=$mlines=$llines=0;
	
	print "$name has $lines lines\n";
	
	open($smfh, '>',"sm-$name") or die $!;
	open($mfh, '>', "mid-$name") or die $!;
	open($lfh, '>', "lg-$name") or die $!;
	for($l=0; $l<$lines; $l++) {
		if (fmod($l,1/$smp)<1) {
			$smlines++; 
			print $smfh $contents[$l];
		}
		
		if (fmod($l,1/$mp)<1) {
			$mlines++;
			print $mfh $contents[$l];
		}
		
		
		if (fmod($l,1/$lp)<1) {
			$llines++;
			print $lfh $contents[$l] ;
		}	
	}
	print "\tsm-$name has $smlines lines\n";
	print "\tmid-$name has $mlines lines\n";
	print "\tlg-$name has $llines lines\n\n"; 
	close $fh or die $!;
	close $smfh or die $!;
	close $mfh or die $!;
	close $lfh or die $!;
}	

$ 3files.pl *.txt

10.txt has 10 lines
	sm-10.txt has 1 lines
	mid-10.txt has 3 lines
	lg-10.txt has 7 lines

100.txt has 100 lines
	sm-100.txt has 10 lines
	mid-100.txt has 25 lines
	lg-100.txt has 75 lines

1000.txt has 1000 lines
	sm-1000.txt has 100 lines
	mid-1000.txt has 250 lines
	lg-1000.txt has 750 lines

20.txt has 20 lines
	sm-20.txt has 2 lines
	mid-20.txt has 5 lines
	lg-20.txt has 15 lines

50.txt has 50 lines
	sm-50.txt has 5 lines
	mid-50.txt has 13 lines
	lg-50.txt has 37 lines

50000.txt has 50000 lines
	sm-50000.txt has 5000 lines
	mid-50000.txt has 12500 lines
	lg-50000.txt has 37500 lines

This prints the first line then an even distribution of lines to achieve the target.

ie, "sm-10.txt" has the first line, then every 10 thereafter, "mid-100.txt" has the first line then every 4 thereafter, "lg-100.txt" has the first line then 7.5 of every 10 thereafter, etc.


If you spent some time on this, you could make it a lot better. Suggestions:

1) Use integer arithmetic vs floating point if you have really big files.

2) Use a regex that you build on the fly that will reduce based on a pattern.

Cheers.
# 5  
Old 03-20-2010
Hi,

If you you get a yen to do your own coding, then a perl module, Algorithm::Numerical::Sample - search.cpan.org , implements the single-pass algorithm described in Knuth. It actually has 2 parts, one to sample from an array, and the other to sample as you read a file.

The module may be already installed, or in an available repository. If not, it's always available at CPAN .

There may be other facilities in other languages to do the same thing -- python likely has one, for example ... cheers, drl
This User Gave Thanks to drl For This Post:
# 6  
Old 03-20-2010
drewk, This is perfect. Exactly what i'm after! Many Thanks :O).

I don't know any perl whatsoever, so improvements to the script wont be any time soon. I've been "writing" bash scripts a few weeks now, so next I may have to pick up something like perl or python so that I can do the smart stuff.
Again, much appreciated, cheers. Phil.
# 7  
Old 03-21-2010
You are welcome.

Of course, as drl states, the Knuth algorithm is the classic way to sample. I case you want to look at it, I have added it in the the routine here:

Code:
#!/usr/bin/perl
use POSIX;
use warnings;
use strict;

# From Knuth Art of Programming
# Algortihm S(3.4.2)
# Select n records at random from a set of N records where
# 0<n<=N
sub selection_sample {
    my ($array,$num)=@_;
    die "Too few elements (".scalar(@{$array}).") to select $num from\n"
        unless $num<@{$array};
    my @result;
    my $pos=0;
    while (@result<$num) {
        $pos++ while (rand(@{$array}-$pos)>($num-@result));
        push @result,$array->[$pos++];
    }
    return \@result
}

my $smp=.10;
my $mp=.25;
my $lp=.75;
my $smlines=0;
my $mlines=0;
my $llines=0;
my $klines=0;
my @contents;
my $knuth;
my $klines=0;
my $line;
my $lines;

foreach my $name (@ARGV) {
	open(my $fh, '<',$name) or die $!;
	@contents=<$fh>;
	$lines= scalar @contents;
	
	print "$name has $lines lines\n";
	
	open(my $smfh, '>',"sm-$name") or die $!;
	open(my $mfh, '>', "mid-$name") or die $!;
	open(my $lfh, '>', "lg-$name") or die $!;
	
	for($line=0; $line<$lines; $line++) {
		if (fmod($line,1/$smp)<1) {
			$smlines++; 
			print $smfh $contents[$line];
		}
		
		if (fmod($line,1/$mp)<1) {
			$mlines++;
			print $mfh $contents[$line];
		}
		
		
		if (fmod($line,1/$lp)<1) {
			$llines++;
			print $lfh $contents[$line] ;
		}	
	}
	
	open(my $kfh, '>', "k-$name") or die $!;	
	$knuth=selection_sample(\@contents,60) ;
	foreach $line (@{$knuth}) {
		$klines++;
		print $kfh $line;
	}

	print "\tsm-$name has $smlines lines\n";
	print "\tmid-$name has $mlines lines\n";
	print "\tlg-$name has $llines lines\n";
	print "\tk-$name has $klines lines \n\n";
	close $fh or die $!;
	close $smfh or die $!;
	close $mfh or die $!;
	close $lfh or die $!;	
	close $kfh or die $!;
}

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Shell script to Split matrix file with delimiter into multiple files

I have a large semicolon delimited file with thousands of columns and many thousands of line. It looks like: ID1;ID2;ID3;ID4;A_1;B_1;C_1;A_2;B_2;C_2;A_3;B_3;C_3 AA;ax;ay;az;01;02;03;04;05;06;07;08;09 BB;bx;by;bz;03;05;33;44;15;26;27;08;09 I want to split this table in to multiple files: ... (1 Reply)
Discussion started by: trymega
1 Replies

2. Shell Programming and Scripting

How to split files using shell script?

solid top facet normal 0 1 0 outer loop vertex 0 1 0 vertex 1 1 1 vertex 1 1 0 endloop endfacet facet normal 0 1 0 outer loop vertex 0 1 0 vertex 0 1 1 vertex 1 1 1 endloop endfacet endsolid top solid bottom facet normal 0 -1 ... (3 Replies)
Discussion started by: linuxUser_
3 Replies

3. Shell Programming and Scripting

Split a text file into multiple text files?

I have a text file with entries like 1186 5556 90844 7873 7722 12 7890.6 78.52 6679 3455 9867 1127 5642 ..N so many records like this. I want to split this file into multiple files like cluster1.txt, cluster2.txt, cluster3.txt, ..... clusterN.txt. (4 Replies)
Discussion started by: sammy777
4 Replies

4. Shell Programming and Scripting

awk script to split file into multiple files based on many columns

So I have a space delimited file that I'd like to split into multiple files based on multiple column values. This is what my data looks like 1bc9A02 1 10 1000 FTDLNLVQALRQFLWSFRLPGEAQKIDRMMEAFAQRYCQCNNGVFQSTDTCYVLSFAIIMLNTSLHNPNVKDKPTVERFIAMNRGINDGGDLPEELLRNLYESIKNEPFKIPELEHHHHHH 1ku1A02 1 10... (9 Replies)
Discussion started by: viored
9 Replies

5. Shell Programming and Scripting

perl script to split the text file after every 4th field

I had a text file(comma seperated values) which contains as below 196237,ram,25-May-06,ram.kiran@xyz.com,204183,Pavan,4-Jun-07,Pavan.Desai@xyz.com,237107,ram Chandra,15-Mar-10,ram.krishna@xyz.com ... (3 Replies)
Discussion started by: giridhar276
3 Replies

6. Shell Programming and Scripting

Backup script to split and tar files

Hi Guys, I'm very new to bash scripting. Please help me on this. I'm in need of a backup script which does the ff. 1. If a file is larger than 5GB. split it and tar the file. 2. Weekly backup file to amazon s3 using s3rsync 3. If a file is unchanged it doesn't need to copy to amazon s3 ... (4 Replies)
Discussion started by: ganitolngyundre
4 Replies

7. Shell Programming and Scripting

Script to create a text file whose content is the text of another files

Hello everyone, I work under Ubuntu 11.10 (c-shell) I need a script to create a new text file whose content is the text of another text files that are in the directory $DIRMAIL at this moment. I will show you an example: - On the one hand, there is a directory $DIRMAIL where there are... (1 Reply)
Discussion started by: tenteyu
1 Replies

8. UNIX for Dummies Questions & Answers

Writing awk script to read csv files and split them

Hi Here is my script that calls my awk script #!/bin/bash set -x dir="/var/local/dsx/csv" testfile="$testfile" while getopts " f: " option do case $option in f ) testfile="$OPTARG";; esac; done ./scriptFile --testfile=$testfile >> $dir/$testfile.csv It calls my awk... (1 Reply)
Discussion started by: ladyAnne
1 Replies

9. Shell Programming and Scripting

Split line to multiple files Awk/Sed/Shell Script help

Hi, I need help to split lines from a file into multiple files. my input look like this: 13 23 45 45 6 7 33 44 55 66 7 13 34 5 6 7 87 45 7 8 8 9 13 44 55 66 77 8 44 66 88 99 6 I want to split every 3 lines from this file to be written to individual files. (3 Replies)
Discussion started by: saint2006
3 Replies

10. Shell Programming and Scripting

Script to split files based on number of lines

I am getting a few gzip files into a folder by doing ftp to another server. Once I get them I move them to another location .But before that I need to make sure each gzip is not more than 5000 lines and split it up . The files I get are anywhere from 500 lines to 10000 lines in them and is in gzip... (4 Replies)
Discussion started by: gubbu
4 Replies
Login or Register to Ask a Question