Hi All,
I'm fairly new to scripting, so need a little help to get started with this problem.
I don't mind whether I go for an awk/bash/other approach, I don't really know which would be best suited to the problem...
Lets say I have a 10000 line text file, I would like to split this up into a few smaller files. Something like:
10 line, say the last 10 lines
100 line, say the first 100 lines
1000 line, say the last 1000 lines
5000 line, say the middle 5000 lines
This I could probably manage with head & tail etc.
However, if my text file was only 1000 lines long it would now work so well. I'g get 10 and 100 lines ok, but the 3rd would give me what I already have, and I guess the 4th would fail. What I would actually want is more like:
1 line
10 lines
100 lines
500 lines
Similarly, a text file much larger than 10000 lines, I'd want to behave the same the other way, like a 100k file = 100, 1000, 10000, 50000.
The numbers of lines does not need to be exact either. I would not mind doing the splits based on a percentage of the lines in the original file. Nor would I mind if lines in the original file were selected at random.
Basically, I just want a set of small medium large larger files of whatever size, but proportional to the original. Files would not need to be unique either, line 1 in the small file, and then line 1-10 in the medium file is fine, though if it's easier I would not mind lines 2-11 in the second file.
I hope I've not over-complicated this explanation...
Would somebody please give me a steer on where to start. What should I use for this - awk?, should I try and use percentages, or try and work out absolutes that work in every situation?
I thanks for your response. I understand how you use head and tail, and your script will help pull out a middle section, however let me try and clear up what i'm trying to do.
The input text file could be anything from 1 line long, to 100,000 lines long.
From which I want to produce a subset or small, medium, large files.
The problem is that I want it to be dynamic, and for the subset of files to be representative of the input file.
Something like:
Input is 1 line long: small output is 1 line, medium is 1 line, large is 1 line.
Input is 10 lines long: small output is ~2 lines, medium is ~4 lines, large is ~8 lines.
Input is 180 lines long: small output is ~20 lines, medium is ~90, large is ~120.
I don't mind overlap between what is in the files, but want to avoid over-coverage on one part, like the start.
The problem with head and tail, would be hard coding the "head -n ??"
It would need to be something more like:
#Note, i'm really guessing here, I hope this helps in some way illustrate...
originalfile_lines=$(cat original.txt | wc -l)
smallfile_lines=$(originalfile_lines % 10)
mediumfile_lines=$(originalfile_lines % 30)
largefile_lines=$(originalfile_lines % 75)
for ( i in 1 .. "$smallfile_lines" ) {
cat any random line from original.txt > smallfile
}
Or something to that effect. Please excuse my poor pseudo-code.
This prints the first line then an even distribution of lines to achieve the target.
ie, "sm-10.txt" has the first line, then every 10 thereafter, "mid-100.txt" has the first line then every 4 thereafter, "lg-100.txt" has the first line then 7.5 of every 10 thereafter, etc.
If you spent some time on this, you could make it a lot better. Suggestions:
1) Use integer arithmetic vs floating point if you have really big files.
2) Use a regex that you build on the fly that will reduce based on a pattern.
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi,
If you you get a yen to do your own coding, then a perl module, Algorithm::Numerical::Sample - search.cpan.org , implements the single-pass algorithm described in Knuth. It actually has 2 parts, one to sample from an array, and the other to sample as you read a file.
The module may be already installed, or in an available repository. If not, it's always available at CPAN .
There may be other facilities in other languages to do the same thing -- python likely has one, for example ... cheers, drl
drewk, This is perfect. Exactly what i'm after! Many Thanks :O).
I don't know any perl whatsoever, so improvements to the script wont be any time soon. I've been "writing" bash scripts a few weeks now, so next I may have to pick up something like perl or python so that I can do the smart stuff.
Again, much appreciated, cheers. Phil.
I have a large semicolon delimited file with thousands of columns and many thousands of line. It looks like:
ID1;ID2;ID3;ID4;A_1;B_1;C_1;A_2;B_2;C_2;A_3;B_3;C_3
AA;ax;ay;az;01;02;03;04;05;06;07;08;09
BB;bx;by;bz;03;05;33;44;15;26;27;08;09
I want to split this table in to multiple files:
... (1 Reply)
I have a text file with entries like
1186
5556
90844
7873
7722
12
7890.6
78.52
6679
3455
9867
1127
5642
..N so many records like this.
I want to split this file into multiple files like cluster1.txt, cluster2.txt, cluster3.txt, ..... clusterN.txt. (4 Replies)
So I have a space delimited file that I'd like to split into multiple files based on multiple column values.
This is what my data looks like
1bc9A02 1 10 1000 FTDLNLVQALRQFLWSFRLPGEAQKIDRMMEAFAQRYCQCNNGVFQSTDTCYVLSFAIIMLNTSLHNPNVKDKPTVERFIAMNRGINDGGDLPEELLRNLYESIKNEPFKIPELEHHHHHH
1ku1A02 1 10... (9 Replies)
I had a text file(comma seperated values) which contains as below
196237,ram,25-May-06,ram.kiran@xyz.com,204183,Pavan,4-Jun-07,Pavan.Desai@xyz.com,237107,ram Chandra,15-Mar-10,ram.krishna@xyz.com ... (3 Replies)
Hi Guys,
I'm very new to bash scripting. Please help me on this.
I'm in need of a backup script which does the ff.
1. If a file is larger than 5GB. split it and tar the file.
2. Weekly backup file to amazon s3 using s3rsync
3. If a file is unchanged it doesn't need to copy to amazon s3
... (4 Replies)
Hello everyone,
I work under Ubuntu 11.10 (c-shell)
I need a script to create a new text file whose content is the text of another text files that are in the directory $DIRMAIL at this moment.
I will show you an example:
- On the one hand, there is a directory $DIRMAIL where there are... (1 Reply)
Hi
Here is my script that calls my awk script
#!/bin/bash
set -x
dir="/var/local/dsx/csv"
testfile="$testfile"
while getopts " f: " option
do
case $option in
f ) testfile="$OPTARG";;
esac;
done
./scriptFile --testfile=$testfile >> $dir/$testfile.csv
It calls my awk... (1 Reply)
Hi,
I need help to split lines from a file into multiple files.
my input look like this:
13
23 45 45 6 7
33 44 55 66 7
13
34 5 6 7 87
45 7 8 8 9
13
44 55 66 77 8
44 66 88 99 6
I want to split every 3 lines from this file to be written to individual files. (3 Replies)
I am getting a few gzip files into a folder by doing ftp to another server. Once I get them I move them to another location .But before that I need to make sure each gzip is not more than 5000 lines and split it up . The files I get are anywhere from 500 lines to 10000 lines in them and is in gzip... (4 Replies)