Divide large data files into smaller files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Divide large data files into smaller files
# 8  
Old 07-21-2010
Another approch with awk.
In the following script (ad23.sh) bigfile is the input file to split (ad23.txt) and maxsize is the maximum size (in bytes) of the fragments (ad23.txt_*).
Code:
bigfile=./ad23.txt
maxsize=${1:-200}

rm ${bigfile}_* >/dev/null 2>&1

awk -v msize=${maxsize} '

function print_record() {
   if ( rsize == 0 ) return;
   if ( csize+rsize > msize && csize != 0 || ifile == 0 ) {
      outfile = FILENAME "_" ++ifile;
      csize = 0;
   }
   csize += rsize;
   print record > outfile;
}

/^>[0-9]+$/ {
   print_record();
   record = $0;
   rsize  = length+1;
   next;
}

{
   record = (record ? record "\n" : "") $0;
   rsize  += length+1;
}

END {
   print_record();
}

' ${bigfile}

Input file (ad23.txt 873 bytes):
Code:
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa

Execution with maxsize=300
Code:
$ ./ad23.sh 300
$ wc -c ad23.txt_*
229 ad23.txt_1
260 ad23.txt_2
316 ad23.txt_3
 68 ad23.txt_4
873 total
$ more -999 ad23.txt_*
::::::::::::::
ad23.txt_1
::::::::::::::
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
::::::::::::::
ad23.txt_2
::::::::::::::
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
::::::::::::::
ad23.txt_3
::::::::::::::
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
::::::::::::::
ad23.txt_4
::::::::::::::
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
$

Another execution with maxsize=500
Code:
$ ./ad23.sh 500
$ wc -c ad23.txt_*
489 ad23.txt_1
384 ad23.txt_2
873 total
$ more -999 ad23.txt_*
::::::::::::::
ad23.txt_1
::::::::::::::
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
::::::::::::::
ad23.txt_2
::::::::::::::
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
$

Jean-Pierre.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Split large file into smaller files without disturbing the entry chunks

Dears, Need you help with the below file manipulation. I want to split the file into 8 smaller files but without cutting/disturbing the entries (meaning every small file should start with a entry and end with an empty line). It will be helpful if you can provide a one liner command for this... (12 Replies)
Discussion started by: Kamesh G
12 Replies

2. Shell Programming and Scripting

Sed: Splitting A large File into smaller files based on recursive Regular Expression match

I will simplify the explaination a bit, I need to parse through a 87m file - I have a single text file in the form of : <NAME>house........ SOMETEXT SOMETEXT SOMETEXT . . . . </script> MORETEXT MORETEXT . . . (6 Replies)
Discussion started by: sumguy
6 Replies

3. Shell Programming and Scripting

Divide an EBCDIC files into multiple files based on value at 45-46 bytes

Hi All, I do have an EBCDIC file sent from the z/os , this file has records with different record types in it, the type of record is identified by bytes 45-46 like value 12 has employee record value 14 has salaray record and etc.... we do now want to split the big ebcdic file into multiple... (3 Replies)
Discussion started by: okkadu
3 Replies

4. Shell Programming and Scripting

Help needed - Split large file into smaller files based on pattern match

Help needed urgently please. I have a large file - a few hundred thousand lines. Sample CP START ACCOUNT 1234556 name 1 CP END ACCOUNT CP START ACCOUNT 2224444 name 1 CP END ACCOUNT CP START ACCOUNT 333344444 name 1 CP END ACCOUNT I need to split this file each time "CP START... (7 Replies)
Discussion started by: frustrated1
7 Replies

5. Shell Programming and Scripting

Finding data in large no. of files

I need to find some data in a large no. of files. The data is in the following format : VALUE A VALUE B VALUE C VALUE D 10 4 65 1 12 4.5 65.5 2 10.75 5.1 ... (2 Replies)
Discussion started by: cooker97
2 Replies

6. Shell Programming and Scripting

Divide data with specific column values into separate files

hello! i need a little help from you :) ... i need to split a file into separate files depending on two conditions using scripting. The file has no delimiters. The conditions are col 17 = "P" and col 81 = "*", this will go to one output file; col 17 = "R" and col 81 = " ". Here is an example. ... (3 Replies)
Discussion started by: chanclitas
3 Replies

7. Shell Programming and Scripting

Divide data into separate files

frnds: i want to divide data on the behalf of dotted line and redirectd into new files ) ------------------------- M-GET CONFIRMATION ( ------------------------- M-GET CONFIRMATION ( INVOKE IDENTIFIER final data shuld be into 3 files ...... (6 Replies)
Discussion started by: dodasajan
6 Replies

8. UNIX for Dummies Questions & Answers

multiple smaller files from one large file

I have a file with a simple list of ids. 750,000 rows. I have to break it down into multiple 50,000 row files to submit in a batch process.. Is there an easy script I could write to accomplish this task? (2 Replies)
Discussion started by: rtroscianecki
2 Replies

9. UNIX for Dummies Questions & Answers

splitting the large file into smaller files

hi all im new to this forum..excuse me if anythng wrong. I have a file containing 600 MB data in that. when i do parse the data in perl program im getting out of memory error. so iam planning to split the file into smaller files and process one by one. can any one tell me what is the code... (1 Reply)
Discussion started by: vsnreddy
1 Replies

10. Shell Programming and Scripting

how to divide single large log file into multiple files.

Can you please help me with writing script for following purpose. I have to divide single large web access log file into multiple log files based on dates inside the log file. For example: if data is logged in the access file for jan-10-08 , jan-11-08 , Jan-12-08 then make small log file... (1 Reply)
Discussion started by: kamleshm
1 Replies
Login or Register to Ask a Question