Split File based on number of rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split File based on number of rows
# 1  
Old 11-19-2014
Split File based on number of rows

Hi

I have a requirement, where i will receive multiple files in a folder (say: /fol1/fol2/). There will be at least 14 to 16 files. The size of the files will different, some may be 80GB or 90GB, some may be less than 5 GB (and the size of the files are very unpredictable). But the names of the files will be have a particular format like "Table1_Insert.dat" , Table1_Update.dat, Table1_delete.dat, Table2_ins.dat, Table2_upd.dat, Table2_del.dat... like this...

I have to read one file at a time, check the size of the file (in GB), if the file size is greater than 90 GB (file size wont be more than 100GB always), then split the files into 5GB. So if the file size is 90 GB, then it should split the source file into 18 sub files (like TT_table1_ins.dataa, TT_Table1_ins.datab , TT_Table1_ins.datac... etc)

I want my script to take only one input argument - just the file name (with the path).

I know we can do this using split -l command, but i need some help. Can somebody help me with a script. I'm very new to shell scripting. I can understand the commands but cannot write a script... Smilie

Thanks
# 2  
Old 11-19-2014
Is this a homework assignment?
# 3  
Old 11-19-2014
No this is not homework... may be i gave too much explanation... thats why it seems like home work... but this is my work...
# 4  
Old 11-19-2014
Why do you need a script? What stops you from using it as a one line command like
Code:
[ $(stat -c"%s" Table1_ins.dat) -gt 90000000000 ] && split -b5000000000 -a1 --verbose Table1_ins.dat TT_table1_ins.dat

This splits by bytes and could split half lines; use -l with an average line length to keep entire lines.

Last edited by RudiC; 11-20-2014 at 06:32 AM.. Reason: removed "a" from target file name
# 5  
Old 11-19-2014
Quote:
Originally Posted by kpk_ds
No this is not homework... may be i gave too much explanation... thats why it seems like home work... but this is my work...
You did not give too much information! (I've never seen anyone make that mistake in these forums!)

If the 1-liner RudiC gave you works, you can put that into a script.

If it doesn't, show us the diagnostic messages it prints and tell us what operating system and shell you're using. (The stat utility and the split --verbose option are not available on all operating systems.)
# 6  
Old 11-19-2014
Some splits (GNU for example) support the --line-bytes=SIZE to keep whole lines eg:

Code:
cd /fol1/fol2
for file in *.dat
do
    if [ $(stat -c"%s" "$file") -gt 90000000000 ]
    then
        split --line-bytes=5000000000 -a1 --verbose "$file" "TT_$file"
    fi
done

# 7  
Old 11-22-2014
Hi
Thanks for the replies. I apologize for the delay.

I tried the following, and It didn't work.
I tried:
Code:
stat -c"%s" FileName.txt

I got:
Code:
ksh: stat:  not found

I tried:
Code:
stat -f FileName.txt

I got:
Code:
ksh: stat:  not found

So here's what I tried. (I tried this in DataStage - on AIX). I created a DataStage code, that will read the and use some simple UNIX commands and do this.

1. Check if the Directory exist, if yes remove all the files inside the directory and the directory.

2. Then check the file size:
Code:
ls -lSr /inputpath/Filename.txt | tail -n1 | awk '{$5=sprintf("%.9f GB", $5/1024^3)} 1' | cut -d' ' -f5 | awk '{printf "%.0f\n", $1}'

This will give me the file size in GB. if the file size is small, then it tells me 0.

(but now I found a new way to get the size: ls -l filename.txt | awk '{print $5}'

3. calculate Split file count (number of files to split): Here I will get the size from above and see how many pieces I can split them. For example, if the file size if 100 GB, then I want to split then into 20 pieces (5GB Each). I will get this number.

4. Get the row count, using wc -l, and then parse to get only the numeric part.

5. The I calculate the rows per split file (I do it this way, because if I split the file just by size then there may be a possibility that the last row may get split). (Formula used: Total Row Count divided by number of files to split. For example, if I have file with 1000 rows, and the size is 100 GB (which means I want to split them into 20 files). So this would give me 1000 divided by 20 equals 50. So this will create 20 files with 50 rows each.

6. The round the result from above to a whole number (higher end).

7. The I use the split command (a simple script) - This will take the file name and the row count for each split file as input argument. This script will first check for correct I/p argument, then creates a directory where I can place the split files, and then uses the split command as
Code:
split -l rowcount file_name /destination/directory/forsplitfile/data_

this will split my actual input files and then store in into the folder "forsplitfile" with prefix "data_"

8. Then I check if all the split file size equals the actual file size, if Yes, then continue, if no, then abort.

9. Then I use my load job to read each split file and load them one by one. once all the files are done, the job completes.

10. If the job fails in between (say split_file_5) then when I re-start my job, it will pick up from where it failed. (split_file_5).

I know this is too much, but can somebody help me put this in a script. I can do this in DataStage with some UNIX commands, but I think what I'm doing wont be a stable solution. So can somebody help me please.

Thanks

---------- Post updated at 03:33 PM ---------- Previous update was at 03:30 PM ----------

Also how do I add checkpoint in the shell script, so when I restart, it can start from where it failed.

Last edited by Franklin52; 11-23-2014 at 09:15 AM.. Reason: Please use code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Split file based on number of blank lines

Hello All , I have a file which needs to split based on the blank lines Name ABC Address London Age 32 (4 blank new line) Name DEF Address London Age 30 (4 blank new line) Name DEF Address London (8 Replies)
Discussion started by: Pratik4891
8 Replies

2. Shell Programming and Scripting

How to split a file based on pattern line number?

Hi i have requirement like below M <form_name> sdasadasdMklkM D ...... D ..... M form_name> sdasadasdMklkM D ...... D ..... D ...... D ..... M form_name> sdasadasdMklkM D ...... M form_name> sdasadasdMklkM i want split file based on line number by finding... (10 Replies)
Discussion started by: bhaskar v
10 Replies

3. UNIX for Dummies Questions & Answers

Command to split the files based on the number of lines in it

Hello Friends, Can anyone help me for the below requirement. I am having a file called Input.txt. My requirement is first check the count that is wc -l input.txt If the result of the wc -l Input.txt is less than 10 then don't split the Input.txt file. Where as if Input.txt >= 10 the split... (12 Replies)
Discussion started by: malaya kumar
12 Replies

4. UNIX for Dummies Questions & Answers

Sum the rows number based on first field string value

Hi, I have a file like this one h1 4.70650E-04 4.70650E-04 4.70650E-04 h2 1.92912E-04 1.92912E-04 1.92912E-04 h3A 3.10160E-11 2.94562E-11 2.78458E-11 h4 0.00000E+00 0.00000E+00 0.00000E+00 h1 1.18164E-12 2.74150E-12 4.35187E-12 h1 7.60813E-01 7.60813E-01 7.60813E-01... (5 Replies)
Discussion started by: f_o_555
5 Replies

5. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Could anybody help with this? I have input below ..... david,39 david,39 emelie,40 clarissa,22 bob,42 bob,42 tim,32 bob,39 david,38 emelie,47 what i want to do is count how many names there are with different ages, so output would be like this .... david,2 emelie,2 clarissa,1... (3 Replies)
Discussion started by: itsme999
3 Replies

6. Shell Programming and Scripting

Average calculation based on number of rows

Dear users, I need your support, I have a file like this: 272134.548 6680572.715 272134.545 6680572.711 272134.546 6680572.713 272134.548 6680572.706 272134.545 6680572.721 272134.543 6680572.710 272134.544 6680572.715 272134.543 6680572.705 272134.540 6680572.720 272134.544... (10 Replies)
Discussion started by: Gery
10 Replies

7. Shell Programming and Scripting

Split single file into multiple files based on the number in the column

Dear All, I would like to split a file of the following format into multiple files based on the number in the 6th column (numbers 1, 2, 3...): ATOM 1 N GLY A 1 -3.198 27.537 -5.958 1.00 0.00 N ATOM 2 CA GLY A 1 -2.199 28.399 -6.617 1.00 0.00 ... (3 Replies)
Discussion started by: tomasl
3 Replies

8. Shell Programming and Scripting

Split File Based on Line Number Pattern

Hello all. Sorry, I know this question is similar to many others, but I just can seem to put together exactly what I need. My file is tab delimitted and contains approximately 1 million rows. I would like to send lines 1,4,& 7 to a file. Lines 2, 5, & 8 to a second file. Lines 3, 6, & 9 to... (11 Replies)
Discussion started by: shankster
11 Replies

9. Shell Programming and Scripting

split based on the number of characters

Hello, if i have file like this: 010000890306932455804 05306977653873 0520080417010520ISMS SMT ZZZZZZZZZZZZZOC30693599000 30971360000 ZZZZZZZZZZZZZZZZZZZZ202011302942311 010000890306946317387 05306977313623 0520080417010520ISMS SMT... (6 Replies)
Discussion started by: chriss_58
6 Replies

10. Shell Programming and Scripting

Splitting file based on number of rows

Hi, I'm, new to shell scripting, I have a requirement where I have to split an incoming file into separate files each containing a maximum of 3 million rows. For e.g: if my incoming file say In.txt has 8 mn rows then I need to create 3 files, in which two will 3 mn rows and one will contain 2... (2 Replies)
Discussion started by: wahi80
2 Replies
Login or Register to Ask a Question