Help splitting files greater than 250mb


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Help splitting files greater than 250mb
# 1  
Old 03-04-2013
Help splitting files greater than 250mb

I have script to split files based on specific columns & then name the files based on those columns. The split works well; however some of the files are still too large. I need some help splitting some of the files a second time based on sized. Any file greater than 250MB would need to be split to be less than 250 MB but keep the name just add a numeric prior to file type.
Ex Name: name.txt name_1.txt
Here is my split script I currently use:
Code:
 
awk -F "^" '{close(f);f=$2"."$5"."$11"."$79"."$80}{print > "P8." f".txt"}'

This command is called post success from a PowerCenter job in a .sh file. Please expain any suggestions because I am a beginner. Your help is greatly appreciated.

Last edited by vbe; 03-04-2013 at 01:37 PM.. Reason: code tags please
# 2  
Old 03-04-2013
I have a suggestion: why don't you use wc -c to count bytes of the file that the awk program is writing to and change the name if it crosses 250 MB?

Here is an example program that I wrote which changes the file name when it has crossed 100 bytes:
Code:
awk '  BEGIN {
                i = 1                                           # Counter variable i intialized to 1
                while ( j <= 50 )                               # while j <= 50
                {
                        if( bytes > 100 )                       # If bytes is > 100
                                i = i + 1;                      # Increment i variable value
                        F = "name" i ".txt";                    # Set filename = "name" i ".txt"
                        print j >> F;                           # Write j value to file
                        close(F);                               # Close file
                        cmd = "wc -c < " F;                     # Define cmd
                        cmd | getline bytes;                    # Run cmd and count file bytes
                        close(cmd);                             # Close cmd
                        j = j + 1;                              # Increment j variable value
                }
} '

This is just to give you an idea, feel free to modify as per your requirement. I hope this helps.

Last edited by Yoda; 03-04-2013 at 04:33 PM.. Reason: correction
# 3  
Old 03-04-2013
On GNU system, split(1) may be what you are looking for (with -b option):
Code:
man split

# 4  
Old 03-05-2013
bipinajith, I am not sure how to apply your code. My awk command names the file based on the records in the columns: 2, 5, 11, 79, 80. I could end up with a lot of files all named differently but they neeed to remain < 250MB so I can move them. I need those column records to remain in the name for searchability because these files are going into our Records Management System.

---------- Post updated at 07:43 AM ---------- Previous update was at 07:21 AM ----------

Here is a better explaination of my task:
This project moves data from a table into a flat file on a Unix server using PowerCenter. In PowerCenter the columns use ^ as a delimiter instead of , because it is a unique character & is used in my Unix script to split the files. After the Powercenter job completes my script is called to split the file based on changes in the data of specific columns: 2, 5, 11, 79, 80. Here is my code:
 
Code:
 awk -F "^" '{close(f);f=$2"."$5"."$11"."$79"."$80}{print > "P8." f".txt"}' /PowerCenter/TgtFiles/dfel_dw/elig_archive/elig_list_comb_arch.out

 


So what this does is split the elig_list_comb_arch.out file, which is the target file put on the server from PowerCenter & name the split files with a Prefix of "P8." and then based on the data in the stated columns.
Examples:
  • P8.AEP_-_Columbus_Southern.Non-Residential.OH.12.2010.txt
  • P8.AEP_-_Columbus_Southern.Residential.OH.12.2010.txt
  • P8.AEP_-_Ohio_Power.Non-Residential.OH.12.2010.txt
  • P8.AEP_-_Ohio_Power.Residential.OH.12.2010.txt
  • P8.CEI.Non-Residential.OH.10.2010.txt
  • P8.PECO_Energy.Non-Residential..12.2010.txt
  • P8.PECO_Energy.Non-Residential.A..12.2010.txt
  • P8.PECO_Energy.Non-Residential.A.12.2010.txt
The code works well however some of the files are still too large. What I need help with is additional splitting of the files which are greater than 250 MB. I need to keep the naming convention the same but can add a number suffix.
So lets say:
if P8.CEI.Non-Residential.OH.10.2010.txt is 300MB
Then I would want it split into the following 2 files:
  1. P8.CEI.Non-Residential.OH.10.2010.txt
  2. P8.CEI.Non-Residential.OH.10.2010_1.txt
Or
if P8.CEI.Non-Residential.OH.10.2010.txt is 600 MB
then I would want it split into 3 files:
  1. P8.CEI.Non-Residential.OH.10.2010.txt
  2. P8.CEI.Non-Residential.OH.10.2010_1.txt
  3. P8.CEI.Non-Residential.OH.10.2010_2.txt
 So as long as the files remain under 250 MB,

Any help is greatly appreciated.
# 5  
Old 03-05-2013
You could try something like:
Code:
awk -F '^' -v M=250000000 '
{       # Set filename base.
        fb = $2"."$5"."$11"."$79"."$80
        # Increment # of characters that would be in current output file for
        # this filename base if this record is appended.
        if((fs[fb] += length($0) + 1) > M) {
                # Adding this line would exceed max size; start a new file.
                f[fb]++
                fs[fb] = length($0) + 1
        }
        # Create output filename for this record.
        fn = fb (f[fb] ? "_" f[fb] : "") ".txt"
        # If this is not the same output file to which the last record was
        # written, close the last file written.
        if(lfn != fn) close(lfn)
        # Add this record to the end of the current output file.
        print >> fn
        # Save the output filename for comparison with the file used for the
        # next record.
        lfn = fn
}' /PowerCenter/TgtFiles/dfel_dw/elig_archive/elig_list_comb_arch.out

Note, however, that this example will limit the size of each file to 250000000 characters; not bytes. So, if your input file contains multibyte characters, you'll need to reduce the size to compensate or find a way to count bytes instead of characters. (Counting bytes isn't hard, but could significantly slow down processing.)

Note also that if you have output files from a previous run, this script will append up to 250000000 characters to them rather than overwrite or skip over them.
# 6  
Old 03-05-2013
You haven't told us what system you are on. GNU makes this easy:
Code:
#!/bin/bash
for file in P8.*txt ; do 
  size=$(ls -l $file | awk '{print $5}')              #get the size of file
  if [[ $size -gt $((250*1024*1024)) ]] ; then        #file bigger than 250M
      split -b 250M -d -a1 $file ${file%.txt}_        #do the spliting, -d: numeric suffix, -a1: use only one digit
      for i in ${file%.txt}_[0-9] ; do                #add .txt suffix:
           mv $i ${i}.txt
      done
  fi
done

# 7  
Old 03-05-2013
How about something a little more simple maybe I am overcomplicating it. After I split my files I want to search the directory for any file greater than 250 MB & then split them into smaller files. Keep the name & just append a suffix to each split of that file. THis directory:

345611455 Feb 28 15:24 P8.AEP_-_Columbus_Southern.Residential.OH.12.2010.txt
55645613 Feb 28 15:24 P8.AEP_-_Columbus_Southern.Non-Residential.OH.12.2010.txt
1916 Feb 28 15:30 P8.CEI.Non-Residential.OH.10.2010.txt
317407175 Feb 28 15:30 P8.AEP_-_Ohio_Power.Residential.OH.12.2010.txt
70283332 Feb 28 15:30 P8.AEP_-_Ohio_Power.Non-Residential.OH.12.2010.txt
572 Feb 28 15:30 P8.PECO_Energy.Residential.R2.12.2010.txt
610 Feb 28 15:31 P8.PECO_Energy.Non-Residential.NP.12.2010.txt
573 Feb 28 15:31 P8.PECO_Energy.Non-Residential.VE.12.2010.txt
12440 Feb 28 15:35 P8.PECO_Energy.Non-Residential.IA.12.2010.txt
447080511 Feb 28 15:35 P8.PECO_Energy.Residential.PA.12.2010.txt
45245 Feb 28 15:35 P8.PECO_Energy.Residential..12.2010.txt
64690395 Feb 28 15:35 P8.PECO_Energy.Non-Residential.PA.12.2010.txt
3582 Feb 28 15:35 P8.PECO_Energy.Non-Residential.ON.12.2010.txt
27263 Feb 28 15:35 P8.PECO_Energy.Non-Residential.LE.12.2010.txt
47728426 Feb 28 15:35 P8.Penn_Power.Residential.PA.12.2010.txt
7393084 Feb 28 15:35 P8.Penn_Power.Non-Residential.PA.12.2010.txt

The highlighted files should be split.
The results should be:

250000000 Feb 28 15:24 P8.AEP_-_Columbus_Southern.Residential.OH.12.2010.txt
95611455 Feb 28 15:24 P8.AEP_-_Columbus_Southern.Residential.OH.12.2010_1.txt
55645613 Feb 28 15:24 P8.AEP_-_Columbus_Southern.Non-Residential.OH.12.2010.txt
1916 Feb 28 15:30 P8.CEI.Non-Residential.OH.10.2010.txt
250000000 Feb 28 15:30 P8.AEP_-_Ohio_Power.Residential.OH.12.2010.txt
67407175 Feb 28 15:30 P8.AEP_-_Ohio_Power.Residential.OH.12.2010_1.txt
70283332 Feb 28 15:30 P8.AEP_-_Ohio_Power.Non-Residential.OH.12.2010.txt
572 Feb 28 15:30 P8.PECO_Energy.Residential.R2.12.2010.txt
610 Feb 28 15:31 P8.PECO_Energy.Non-Residential.NP.12.2010.txt
573 Feb 28 15:31 P8.PECO_Energy.Non-Residential.VE.12.2010.txt
12440 Feb 28 15:35 P8.PECO_Energy.Non-Residential.IA.12.2010.txt
250000000 Feb 28 15:35 P8.PECO_Energy.Residential.PA.12.2010.txt
197080511 Feb 28 15:35 P8.PECO_Energy.Residential.PA.12.2010_1.txt
45245 Feb 28 15:35 P8.PECO_Energy.Residential..12.2010.txt
64690395 Feb 28 15:35 P8.PECO_Energy.Non-Residential.PA.12.2010.txt
3582 Feb 28 15:35 P8.PECO_Energy.Non-Residential.ON.12.2010.txt
27263 Feb 28 15:35 P8.PECO_Energy.Non-Residential.LE.12.2010.txt
47728426 Feb 28 15:35 P8.Penn_Power.Residential.PA.12.2010.txt
7393084 Feb 28 15:35 P8.Penn_Power.Non-Residential.PA.12.2010.txt
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Automate splitting of files , scp files as each split completes and combine files on target server

i use the split command to split a one terabyte backup file into 10 chunks of 100 GB each. The files are split one after the other. While the files is being split, I will like to scp the files one after the other as soon as the previous one completes, from server A to Server B. Then on server B ,... (2 Replies)
Discussion started by: malaika
2 Replies

2. Shell Programming and Scripting

Please help list/find files greater 1G move to different directory

I have have 6 empty directory below. I would like write bash scipt if any files less "1000000000" bytes then move to "/export/home/mytmp/final" folder first and any files greater than "1000000000" bytes then move to final1, final2, final3, final4, final4, final5 and that depend see how many files,... (6 Replies)
Discussion started by: dotran
6 Replies

3. Shell Programming and Scripting

Need to delete the log files when the disk used% greater than 85% using df -k

Hi, I am new to Shell scripts. I have an urgent requirement to find the disk space using "df -k". from that output,I need to check the used% whether greater than 85%. if it is greater than 85% then need to delete my log files. It is very urgent please some one help me. Thanks in Advance... (2 Replies)
Discussion started by: Anandbarnabas
2 Replies

4. How to Post in the The UNIX and Linux Forums

Very Urgent ---Need to delete the log files when the disk used% greater than 85% using df -k*

Hi, I am new to Shell scripts. I have an urgent requirement to find the disk space using "df -k". from that output,I need to check the used% whether greater than 85%. if it is greater than 85% then need to delete my log files. It is very urgent please some one help me. Thanks in Advance... (1 Reply)
Discussion started by: Anandbarnabas
1 Replies

5. Shell Programming and Scripting

Find files greater than a particular date in filename.

I need a unix command which will find all the files greater that a particular date in the file name. say for example I have files like(filenaming cov : filename.YYDDMMSSSS.txt) abc.201206015423.txt abc.201207013456.txt abc.201202011234.txt abc.201201024321.txt efg.201202011234.txt... (11 Replies)
Discussion started by: lijjumathew
11 Replies

6. UNIX for Dummies Questions & Answers

List of Files which are Greater then a specific date

A newbie question... I need to get a list of the Files and folders which are greater then a specific date. I want write the output to a Text file. What I know ls -lrt gives me list of all the files ordered by date. Also ls > fileName will write the results to a text file. Please help (6 Replies)
Discussion started by: rkaif
6 Replies

7. Shell Programming and Scripting

Trying to find files equal to and greater than

Hi Guys and Gals, I'm having some difficulty putting this check into a shell script. I would like to search a particular directory for a number of files. The logic I have is pretty simple: Find file named *.txt that are newer than <this file> and count them If the number of files is equal to... (4 Replies)
Discussion started by: bbbngowc
4 Replies

8. Shell Programming and Scripting

Important finding --- find files greater than 1 MB

as we can find file greater than 1 MB with find command as: find /dir -name '*' -size +1M find /dir/* -name '*' -size +1M but wats its doing is , its finding files only in current directory not in sub-directories. i want files from sub-directories too. Please help... Thanx in... (3 Replies)
Discussion started by: manoj_dahiya22
3 Replies

9. UNIX for Advanced & Expert Users

Problem creating files greater than 2GB

With the C code I am able to create files greater than 2GB if I use the 64 bit compile option -D_FILE_OFFSET_BITS=64. There I am using the function fprintf to write into the file. But when I use C++ and ofstream the file is getting truncated when the size grows beyond 2GB. Is there any special... (1 Reply)
Discussion started by: bobbyjohnz
1 Replies

10. Shell Programming and Scripting

Deleting specific files greater then 90 days

I have a directory that contains a number of history files for a process. Every file starts with the string "EVACK". FOr example EVACK0101.000001 EVACK0102.095940 EVACKmmdd.hhmiss I want to erase in the specific directory ONLY(no sub-directories) all EVACK files older then 90 days. I... (2 Replies)
Discussion started by: beilstwh
2 Replies
Login or Register to Ask a Question