Breaking large file into small files

03-06-2015

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello emily,

Not sure about your complete requirement, could you please try following and let me know if this helps.

Code:

echo $3 | awk '{FILENAME=$3"_"int((NR-1)/200)".txt";print >> FILENAME}'

You can replace this command with the shown one.

Thanks,
R. Singh

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

03-06-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The awk variable FILENAME is provided by awk and contains the name of the input file that is currently being processed. Redefining it is not a good idea. Try something like this instead:

Code:

awk '{outfile=FILENAME int((NR-1)/200) ".txt";print >> outfile}' $3

Note, however, that both your script and the above script consume a file descriptor for each output file created and don't free any file descriptors until awk exits. If you need to create several files, you may have to close files when you're done writing to them to avoid a "too many open files" error. Even if you don't "have to", it is usually a good habit to close files you no longer need open. And, if you have a lot of files with numbers in them that might be more than one digit, you may want to add some leading zeros so the files will appear in numeric order when output by ls...

Code:

awk '
BEGIN {	outfile = sprintf("%s%03d.txt", FILENAME, 0))
}
{	print > outfile
}
(NR % 200) == 0 {
	close(outfile)
	outfile = sprintf("%s%03d.txt", FILENAME, int(NR/200))
}' $3

And, just out of curiosity, why does your script bother defining:

Code:

PATHNAME=$1
CONSTANT=rfio:
GREP=$2
OUTPUT=$3

when none of them are ever referenced in your script?

Note that I also changed the print >> outfile to print > outfile. If you ever need to update the split files due to an update in a base file, you will want to overwrite the old files instead of append to the en of them. (Note, however, that this won't remove any trailing files that may no longer be needed if your updated base file is smaller than it was before.) If that is a concern, you could add a line to your script before invoking awk:

Code:

# Remove any earlier versions of the split output files.
rm -f ${3}[0-9][0-9][0-9].txt

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-06-2015

Registered User

158, 1

Join Date: Sep 2012

Last Activity: 12 January 2016, 3:06 AM EST

Location: Switzerland

Posts: 158

Thanks Given: 58

Thanked 1 Time in 1 Post

Hello Ravinder and Don,
Here is my modified script [1] and the output. Why I am getting filename like:

Code:

-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 _0.txt

[1]

Code:

#!/bin/bash                                                                                                                  

OUTPUT=InputFile_
GREP=root
EOSPATH="srm://dcache-se-cms.desy.de:8443SingleMu
onGun/SingleMuMinus_Fall14_FlatPt-0to200_MCRUN2_72_V3_GEN_SIM_DIGI_RECO_L1/150127_084421/"
FILEPATH[1]=$EOSPATH/0001
FILEPATH[2]=$EOSPATH/0002
#FILEPATH[3]=$EOSPATH/0003                                                                                                   
#FILEPATH[4]=$EOSPATH/0004                                                                                                   

## copy the FileName from eos to $3                                                                                          
for FileNameIndx in "${FILEPATH[@]}"
  do
    if [[ ! -e "dest_path/$FileNameIndx" ]]; then
        echo "Copying fileName \"$FileNameIndx  | grep root\" to $OUTPUT"
        Index=$(echo $FileNameIndx | awk '{split($FileNameIndx, a, "000"); print "000"a[2]}')
        srmls $FileNameIndx --count 99999 --offset 2 | grep $GREP | awk -F'tier2' '{print string path $GREP}' string="" path\
=""  > $OUTPUT$Index
        FINALFILE=$OUTPUT$Index
        echo $FINALFILE
        echo "progressing ... please be patient..."

        awk '                                                                                                                
        BEGIN {outfile = sprintf("%s_%01d.txt", FILENAME, 0)                                                                 
}                                                                                                                            
{print > outfile                                                                                                             
}                                                                                                                            
(NR % 200) == 0 {                                                                                                            
close(outfile)                                                                                                               
outfile = sprintf("%s_%01d.txt", FILENAME, int(NR/200))                                                                      
}'  $FINALFILE

    fi
done

It is working, but giving the output like:

Code:

-rwxr-xr-x 1 emily af-cms   1820  6. Mr 11:18 copyTextFromCastor.sh
-rw-r--r-- 1 emily af-cms 271184  6. Mr 11:18 InputFile_0001
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0001_1.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0001_2.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0001_3.txt
-rw-r--r-- 1 emily af-cms  53584  6. Mr 11:18 InputFile_0001_4.txt
-rw-r--r-- 1 emily af-cms 271456  6. Mr 11:18 InputFile_0002
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 _0.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0002_1.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0002_2.txt
-rw-r--r-- 1 emily af-cms  54400  6. Mr 11:18 InputFile_0002_3.txt
-rw-r--r-- 1 emily af-cms  53856  6. Mr 11:18 InputFile_0002_4.txt

Last edited by emily; 03-06-2015 at 06:32 AM..

emily

View Public Profile for emily

Find all posts by emily

03-06-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Sorry. My mistake. FILENAME isn't defined yet in the BEGIN clause...

Change:

Code:

        BEGIN {outfile = sprintf("%s_%01d.txt", FILENAME, 0)

to:

Code:

        NR==1 {outfile = sprintf("%s_%01d.txt", FILENAME, 0)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-06-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

In awk, FILENAME is only defined after the first file has been opened, which is after the BEGIN section has been finished. Within the BEGIN section FILENAME is empty.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-06-2015

Registered User

158, 1

Join Date: Sep 2012

Last Activity: 12 January 2016, 3:06 AM EST

Location: Switzerland

Posts: 158

Thanks Given: 58

Thanked 1 Time in 1 Post

working fine..

thanks everyone for your useful suggestions

Last edited by emily; 03-06-2015 at 07:38 AM..

emily

View Public Profile for emily

Find all posts by emily

Shell Programming and Scripting

Breaking large file into small files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Split large file into 24 small files on one hour basis

Discussion started by: Raghuram717

2. Shell Programming and Scripting

Split a large array into small chunks

Discussion started by: rkrish

3. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Discussion started by: Ann Mc Cartney

4. UNIX for Advanced & Expert Users

Splitting a file into small files

Discussion started by: piyushbhashkar

5. Shell Programming and Scripting

Breaking the files as 10k recs. per file

Discussion started by: mr_manii

6. Shell Programming and Scripting

Breaking one file into many files based on first column?

Discussion started by: kylle345

7. Shell Programming and Scripting

script to splite large file to number of small files

Discussion started by: ahmed.gad

8. Shell Programming and Scripting

Split large file and add header and footer to each small files

Discussion started by: ashish4422

9. Shell Programming and Scripting

Split a file into 16 small files

Discussion started by: rrkks

10. Shell Programming and Scripting

Splitting large file into small files

Discussion started by: dncs