Split files with formatted numbers

07-02-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

When you use the *printf() family of functions and you want to print a percent sign (%) rather than have it act as a format field introducing character, you need to use %% as in:

Code:

cmd = sprintf("date +%%Y%%m%%d >> \"%s\"\n", fn)

If you need to include characters in the date format operand that have to be escaped from the shell (such as if you wanted the date output to include spaces between fields), it would be something like:

Code:

cmd = sprintf("date \"+%%Y %%m %%d\" >> \"%s\"\n", fn)

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-02-2014

Registered User

157, 0

Join Date: Oct 2007

Last Activity: 1 March 2019, 12:18 PM EST

Posts: 157

Thanks Given: 36

Thanked 0 Times in 0 Posts

Thanks Don Cragun. So % works like escape character within *print() functions

bobbygsk

View Public Profile for bobbygsk

Find all posts by bobbygsk

07-10-2014

Registered User

157, 0

Join Date: Oct 2007

Last Activity: 1 March 2019, 12:18 PM EST

Posts: 157

Thanks Given: 36

Thanked 0 Times in 0 Posts

Don Cragun's code works fine. But since I rarely use Unix, I'm not expert in awk.
My requirement changed and in header, it is needed to print no. of records in each file.
Though we are splitting 100,000 records, the last file might have less than 100,000.

So to display the number of records in each split files, I guess, I have to take FNR (record number in current file). But how do I print it. FNR is known only at the end of record and we are displaying header and all the records(lines) first.

So my split files header should look like the following

Code:

HD~<total records in this split file>~Total number of files

~being the delimiter

Last edited by bobbygsk; 07-10-2014 at 10:12 AM..

bobbygsk

View Public Profile for bobbygsk

Find all posts by bobbygsk

07-10-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Even though you're not an expert in awk, which line in the code I supplied do you think needs to be changed? Did you make any attempt at changing that line to meet your new requirements? What part of what you tried is not working?

Do you want the line count in the header of each file to include the header and trailer in that file in the count, or just the number of lines in that file from the file that is being split?

Do you still want a 3 digit number (with leading zeros) for the "Total number of files" field at the end of the header line?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-10-2014

Registered User

157, 0

Join Date: Oct 2007

Last Activity: 1 March 2019, 12:18 PM EST

Posts: 157

Thanks Given: 36

Thanked 0 Times in 0 Posts

I tried the following before NR % lpf == 1

Code:

}
{       # count no. of lines
        ++cntRec
}
NR % lpf == 1 {
        # 1st line of output file:
        fn=sprintf("split.%03d.txt", ++ofc)
        # Header format HD~A~B (A:File no.;  B: Total Files)
        printf("HD~%03d~%03d\n", cntRec, nf) > fn
}
{       # all lines:
        print > fn
}

I do not know where to increment it.
I need header in each splitted file, how many records(lines) it has excluding header and footer.

bobbygsk

View Public Profile for bobbygsk

Find all posts by bobbygsk

07-10-2014

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

I guess I missed something - generally I think it is better to use a command that does what you want than to write a script, in this case

Code:

csplit

is a possible choice. It is educational to write a script but a better idea to use known good commands for production work.

Code:

csplit  -f splitz -k  -n 3  csprap01.logscan 10000 {5}

Explanation: split csprap01.logscan into five files named splitz000..splitz004

-f splitz -prefix for numbered file name - splitz001 .. splits999

-n number of decimal digits in the number: -n 3 means use zero filled numbers with 3 digits for output filenames

10000 means start from where you are in the file (usually the beginning) and stop 10000 lines later == lines 1-9999 are in the first split. 10000 - 19999 in the second.

{5} repeat five times - {*} (Linux csplit) means keep on repeating. This last option will cause you to overwrite the splitz000 file (and others) if you create more than 999 files as splits.

The line in red means the last file came up short of lines. With -k you lose no lines in the splits in case of error.

Code:

csplit  -f splitz -k  -n 3  csprap01.logscan 10000 {5}
1293851
1305465
1306543
2458441
1785104
/usr/local/bin/csplit: `10000': line number out of range on repetition 5
258231
jmcnama>
jmcnama > ls -lrt splitz*
-rw-r--r--   1 jmcnama  other    1293851 Jul 10 14:39 splitz000
-rw-r--r--   1 jmcnama  other    1305465 Jul 10 14:39 splitz001
-rw-r--r--   1 jmcnama  other    1306543 Jul 10 14:39 splitz002
-rw-r--r--   1 jmcnama  other    2458441 Jul 10 14:39 splitz003
-rw-r--r--   1 jmcnama  other    1785104 Jul 10 14:39 splitz004
-rw-r--r--   1 jmcnama  other     258231 Jul 10 14:39 splitz005

Code:

 jmcnama > wc -l splitz*
    9999 splitz000
   10000 splitz001
   10000 splitz002
   10000 splitz003
   10000 splitz004
    2093 splitz005
   52092 total
jmcnama >  wc -l csprap01.logscan
   52092 csprap01.logscan

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

07-10-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by bobbygsk

I tried the following before NR % lpf == 1

Code:

}
{       # count no. of lines
        ++cntRec
}
NR % lpf == 1 {
        # 1st line of output file:
        fn=sprintf("split.%03d.txt", ++ofc)
        # Header format HD~A~B (A:File no.;  B: Total Files)
        printf("HD~%03d~%03d\n", cntRec, nf) > fn
}
{       # all lines:
        print > fn
}

I do not know where to increment it.
I need header in each splitted file, how many records(lines) it has excluding header and footer.

OK. Unfortunately, you can't count how many lines you have written into a file before you write those lines into the file. So using cntRec like you tried can only show you how many lines were written into previous files.

But, since we know how many lines we've read and how many lines are in the input file, we can calculate how many lines we are going to write into this file before we write the header record. So, remove the new action you added:

Code:

{       # count no. of lines
        ++cntRec
}

and just change the printf() statement you changed to something like:

Code:

	printf("HD-%d-%03d\n", (NR - 1 + lpf) <= lc ? lpf : lc % lpf, nf) > fn

If the current line number (which is the 1st line in an output file) - 1 + the maximum number of lines that we will write to a file is less than or equal to the the number of lines in the input file, print the maximum number of lines to write to a file; otherwise, print the number of lines left over (which will only happen on the last file and only then if there are less than lpf lines left to go into that file).

Jim,
The reason for the script is that csplit doesn't add the desired headers and trailers in the split files. And, yes, that could be done with after csplit did the big part of the job; but why read and write the data again to add a header when awk can do it in one pass.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Split files with formatted numbers

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sum up formatted numbers with comma separation

Discussion started by: krishmaths

2. UNIX for Beginners Questions & Answers

Split and Rename Split Files

Discussion started by: techedipro

3. Shell Programming and Scripting

awk split numbers

Discussion started by: sdf

4. Shell Programming and Scripting

Split a file into multiple files based on line numbers and first column value

Discussion started by: sarav.shan

5. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Discussion started by: Ann Mc Cartney

6. Shell Programming and Scripting

Extracting formatted text and numbers

Discussion started by: DFr0st

7. UNIX for Dummies Questions & Answers

Split Function Prefix Numbers

Discussion started by: Gussifinknottle

8. Shell Programming and Scripting

Generating formatted reports from log files

Discussion started by: daccad

9. Shell Programming and Scripting

Need to remove improperly formatted fortran output line from files, tried sed

Discussion started by: gillesc_mac