Split files with formatted numbers


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split files with formatted numbers
# 15  
Old 07-02-2014
When you use the *printf() family of functions and you want to print a percent sign (%) rather than have it act as a format field introducing character, you need to use %% as in:
Code:
cmd = sprintf("date +%%Y%%m%%d >> \"%s\"\n", fn)

If you need to include characters in the date format operand that have to be escaped from the shell (such as if you wanted the date output to include spaces between fields), it would be something like:
Code:
cmd = sprintf("date \"+%%Y %%m %%d\" >> \"%s\"\n", fn)

This User Gave Thanks to Don Cragun For This Post:
# 16  
Old 07-02-2014
Thanks Don Cragun. So % works like escape character within *print() functions
# 17  
Old 07-10-2014
Don Cragun's code works fine. But since I rarely use Unix, I'm not expert in awk.
My requirement changed and in header, it is needed to print no. of records in each file.
Though we are splitting 100,000 records, the last file might have less than 100,000.

So to display the number of records in each split files, I guess, I have to take FNR (record number in current file). But how do I print it. FNR is known only at the end of record and we are displaying header and all the records(lines) first.

So my split files header should look like the following
Code:
HD~<total records in this split file>~Total number of files

~being the delimiter

Last edited by bobbygsk; 07-10-2014 at 10:12 AM..
# 18  
Old 07-10-2014
Even though you're not an expert in awk, which line in the code I supplied do you think needs to be changed? Did you make any attempt at changing that line to meet your new requirements? What part of what you tried is not working?

Do you want the line count in the header of each file to include the header and trailer in that file in the count, or just the number of lines in that file from the file that is being split?

Do you still want a 3 digit number (with leading zeros) for the "Total number of files" field at the end of the header line?
# 19  
Old 07-10-2014
I tried the following before NR % lpf == 1
Code:
}
{       # count no. of lines
        ++cntRec
}
NR % lpf == 1 {
        # 1st line of output file:
        fn=sprintf("split.%03d.txt", ++ofc)
        # Header format HD~A~B (A:File no.;  B: Total Files)
        printf("HD~%03d~%03d\n", cntRec, nf) > fn
}
{       # all lines:
        print > fn
}

I do not know where to increment it.
I need header in each splitted file, how many records(lines) it has excluding header and footer.
# 20  
Old 07-10-2014
I guess I missed something - generally I think it is better to use a command that does what you want than to write a script, in this case
Code:
csplit

is a possible choice. It is educational to write a script but a better idea to use known good commands for production work.

Code:
csplit  -f splitz -k  -n 3  csprap01.logscan 10000 {5}

Explanation: split csprap01.logscan into five files named splitz000..splitz004

-f splitz -prefix for numbered file name - splitz001 .. splits999

-n number of decimal digits in the number: -n 3 means use zero filled numbers with 3 digits for output filenames

10000 means start from where you are in the file (usually the beginning) and stop 10000 lines later == lines 1-9999 are in the first split. 10000 - 19999 in the second.

{5} repeat five times - {*} (Linux csplit) means keep on repeating. This last option will cause you to overwrite the splitz000 file (and others) if you create more than 999 files as splits.

The line in red means the last file came up short of lines. With -k you lose no lines in the splits in case of error.

Code:
csplit  -f splitz -k  -n 3  csprap01.logscan 10000 {5}
1293851
1305465
1306543
2458441
1785104
/usr/local/bin/csplit: `10000': line number out of range on repetition 5
258231
jmcnama>
jmcnama > ls -lrt splitz*
-rw-r--r--   1 jmcnama  other    1293851 Jul 10 14:39 splitz000
-rw-r--r--   1 jmcnama  other    1305465 Jul 10 14:39 splitz001
-rw-r--r--   1 jmcnama  other    1306543 Jul 10 14:39 splitz002
-rw-r--r--   1 jmcnama  other    2458441 Jul 10 14:39 splitz003
-rw-r--r--   1 jmcnama  other    1785104 Jul 10 14:39 splitz004
-rw-r--r--   1 jmcnama  other     258231 Jul 10 14:39 splitz005

Code:
 jmcnama > wc -l splitz*
    9999 splitz000
   10000 splitz001
   10000 splitz002
   10000 splitz003
   10000 splitz004
    2093 splitz005
   52092 total
jmcnama >  wc -l csprap01.logscan
   52092 csprap01.logscan

# 21  
Old 07-10-2014
Quote:
Originally Posted by bobbygsk
I tried the following before NR % lpf == 1
Code:
}
{       # count no. of lines
        ++cntRec
}
NR % lpf == 1 {
        # 1st line of output file:
        fn=sprintf("split.%03d.txt", ++ofc)
        # Header format HD~A~B (A:File no.;  B: Total Files)
        printf("HD~%03d~%03d\n", cntRec, nf) > fn
}
{       # all lines:
        print > fn
}

I do not know where to increment it.
I need header in each splitted file, how many records(lines) it has excluding header and footer.
OK. Unfortunately, you can't count how many lines you have written into a file before you write those lines into the file. So using cntRec like you tried can only show you how many lines were written into previous files.

But, since we know how many lines we've read and how many lines are in the input file, we can calculate how many lines we are going to write into this file before we write the header record. So, remove the new action you added:
Code:
{       # count no. of lines
        ++cntRec
}

and just change the printf() statement you changed to something like:
Code:
	printf("HD-%d-%03d\n", (NR - 1 + lpf) <= lc ? lpf : lc % lpf, nf) > fn

If the current line number (which is the 1st line in an output file) - 1 + the maximum number of lines that we will write to a file is less than or equal to the the number of lines in the input file, print the maximum number of lines to write to a file; otherwise, print the number of lines left over (which will only happen on the last file and only then if there are less than lpf lines left to go into that file).

Jim,
The reason for the script is that csplit doesn't add the desired headers and trailers in the split files. And, yes, that could be done with after csplit did the big part of the job; but why read and write the data again to add a header when awk can do it in one pass.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sum up formatted numbers with comma separation

I need to sum up the values in field nr 5 in a data file that contains some file listing. The 5th field denotes the size of each file and following are some sample values. 1,775,947,633 4,738 7,300 16,610 15,279 0 0 I tried the following code in a shell script. awk '{sum+=$5} END{print... (4 Replies)
Discussion started by: krishmaths
4 Replies

2. UNIX for Beginners Questions & Answers

Split and Rename Split Files

Hello, I need to split a file by number of records and rename each split file with actual filename pre-pended with 3 digit split number. What I have tried is the below command with 2 digit numeric value split -l 3 -d abc.txt F (# Will Produce split Files as F00 F01 F02) How to produce... (19 Replies)
Discussion started by: techedipro
19 Replies

3. Shell Programming and Scripting

awk split numbers

I would like to split a string of numbers "1-2,4-13,16,19-20,21-25,31-32" and output these with awk into -dFirstPage=1 -dLastPage=2 file.pdf -dFirstPage=4 -dLastPage=13 file.pdf -dFirstPage=16 -dLastPage=16 file.pdf file.pdf -dFirstPage=19 -dLastPage=20 file.pdf -dFirstPage=21 -dLastPage=25... (3 Replies)
Discussion started by: sdf
3 Replies

4. Shell Programming and Scripting

Split a file into multiple files based on line numbers and first column value

Hi All I have one query,say i have a requirement like the below code should be move to diffent files whose maximum lines can be of 10 lines.Say in the below example,it consist of 14 lines. This should be moved logically using the data in the fisrt coloumn to file1 and file 2.The data of first... (2 Replies)
Discussion started by: sarav.shan
2 Replies

5. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Hey, I've been trying to break a massive fasta formatted file into files containing each gene separately. Could anyone help me? I've tried to use the following code but i've recieved errors every time: for i in *.rtf.out do awk '/^>/{f=++d".fasta"} {print > $i.out}' $i done (1 Reply)
Discussion started by: Ann Mc Cartney
1 Replies

6. Shell Programming and Scripting

Extracting formatted text and numbers

Hello, I have a file of text and numbers from which I want to extract certain fields and write it to a new file. I would use awk but unfortunately the input data isn't always formatted into the correct columns. I am using tcsh. For example, given the following data I want to extract: and... (3 Replies)
Discussion started by: DFr0st
3 Replies

7. UNIX for Dummies Questions & Answers

Split Function Prefix Numbers

Hello, Hello, I use the following command to split a file: split -Number_of_Lines Input_File MyPrefix_ output is MyPrefix_a MyPrefix_b MyPrefix_c ...... Instead, how can I get numerical values like: MyPrefix_1 MyPrefix_2 MyPrefix_3 ...... (2 Replies)
Discussion started by: Gussifinknottle
2 Replies

8. Shell Programming and Scripting

Generating formatted reports from log files

Given that I have a log file of the format: DATE ID LOG_LEVEL | EVENT 2009-07-23T14:05:11Z T-4030097550 D | MessX 2009-07-23T14:10:44Z T-4030097550 D | MessY 2009-07-23T14:34:08Z T-7298651656 D | MessX 2009-07-23T14:41:00Z T-7298651656 D | MessY 2009-07-23T15:05:10Z T-4030097550 D | MessZ... (5 Replies)
Discussion started by: daccad
5 Replies

9. Shell Programming and Scripting

Need to remove improperly formatted fortran output line from files, tried sed

I have been trying to remove some improperly formatted lines of output from fortran code I have been using. The problem is that I have some singularities in the math for some points that causes an incorrectly large value to be reported that exceeds the normal formating set in the code resulting in... (2 Replies)
Discussion started by: gillesc_mac
2 Replies
Login or Register to Ask a Question