Split a large file in n records and skip a particular record


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split a large file in n records and skip a particular record
# 8  
Old 11-29-2013
Thanks a lot Corona688, that works like a charm, that is what I was searching for.
I really appreciate it.
Thanks Bud.

@Akshay
Your works, but its not giving the desired results, some lines are 25000 and some are 15000 and some 5000.
Any way thanks for helping out.
# 9  
Old 11-29-2013
Quote:
Originally Posted by ibmtech
Thanks a lot Corona688, that works like a charm, that is what I was searching for.
I really appreciate it.
Thanks Bud.

@Akshay
Your works, but its not giving the desired results, some lines are 25000 and some are 15000 and some 5000.
Any way thanks for helping out.
@ibmtech Thank you.

@Corona688 can please tell me, what's wrong in my code, if possible please explain, I will correct it.
# 10  
Old 11-29-2013
For starters you need to set the original value of 'f' so you won't be printing into a blank filename.

% can cause some precedence problems in C I know, I would bracket your expressions more carefully.
# 11  
Old 11-29-2013
Quote:
Originally Posted by Corona688
For starters you need to set the original value of 'f' so you won't be printing into a blank filename.

% can cause some precedence problems in C I know, I would bracket your expressions more carefully.
But NR ==1 || ........ will set the filename right ? I still didn't get what might be wrong..
# 12  
Old 11-30-2013
Quote:
Originally Posted by Akshay Hegde
But NR ==1 || ........ will set the filename right ? I still didn't get what might be wrong..
The problem is that all the conditions must be true to change filename so if record 5000 starts with "3" its not tested again on record 5001.

My solution failed because record 1 had "3" so no starting filename was set.
# 13  
Old 11-30-2013
Quote:
Originally Posted by Chubler_XL
The problem is that all the conditions must be true to change filename so if record 5000 starts with "3" its not tested again on record 5001.

My solution failed because record 1 had "3" so no starting filename was set.
My understanding
Corona's code
Code:
awk 'BEGIN{x="F"++i } NR%5==1{N++} N&&!/^\s*3/{if(x) close(x);x="Fa"++i;N=0}{print > x}'

1.It sets x in beginning
2.when NR becomes 6 remainder will be 1 and N will be incremented
3. if N is set and line doesn't start with 3 is true, it checks whether x is set or not, if x is set close x, increment i x will be the new file, reset N.
4. Last write line to file x

My code
Code:
awk 'NR==1 || NR % 5000 == 1 && !/^\s*3/{close(f);f="File_"++i".tmp"}{print >f}' file

1. NR == 1 , close f, since f is not set, no effect on close(f), increment i thats 1 and f will be the name of file., instead of BEGIN block I used NR==1
2. when NR becomes 5001 remainder will be 1, and check whether line starts with digit 3 if not close f, increment i, file name will be changed
3. write line to file f

let me know if my understanding is wrong.
# 14  
Old 11-30-2013
Quote:
Originally Posted by Akshay Hegde
My understanding
Corona's code
Code:
awk 'BEGIN{x="F"++i } NR%5==1{N++} N&&!/^\s*3/{if(x) close(x);x="Fa"++i;N=0}{print > x}'

1.It sets x in beginning
2.when NR becomes 6 remainder will be 1 and N will be incremented
3. if N is set and line doesn't start with 3 is true, it checks whether x is set or not, if x is set close x, increment i x will be the new file, reset N.
4. Last write line to file x

My code
Code:
awk 'NR==1 || NR % 5000 == 1 && !/^\s*3/{close(f);f="File_"++i".tmp"}{print >f}' file

1. NR == 1 , close f, since f is not set, no effect on close(f), increment i thats 1 and f will be the name of file., instead of BEGIN block I used NR==1
2. when NR becomes 5001 remainder will be 1, and check whether line starts with digit 3 if not close f, increment i, file name will be changed
3. write line to file f

let me know if my understanding is wrong.
Hi Akshay,
Yes, you set x in the beginning; that isn't the problem. The problem is that if line (5000 * x) + 1 starts with a 3 you won't attempt to switch files until you have added another 5000 lines to the file. The request is to print 5000 lines per file but add single lines to a file such that the 1st line in an output file will never start with a 3 (with the possible exception of the first file).

Another (more complicated, but more efficient) way to do this is:
Code:
$!/bin/ksh
awk '
function nf() {
        x = sprintf("F%02d", ++ofc)
        cnt = 0
}
BEGIN { nf()            # Set 1st output file name.
        lpf = 5000      # Set # of lines to be included in each output file
}
NR == 1 {
        # Skip 1st input line.
        next
}
NR > 2 {# Print previous line.
        print last > x
        cnt++
}
{       # Save current line.  Do not print it yet so we can skip the last line.
        # When we hit EOF, last will contain the last line read, but we will
        # not have printed it.
        last = $0
}
cnt >= lpf && ! /^ *3/ {
        # If we have a full file and current line does not start with a 3,
        # close current output file and switch to a new output file name.
        close(x)
        nf()
}' "$@"

I use the Korn shell, but any shell that recognizes basic Bourne shell syntax will also work for this script.

This script is more efficient because it only reads the input file once. Rather than using sed to delete the 1st and last line and awk to split the remaining lines, this script just uses awk to skip the 1st and last lines and split the other lines.

It also uses Fxx as the output file name format in case the input is a little more than 50000 lines which would produce F1, F2, ... F10. Using two digits means that the output file names will sort in sequence instead of having to worry about special handling for F1, F10, F2, F3, ... F9.

If you name this script tester, make it executable, and invoke it as follows:
Code:
./tester fiscal13

it should split the submitter's real input file into approximately 5000 line chunks.

If the test input file is named file and you invoke the script as follows:
Code:
./tester lpf=3 file

it will produces 3 files; F01 containing:
Code:
  100000035900015300007538   172359500000000000AA000000000Y000000000Y00
  100000035900015300007538   1166231200000000000AA000000000Y000000000Y00
  200000035900015300007538   11029684830A   000000000Y000000000Y01YA

F02containing:
Code:
  200000035900015300007538   0127862850000000000000Y000000000Y00YY 
  200000035900015300007538   01282938700000000000AA000000000Y000000000Y00    
  300000035900015300007538   01282938701025828658A   000000000Y000000000Y01   
  300000035900015300007538   1282938700000000000AA000000000Y000000000Y00
  300000035900015300007538   1282938703028860515A   000000000Y000000000Y03

and F03 containing:
Code:
  100000035900015300007538   172359500000000000AA000000000Y000000000Y00Y     
  100000035900015300007538   172359500000000000AA000000000Y000000000Y00Y        
  200000035900015300007538   1166231201029684830A

(The lpf=3 operand overrides the default 5000 lines per file setting set in the BEGIN clause.) Note that F02 contains 5 lines instead of 3 to avoid splitting files in the middle of a multi-line record (assuming that a line starting with a 3 is some kind of continuation line in a multi-line record) but 5 is not a multiple of 3.
These 2 Users Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Trying To Split a Large File

Trying to split a 35gb file into 1000mb parts. My research shows I should you this. split -b 1000m file.txt and my return is "split: cannot open 'crunch1.txt' for reading: No such file or directory" so I tried split -b 1000m Documents/Wordlists/file.txt and I get nothing other than the curser just... (3 Replies)
Discussion started by: sub terra
3 Replies

2. UNIX for Advanced & Expert Users

How to split large file with different record delimiter?

Hi, I have received a file which is 20 GB. We would like to split the file into 4 equal parts and process it to avoid memory issues. If the record delimiter is unix new line, I could use split command either with option l or b. The problem is that the line terminator is |##| How to use... (5 Replies)
Discussion started by: Ravi.K
5 Replies

3. Shell Programming and Scripting

How to split one record to multiple records?

Hi, I have one tab delimited file which is having multiple store_ids in first column seprated by pipe.I want to split the file on the basis of store_id(separating 1st record in to 2 records ). I tried some more options like below with using split,awk etc ,But not able to get proper output. can... (1 Reply)
Discussion started by: jaggy
1 Replies

4. UNIX for Dummies Questions & Answers

Using awk to skip record in file

I need to amend the code blow such that it reads a "black list" before the "print" statement; if "substr($1,1,6)" is found in the "blacklist" it will ignore that record and continue. the code is from an awk script that is being called from shell script which passes the input values. BEGIN { "date... (5 Replies)
Discussion started by: bazel
5 Replies

5. UNIX for Dummies Questions & Answers

Split single record to multiple records

Hi Friends, source .... col1,col2,col3 a,b,1;2;3 here colom delimeter is comma(,). here we dont know what is the max length of col3 means now we have 1;2;3 next time i will receive 1;2;3;4;5;etc... required output .............. col1,col2,col3 a,b,1 a,b,2 a,b,3 please give me... (5 Replies)
Discussion started by: bab.galary
5 Replies

6. Shell Programming and Scripting

How to delete 1 record in large file!

Hi All, I'm a newbie here, I'm just wondering on how to delete a single record in a large file in unix. ex. file1.txt is 1000 records nikki1 nikki2 nikki3 what i want to do is delete the nikki2 record in file1.txt. is it possible? Please advise, Thanks, (3 Replies)
Discussion started by: nikki1200
3 Replies

7. Shell Programming and Scripting

Split a single record to multiple records & add folder name to each line

Hi Gurus, I need to cut single record in the file(asdf) to multile records based on the number of bytes..(44 characters). So every record will have 44 characters. All the records should be in the same file..to each of these lines I need to add the folder(<date>) name. I have a dir. in which... (20 Replies)
Discussion started by: ram2581
20 Replies

8. Shell Programming and Scripting

Split a large file

I have a 3 GB text file that I would like to split. How can I do this? It's a giant comma-separated list of numbers. I would like to make it into about 20 files of ~100 MB each, with a custom header and footer. The file can only be split on commas, but they're plentiful. Something like... (3 Replies)
Discussion started by: CRGreathouse
3 Replies

9. Shell Programming and Scripting

Split Large File

HI, i've to split a large file which inputs seems like : Input file name_file.txt 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00003|CCCC|MAIL|DATEOFBIRTHT|.......... (1 Reply)
Discussion started by: AMARA
1 Replies

10. Shell Programming and Scripting

Split A Large File

Hi, I have a large file(csv format) that I need to split into 2 files. The file looks something like Original_file.txt first name, family name, address a, b, c, d, e, f, and so on for over 100,00 lines I need to create two files from this one file. The condition is i need to ensure... (4 Replies)
Discussion started by: nbvcxzdz
4 Replies
Login or Register to Ask a Question