Split Large Files Based On Row Pattern..


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Split Large Files Based On Row Pattern..
# 1  
Old 11-25-2013
Split Large Files Based On Row Pattern..

Hi all.

I've tried searching the web but could not find similar problem to mine.

I have one large file to be splitted into several files based on the matching pattern found in each row.

For example, let's say the file content:
Code:
MONTH,ACCOUNT_NO,CUSTOMER_NAME,SEGMENT_CODE,COST_CENTER,BILLED_DATE,BILL_FREQ,BILL_START_DATE,BILL_END_DATE,BILLED_RENTAL,BILLED_EARNED,PREVIOUS_BILLED_EARNED,REMAINING_UNEARNED_BALANCE,TOTAL_REVENUE_EARNED
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,A100027860305,PANG KAM SENG,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,25,15.32,9.68,0,25
08,A100026920403,LIM YOKE TIN,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,111,68.03,42.97,0,111
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,D217414580206,ABDOL ADI BIN BABA,R20,YRACMM,BP 19,Monthly,19/08/13,18/09/13,85,35.64,0,49.36,35.64
08,A100015330204,CHUA THIAN SONG,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,45,27.58,0,17.42,27.58
08,A100011760104,TOYO ENTERPRISE,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,45,27.58,0,17.42,27.58
08,A100036551202,LEE HUA FONG,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,113,69.26,43.74,0,113
08,A600022365604,LIM HOOI LENG,R30,YRACPP,BP 19,Monthly,19/07/13,18/08/13,135,56.61,78.39,0,135
08,A100036551202,LEE HUA FONG,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,113,69.26,0,43.74,69.26
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/07/13,03/08/13,115,103.87,11.13,0,115
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/08/13,03/09/13,115,103.87,0,11.13,103.87
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,Y102551020201,LIM LIAN HOCK,S10,YNDSJJ,BP 04,Monthly,04/08/13,03/09/13,20,18.06,0,1.94,18.06
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,D214091570207,CHUAH ANG TUAN,R60,YRACAA,BP 19,Monthly,19/08/13,18/09/13,113,47.38,0,65.62,47.38
08,Y502630980403,BAN SENG CHAN SDN BHD,S10,YNDSPP,BP 04,Monthly,04/07/13,03/08/13,45,40.65,4.35,0,45
08,D214091570207,CHUAH ANG TUAN,R60,YRACAA,BP 19,Monthly,19/07/13,18/08/13,113,47.38,65.62,0,113
08,A100015330204,CHUA THIAN SONG,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,45,27.58,17.42,0,45
08,Y502630980403,BAN SENG CHAN SDN BHD,S10,YNDSPP,BP 04,Monthly,04/08/13,03/09/13,45,40.65,0,4.35,40.65
08,F221456710106,NORHAZLINA BINTI ABDUL HAMID,R10,YRACAA,BP 19,Monthly,19/07/13,18/08/13,91,38.16,52.84,0,91
08,A100011760104,TOYO ENTERPRISE,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,45,27.58,17.42,0,45
08,A600022365604,LIM HOOI LENG,R30,YRACPP,BP 19,Monthly,19/08/13,18/09/13,135,56.61,0,78.39,56.61
08,D208631350106,ZAILAN BIN KHASRAN,R30,YRACNN,BP 19,Monthly,19/08/13,18/09/13,25,10.48,0,14.52,10.48
08,D217414580206,ABDOL ADI BIN BABA,R20,YRACMM,BP 19,Monthly,19/07/13,18/08/13,85,35.64,49.36,0,85
08,A100027860305,PANG KAM SENG,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,25,15.32,0,9.68,15.32
08,F221456710106,NORHAZLINA BINTI ABDUL HAMID,R10,YRACAA,BP 19,Monthly,19/08/13,18/09/13,91,38.16,0,52.84,38.16
08,A100026920403,LIM YOKE TIN,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,111,68.03,0,42.97,68.03

So, I want to split based on that 6th column value (comma delimited) and thus
Code:
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/07/13,03/08/13,115,103.87,11.13,0,115
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/08/13,03/09/13,115,103.87,0,11.13,103.87
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,Y102551020201,LIM LIAN HOCK,S10,YNDSJJ,BP 04,Monthly,04/08/13,03/09/13,20,18.06,0,1.94,18.06
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,Y502630980403,BAN SENG CHAN SDN BHD,S10,YNDSPP,BP 04,Monthly,04/07/13,03/08/13,45,40.65,4.35,0,45
08,Y502630980403,BAN SENG CHAN SDN BHD,S10,YNDSPP,BP 04,Monthly,04/08/13,03/09/13,45,40.65,0,4.35,40.65

---> BP 04.TXT
Code:
MONTH,ACCOUNT_NO,CUSTOMER_NAME,SEGMENT_CODE,COST_CENTER,BILLED_DATE,BILL_FREQ,BILL_START_DATE,BILL_END_DATE,BILLED_RENTAL,BILLED_EARNED,PREVIOUS_BILLED_EARNED,REMAINING_UNEARNED_BALANCE,TOTAL_REVENUE_EARNED
08,D217414580206,ABDOL ADI BIN BABA,R20,YRACMM,BP 19,Monthly,19/08/13,18/09/13,85,35.64,0,49.36,35.64
08,A600022365604,LIM HOOI LENG,R30,YRACPP,BP 19,Monthly,19/07/13,18/08/13,135,56.61,78.39,0,135
08,D214091570207,CHUAH ANG TUAN,R60,YRACAA,BP 19,Monthly,19/08/13,18/09/13,113,47.38,0,65.62,47.38
08,D214091570207,CHUAH ANG TUAN,R60,YRACAA,BP 19,Monthly,19/07/13,18/08/13,113,47.38,65.62,0,113
08,F221456710106,NORHAZLINA BINTI ABDUL HAMID,R10,YRACAA,BP 19,Monthly,19/07/13,18/08/13,91,38.16,52.84,0,91
08,A600022365604,LIM HOOI LENG,R30,YRACPP,BP 19,Monthly,19/08/13,18/09/13,135,56.61,0,78.39,56.61
08,D208631350106,ZAILAN BIN KHASRAN,R30,YRACNN,BP 19,Monthly,19/08/13,18/09/13,25,10.48,0,14.52,10.48
08,D217414580206,ABDOL ADI BIN BABA,R20,YRACMM,BP 19,Monthly,19/07/13,18/08/13,85,35.64,49.36,0,85
08,F221456710106,NORHAZLINA BINTI ABDUL HAMID,R10,YRACAA,BP 19,Monthly,19/08/13,18/09/13,91,38.16,0,52.84,38.16

---> BP 19.TXT

and so on.

Also, I want to retain the heading
Code:
MONTH,ACCOUNT_NO,CUSTOMER_NAME,SEGMENT_CODE,COST_CENTER,BILLED_DATE,BILL_FREQ,BILL_START_DATE,BILL_END_DATE,BILLED_RENTAL,BILLED_EARNED,PREVIOUS_BILLED_EARNED,REMAINING_UNEARNED_BALANCE,TOTAL_REVENUE_EARNED

in each file

Thank you very much for helping.
# 2  
Old 11-26-2013
As long as you don't have more than about 10 different output files to be produced from an input file, the following awk script should do what you want:
Code:
awk -F, '
FNR == 1 {
        h = $0
        next
}
{       ofile = $6".TXT"
        if(!(ofile in ofiles)) {
                ofiles[ofile]
                print h > ofile
        }
        print > ofile
}' file

If you have a lot of output files, you'll need to keep track of how many files are open and close and reopen files as needed. Since your sample input only produces three output files, there was no need to keep track of open files (other than to print the header in each new output file).

If you want to run this script on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 11-26-2013
Wow, that's work charm!

Thanks a lot Don! Really appreciate.

It should work perfect since the total files expected is 10 perfectly!

But how do I exclude the header to be produced as separate text file? If not, total files would become 11.

Anyway, thanks a lot.

---------- Post updated at 12:29 PM ---------- Previous update was at 12:26 PM ----------

One more thing, how do I eliminate the space in the filename?

For example, BP 04.TXT should be BP04.TXT

Thanks for your help.
# 4  
Old 11-26-2013
Quote:
Originally Posted by aimy
Wow, that's work charm!

Thanks a lot Don! Really appreciate.

It should work perfect since the total files expected is 10 perfectly!

But how do I exclude the header to be produced as separate text file? If not, total files would become 11.

Anyway, thanks a lot.

---------- Post updated at 12:29 PM ---------- Previous update was at 12:26 PM ----------

One more thing, how do I eliminate the space in the filename?

For example, BP 04.TXT should be BP04.TXT

Thanks for your help.
The header is never written as a separate text file; it is only written as the first line in every text file it creates as a result of finding a new value in field 6 of your input file.

To get rid of zero or more spaces in your output file names, change:
Code:
{       ofile = $6".TXT"
        if(!(ofile in ofiles)) {

to:
Code:
{       ofile = $6".TXT"
        gsub(/ /, "", ofile)
        if(!(ofile in ofiles)) {

This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 11-26-2013
Thanks a lot Don. The filename now worked perfectly.

Regarding the header produced as txt file. Actually that was based on the real file which actually got extra info on top:
Code:
HD|20131126_104934|1
MONTH,ACCOUNT_NO,CUSTOMER_NAME,SEGMENT_CODE,COST_CENTER,BILLED_DATE,BILL_FREQ,BILL_START_DATE,BILL_END_DATE,BILLED_RENTAL,BILLED_EARNED,PREVIOUS_BILLED_EARNED,REMAINING_UNEARNED_BALANCE,TOTAL_REVENUE_EARNED
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,A100027860305,PANG KAM SENG,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,25,15.32,9.68,0,25
08,A100026920403,LIM YOKE TIN,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,111,68.03,42.97,0,111
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,D217414580206,ABDOL ADI BIN BABA,R20,YRACMM,BP 19,Monthly,19/08/13,18/09/13,85,35.64,0,49.36,35.64
08,A100015330204,CHUA THIAN SONG,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,45,27.58,0,17.42,27.58
08,A100011760104,TOYO ENTERPRISE,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,45,27.58,0,17.42,27.58
08,A100036551202,LEE HUA FONG,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,113,69.26,43.74,0,113
08,A600022365604,LIM HOOI LENG,R30,YRACPP,BP 19,Monthly,19/07/13,18/08/13,135,56.61,78.39,0,135
08,A100036551202,LEE HUA FONG,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,113,69.26,0,43.74,69.26
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/07/13,03/08/13,115,103.87,11.13,0,115
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/08/13,03/09/13,115,103.87,0,11.13,103.87
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,Y102551020201,LIM LIAN HOCK,S10,YNDSJJ,BP 04,Monthly,04/08/13,03/09/13,20,18.06,0,1.94,18.06
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,D214091570207,CHUAH ANG TUAN,R60,YRACAA,BP 19,Monthly,19/08/13,18/09/13,113,47.38,0,65.62,47.38
08,Y502630980403,BAN SENG CHAN SDN BHD,S10,YNDSPP,BP 04,Monthly,04/07/13,03/08/13,45,40.65,4.35,0,45

And thus, it will create BILLED_DATE.txt as well.

So, in other words the actual header is actually belongs to the 2nd line, not the 1st.

Ermm... Smilie
# 6  
Old 11-26-2013
Quote:
Originally Posted by aimy
Thanks a lot Don. The filename now worked perfectly.

Regarding the header produced as txt file. Actually that was based on the real file which actually got extra info on top:
Code:
HD|20131126_104934|1
MONTH,ACCOUNT_NO,CUSTOMER_NAME,SEGMENT_CODE,COST_CENTER,BILLED_DATE,BILL_FREQ,BILL_START_DATE,BILL_END_DATE,BILLED_RENTAL,BILLED_EARNED,PREVIOUS_BILLED_EARNED,REMAINING_UNEARNED_BALANCE,TOTAL_REVENUE_EARNED
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,A100027860305,PANG KAM SENG,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,25,15.32,9.68,0,25
08,A100026920403,LIM YOKE TIN,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,111,68.03,42.97,0,111
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,D217414580206,ABDOL ADI BIN BABA,R20,YRACMM,BP 19,Monthly,19/08/13,18/09/13,85,35.64,0,49.36,35.64
08,A100015330204,CHUA THIAN SONG,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,45,27.58,0,17.42,27.58
08,A100011760104,TOYO ENTERPRISE,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,45,27.58,0,17.42,27.58
08,A100036551202,LEE HUA FONG,S10,YNDSAA,BP 13,Monthly,13/07/13,12/08/13,113,69.26,43.74,0,113
08,A600022365604,LIM HOOI LENG,R30,YRACPP,BP 19,Monthly,19/07/13,18/08/13,135,56.61,78.39,0,135
08,A100036551202,LEE HUA FONG,S10,YNDSAA,BP 13,Monthly,13/08/13,12/09/13,113,69.26,0,43.74,69.26
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/07/13,03/08/13,115,103.87,11.13,0,115
08,N351670650204,TING SIEW KIONG,R30,YRAUSI,BP 04,Monthly,04/08/13,03/09/13,115,103.87,0,11.13,103.87
08,D919518500104,HENG POH MING,R20,YRAC33,BP 04,Monthly,04/07/13,03/08/13,25,22.58,2.42,0,25
08,Y102551020201,LIM LIAN HOCK,S10,YNDSJJ,BP 04,Monthly,04/08/13,03/09/13,20,18.06,0,1.94,18.06
08,D925038340109,SITI SHARAH BINTI OTHMAN,R20,YRAC33,BP 04,Monthly,04/08/13,03/09/13,25,22.58,0,2.42,22.58
08,D214091570207,CHUAH ANG TUAN,R60,YRACAA,BP 19,Monthly,19/08/13,18/09/13,113,47.38,0,65.62,47.38
08,Y502630980403,BAN SENG CHAN SDN BHD,S10,YNDSPP,BP 04,Monthly,04/07/13,03/08/13,45,40.65,4.35,0,45

And thus, it will create BILLED_DATE.txt as well.

So, in other words the actual header is actually belongs to the 2nd line, not the 1st.

Ermm... Smilie
So, assuming that the 1st line in your input file is not to be copied into any of the output files, change:
Code:
FNR == 1 {

to:
Code:
FNR <= 2 {

This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 11-26-2013
Oh God, you are so kind Don.

Thank you so much!

But would you mind to explain a little bit about the code? Why FNR <=2 not FNR == 2?

Thanks.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split files based on row delimiter count

I have a huge file (around 4-5 GB containing 20 million rows) which has text like: <EOFD>11<EOFD>22<EORD>2<EOFD>2222<EOFD>3333<EORD>3<EOFD>44<EOFD>55<EORD>66<EOFD>888<EOFD>9999<EORD> Actually above is an extracted file from a Sql Server with each field delimited by <EOFD> and each row ends... (8 Replies)
Discussion started by: amvip
8 Replies

2. UNIX for Advanced & Expert Users

Split one file to many based on pattern

Hello All, I have records in a file in a pattern A,B,B,B,B,K,A,B,B,K Is there any command or simple logic I can pull out records into multiple files based on A record? I want output as File1: A,B,B,B,B,K File2: A,B,B,K (9 Replies)
Discussion started by: deal1dealer
9 Replies

3. UNIX for Dummies Questions & Answers

Split a huge 7 GB File Based on Pattern into 4 files

Hi, I have a Huge 7 GB file which has around 1 million records, i want to split this file into 4 files to contain around 250k messages each. Please help me as Split command cannot work here as it might miss tags.. Format of the file is as below <!--###### ###### START-->... (6 Replies)
Discussion started by: KishM
6 Replies

4. Shell Programming and Scripting

Help needed - Split large file into smaller files based on pattern match

Help needed urgently please. I have a large file - a few hundred thousand lines. Sample CP START ACCOUNT 1234556 name 1 CP END ACCOUNT CP START ACCOUNT 2224444 name 1 CP END ACCOUNT CP START ACCOUNT 333344444 name 1 CP END ACCOUNT I need to split this file each time "CP START... (7 Replies)
Discussion started by: frustrated1
7 Replies

5. Shell Programming and Scripting

Problem with splitting large file based on pattern

Hi Experts, I have to split huge file based on the pattern to create smaller files. The pattern which is expected in the file is: Master..... First... second.... second... third.. third... Master... First.. second... third... Master... First... second.. second.. second..... (2 Replies)
Discussion started by: saisanthi
2 Replies

6. Shell Programming and Scripting

split XML file into multiple files based on pattern

Hello, I am using awk to split a file into multiple files using command: nawk '{ if ( $1 == "<process" ) { n=split($2, arr, "\""); file=arr } print > file }' processes.xml <process name="Process1.process"> ... (3 Replies)
Discussion started by: chiru_h
3 Replies

7. Shell Programming and Scripting

Splitting large file into multiple files in unix based on pattern

I need to write a shell script for below scenario My input file has data in format: qwerty0101TWE 12345 01022005 01022005 datainala alanfernanded 26 qwerty0101mXZ 12349 01022005 06022008 datainalb johngalilo 28 qwerty0101TWE 12342 01022005 07022009 datainalc hitalbert 43 qwerty0101CFG 12345... (19 Replies)
Discussion started by: jimmy12
19 Replies

8. Shell Programming and Scripting

Split a file into multiple files based on the input pattern

I have a file with lines something like. ...... 123_start ...... ....... 123_end .... ..... 456_start ...... ..... 456_end .... ..... 789_start .... .... 789_end (6 Replies)
Discussion started by: abinash
6 Replies

9. Shell Programming and Scripting

Split large file based on last digit from a column

Hello, What's the best way to split a large into multiple files based on the last digit in the first column. input file: f 2738483300000x0y03772748378831x1y13478378358383x2y23743878383802x3y33787828282820x4y43748838383881x5y5 Desired Output: f0 3738483300000x0y03787828282820x4y4 f1... (9 Replies)
Discussion started by: alain.kazan
9 Replies

10. Shell Programming and Scripting

split large file based on field criteria

I have a file containing date/time sorted data of the form ... 2009/06/10,20:59:59.950,XAG/USD,Q,1,1115, 14.3025,100,1,1 2009/06/10,20:59:59.950,XAG/USD,Q,1,1116, 14.3026,125,1,1 2009/06/10,20:59:59.950,XAG/USD,R,0,0, , 0,0,0 2009/06/10,20:59:59.950,XAG/USD,R,1,0, 14.1910,100,1,1... (6 Replies)
Discussion started by: asriva
6 Replies
Login or Register to Ask a Question