File splitting script help

05-07-2015

Registered User

4, 0

Join Date: May 2015

Last Activity: 12 May 2015, 4:45 AM EDT

Posts: 4

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thanks,
Tomorrow I will try and let you know if I need more help.

---------- Post updated at 10:47 PM ---------- Previous update was at 10:42 AM ----------

Hi Corona,

I tried your code and it working fine, Thank you so much but I have couple of doubt which I wanted to share with you

Code:

awk -v POS="$Position" -v NXT="$NextPostion" -v PRF="$TargetfilePrefix" '{print >  PRF "_" substr($0, POS, NXT-POS+1) ".txt"}'  $Filename

After Running the above code I check the data inside the file using more and cat comment. Its coming as per expectation but when I opening these files in notepad in windows system all records are coming in one line
Notepad file

Code:

PP1234512345671234567BABCPP1234512345671234567BABC

Using more command

Code:

$ more Pplus_GIC_*
::::::::::::::
Pplus_GIC_B.txt
::::::::::::::
PP1234512345671234567BABC
PP1234512345671234567BABC
::::::::::::::
Pplus_GIC_C.txt
::::::::::::::
PP1234512345671234567CABC
PP1234512345671234567CABC
::::::::::::::
Pplus_GIC_D.txt
::::::::::::::
PP1234512345671234567DABC
PP1234512345671234567DABC
PP1234512345671234567DABC
::::::::::::::
Pplus_GIC_E.txt
::::::::::::::
PP1234512345671234567EABC
PP1234512345671234567EABC

Do we have to insert newline for each record ?

Also, Can you suggest abount the performance of this command on 2 million records?

Regards
Ibrar Ahmad

Last edited by Don Cragun; 05-07-2015 at 01:09 AM.. Reason: Change ICODE tags to CODE tags for multi-line data.

ibrar Ahmad

View Public Profile for ibrar Ahmad

Find all posts by ibrar Ahmad

05-07-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

*nix and windows are two seriously different systems. One of the differences is the designation of line ends in text files. *nix uses <NL> (= <new line>, 0x0A, \n), windows uses the combination of <CR> (= <carriage return>, 0x0D, \r) and <NL>.
So you should stay on one system. If working on both is unavoidable, you'll need to take extra care to convert the files. You can use conversion tools like unix2dos, iconv, or recode. Or you can print the \r explicitly in the above awk solution.

I don't have a file with 2 million records at hand, so its difficult to predict, esp. on a windows system. Still I guess it should work in a few seconds unless many files are to be opened/closed again and again..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-07-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Using:

Code:

awk -v POS="$Position" -v NXT="$NextPostion" -v PRF="$TargetfilePrefix" '{print >  PRF "_" substr($0, POS, NXT-POS+1) ".txt"}'  $Filename

could have a problem on some versions of awk because the precedence between concatenation of strings and the output redirection operator in print statements is not specified by the standards. Some implementations of awk will treat this code as:

Code:

awk -v POS="$Position" -v NXT="$NextPostion" -v PRF="$TargetfilePrefix" '{(print >  PRF) "_" substr($0, POS, NXT-POS+1) ".txt"}'  $Filename

(possibly giving a syntax error, and certainly not producing the output files you want) and others will treat it as:

Code:

awk -v POS="$Position" -v NXT="$NextPostion" -v PRF="$TargetfilePrefix" '{print > (PRF "_" substr($0, POS, NXT-POS+1) ".txt")}'  $Filename

Since it is working on your system, we can assume that your version of awk is doing the latter.

This code doesn't close and reopen files, so, if it doesn't give an error (too many open files) it should run pretty quickly. It might run slightly faster if you move appending the "_" out of the loop and just do it once instead of two million times:

Code:

awk -v POS="$Position" -v NXT="$NextPostion" -v PRF="${TargetfilePrefix}_" '{print >  (PRF substr($0, POS, NXT-POS+1) ".txt")}'  $Filename

As was noted before, if this fails on your two million record file with a too many open files error, you'll have to build in code to close and open files. If you open and close files for each output line, that will run considerably slower. But, doing anything smarter than that would require you to evaluate the input file to see if lines directed to the same output file are closely grouped in the input, if there are common occurrences of adjacent lines that will be directed to the same output file, etc. that can be used to make smarter decisions about when to close a file and when (if ever) a file needs to be reopened.

Since you couldn't even guess at how many output files would be produced from your input file, I assume you have not tried to evaluate any of the above questions that might help produce more efficient code if you do have to close and reopen files.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

05-07-2015

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

In general for collecting, gating lines that are related, one could do an initial stable sort on the field of interest, then, while reading with a subsequent (say awk) code, close previous file and open new file when the field content changes.

Obviously this touches the data an extra time (at least), but is conceptually simple and would never run into the limit for open files. Assuming a pipe connection, then no extra files (other than from the sort) are created.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

UNIX for Dummies Questions & Answers

File splitting script help

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting a text file into smaller files with awk, how to create a different name for each new file

Discussion started by: LMHmedchem

2. Shell Programming and Scripting

Script for splitting file of records into multiple files

Discussion started by: wincrazy

3. Shell Programming and Scripting

Execution of loop :Splitting a single file into multiple .dat file

Discussion started by: SwagatikaP1

4. Shell Programming and Scripting

Splitting XML file on basis of line number into multiple file

Discussion started by: ajju

5. Shell Programming and Scripting

Splitting a file and creating new files using Perl script

Discussion started by: Deepak9870

6. Shell Programming and Scripting

Splitting a file in to multiple files and passing each individual file to a command

Discussion started by: rkrish

7. UNIX for Dummies Questions & Answers

Is there any way of splitting the script (Noob Here).

Discussion started by: pinga123

8. Shell Programming and Scripting

script for splitting file

Discussion started by: Sudhakishore.P

9. Shell Programming and Scripting

File splitting and grouping using unix script

Discussion started by: nandhan11891

10. Shell Programming and Scripting

Help with shell script - splitting

Discussion started by: manosubsulo