Modification of perl script to split a large file into chunks of 5000 chracters


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Modification of perl script to split a large file into chunks of 5000 chracters
# 1  
Old 05-08-2018
Modification of perl script to split a large file into chunks of 5000 chracters

I have a perl script which splits a large file into chunks.The script is given below
Code:
use strict;
use warnings;
open (FH, "<monolingual.txt") or die "Could not open source file. $!";
my $i = 0;
while (1) {
    my $chunk;
	print "process part $i\n";
	open(OUT, ">part$i.log") or die "Could not open destination file";
	$i ++;
	if (!eof(FH)) {
		read(FH, $chunk, 5000);
		print OUT $chunk;
	} 
	if (!eof(FH)) {
		$chunk = <FH>;
		print OUT $chunk;
	}
	close(OUT);
	last if eof(FH);
}

I want the script to create chunks of 5000 characters or a bit less but not more than that.
How do I modify the chunk size to ensure that each chunk is of 5000 characters. When I run it some chunks are more than 5000 characters.
Many thanks for your kind help
# 2  
Old 05-08-2018
As an aside, there is a split command that does exactly what you ask.

Code:
 split -b [size in bytes ] infile [option control outfile naming]

Linux man page:

split(1) - Linux manual page
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 05-08-2018
Thanks a lot. Excuse my ignorance but how many bytes do I allocate ?
My data is in UTF8 format and if I want to ensure that 5000 characters are chunked, what would be the byte size. In ASCII format it would be just 1 but in UTF8 I find that the byte size varies.
# 4  
Old 05-09-2018
That may also be why your perl has issues as well. UTF8 characters encode all of Unicode 1,112,064 characters, so a UTF8 character may be 8, 16, 24, or 32 bits.

To fix perl will require the understanding of wide characters, a locale based "datatype", sort of. Help is here:
Perl Programming/Unicode UTF-8 - Wikibooks, open books for an open world

Recent linux awk version 4.2 onward splits UTF8 encoded records into fields using wide characters, -a forces the split to be created and placed in the $F array. Here is a perl sample and an awk sample that do the same thing on UTF8 files.
Code:
perl -CSD -aF'\N{U+1f4a9}' -nle 'print $F[0]' somefile.txt  # $F[0] is the same as awk's $1 variable

awk -F$'\U0001f4a9' '{print $1}' somefile.txt  # or $'\u007c' for 4-digit code points

code point is a delimiter. All of this is explained in the link.
This User Gave Thanks to jim mcnamara For This Post:
# 5  
Old 05-09-2018
Thanks a lot for your kind help. I now understand why my PERL script goofed up also.

---------- Post updated 05-09-18 at 01:45 AM ---------- Previous update was 05-08-18 at 10:49 PM ----------

Hello,
I found an easier method which accommodates words. Am posting it in case someone meets a similar problem
Code:
csplit filename /([\w.,;]+\s+){5000}/

I set it for 5000 words but it can be set for any number.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Trying To Split a Large File

Trying to split a 35gb file into 1000mb parts. My research shows I should you this. split -b 1000m file.txt and my return is "split: cannot open 'crunch1.txt' for reading: No such file or directory" so I tried split -b 1000m Documents/Wordlists/file.txt and I get nothing other than the curser just... (3 Replies)
Discussion started by: sub terra
3 Replies

2. UNIX for Beginners Questions & Answers

Split large file into smaller files without disturbing the entry chunks

Dears, Need you help with the below file manipulation. I want to split the file into 8 smaller files but without cutting/disturbing the entries (meaning every small file should start with a entry and end with an empty line). It will be helpful if you can provide a one liner command for this... (12 Replies)
Discussion started by: Kamesh G
12 Replies

3. Shell Programming and Scripting

Split a large array into small chunks

Hi, I need to split a large array "@sharedArray" into 10 small arrays. The arrays should be like @sharedArray1,@sharedArray2,@sharedArray3...so on.. Can anyone help me with the logic to do so :(:confused: (6 Replies)
Discussion started by: rkrish
6 Replies

4. Shell Programming and Scripting

perl script to split the text file after every 4th field

I had a text file(comma seperated values) which contains as below 196237,ram,25-May-06,ram.kiran@xyz.com,204183,Pavan,4-Jun-07,Pavan.Desai@xyz.com,237107,ram Chandra,15-Mar-10,ram.krishna@xyz.com ... (3 Replies)
Discussion started by: giridhar276
3 Replies

5. Shell Programming and Scripting

Split a large file

I have a 3 GB text file that I would like to split. How can I do this? It's a giant comma-separated list of numbers. I would like to make it into about 20 files of ~100 MB each, with a custom header and footer. The file can only be split on commas, but they're plentiful. Something like... (3 Replies)
Discussion started by: CRGreathouse
3 Replies

6. Shell Programming and Scripting

Split file into chunks of low & high byte

Hi guys, i have a question about spliting a binary file into 2 chunks. First chunk with all high bytes and the second one with all low bytes. What unix tools can i use? And how can this be performed? I looked in manpages of split and dd but this does not help. Thanks (2 Replies)
Discussion started by: basta
2 Replies

7. Shell Programming and Scripting

how to get split output of a file, using perl script

Hi, I have file: data.log.1 ### s1 main.build.3495 main.build.199 main.build.3408 ###s2 main.build.3495 main.build.3408 main.build.199 I want to read this file and store in two arrays in Perl. I have following command, which is working fine on command prompt. perl -n -e... (1 Reply)
Discussion started by: ashvini
1 Replies

8. Shell Programming and Scripting

Split Large File

HI, i've to split a large file which inputs seems like : Input file name_file.txt 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00003|CCCC|MAIL|DATEOFBIRTHT|.......... (1 Reply)
Discussion started by: AMARA
1 Replies

9. HP-UX

Need to split a large data file using a Unix script

Greetings all: I am still new to Unix environment and I need help with the following requirement. I have a large sequential file sorted on a field (say store#) that is being split into several smaller files, one for each store. That means if there are 500 stores, there will be 500 files. This... (1 Reply)
Discussion started by: SAIK
1 Replies

10. Shell Programming and Scripting

Split A Large File

Hi, I have a large file(csv format) that I need to split into 2 files. The file looks something like Original_file.txt first name, family name, address a, b, c, d, e, f, and so on for over 100,00 lines I need to create two files from this one file. The condition is i need to ensure... (4 Replies)
Discussion started by: nbvcxzdz
4 Replies
Login or Register to Ask a Question