Sponsored Content
Top Forums Shell Programming and Scripting Modification of perl script to split a large file into chunks of 5000 chracters Post 303017077 by jim mcnamara on Tuesday 8th of May 2018 11:46:03 PM
Old 05-09-2018
That may also be why your perl has issues as well. UTF8 characters encode all of Unicode 1,112,064 characters, so a UTF8 character may be 8, 16, 24, or 32 bits.

To fix perl will require the understanding of wide characters, a locale based "datatype", sort of. Help is here:
Perl Programming/Unicode UTF-8 - Wikibooks, open books for an open world

Recent linux awk version 4.2 onward splits UTF8 encoded records into fields using wide characters, -a forces the split to be created and placed in the $F array. Here is a perl sample and an awk sample that do the same thing on UTF8 files.
Code:
perl -CSD -aF'\N{U+1f4a9}' -nle 'print $F[0]' somefile.txt  # $F[0] is the same as awk's $1 variable

awk -F$'\U0001f4a9' '{print $1}' somefile.txt  # or $'\u007c' for 4-digit code points

code point is a delimiter. All of this is explained in the link.
This User Gave Thanks to jim mcnamara For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split A Large File

Hi, I have a large file(csv format) that I need to split into 2 files. The file looks something like Original_file.txt first name, family name, address a, b, c, d, e, f, and so on for over 100,00 lines I need to create two files from this one file. The condition is i need to ensure... (4 Replies)
Discussion started by: nbvcxzdz
4 Replies

2. HP-UX

Need to split a large data file using a Unix script

Greetings all: I am still new to Unix environment and I need help with the following requirement. I have a large sequential file sorted on a field (say store#) that is being split into several smaller files, one for each store. That means if there are 500 stores, there will be 500 files. This... (1 Reply)
Discussion started by: SAIK
1 Replies

3. Shell Programming and Scripting

Split Large File

HI, i've to split a large file which inputs seems like : Input file name_file.txt 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00003|CCCC|MAIL|DATEOFBIRTHT|.......... (1 Reply)
Discussion started by: AMARA
1 Replies

4. Shell Programming and Scripting

how to get split output of a file, using perl script

Hi, I have file: data.log.1 ### s1 main.build.3495 main.build.199 main.build.3408 ###s2 main.build.3495 main.build.3408 main.build.199 I want to read this file and store in two arrays in Perl. I have following command, which is working fine on command prompt. perl -n -e... (1 Reply)
Discussion started by: ashvini
1 Replies

5. Shell Programming and Scripting

Split file into chunks of low & high byte

Hi guys, i have a question about spliting a binary file into 2 chunks. First chunk with all high bytes and the second one with all low bytes. What unix tools can i use? And how can this be performed? I looked in manpages of split and dd but this does not help. Thanks (2 Replies)
Discussion started by: basta
2 Replies

6. Shell Programming and Scripting

Split a large file

I have a 3 GB text file that I would like to split. How can I do this? It's a giant comma-separated list of numbers. I would like to make it into about 20 files of ~100 MB each, with a custom header and footer. The file can only be split on commas, but they're plentiful. Something like... (3 Replies)
Discussion started by: CRGreathouse
3 Replies

7. Shell Programming and Scripting

perl script to split the text file after every 4th field

I had a text file(comma seperated values) which contains as below 196237,ram,25-May-06,ram.kiran@xyz.com,204183,Pavan,4-Jun-07,Pavan.Desai@xyz.com,237107,ram Chandra,15-Mar-10,ram.krishna@xyz.com ... (3 Replies)
Discussion started by: giridhar276
3 Replies

8. Shell Programming and Scripting

Split a large array into small chunks

Hi, I need to split a large array "@sharedArray" into 10 small arrays. The arrays should be like @sharedArray1,@sharedArray2,@sharedArray3...so on.. Can anyone help me with the logic to do so :(:confused: (6 Replies)
Discussion started by: rkrish
6 Replies

9. UNIX for Beginners Questions & Answers

Split large file into smaller files without disturbing the entry chunks

Dears, Need you help with the below file manipulation. I want to split the file into 8 smaller files but without cutting/disturbing the entries (meaning every small file should start with a entry and end with an empty line). It will be helpful if you can provide a one liner command for this... (12 Replies)
Discussion started by: Kamesh G
12 Replies

10. UNIX for Beginners Questions & Answers

Trying To Split a Large File

Trying to split a 35gb file into 1000mb parts. My research shows I should you this. split -b 1000m file.txt and my return is "split: cannot open 'crunch1.txt' for reading: No such file or directory" so I tried split -b 1000m Documents/Wordlists/file.txt and I get nothing other than the curser just... (3 Replies)
Discussion started by: sub terra
3 Replies
Perl::Critic::Policy::InputOutput::RequireEncodingWithUTUseryContributed Perl Perl::Critic::Policy::InputOutput::RequireEncodingWithUTF8Layer(3pm)

NAME
Perl::Critic::Policy::InputOutput::RequireEncodingWithUTF8Layer - Write "open $fh, q{<:encoding(UTF-8)}, $filename;" instead of "open $fh, q{{<:utf8}, $filename;". AFFILIATION
This Policy is part of the core Perl::Critic distribution. DESCRIPTION
Use of the ":utf8" I/O layer (as opposed to ":encoding(UTF8)" or ":encoding(UTF-8)") was suggested in the Perl documentation up to version 5.8.8. This may be OK for output, but on input ":utf8" does not validate the input, leading to unexpected results. An exploit based on this behavior of ":utf8" is exhibited on PerlMonks at <http://www.perlmonks.org/?node_id=644786>. The exploit involves a string read from an external file and sanitized with "m/^(w+)$/", where $1 nonetheless ends up containing shell meta-characters. To summarize: open $fh, '<:utf8', 'foo.txt'; # BAD open $fh, '<:encoding(UTF8)', 'foo.txt'; # GOOD open $fh, '<:encoding(UTF-8)', 'foo.txt'; # BETTER See the Encode documentation for the difference between "UTF8" and "UTF-8". The short version is that "UTF-8" implements the Unicode standard, and "UTF8" is liberalized. For consistency's sake, this policy checks files opened for output as well as input, For complete coverage it also checks "binmode()" calls, where the direction the operation can not be determined. CONFIGURATION
This Policy is not configurable except for the standard options. NOTES
Because "Perl::Critic" does a static analysis, this policy can not detect cases like my $encoding = ':utf8'; binmode $fh, $encoding; where the encoding is computed. SEE ALSO
PerlIO Encode "perldoc -f binmode" <http://www.socialtext.net/perl5/index.cgi?the_utf8_perlio_layer> <http://www.perlmonks.org/?node_id=644786> AUTHOR
Thomas R. Wyant, III wyant at cpan dot org COPYRIGHT
Copyright (c) 2010-2011 Thomas R. Wyant, III This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. perl v5.14.2 2012-06-07Perl::Critic::Policy::InputOutput::RequireEncodingWithUTF8Layer(3pm)
All times are GMT -4. The time now is 01:26 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy