Sponsored Content
Top Forums Shell Programming and Scripting Modification of perl script to split a large file into chunks of 5000 chracters Post 303017075 by gimley on Tuesday 8th of May 2018 10:54:32 PM
Old 05-08-2018
Thanks a lot. Excuse my ignorance but how many bytes do I allocate ?
My data is in UTF8 format and if I want to ensure that 5000 characters are chunked, what would be the byte size. In ASCII format it would be just 1 but in UTF8 I find that the byte size varies.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Split A Large File

Hi, I have a large file(csv format) that I need to split into 2 files. The file looks something like Original_file.txt first name, family name, address a, b, c, d, e, f, and so on for over 100,00 lines I need to create two files from this one file. The condition is i need to ensure... (4 Replies)
Discussion started by: nbvcxzdz
4 Replies

2. HP-UX

Need to split a large data file using a Unix script

Greetings all: I am still new to Unix environment and I need help with the following requirement. I have a large sequential file sorted on a field (say store#) that is being split into several smaller files, one for each store. That means if there are 500 stores, there will be 500 files. This... (1 Reply)
Discussion started by: SAIK
1 Replies

3. Shell Programming and Scripting

Split Large File

HI, i've to split a large file which inputs seems like : Input file name_file.txt 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00001|AAAA|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00002|BBBB|MAIL|DATEOFBIRTHT|....... 00003|CCCC|MAIL|DATEOFBIRTHT|.......... (1 Reply)
Discussion started by: AMARA
1 Replies

4. Shell Programming and Scripting

how to get split output of a file, using perl script

Hi, I have file: data.log.1 ### s1 main.build.3495 main.build.199 main.build.3408 ###s2 main.build.3495 main.build.3408 main.build.199 I want to read this file and store in two arrays in Perl. I have following command, which is working fine on command prompt. perl -n -e... (1 Reply)
Discussion started by: ashvini
1 Replies

5. Shell Programming and Scripting

Split file into chunks of low & high byte

Hi guys, i have a question about spliting a binary file into 2 chunks. First chunk with all high bytes and the second one with all low bytes. What unix tools can i use? And how can this be performed? I looked in manpages of split and dd but this does not help. Thanks (2 Replies)
Discussion started by: basta
2 Replies

6. Shell Programming and Scripting

Split a large file

I have a 3 GB text file that I would like to split. How can I do this? It's a giant comma-separated list of numbers. I would like to make it into about 20 files of ~100 MB each, with a custom header and footer. The file can only be split on commas, but they're plentiful. Something like... (3 Replies)
Discussion started by: CRGreathouse
3 Replies

7. Shell Programming and Scripting

perl script to split the text file after every 4th field

I had a text file(comma seperated values) which contains as below 196237,ram,25-May-06,ram.kiran@xyz.com,204183,Pavan,4-Jun-07,Pavan.Desai@xyz.com,237107,ram Chandra,15-Mar-10,ram.krishna@xyz.com ... (3 Replies)
Discussion started by: giridhar276
3 Replies

8. Shell Programming and Scripting

Split a large array into small chunks

Hi, I need to split a large array "@sharedArray" into 10 small arrays. The arrays should be like @sharedArray1,@sharedArray2,@sharedArray3...so on.. Can anyone help me with the logic to do so :(:confused: (6 Replies)
Discussion started by: rkrish
6 Replies

9. UNIX for Beginners Questions & Answers

Split large file into smaller files without disturbing the entry chunks

Dears, Need you help with the below file manipulation. I want to split the file into 8 smaller files but without cutting/disturbing the entries (meaning every small file should start with a entry and end with an empty line). It will be helpful if you can provide a one liner command for this... (12 Replies)
Discussion started by: Kamesh G
12 Replies

10. UNIX for Beginners Questions & Answers

Trying To Split a Large File

Trying to split a 35gb file into 1000mb parts. My research shows I should you this. split -b 1000m file.txt and my return is "split: cannot open 'crunch1.txt' for reading: No such file or directory" so I tried split -b 1000m Documents/Wordlists/file.txt and I get nothing other than the curser just... (3 Replies)
Discussion started by: sub terra
3 Replies
eucset(1)						      General Commands Manual							 eucset(1)

NAME
eucset - set and get code widths for ldterm SYNOPSIS
HP15-codeset] or or or or [cswidth] ] DESCRIPTION
The command sets or gets (reports) the encoding and display widths of the Extended UNIX Code (EUC), UCS Transformation Format (UTF8), or GB18030 characters processed by the current input terminal. EUC is an encoding method for codesets composed of single or multiple bytes. EUC permits applications and the terminal hardware to use the 7-bit US ASCII code and up to three single byte or multibyte codesets simul- taneously. ldterm is a STREAMS terminal line discipline module which obtains codeset information from See ldterm(7). The cswidth value defines the character widths for codesets. If cswidth is not implicitly or explicitly defined by passing no argument to the command, the cswidth value is determined by the following criteria in descending priority: 1. Use the cswidth value stored in the current locale, if defined. 2. Use predefined cswidth values if the codeset name defined in the locale is GB18030, UTF8, or one of the four HP15 codesets. 3. Use the environment variable if defined and in the correct format. 4. Use 7-bit US ASCII as the default codeset and its cswidth value. This command must be used to specify EUC or non-EUC codesets, whether they are single byte or multibyte. However, the command can correctly set the cswidth parameter without using any options in most cases except for ASIAN_UTF8. See the section for special warnings on the val- ues of the cswidth argument. For the GB18030, ASIAN_UTF8, or UTF8 setting, use the option. Options The command recognizes the following options and arguments: Displays the current settings of the EUC character widths for the terminal. Sets the width to one of the four HP15 codesets, or or The HP15 codesets supported are and cswidth Defines the character widths for codesets 1 through 3. See the section in this manpage for more information. EUC Code Set Classes EUC divides codesets into four classes. Each codeset has two characteristics: the number of bytes for encoding the characters in the code- set, and the number of display columns to display the characters in the codeset. All characters within a codeset possess the same charac- teristics. ASIAN_UTF8 is used for setting double width display, and UTF8 is used for single width. o Codeset 0 consists of all 7-bit, single byte ASCII characters. The most significant bit of each of these characters is 0 (zero). Characters in codeset 0 require one byte for encoding, and occupy one display column. These values are fixed for codeset 0 (zero). The 7-bit US ASCII code is the primary EUC codeset, which is available to users without direct specification. o Codeset 1 is a supplementary EUC codeset. Codeset 1 characters have an initial byte whose most significant bit is 1. Characters in codeset 1 may require more than one byte for encoding, and may require more than one display column. The command must be used to set the characteristics for codeset 1. o Codesets 2 and 3 are supplementary EUC codesets. Characters in these codesets have an initial byte of SS2 or SS3, respectively. They require more than one byte for encoding, and may require more than one display column. The command must be used to set the characteristics for codesets 2 and 3. The cswidth argument in the command line is a character string that describes the character widths for codesets 1 through 3. This command does not allow the user to modify the settings for codeset 0. The character string is of the following format: X1[:Y1],X2[:Y2],X3[:Y3] X1 The number of bytes required to encode a character in codeset class 1. Y1 The number of display columns needed to display characters in this class. X2 The number of bytes required to encode a character in codeset 2, not counting the SS2 byte, Y2 The number of display columns for codeset 2 characters. X3 The number of bytes needed to encode characters in codeset 3, not counting the SS3 byte, Y3 The number of display columns required for these characters. The values for the column widths may be omitted if they are equal to the number of encoding bytes. If the encoding value of any of the EUC codesets is set to (zero), then the codeset does not exist. See the section for special warnings on the values of the cswidth argument. If no cswidth argument is supplied, the command uses the value of the environment variable. If this variable is not present, the following default string is substituted: This default string designates that the environment uses a single byte EUC codeset that has characters in the EUC codeset 1 format. If the environment uses a multibyte EUC codeset in the codeset 1 format, single byte or multibyte EUC codesets in the codeset 2 or 3 format, or both, the default setting cannot be used. EXTERNAL INFLUENCES
Environment Variables Provide a default value for the internationalization variables that are unset or null. If is not specified or is set to the empty string, a default of (see lang(5)) is used instead of If any of the internationalization variables contain an invalid setting, behaves as if all internation- alization variables are set to See environ(5). If set to a nonempty string value, override the values of all other internationalization variables. Determines the locale that should be used to affect the format and contents of diagnostic messages written to standard error and informative messages written to standard output. Determines the location of message catalogs for the processing of EXAMPLES
To display the encoding and display widths for the EUC codesets 1 to 3 in your environment, enter: Assuming has been previously used to set for the entry generates the following: To change the current settings of the encoding and display widths for the EUC characters in codesets 1 and 2 to two bytes each, enter one of the following: To set the encoding and display widths for the EUC characters in the locale enter: For enter: For enter: To set the code width to that of enter: To set the code width to that of enter: To set the code width to that of enter: WARNINGS
The cswidth argument does not include the SS2 or SS3 bytes in the byte width values. This command is not specified by standards, may not be available on other vendor's systems, and may be subject to change or obsolescence in a future release. AUTHOR
was developed by OSF and HP. SEE ALSO
dtterm(1), ldterm(7). eucset(1)
All times are GMT -4. The time now is 01:26 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy