Sponsored Content
Top Forums Shell Programming and Scripting split file by delimiter with csplit Post 302679725 by yifangt on Tuesday 31st of July 2012 01:54:56 PM
Old 07-31-2012
split file by delimiter with csplit

Hello,
I want to split a big file into smaller ones with certain "counts". I am aware this type of job has been asked quite often, but I posted again when I came to csplit, which may be simpler to solve the problem.
Input file (fasta format):
Code:
>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

I want output file by sequence counts (say 2, i.e. each output file will contain 2 sequences). Normally the input file can be hundreds of entries with different length of each row and different number of rows, but with the same ">" delimiter for sure. Recent fastq file can be millions of entries (or short reads, but with @ as the delimiter, if not familiar with DNA sequence).
OUTFILE1
Code:
>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc

OUTFILE2
Code:
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc

OUTFILE3
Code:
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

AWK can do this job easily, also there are many perl/bioperl, python/biopython to do this job. It seems to me csplit could be a simpler way as a single command.
I gave it tries like following, but did not work out correctly:
Code:
csplit -s -f --prefix=part INFILE.fasta ARG1 {ARG2}

The problem here is ARG1 is the number of lines for csplit, but my case is each sequence may have different lines so that the counts I want is not proportional to the line number. ARG2 is number of output files, or repeat times for csplit. Thanks a bunch!
yifang
 

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

split string with multibyte delimiter

Hi, I need to split a string, either using awk or cut or basic unix commands (no programming) , with a multibyte charectar as a delimeter. Ex: abcd-efgh-ijkl split by -efgh- to get two segments abcd & ijkl Is it possible? Thanks A.H.S (1 Reply)
Discussion started by: azmathshaikh
1 Replies

2. UNIX for Dummies Questions & Answers

Split files using Csplit

I have an excel file with more than 65K records... Since excel does not take more than 65K records i wan to split the file and send it as two excel files... Could some help me how to use the csplit by specifiying the no of records (7 Replies)
Discussion started by: savitha
7 Replies

3. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this. For example: split -l 3000000 filename.txt This is very slow and it splits the file with 3 million records in each... (10 Replies)
Discussion started by: madhunk
10 Replies

4. Shell Programming and Scripting

How to split a string with no delimiter

Hi; I want to write a shell script that will split a string with no delimiter. Basically the script will read a line from a file. For example the line it read from the file contains: 99234523 These values are never the same but the length will always be 8. How do i split this... (8 Replies)
Discussion started by: saint34
8 Replies

5. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;” Here is the sample of 5 lines in the file: Name1;phone1;address1;city1;state1;zipcode1 Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Discussion started by: lv99
7 Replies

6. Shell Programming and Scripting

Split file into multiple files using delimiter

Hi, I have a file which has many URLs delimited by space. Now i want them to move to separate files each one holding 10 URLs per file. http://3276.e-printphoto.co.uk/guardian http://abdera.apache.org/ http://abdera.apache.org/docs/api/index.html I have used the below code to arrange... (6 Replies)
Discussion started by: vel4ever
6 Replies

7. Shell Programming and Scripting

How to target certain delimiter to split text file?

Hi, all. I have an input file. I would like to generate 3 types of output files. Input: LG10_PM_map_19_LEnd_1000560 LG10_PM_map_6-1_27101856 LG10_PM_map_71_REnd_20597718 LG12_PM_map_5_chr_118419232 LG13_PM_map_121_24341052 LG14_PM_1a_456799 LG1_MM_scf_5a_opt_abc_9029993 ... (5 Replies)
Discussion started by: huiyee1
5 Replies

8. UNIX for Advanced & Expert Users

How to split large file with different record delimiter?

Hi, I have received a file which is 20 GB. We would like to split the file into 4 equal parts and process it to avoid memory issues. If the record delimiter is unix new line, I could use split command either with option l or b. The problem is that the line terminator is |##| How to use... (5 Replies)
Discussion started by: Ravi.K
5 Replies

9. UNIX for Beginners Questions & Answers

Shell script to Split matrix file with delimiter into multiple files

I have a large semicolon delimited file with thousands of columns and many thousands of line. It looks like: ID1;ID2;ID3;ID4;A_1;B_1;C_1;A_2;B_2;C_2;A_3;B_3;C_3 AA;ax;ay;az;01;02;03;04;05;06;07;08;09 BB;bx;by;bz;03;05;33;44;15;26;27;08;09 I want to split this table in to multiple files: ... (1 Reply)
Discussion started by: trymega
1 Replies
csplit(1)							   User Commands							 csplit(1)

NAME
csplit - split files based on context SYNOPSIS
csplit [-ks] [-f prefix] [-n number] file arg1... argn DESCRIPTION
The csplit utility reads the file named by the file operand, writes all or part of that file into other files as directed by the arg oper- ands, and writes the sizes of the files. OPTIONS
The following options are supported: -f prefix Names the created files prefix00, prefix01, ..., prefixn. The default is xx00 ... xxn. If the prefix argument would create a file name exceeding 14 bytes, an error results. In that case, csplit exits with a diagnostic message and no files are created. -k Leaves previously created files intact. By default, csplit removes created files if an error occurs. -n number Uses number decimal digits to form filenames for the file pieces. The default is 2. -s Suppresses the output of file size messages. OPERANDS
The following operands are supported: file The path name of a text file to be split. If file is -, the standard input will be used. The operands arg1 ... argn can be a combination of the following: /rexp/[offset] Create a file using the content of the lines from the current line up to, but not including, the line that results from the evaluation of the regular expression with offset, if any, applied. The regular expression rexp must follow the rules for basic regular expressions. Regular expressions can include the use of '/' and '\%'. These forms must be properly quoted with single quotes, since "" is special to the shell. The optional offset must be a positive or negative integer value representing a number of lines. The integer value must be preceded by + or -. If the selection of lines from an offset expression of this type would create a file with zero lines, or one with greater than the number of lines left in the input file, the results are unspecified. After the section is created, the current line will be set to the line that results from the evaluation of the regular expression with any offset applied. The pattern match of rexp always is applied from the current line to the end of the file. %rexp%[offset] This operand is the same as /rexp/[offset], except that no file will be created for the selected section of the input file. line_no Create a file from the current line up to (but not including) the line number line_no. Lines in the file will be numbered starting at one. The current line becomes line_no. {num} Repeat operand. This operand can follow any of the operands described previously. If it follows a rexp type operand, that operand will be applied num more times. If it follows a line_no operand, the file will be split every line_no lines, num times, from that point. An error will be reported if an operand does not reference a line between the current position and the end of the file. USAGE
See largefile(5) for the description of the behavior of csplit when encountering files greater than or equal to 2 Gbyte (2^31 bytes). EXAMPLES
Example 1 Splitting and combining files This example creates four files, cobol00...cobol03. example% csplit -f cobol filename '/procedure division/' /par5./ /par16./ After editing the split files, they can be recombined as follows: example% cat cobol0[0-3] > filename This example overwrites the original file. Example 2 Splitting a file into equal parts This example splits the file at every 100 lines, up to 10,000 lines. The -k option causes the created files to be retained if there are less than 10,000 lines; however, an error message would still be printed. example% csplit -k filename 100 {99} Example 3 Creating a file for separate C routines If prog.c follows the normal C coding convention (the last line of a routine consists only of a } in the first character position), this example creates a file for each separate C routine (up to 21) in prog.c. example% csplit -k prog.c '%main(%' '/^}/+1' {20} ENVIRONMENT VARIABLES
See environ(5) for descriptions of the following environment variables that affect the execution of csplit: LANG, LC_ALL, LC_COLLATE, LC_CTYPE, LC_MESSAGES, and NLSPATH. EXIT STATUS
The following exit values are returned: 0 Successful completion. >0 An error occurred. ATTRIBUTES
See attributes(5) for descriptions of the following attributes: +-----------------------------+-----------------------------+ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | +-----------------------------+-----------------------------+ |Availability |SUNWesu | +-----------------------------+-----------------------------+ |CSI |Enabled | +-----------------------------+-----------------------------+ |Interface Stability |Standard | +-----------------------------+-----------------------------+ SEE ALSO
sed(1), split(1), attributes(5), environ(5), largefile(5), standards(5) DIAGNOSTICS
The diagnostic messages are self-explanatory, except for the following: arg - out of range The given argument did not reference a line between the current position and the end of the file. SunOS 5.11 4 Dec 2003 csplit(1)
All times are GMT -4. The time now is 01:40 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy