split file by delimiter with csplit

07-31-2012

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

split file by delimiter with csplit

Hello,
I want to split a big file into smaller ones with certain "counts". I am aware this type of job has been asked quite often, but I posted again when I came to csplit, which may be simpler to solve the problem.
Input file (fasta format):

Code:

>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

I want output file by sequence counts (say 2, i.e. each output file will contain 2 sequences). Normally the input file can be hundreds of entries with different length of each row and different number of rows, but with the same ">" delimiter for sure. Recent fastq file can be millions of entries (or short reads, but with @ as the delimiter, if not familiar with DNA sequence).
OUTFILE1

Code:

>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc

OUTFILE2

Code:

>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc

OUTFILE3

Code:

>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

AWK can do this job easily, also there are many perl/bioperl, python/biopython to do this job. It seems to me csplit could be a simpler way as a single command.
I gave it tries like following, but did not work out correctly:

Code:

csplit -s -f --prefix=part INFILE.fasta ARG1 {ARG2}

The problem here is ARG1 is the number of lines for csplit, but my case is each sequence may have different lines so that the counts I want is not proportional to the line number. ARG2 is number of output files, or repeat times for csplit. Thanks a bunch!
yifang

yifangt

View Public Profile for yifangt

Find all posts by yifangt

07-31-2012

Registered User

164, 39

Join Date: Sep 2010

Last Activity: 1 April 2015, 7:46 AM EDT

Posts: 164

Thanks Given: 4

Thanked 39 Times in 38 Posts

Hi

Try this

Code:

csplit -s -f part INFILE.fasta  '/^>/' {*}

This should create lot's of partxx files.

Chirel

View Public Profile for Chirel

Find all posts by Chirel

07-31-2012

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks Chirel!
I did not mean to create one sequence a file, but every 500 (for example) sequences a file. What is the ARG for that?

Last edited by yifangt; 07-31-2012 at 04:46 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

08-01-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Here is a demonstration of a gathering technique that might be useful here:

Code:

#!/usr/bin/env bash

# @(#) s1	Demonstrate gather of lines, split into groups, expand.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk tail split

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

# Remove debris.
rm -f xa*

pl " Script \"gather\":"
cat gather

# Bunch lines of each sequence together on a single line.
./gather $FILE |
tail --lines=+2 |
split --lines=2

pl " Files xa created by split:"
ls xa*

# Expand file to temporary, replace original.
for file in xa*
do
awk '
	{ gsub(/=/,"\n") ; printf("%s",$0) }
' $file > t1
mv t1 $file
done

pl " Sample: file xab expanded:"
cat xab

exit 0

producing

Code:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
tail (GNU coreutils) 6.10
split (GNU coreutils) 6.10

-----
 Input data file data1:
>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

-----
 Script "gather":
#!/usr/bin/env bash

# @(#) gather	Substitution of newline.

FILE=${1-data1}

awk '
BEGIN	{ RS = ">" }
	{ gsub(/\n/,"=") ; printf("%s%s\n", RS,$0) }
' $FILE

exit 0

-----
 Files xa created by split:
xaa  xab  xac

-----
 Sample: file xab expanded:
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc

The short awk script in file gather collects lines belonging to a sequence into a super line. The newlines are replaced by some character not in the data, here I used "=".

Then the super lines are split into groups of 2.

Another awk script expands the super lines by replacing "=" with a real newline and re-writes the files.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

08-01-2012

Registered User

164, 39

Join Date: Sep 2010

Last Activity: 1 April 2015, 7:46 AM EDT

Posts: 164

Thanks Given: 4

Thanked 39 Times in 38 Posts

Quote:

Originally Posted by yifangt

Thanks Chirel!
I did not mean to create one sequence a file, but every 500 (for example) sequences a file. What is the ARG for that?

I'm sorry but it's not possible with csplit to cut every 500 sequences. If it's lines you can, but sequences no.

I think you should go for an awk or perl or sed or other solution to you problem.

Like for exemple the answer from drl

Chirel

View Public Profile for Chirel

Find all posts by Chirel

08-01-2012

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks drl!
Your script seems quite different from regular ones. Could you please explain it in more detail to give me some education? Thanks a lot!
yt

yifangt

View Public Profile for yifangt

Find all posts by yifangt

08-01-2012

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

There is some preliminary coding to display my environment. The core of the solution is:

Code:

bunch input file into super lines -> split into separate files of n (=2 in this demo) -> expand the individual files

The bunching is done with an awk script, gather. Each "bunch" starts with (is separated by), RS=">", the Record Separator. For all the newlines in the bunch, replace the newlines with "=". That creates "super lines".

The split works on lines, so that it does not know that there are really many lines in a super line. For each group of n (=2) super lines, split will write to a new file, xaa, xab, etc.

For all those files xa*, replace "=" with newlines, and re-write the files.

Note that this is a demonstration of a technique. There are certainly other solutions.

Viola! ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

Shell Programming and Scripting

split file by delimiter with csplit

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Shell script to Split matrix file with delimiter into multiple files

Discussion started by: trymega

2. UNIX for Advanced & Expert Users

How to split large file with different record delimiter?

Discussion started by: Ravi.K

3. Shell Programming and Scripting

How to target certain delimiter to split text file?

Discussion started by: huiyee1

4. Shell Programming and Scripting

Split file into multiple files using delimiter

Discussion started by: vel4ever

5. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

Discussion started by: lv99

6. Shell Programming and Scripting

How to split a string with no delimiter

Discussion started by: saint34

7. UNIX for Dummies Questions & Answers

Split a file with no pattern -- Split, Csplit, Awk

Discussion started by: madhunk

8. UNIX for Dummies Questions & Answers

Split files using Csplit

Discussion started by: savitha

9. Shell Programming and Scripting

split string with multibyte delimiter

Discussion started by: azmathshaikh