Hello,
I want to split a big file into smaller ones with certain "counts". I am aware this type of job has been asked quite often, but I posted again when I came to csplit, which may be simpler to solve the problem.
Input file (fasta format):
I want output file by sequence counts (say 2, i.e. each output file will contain 2 sequences). Normally the input file can be hundreds of entries with different length of each row and different number of rows, but with the same ">" delimiter for sure. Recent fastq file can be millions of entries (or short reads, but with @ as the delimiter, if not familiar with DNA sequence).
OUTFILE1
OUTFILE2
OUTFILE3
AWK can do this job easily, also there are many perl/bioperl, python/biopython to do this job. It seems to me csplit could be a simpler way as a single command.
I gave it tries like following, but did not work out correctly:
The problem here is ARG1 is the number of lines for csplit, but my case is each sequence may have different lines so that the counts I want is not proportional to the line number. ARG2 is number of output files, or repeat times for csplit. Thanks a bunch!
yifang
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
Here is a demonstration of a gathering technique that might be useful here:
producing
The short awk script in file gather collects lines belonging to a sequence into a super line. The newlines are replaced by some character not in the data, here I used "=".
Then the super lines are split into groups of 2.
Another awk script expands the super lines by replacing "=" with a real newline and re-writes the files.
Thanks drl!
Your script seems quite different from regular ones. Could you please explain it in more detail to give me some education? Thanks a lot!
yt
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 2,288
Thanks Given: 430
Thanked 480 Times in 395 Posts
Hi.
There is some preliminary coding to display my environment. The core of the solution is:
The bunching is done with an awk script, gather. Each "bunch" starts with (is separated by), RS=">", the Record Separator. For all the newlines in the bunch, replace the newlines with "=". That creates "super lines".
The split works on lines, so that it does not know that there are really many lines in a super line. For each group of n (=2) super lines, split will write to a new file, xaa, xab, etc.
For all those files xa*, replace "=" with newlines, and re-write the files.
Note that this is a demonstration of a technique. There are certainly other solutions.
I have a large semicolon delimited file with thousands of columns and many thousands of line. It looks like:
ID1;ID2;ID3;ID4;A_1;B_1;C_1;A_2;B_2;C_2;A_3;B_3;C_3
AA;ax;ay;az;01;02;03;04;05;06;07;08;09
BB;bx;by;bz;03;05;33;44;15;26;27;08;09
I want to split this table in to multiple files:
... (1 Reply)
Hi,
I have received a file which is 20 GB. We would like to split the file into 4 equal parts and process it to avoid memory issues.
If the record delimiter is unix new line, I could use split command either with option l or b.
The problem is that the line terminator is |##|
How to use... (5 Replies)
Hi, all.
I have an input file. I would like to generate 3 types of output files.
Input:
LG10_PM_map_19_LEnd_1000560
LG10_PM_map_6-1_27101856
LG10_PM_map_71_REnd_20597718
LG12_PM_map_5_chr_118419232
LG13_PM_map_121_24341052
LG14_PM_1a_456799
LG1_MM_scf_5a_opt_abc_9029993
... (5 Replies)
Hi,
I have a file which has many URLs delimited by space. Now i want them to move to separate files each one holding 10 URLs per file.
http://3276.e-printphoto.co.uk/guardian http://abdera.apache.org/ http://abdera.apache.org/docs/api/index.html
I have used the below code to arrange... (6 Replies)
I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;”
Here is the sample of 5 lines in the file:
Name1;phone1;address1;city1;state1;zipcode1
Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Hi;
I want to write a shell script that will split a string with no delimiter.
Basically the script will read a line from a file.
For example the line it read from the file contains:
99234523
These values are never the same but the length will always be 8.
How do i split this... (8 Replies)
I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this.
For example:
split -l 3000000 filename.txt
This is very slow and it splits the file with 3 million records in each... (10 Replies)
I have an excel file with more than 65K records... Since excel does not take more than 65K records i wan to split the file and send it as two excel files... Could some help me how to use the csplit by specifiying the no of records (7 Replies)
Hi,
I need to split a string, either using awk or cut or basic unix commands (no programming) , with a multibyte charectar as a delimeter.
Ex:
abcd-efgh-ijkl
split by -efgh- to get two segments abcd & ijkl
Is it possible?
Thanks
A.H.S (1 Reply)