11-13-2012
Combining 3 fastq files
Hello,
I am working with next-gen short-read sequence data, which we receive in 3 fastq files. These are arranged in 4-line groups for each read:
line1: read identifier, beginning, e.g., "@HWI-ST1342..."
line2: DNA sequence, for files 1 and 2, 101 characters, for file 3, 7 chars.
line3: "+"
line4: quality score codes equaling line 2 in length.
There are ~160 million reads in total per file, so quite big files.
I need to compile the data from all three files, which are in the same order and have the same read identifier between the files. So what I need to do is:
line1: identifier
line2: File1sequenceFile2sequenceFile3sequence
line3: "+"
line4: File1qualFile2qualFile3qual
Can anyone suggest an efficient way of doing this with shell commands?
thanks a lot!
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
how will i combine these 2 files below, with the desired output
specified below:
file1:
one
two
three
four
file2:
red
blue
yellow
green
file3:
aaa
bbb
ccc
ddd (3 Replies)
Discussion started by: apalex
3 Replies
2. UNIX for Dummies Questions & Answers
Hi,
is there a way to combine 2 files together, joining line 1 from file A with line 1 from file B, line 2 from A with line 2 from B etc.
File A File B
1 4
2 5
3 6
Combined result =
File C
14
25
36 (2 Replies)
Discussion started by: Enda Martin
2 Replies
3. Shell Programming and Scripting
I have two files which contain data from two different transactions in the same format:
<Name> - <Count>
My goal is to end up with data in this format after combining the two:
<Name> - <Count1> - <Count2>
Is this possible to do with awk, or is there something better?
Thanks... (3 Replies)
Discussion started by: bat711
3 Replies
4. Shell Programming and Scripting
Could someone help me reduce the number of runs for a shell program I created?
I have two text files below:
$ more list1.txt
01 AAA
02 BBB
03 CCC
04 DDD
$ more list2.txt
01 EEE
02 FFF
03 GGG
I want to combine the lines with the same number to get the below:
01 AAA 01 EEE
02... (4 Replies)
Discussion started by: stevefox
4 Replies
5. UNIX for Dummies Questions & Answers
Hi Gurus,
I have 2 files:
File1
Filename1 xx
Filename1 yy
Filename1 Total
Filename2 xx
Filename2 yy
Filename2 zz
Filename2 Total
Filename3 xx
Filename3 Total
and File2:
Filename1 10296 xxx Date: 09/01/08
Filename2 10296 xxx Date: 09/05/08... (36 Replies)
Discussion started by: rock1
36 Replies
6. UNIX for Dummies Questions & Answers
Hi All,
Request your expertise in tackling one requirement in my project,(i dont have much expertise in Shell Scripting). The requirement is as below,
1) We store the last run date of a process in a file. When the batch run the next time, it should read this file, get the last run date from... (1 Reply)
Discussion started by: dsfreddie
1 Replies
7. Shell Programming and Scripting
i am having 2 files like this
file 1
1,
2,
3,
4,
file2
5,
6,
7,
8,
what i want do is like this
i want to put all the contents for file 2 after file 1,means adding column in file1 (5 Replies)
Discussion started by: sagar_1986
5 Replies
8. Shell Programming and Scripting
Hi I have about 108 files (text files) that end with .avg and each one of these files have a distinct name that describes what is in the file. In each file there is a set of 80 values that are tab separated. I want to combine all 108 files into ONE main file.
So each file is named:
1.avg... (5 Replies)
Discussion started by: phil_heath
5 Replies
9. UNIX for Dummies Questions & Answers
Hello.
I have to compare two different fastq.gz files that I concatenated, and then zipped into a new merge fastq.gz file.
The files that need to be merged are:
Sample-136-P_S7_L001_R1_001.fastq.gz and Sample -136-P_S7_L002_R1_001.fastq.gz
They were meged to a new file called:... (1 Reply)
Discussion started by: arcolombo698
1 Replies
10. UNIX for Beginners Questions & Answers
I have two files:
File_1:
@M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86
GGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCAGAAGCAGCAT
+
GGGGGGGGGGGGGGGGGCCGGGGGF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8F
... (3 Replies)
Discussion started by: Xterra
3 Replies
LEARN ABOUT DEBIAN
srf2fastq
srf2fastq(1) Staden io_lib srf2fastq(1)
NAME
srf2fastq - Converts SRF files to Sanger fastq format
SYNOPSIS
srf2fastq [options] srf_archive ...
DESCRIPTION
srf2fastq extracts sequences and qualities from one or more SRF archives and writes them in Sanger fastq format to stdout.
Note that Illumina also have a fastq format (used in the GERALD directories) which differs slightly in the use of log-odds scores for the
quality values. The format described here is using the traditional Phred style of quality encoding.
OPTIONS
-c Outputs calibrated confidence values using the ZTR CNF1 chunk type for a single quality per base. Without this use the original
Illumina _prb.txt files consisting of four quality values per base, stored in the ZTR CNF4 chunks.
-C Masks out sequences tagged as bad quality.
-s root
Generates files on disk with filenames starting root, one file per non-explicit element in the SRF/ZTR region (REGN) chunk. Typi-
cally this results in two files for paired end runs. The filename suffixes come from the names listed in the SRF region chunks.
This option conflicts with the -S parameter.
-S Splits sequences into regions, but sequentially lists each sequence region to stdout instead of splitting to separate files on disk.
This option conflicts with the -s parameter.
-n When using -s the filename suffixes are simply numbered (starting with 1) instead of using the names listed in the SRF region
chunks.
-a Appends region index to the sequence names. Ie generate "name/1" and "name/2" for a paired read.
-e Include any explicit sequence (ZTR region chunk of type 'E') in the sequence output. The explicit sequence is also included in the
quality line too. Currently this is utilised by ABI SOLiD to store the last base of the primer.
-r region list
Reverse complements the sequence and reverses the quality values for all regions in the region list. This is a comma separated list
of integer values enumerating the regions, starting from 1. Note that this option only works when either -s or -S are specified.
EXAMPLES
To extract only the good quality sequences from all srf files in the current directory using calibrated confidence values (if available).
srf2fastq -c -C *.srf > runX.fastq
To extract a paired end run into two separate files with sequences named name/1 and name/2.
srf2fastq -s runX -a -n runX.srf
To extract a paired end run as a single file, alternating forward and reverse sequences, with the second read being reverse complemented.
srf2fastq -S -r 2 runX.srf > runX.fastq
AUTHOR
James Bonfield, Steven Leonard - Wellcome Trust Sanger Institute
December 10 srf2fastq(1)