Visit Our UNIX and Linux User Community


Illumina reads remove duplicate...


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Illumina reads remove duplicate...
# 8  
Old 10-13-2009
Base on your sample but not tested on large scale
Code:
awk '{a[$2]++}a[$2]==1{$1=RS $1;print}' FS="\n" RS="@" OFS="\n" ORS=""  in_file > out_file

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.
# 9  
Old 10-13-2009
For files with greater than 7 mill rows it is preferable not to store all the values in the memory as in production if enough memory is not allocated the script will get killed.

Assuming every line looks like this
Code:
line1
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
line 2
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh

Can you clarify if you dont want any line which has the duplicate of just @HWI-ABC123_30DFGGDA or the whole line ?

If its still a requirement leave a note

Last edited by Scott; 01-11-2010 at 09:02 PM.. Reason: Added code tags
# 10  
Old 10-13-2009
Hi daptal,

At this stage, I will prefer to just select the info of the first shown unique nucleotide sequence as my "unique" read. Keep all the contents of the first shown nucleotide sequence contents

For example:
Code:
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA  #First shown unique reads
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #First shown unique reads
+HWI-ABC555_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC467_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA #Seconds shown reads,discards
+HWI-ABC467_30DFGGDA:1:100:3:1234
hhhHHHHhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #Seconds shown reads,discards
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]
.
.
.
.
.
.

I got a long list of Illumina reads. My desired output is like this:
Code:
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGG
+HWI-ABC555_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh

Thanks again for your help Smilie

Last edited by Scott; 01-11-2010 at 09:02 PM.. Reason: Added code tags
# 11  
Old 10-13-2009
@patrick87,

Can you give more data on your input and output.

Because I couldn't understand the output. ( I could notice only first 8 lines of input as an output) may be am not intelligent enough to crasp it Smilie

It would be better to give solution.
# 12  
Old 10-13-2009
Code:
#!/usr/bin/perl
use strict;
use warnings;

open my $fh , '<' , 'abc.txt' || die "$!";

my @tmp_arr;
my %hash;
while (my $line = <$fh>){
        chomp $line;
        if (@tmp_arr){
                if ($line =~ m/^@/){
                        unless (exists $hash{$tmp_arr[1]}){
                                $hash{$tmp_arr[1]} = 1;
                                map { print "$_\n";} @tmp_arr;
                        }
                        @tmp_arr = ();
                        push @tmp_arr , $line;
                }
                else {
                        push @tmp_arr , $line;
                }
        }
        else {
                push @tmp_arr , $line;
        }
}
if (@tmp_arr && (exists $hash{$tmp_arr[1]}) ){
        $hash{$tmp_arr[1]} = 1;
        map { print "$_\n";} @tmp_arr;
}
close $fh;

From what i could understand from your posts this might be the solution. Please test it on your test data set before running it on 7 mill rows.

Lemme know if it doesnot work properly or as intended.

HTH,
PL
# 13  
Old 10-13-2009
Dear skmdu,

For example:
Code:
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA  #First shown unique reads
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #First shown unique reads
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC467_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA #Seconds shown reads,discards
+HWI-ABC467_30DFGGDA:1:100:3:1234
hhhHHHHhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG #Seconds shown reads,discards
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]
@HWI-ABC796_30DFGGDA:1:100:3:1234
 ACGTAGTACCCGGGTTTTTTTTTAAAAA  #Third shown reads
+HWI-ABC796_30DFGGDA:1:100:3:1234
 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
 @HWI-ABC140_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGG #Third shown reads
 +HWI-ABC140_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh
 .
.
.
.
.

I got a long list of Illumina reads. My desired output is like this:
Code:
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh

Sorry if my question make you feel confusing.
Actually I just consider sequence duplicate based on its nucleotide sequence (line2 contents) no related with its header or its quality score.
But at this stage, I will select those first shown unique nucleotide sequence (line2 contents)
and its header and quality score consider as my unique.
I will consider the rest those nucleotide sequence (line2 contents) which same as the first shown nucleotide sequence as duplicated and wanted to discard it.

Thanks a lot for solving my troubles.
If you have any problem or question, kindly ask me anytime.

---------- Post updated at 01:57 AM ---------- Previous update was at 01:31 AM ----------

Hi daptal,
Sad to said that your perl script can't give me my desired output Smilie
It gives me something like:
Code:
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC555_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]]]

It is not what I desired output Smilie

Last edited by Scott; 01-11-2010 at 09:04 PM.. Reason: Added code tags
# 14  
Old 10-13-2009
Wrench

Code:
#! /usr/bin/perl

while ( <> ) {
        if ( /^@/ ){
                $key = <>;
                $QL= <>;
                $QL.=<>;
        }
        if ( !exists $hash{$key}){
                push(@array,[$key,$_,$QL]);
                $hash{$key}=1;
        }
}
foreach (@array) {
        print $_->[1].$_->[0].$_->[2];
}

Code:
perl illum.pl < filename
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhh

Try this with your sample file first, if the output seems to be okay, then try the same with 7million rows.

Note: Assumed that Quality score will be two lines always.

Previous Thread | Next Thread
Test Your Knowledge in Computers #571
Difficulty: Medium
A simple units bug in the Mars Climate Orbiter caused an error which resulted in loses of over $125M USD in 1999.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Remove duplicate

Hi, How can I replace || with space and then remove duplicate from following text? T111||T222||T444||T222||T555 Thanks in advance (10 Replies)
Discussion started by: tinku981
10 Replies

2. Shell Programming and Scripting

Remove duplicate

Hi , I have a pipe seperated file repo.psv where i need to remove duplicates based on the 1st column only. Can anyone help with a Unix script ? Input: 15277105||Common Stick|ESHR||Common Stock|CYRO AB 15277105||Common Stick|ESHR||Common Stock|CYRO AB 16111278||Common Stick|ESHR||Common... (12 Replies)
Discussion started by: samrat dutta
12 Replies

3. Shell Programming and Scripting

How to remove duplicate ID's?

HI I have file contains 1000'f of duplicate id's with (upper and lower first character) as below i/p: a411532A411532a508661A508661c411532C411532 Requirement: But i need to ignore lowercase id's and need only below id's o/p: A411532 A508661 C411532 (9 Replies)
Discussion started by: buzzme
9 Replies

4. UNIX for Dummies Questions & Answers

Difference between buffered disk reads and cached reads?

I was analyzing the Disk read using hdparm utility. This is what i got as a result. # hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 108 MB in 3.04 seconds = 35.51 MB/sec # hdparm -T /dev/sda /dev/sda: Timing cached reads: 3496 MB in 1.99 seconds = 1756.56 MB/sec... (1 Reply)
Discussion started by: pinga123
1 Replies

5. Shell Programming and Scripting

remove duplicate

Hi, I am tryung to use shell or perl to remove duplicate characters for example , if I have " I love google" it will become I love ggle" or even "I loveggle" if removing duplicate white space Thanks CC (6 Replies)
Discussion started by: ccp
6 Replies

6. UNIX for Dummies Questions & Answers

Remove duplicate in array

Hi, I have a list of numbers stored in an array as below. 5 7 10 30 30 40 50 Please advise how could I remove the duplicate value in the array ? Thanks in advance. (5 Replies)
Discussion started by: Rock
5 Replies

7. Shell Programming and Scripting

how to remove duplicate lines

I have following file content (3 fields each line): 23 888 10.0.0.1 dfh 787 10.0.0.2 dssf dgfas 10.0.0.3 dsgas dg 10.0.0.4 df dasa 10.0.0.5 df dag 10.0.0.5 dfd dfdas 10.0.0.5 dfd dfd 10.0.0.6 daf nfd 10.0.0.6 ... as can be seen, that the third field is ip address and sorted. but... (3 Replies)
Discussion started by: fredao
3 Replies

8. Shell Programming and Scripting

Remove duplicate

Hi all, I have a text file fileA.txt DXRV|02/28/2006 11:36:49.049|SAC||||CDxAcct=2420991350 DXRV|02/28/2006 11:37:06.404|SAC||||CDxAcct=6070970034 DXRV|02/28/2006 11:37:25.740|SAC||||CDxAcct=2420991350 DXRV|02/28/2006 11:38:32.633|SAC||||CDxAcct=6070970034 DXRV|02/28/2006... (2 Replies)
Discussion started by: sabercats
2 Replies

9. Shell Programming and Scripting

Remove duplicate ???

Hi all, I have a out.log file CARR|02/26/2006 10:58:30.107|CDxAcct=1405157051 CARR|02/26/2006 11:11:30.107|CDxAcct=1405157051 CARR|02/26/2006 11:18:30.107|CDxAcct=7659579782 CARR|02/26/2006 11:28:30.107|CDxAcct=9534922327 CARR|02/26/2006 11:38:30.107|CDxAcct=9534922327 CARR|02/26/2006... (3 Replies)
Discussion started by: sabercats
3 Replies

10. Shell Programming and Scripting

remove duplicate

i have a text its contain many record, but its written in one line, i want to remove from that line the duplicate record, not record have fixed width ex: width = 4 inputfile test.txt =abc cdf abc abc cdf fgh fgh abc abc i want the outputfile =abc cdf fgh only those records can any one help... (4 Replies)
Discussion started by: kazanoova2
4 Replies

Featured Tech Videos