Visit Our UNIX and Linux User Community


Illumina reads remove duplicate...


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Illumina reads remove duplicate...
# 1  
Old 10-12-2009
Illumina reads remove duplicate...

After I using the search tool, I still can't find a solution that was related with my trouble.
My input file:
Code:
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC467_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC467_30DFGGDA:1:100:3:1234
hhhHHHHhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC889_30DFGGDA:1:100:3:1234
GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC889_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhh]]]]]]]]]]
.
.
.
.
.
.

I got a long list of Illumina reads. My desired output is like this:
Code:
@HWI-ABC123_30DFGGDA:1:100:3:1234
ACGTAGTACCCGGGTTTTTTTTTAAAAAAA
+HWI-ABC123_30DFGGDA:1:100:3:1234
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
@HWI-ABC555_30DFGGDA:1:100:3:1234
 GGGTTTTTTTTTAAAAAAAGGGGGGGGGGG
+HWI-ABC123_30DFGGDA:1:100:3:1234
 ]]]]]]]]]]hhhhhhhhhhhhhhhhhhhhhhhhhhhh

Once the line 2 which is nucleotide sequence is exactly match. The rest of the duplicate are removed.
Hopefully can get anybody expert to help me solve with this problem.
Thanks a lot.

Last edited by Scott; 01-11-2010 at 10:01 PM.. Reason: Added code tags
# 2  
Old 10-12-2009
Perl Script

Code:
#! /usr/bin/perl

use strict;
use warnings;

open ( IN, "data" ) || die "Perl blew up\n";

my $lookup;
my @temp;
my @keep;
my $nuc;

while (<IN>) {

    push @temp, $_;

    if ( /^([A-Z]{30})$/ ) {
       $nuc = $1;
       ++$lookup->{$nuc};
    }

    if ( @temp == 4 ) {
        if ( $lookup->{$nuc} == 1 ) { push @keep, @temp; }
        @temp = (); $nuc = ();

    }

}

print map "$_", @keep;


Last edited by pludi; 10-13-2009 at 03:57 AM.. Reason: code tags please...
# 3  
Old 10-12-2009
Hi, thanks for your suggestion.
Sad to said that it can't function well.
When I run the perl script, it keep on mention a long list of :
"Use of uninitialized value in hash element at unique.pl line 23, <IN>line 38928"

Do you have any idea about this problem facing?
# 4  
Old 10-12-2009
Wrench

Patrick, I would like to see the more of the errors you are getting. What I think may be happening is that maybe on some lines, your nucleotide values may be missing....

Try this code, it is perhaps easier to cut and paste. It should do the exact same thing as the first code. ( I was interested in trying to get this code into one line =). See if you get the same errors...
Code:
#! /usr/bin/perl

use strict;
use warnings;

open ( IN, "data" ) || die "Perl blew up\n";
undef $/;

my $str = <IN>;
my $lookup;

while ( $str =~ /(.+\n([A-Z]{30})\n.+\n.+\n)/g ) {

    ++$lookup->{$2};
    print $1 unless $lookup->{$2} > 1;

}


Last edited by pludi; 10-13-2009 at 03:58 AM.. Reason: code tags please...
# 5  
Old 10-12-2009
thanks a lot, deindorfer.
I trying your perl script now.
It seem like take a long time to proceed?
My input file got around 7000000++ Illumina reads.
It still running now.
Hopefully this script is worked Smilie
Thanks again, deindorfer.
# 6  
Old 10-12-2009
7 million lines should not be too bad. code #2 is storing all those lines in one variable, but if you are on a production system, you should be ok.
# 7  
Old 10-12-2009
Hi, [IMG]file:///C:/DOCUME%7E1/Patrick/LOCALS%7E1/Temp/moz-screenshot-1.jpg[/IMG]deindorfer.
Your perl script already finished run.
Unfortunately, the output data is emptySmilie
Why is the reason causing it?!

Previous Thread | Next Thread
Test Your Knowledge in Computers #814
Difficulty: Easy
Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language like HTML.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Remove duplicate

Hi, How can I replace || with space and then remove duplicate from following text? T111||T222||T444||T222||T555 Thanks in advance (10 Replies)
Discussion started by: tinku981
10 Replies

2. Shell Programming and Scripting

Remove duplicate

Hi , I have a pipe seperated file repo.psv where i need to remove duplicates based on the 1st column only. Can anyone help with a Unix script ? Input: 15277105||Common Stick|ESHR||Common Stock|CYRO AB 15277105||Common Stick|ESHR||Common Stock|CYRO AB 16111278||Common Stick|ESHR||Common... (12 Replies)
Discussion started by: samrat dutta
12 Replies

3. Shell Programming and Scripting

How to remove duplicate ID's?

HI I have file contains 1000'f of duplicate id's with (upper and lower first character) as below i/p: a411532A411532a508661A508661c411532C411532 Requirement: But i need to ignore lowercase id's and need only below id's o/p: A411532 A508661 C411532 (9 Replies)
Discussion started by: buzzme
9 Replies

4. UNIX for Dummies Questions & Answers

Difference between buffered disk reads and cached reads?

I was analyzing the Disk read using hdparm utility. This is what i got as a result. # hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 108 MB in 3.04 seconds = 35.51 MB/sec # hdparm -T /dev/sda /dev/sda: Timing cached reads: 3496 MB in 1.99 seconds = 1756.56 MB/sec... (1 Reply)
Discussion started by: pinga123
1 Replies

5. Shell Programming and Scripting

remove duplicate

Hi, I am tryung to use shell or perl to remove duplicate characters for example , if I have " I love google" it will become I love ggle" or even "I loveggle" if removing duplicate white space Thanks CC (6 Replies)
Discussion started by: ccp
6 Replies

6. UNIX for Dummies Questions & Answers

Remove duplicate in array

Hi, I have a list of numbers stored in an array as below. 5 7 10 30 30 40 50 Please advise how could I remove the duplicate value in the array ? Thanks in advance. (5 Replies)
Discussion started by: Rock
5 Replies

7. Shell Programming and Scripting

how to remove duplicate lines

I have following file content (3 fields each line): 23 888 10.0.0.1 dfh 787 10.0.0.2 dssf dgfas 10.0.0.3 dsgas dg 10.0.0.4 df dasa 10.0.0.5 df dag 10.0.0.5 dfd dfdas 10.0.0.5 dfd dfd 10.0.0.6 daf nfd 10.0.0.6 ... as can be seen, that the third field is ip address and sorted. but... (3 Replies)
Discussion started by: fredao
3 Replies

8. Shell Programming and Scripting

Remove duplicate

Hi all, I have a text file fileA.txt DXRV|02/28/2006 11:36:49.049|SAC||||CDxAcct=2420991350 DXRV|02/28/2006 11:37:06.404|SAC||||CDxAcct=6070970034 DXRV|02/28/2006 11:37:25.740|SAC||||CDxAcct=2420991350 DXRV|02/28/2006 11:38:32.633|SAC||||CDxAcct=6070970034 DXRV|02/28/2006... (2 Replies)
Discussion started by: sabercats
2 Replies

9. Shell Programming and Scripting

Remove duplicate ???

Hi all, I have a out.log file CARR|02/26/2006 10:58:30.107|CDxAcct=1405157051 CARR|02/26/2006 11:11:30.107|CDxAcct=1405157051 CARR|02/26/2006 11:18:30.107|CDxAcct=7659579782 CARR|02/26/2006 11:28:30.107|CDxAcct=9534922327 CARR|02/26/2006 11:38:30.107|CDxAcct=9534922327 CARR|02/26/2006... (3 Replies)
Discussion started by: sabercats
3 Replies

10. Shell Programming and Scripting

remove duplicate

i have a text its contain many record, but its written in one line, i want to remove from that line the duplicate record, not record have fixed width ex: width = 4 inputfile test.txt =abc cdf abc abc cdf fgh fgh abc abc i want the outputfile =abc cdf fgh only those records can any one help... (4 Replies)
Discussion started by: kazanoova2
4 Replies

Featured Tech Videos