For files with greater than 7 mill rows it is preferable not to store all the values in the memory as in production if enough memory is not allocated the script will get killed.
Assuming every line looks like this
Can you clarify if you dont want any line which has the duplicate of just @HWI-ABC123_30DFGGDA or the whole line ?
If its still a requirement leave a note
Last edited by Scott; 01-11-2010 at 09:02 PM..
Reason: Added code tags
At this stage, I will prefer to just select the info of the first shown unique nucleotide sequence as my "unique" read. Keep all the contents of the first shown nucleotide sequence contents
For example:
I got a long list of Illumina reads. My desired output is like this:
Thanks again for your help
Last edited by Scott; 01-11-2010 at 09:02 PM..
Reason: Added code tags
For example:
I got a long list of Illumina reads. My desired output is like this:
Sorry if my question make you feel confusing.
Actually I just consider sequence duplicate based on its nucleotide sequence (line2 contents) no related with its header or its quality score.
But at this stage, I will select those first shown unique nucleotide sequence (line2 contents) and its header and quality score consider as my unique.
I will consider the rest those nucleotide sequence (line2 contents) which same as the first shown nucleotide sequence as duplicated and wanted to discard it.
Thanks a lot for solving my troubles.
If you have any problem or question, kindly ask me anytime.
---------- Post updated at 01:57 AM ---------- Previous update was at 01:31 AM ----------
Hi daptal,
Sad to said that your perl script can't give me my desired output
It gives me something like:
It is not what I desired output
Last edited by Scott; 01-11-2010 at 09:04 PM..
Reason: Added code tags
Hi ,
I have a pipe seperated file repo.psv where i need to remove duplicates based on the 1st column only. Can anyone help with a Unix script ?
Input:
15277105||Common Stick|ESHR||Common Stock|CYRO AB
15277105||Common Stick|ESHR||Common Stock|CYRO AB
16111278||Common Stick|ESHR||Common... (12 Replies)
HI
I have file contains 1000'f of duplicate id's with (upper and lower first character) as below
i/p:
a411532A411532a508661A508661c411532C411532
Requirement: But i need to ignore lowercase id's and need only below id's
o/p:
A411532
A508661
C411532 (9 Replies)
I was analyzing the Disk read using hdparm utility.
This is what i got as a result.
# hdparm -t /dev/sda
/dev/sda:
Timing buffered disk reads: 108 MB in 3.04 seconds = 35.51 MB/sec
# hdparm -T /dev/sda
/dev/sda:
Timing cached reads: 3496 MB in 1.99 seconds = 1756.56 MB/sec... (1 Reply)
Hi,
I am tryung to use shell or perl to remove duplicate characters
for example , if I have " I love google" it will become I love ggle"
or even "I loveggle" if removing duplicate white space
Thanks
CC (6 Replies)
Hi,
I have a list of numbers stored in an array as below.
5 7 10 30 30 40 50
Please advise how could I remove the duplicate value in the array ?
Thanks in advance. (5 Replies)
I have following file content (3 fields each line):
23 888 10.0.0.1
dfh 787 10.0.0.2
dssf dgfas 10.0.0.3
dsgas dg 10.0.0.4
df dasa 10.0.0.5
df dag 10.0.0.5
dfd dfdas 10.0.0.5
dfd dfd 10.0.0.6
daf nfd 10.0.0.6
...
as can be seen, that the third field is ip address and sorted. but... (3 Replies)
Hi all,
I have a text file fileA.txt
DXRV|02/28/2006 11:36:49.049|SAC||||CDxAcct=2420991350
DXRV|02/28/2006 11:37:06.404|SAC||||CDxAcct=6070970034
DXRV|02/28/2006 11:37:25.740|SAC||||CDxAcct=2420991350
DXRV|02/28/2006 11:38:32.633|SAC||||CDxAcct=6070970034
DXRV|02/28/2006... (2 Replies)
Hi all,
I have a out.log file
CARR|02/26/2006 10:58:30.107|CDxAcct=1405157051
CARR|02/26/2006 11:11:30.107|CDxAcct=1405157051
CARR|02/26/2006 11:18:30.107|CDxAcct=7659579782
CARR|02/26/2006 11:28:30.107|CDxAcct=9534922327
CARR|02/26/2006 11:38:30.107|CDxAcct=9534922327
CARR|02/26/2006... (3 Replies)
i have a text its contain many record, but its written in one line,
i want to remove from that line the duplicate record,
not record have fixed width ex: width = 4
inputfile test.txt =abc cdf abc abc cdf fgh fgh abc abc
i want the outputfile =abc cdf fgh
only those records
can any one help... (4 Replies)