Perl: filtering lines based on duplicate values in a column
Hi I have a file like this. I need to eliminate lines with first column having the same value 10 times.
The value 13 in the first column is repeated 10 times in the consecutive lines. I need to eliminate all those lines in the output.
so the desired output will be
Thank you much in advance. If it is possible a code in Perl would be much appreciated.
Does anybody know a command that filters duplicate lines out of a file. Similar to the uniq command but can handle duplicate lines no matter where they occur in a file? (9 Replies)
Hi, I've got a file that I'd like to uniquely sort based on column 2 (values in column 2 begin with "comp").
I tried sort -t -nuk2,3 file.txtBut got:
sort: multi-character tab `-nuk2,3'
"man sort" did not help me out
Any pointers?
Input:
Output: (5 Replies)
Hi experts, I have a tab-delimited file with one column containing values separated by a comma. I wish to duplicate the entire line for every value in that comma-delimited field.
For example:
$cat file
4444 4444 4444 4444
9990 2222,7777 6666 2222 ... (3 Replies)
Hi,
I have a similar input format-
A_1 2
B_0 4
A_1 1
B_2 5
A_4 1
and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks!
letter number_of_letters... (5 Replies)
Hi,
I have tried to remove dublicate lines based on first column with pipe delimiter . but i ma not able to get some uniqu lines
Command : sort -t'|' -nuk1 file.txt
Input :
38376KZ|09/25/15|1.057
38376KZ|09/25/15|1.057
02006YB|09/25/15|0.859
12593PS|09/25/15|2.803... (2 Replies)
Dear folks
I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Hi there,
I am trying to filter a big file with several columns using values on a column with values like (AC=5;AN=10;SF=341,377,517,643,662;VRT=1). I wont to filter the data based on SF= values that are (bigger than 400)
... (25 Replies)
I have a file with 5 columns. I want to pull out all records where the value in column 4 is not unique. For example in the sample below, I would want it to print out all lines except for the last two.
40991764 2419 724 47182 Cand A
40992936 3591 724 47182 Cand B
40993016 3671 724 47182 Cand C... (5 Replies)
Discussion started by: kaktus
5 Replies
LEARN ABOUT DEBIAN
bp_process_gadfly
BP_PROCESS_GADFLY(1p) User Contributed Perl Documentation BP_PROCESS_GADFLY(1p)NAME
process_gadfly.pl - Massage Gadfly/FlyBase GFF files into a version suitable for the Generic Genome Browser
SYNOPSIS
% process_gadfly.pl ./RELEASE2 > gadfly.gff
DESCRIPTION
This script massages the RELEASE 3 Flybase/Gadfly GFF files located at http://www.fruitfly.org/sequence/release3download.shtml into the
"correct" version of the GFF format.
To use this script, download the whole genome FASTA file and save it to disk. (The downloaded file will be called something like
"na_whole-genome_genomic_dmel_RELEASE3.FASTA", but the link on the HTML page doesn't give the filename.) Do the same for the whole genome
GFF annotation file (the saved file will be called something like "whole-genome_annotation-feature-region_dmel_RELEASE3.GFF".) If you wish
you can download the ZIP compressed versions of these files.
Next run this script on the two files, indicating the name of the downloaded FASTA file first, followed by the gff file:
% process_gadfly.pl na_whole-genome_genomic_dmel_RELEASE3.FASTA whole-genome_annotation-feature-region_dmel_RELEASE3.GFF > fly.gff
The gadfly.gff file and the fasta file can now be loaded into a Bio::DB::GFF database using the following command:
% bulk_load_gff.pl -d fly -fasta na_whole-genome_genomic_dmel_RELEASE3.FASTA fly.gff
(Where "fly" is the name of the database. Change it as appropriate. The database must already exist and be writable by you!)
The resulting database will have the following feature types (represented as "method:source"):
Component:arm A chromosome arm
Component:scaffold A chromosome scaffold (accession #)
Component:gap A gap in the assembly
clone:clonelocator A BAC clone
gene:gadfly A gene accession number
transcript:gadfly A transcript accession number
translation:gadfly A translation
codon:gadfly Significance unknown
exon:gadfly An exon
symbol:gadfly A classical gene symbol
similarity:blastn A BLASTN hit
similarity:blastx A BLASTX hit
similarity:sim4 EST->genome using SIM4
similarity:groupest EST->genome using GROUPEST
similarity:repeatmasker A repeat
IMPORTANT NOTE: This script will *only* work with the RELEASE3 gadfly files and will not work with earlier releases.
SEE ALSO
Bio::DB::GFF, bulk_load_gff.pl, load_gff.pl
AUTHOR
Lincoln Stein, lstein@cshl.org
Copyright (c) 2002 Cold Spring Harbor Laboratory
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See DISCLAIMER.txt for
disclaimers of warranty.
perl v5.14.2 2012-03-02 BP_PROCESS_GADFLY(1p)