How to remove duplicate sentence/string in perl? Post: 302261238

Sponsored Content

Top Forums Shell Programming and Scripting How to remove duplicate sentence/string in perl? Post 302261238 by radoulov on Monday 24th of November 2008 05:19:48 AM

11-24-2008

Registered User

Did you read my post?

Code:

$ cat p
#! /usr/bin/env perl

@arr =(
'TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.',
'For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.',
'In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.',
'Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.',
'Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.',
'TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.',
'For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.',
'In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.',
'Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.',
'Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.'
);

$, = "\n\n";
$\ = "\n";

print grep !$_{$_}++, @arr;

$ ./p
TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.

For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.

In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.

Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.

Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.
$

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replacement of sentence in perl

Hi, I have 3 arrays: @arr1=("Furthermore, apigenin treatment increased the level of association of the RNA binding protein HuR with endogenous p53 mRNA","one of the mechanisms by which apigenin induces p53 protein expression is enhancement of translation through the RNA binding protein...

2. Shell Programming and Scripting

Remove duplicate files based on text string?

Hi I have been struggling with a script for removing duplicate messages from a shared mailbox. I would like to search for duplicate messages based on the “Message-ID” string within the messages files. I have managed to find the duplicate “Message-ID” strings and (if I would like) delete...

3. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Input: hello hello hello hello monkey donkey hello hello drink dance drink Output should be: hello hello monkey donkey drink dance

4. Shell Programming and Scripting

perl/shell need help to remove duplicate lines from files

Dear All, I have multiple files having number of records, consist of more than 10 columns some column values are duplicate and i want to remove these duplicate values from these files. Duplicate values may come in different files.... all files laying in single directory.. Need help to...

5. Shell Programming and Scripting

Remove duplicate chars and sort string [SED]

Hi, INPUT: DCBADD OUTPUT: ABCD The SED script should alphabetically sort the chars in the string and remove the duplicate chars.

6. Shell Programming and Scripting

Remove not only the duplicate string but also the keyword of the string in Perl

Hi Perl users, I have another problem with text processing in Perl. I have a file below: Linux Unix Linux Windows SUN MACOS SUN SUN HP-AUX I want the result below: Unix Windows SUN MACOS HP-AUX so the duplicate string will be removed and also the keyword of the string on...

7. UNIX for Dummies Questions & Answers

Help with if then sentence (string in file)

Hello! I'd like some help with a sentance, this 'if' should take a string from the user, then search my list for that string, now only those lines that string is found should be worked on. I'm new to this, but i'm guessing it's something like this.. #!/bin/bash ...

8. Shell Programming and Scripting

Remove string perl with first or last word is in a list

Hello, I try to delete all strings if their first or last word is one of this list of words : "the", "i", "in", "there", "this", "with", "on", "we", "that", "of" For example if i have this string in an input file "with me" this string will be removed, Example: input "the european...

9. Shell Programming and Scripting

Remove First word of a sentence in shell

Hi there, How I remove the first word of a sentence. I have tried. echo '1.1;' ; echo "$one" | grep '1.1 ' | awk '{print substr($0,index($0," ")+1)}' For the below input. 1.1 Solaris 10 8/07 s10s_u4wos_12b SPARC Just want to know if there is any shorter alternative.

10. Shell Programming and Scripting

Remove duplicate consecutive lines with specific string

Hello, I'm trying to remove the duplicate consecutive lines with specific string "WARNING". File.txt abc; WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 bcd; abc; 123 123 123 WARNING 1234 WARNING 2345 WARNING 2345 efgh;

LEARN ABOUT DEBIAN

bp_genbank2gff3

BP_GENBANK2GFF3(1p)					User Contributed Perl Documentation				       BP_GENBANK2GFF3(1p)

NAME

       genbank2gff3.pl -- Genbank->gbrowse-friendly GFF3

SYNOPSIS

	 genbank2gff3.pl [options] filename(s)

	 # process a directory containing GenBank flatfiles
	 perl genbank2gff3.pl --dir path_to_files --zip

	 # process a single file, ignore explicit exons and introns
	 perl genbank2gff3.pl --filter exon --filter intron file.gbk.gz

	 # process a list of files
	 perl genbank2gff3.pl *gbk.gz

	 # process data from URL, with Chado GFF model (-noCDS), and pipe to database loader
	 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk 
	 | perl genbank2gff3.pl -noCDS -in stdin -out stdout 
	 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata

	   Options:
	       --noinfer  -r  don't infer exon/mRNA subfeatures
	       --conf	  -i  path to the curation configuration file that contains user preferences
			      for Genbank entries (must be YAML format)
			      (if --manual is passed without --ini, user will be prompted to
			       create the file if any manual input is saved)
	       --sofile  -l  path to to the so.obo file to use for feature type mapping
			      (--sofile live will download the latest online revision)
	       --manual   -m  when trying to guess the proper SO term, if more than
			      one option matches the primary tag, the converter will
			      wait for user input to choose the correct one
			      (only works with --sofile)
	       --dir	  -d  path to a list of genbank flatfiles
	       --outdir   -o  location to write GFF files (can be 'stdout' or '-' for pipe)
	       --zip	  -z  compress GFF3 output files with gzip
	       --summary  -s  print a summary of the features in each contig
	       --filter   -x  genbank feature type(s) to ignore
	       --split	  -y  split output to separate GFF and fasta files for
			      each genbank record
	       --nolump   -n  separate file for each reference sequence
			      (default is to lump all records together into one
			      output file for each input file)
	       --ethresh  -e  error threshold for unflattener
			      set this high (>2) to ignore all unflattener errors
	       --[no]CDS  -c  Keep CDS-exons, or convert to alternate gene-RNA-protein-exon
			      model. --CDS is default. Use --CDS to keep default GFF gene model,
			      use --noCDS to convert to g-r-p-e.
	       --format   -f  Input format (SeqIO types): GenBank, Swiss or Uniprot, EMBL work
			      (GenBank is default)
	       --GFF_VERSION  3 is default, 2 and 2.5 and other Bio::Tools::GFF versions available
	       --quiet	      don't talk about what is being processed
	       --typesource   SO sequence type for source (e.g. chromosome; region; contig)
	       --help	  -h  display this message

DESCRIPTION

       This script uses Bio::SeqFeature::Tools::Unflattener and Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene containment
       hierarchies mapped for optimal display in gbrowse.

       The input files are assumed to be gzipped GenBank flatfiles for refseq contigs.	The files may contain multiple GenBank records.  Either a
       single file or an entire directory can be processed.  By default, the DNA sequence is embedded in the GFF but it can be saved into separate
       fasta file with the --split(-y) option.

       If an input file contains multiple records, the default behaviour is to dump all GFF and sequence to a file of the same name (with .gff
       appended).  Using the 'nolump' option will create a separate file for each genbank record.  Using the 'split' option will create separate
       GFF and Fasta files for each genbank record.

   Notes
       'split' and 'nolump' produce many files

       In cases where the input files contain many GenBank records (for example, the chromosome files for the mouse genome build), a very large
       number of output files will be produced if the 'split' or 'nolump' options are selected.  If you do have lists of files > 6000, use the
       --long_list option in bp_bulk_load_gff.pl or bp_fast_load_gff.pl to load the gff and/ or fasta files.

       Designed for RefSeq

       This script is designed for RefSeq genomic sequence entries.  It may work for third party annotations but this has not been tested.  But
       see below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl if you don't mind some gene model unflattener errors (dgg).

       G-R-P-E Gene Model

       Don Gilbert worked this over with needs to produce GFF3 suited to loading to GMOD Chado databases.  Most of the changes I believe are
       suited for general use.	One main chado-specific addition is the
	 --[no]cds2protein  flag

       My favorite GFF is to set the above as ON by default (disable with --nocds2prot) For general use it probably should be OFF, enabled with
       --cds2prot.

       This writes GFF with an alternate, but useful Gene model, instead of the consensus model for GFF3

	 [ gene > mRNA> (exon,CDS,UTR) ]

       This alternate is

	 gene > mRNA > polypeptide > exon

       means the only feature with dna bases is the exon.  The others specify only location ranges on a genome.  Exon of course is a child of mRNA
       and protein/peptide.

       The protein/polypeptide feature is an important one, having all the annotations of the GenBank CDS feature, protein ID, translation, GO
       terms, Dbxrefs to other proteins.

       UTRs, introns, CDS-exons are all inferred from the primary exon bases inside/outside appropriate higher feature ranges.	 Other special
       gene model features remain the same.

       Several other improvements and bugfixes, minor but useful are included

	 * IO pipes now work:
	   curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ...

	 * GenBank main record fields are added to source feature, e.g. organism, date,
	   and the sourcetype, commonly chromosome for	genomes, is used.

	 * Gene Model handling for ncRNA, pseudogenes are added.

	 * GFF header is cleaner, more informative.
	   --GFF_VERSION flag allows choice of v2 as well as default v3

	 * GFF ##FASTA inclusion is improved, and
	   CDS translation sequence is moved to FASTA records.

	 * FT -> GFF attribute mapping is improved.

	 * --format choice of SeqIO input formats (GenBank default).
	   Uniprot/Swissprot and EMBL work and produce useful GFF.

	 * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions
	     and more flexible usage.

TODO

   Are these additions desired?
	* filter input records by taxon (e.g. keep only organism=xxx or taxa level = classYYY
	* handle Entrezgene, other non-sequence SeqIO structures (really should change
	   those parsers to produce consistent annotation tags).

   Related bugfixes/tests
       These items from Bioperl mail were tested (sample data generating errors), and found corrected:

	From: Ed Green <green <at> eva.mpg.de>
	Subject: genbank2gff3.pl on new human RefSeq
	Date: 2006-03-13 21:22:26 GMT
	  -- unspecified errors (sample data works now).

	From: Eric Just <e-just <at> northwestern.edu>
	Subject: bp_genbank2gff3.pl
	Date: 2007-01-26 17:08:49 GMT
	  -- bug fixed in genbank2gff3 for multi-record handling

       This error is for a /trans_splice gene that is hard to handle, and unflattner/genbank2 doesn't

	From: Chad Matsalla <chad <at> dieselwurks.com>
	Subject: genbank2gff3.PLS and the unflatenner - Inconsistent   order?
	Date: 2005-07-15 19:51:48 GMT

AUTHOR

       Sheldon McKay (mckays@cshl.edu)

       Copyright (c) 2004 Cold Spring Harbor Laboratory.

   AUTHOR of hacks for GFF2Chado loading
       Don Gilbert (gilbertd@indiana.edu)

perl v5.14.2							    2012-03-02						       BP_GENBANK2GFF3(1p)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Replacement of sentence in perl

Discussion started by: vanitham

2. Shell Programming and Scripting

Remove duplicate files based on text string?

Discussion started by: spangberg

3. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Discussion started by: cola

4. Shell Programming and Scripting

perl/shell need help to remove duplicate lines from files

Discussion started by: arvindng