How to remove duplicate sentence/string in perl?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to remove duplicate sentence/string in perl?
# 1  
Old 11-22-2008
How to remove duplicate sentence/string in perl?

Hi,

I have two strings like this in an array:

For example:

Code:
@a=("Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive Oxygen Species","Brain aging is associated with a progressive imbalance between intracellular concentration of Reactive Oxygen Species");

Actually its the duplicate sentences.

I want to remove the duplicate string from an array and i have many duplicate strings like this in an array.

How do i remove duplicate sentence/string from an array in perl?

I don't want to use any module to remove duplicate sentences.

Any solution???

with regards
Vanitha
# 2  
Old 11-22-2008
Well, it can be as simple as this:

Code:
@a = ("a", "b", "c", "a", "b", "d");
@a = (map { $_u{$_} = 1; (); } @a or keys %_u);
$, = "\n";
print @a;

# 3  
Old 11-22-2008
Or, if you also want to preserve the order:
Code:
@a = qw{a b c a b d};
print join "\n", grep !$_{$_}++, @a

# 4  
Old 11-23-2008
Code:
@arr=('a','b','a','b','c');
$hash{$_}++ foreach @arr;
print join ",",keys %hash;

# 5  
Old 11-24-2008
Quote:
Originally Posted by summer_cherry
Code:
@arr=('a','b','a','b','c');
$hash{$_}++ foreach @arr;
print join ",",keys %hash;

Hi,

I tried the above methods its removing the duplicates but the order is not retained.I want to retain the order also.I used sort but still its giving different order.Here is my array:

Code:
@arr=('TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.','For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.','In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.','Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.','Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.','TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.','For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.','In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.','Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.','Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.');

output i got was:
Code:
In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.


But i want the output like this:
Code:
TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis. For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter. In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion. Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1. Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.


How to change the order and print the same?

With regards
Vanitha
# 6  
Old 11-24-2008
Did you read my post?

Code:
$ cat p
#! /usr/bin/env perl

@arr =(
'TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.',
'For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.',
'In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.',
'Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.',
'Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.',
'TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.',
'For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.',
'In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.',
'Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.',
'Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.'
);

$, = "\n\n";
$\ = "\n";

print grep !$_{$_}++, @arr;

$ ./p
TDP-43 is a highly conserved, 43-kDa RNA-binding protein implicated to play a role in transcription repression, nuclear organization, and alternative splicing. More recently, this factor has been identified as the major disease protein of several neurodegenerative diseases, including frontotemporal lobar degeneration with ubiquitin-positive inclusions and amyotrophic lateral sclerosis.

For the splicing activity, the factor has been shown to be mainly an exon-skipping promoter.

In this study using the survival of motor neuron (SMN) minigenes as the reporters in transfection assay, we show for the first time that TDP-43 could also act as an exon-inclusion factor. Furthermore, both RNA-recognition motif domains are required for its ability to enhance the SMN2 exon 7 inclusion.

Combined protein-immunoprecipitation and RNA-immunoprecipitation experiments also suggested that this exon inclusion activity might be mediated by multimeric complex(es) consisting of this protein interacting with other splicing factors, including Htra2-beta1.

Our data further evidence TDP-43 as a multifunctional RNA-binding protein for a diverse set of cellular activities.
$

# 7  
Old 11-24-2008
(radoulov was faster, shorter and probably better. Code below uses same principle, but very verbose.)

This program should do it (very verbose)

Code:
@a = qw/a c b a b d/;
%b = {};
@c = ();
foreach(@a) {
        if(!$b{$_}){
                push @c, $_;
                $b{$_}++;
        }
}
print "Output:\n" . join("\n",@c) . "\n";

Feel free to line it up...

Code:
@a = qw/a c b a b d/; %b = {}; @c = ();
foreach (@a) { push @c, $_ if !$b{$_}++; }
print "Output:\n" . join("\n",@c) . "\n";

Real perl guru's can probably make it even more crypting, reusing variables and special variables Smilie

Basically, you just use a hashtable to store the words, and each time check if the word is already in the hashtable (which is a O(1) action).
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate consecutive lines with specific string

Hello, I'm trying to remove the duplicate consecutive lines with specific string "WARNING". File.txt abc; WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 WARNING 2345 bcd; abc; 123 123 123 WARNING 1234 WARNING 2345 WARNING 2345 efgh; (6 Replies)
Discussion started by: Mannu2525
6 Replies

2. Shell Programming and Scripting

Remove First word of a sentence in shell

Hi there, How I remove the first word of a sentence. I have tried. echo '1.1;' ; echo "$one" | grep '1.1 ' | awk '{print substr($0,index($0," ")+1)}' For the below input. 1.1 Solaris 10 8/07 s10s_u4wos_12b SPARC Just want to know if there is any shorter alternative. (3 Replies)
Discussion started by: alvinoo
3 Replies

3. Shell Programming and Scripting

Remove string perl with first or last word is in a list

Hello, I try to delete all strings if their first or last word is one of this list of words : "the", "i", "in", "there", "this", "with", "on", "we", "that", "of" For example if i have this string in an input file "with me" this string will be removed, Example: input "the european... (2 Replies)
Discussion started by: cyrine
2 Replies

4. UNIX for Dummies Questions & Answers

Help with if then sentence (string in file)

Hello! I'd like some help with a sentance, this 'if' should take a string from the user, then search my list for that string, now only those lines that string is found should be worked on. I'm new to this, but i'm guessing it's something like this.. #!/bin/bash ... (10 Replies)
Discussion started by: klskl
10 Replies

5. Shell Programming and Scripting

Remove not only the duplicate string but also the keyword of the string in Perl

Hi Perl users, I have another problem with text processing in Perl. I have a file below: Linux Unix Linux Windows SUN MACOS SUN SUN HP-AUX I want the result below: Unix Windows SUN MACOS HP-AUX so the duplicate string will be removed and also the keyword of the string on... (2 Replies)
Discussion started by: askari
2 Replies

6. Shell Programming and Scripting

Remove duplicate chars and sort string [SED]

Hi, INPUT: DCBADD OUTPUT: ABCD The SED script should alphabetically sort the chars in the string and remove the duplicate chars. (5 Replies)
Discussion started by: jds93
5 Replies

7. Shell Programming and Scripting

perl/shell need help to remove duplicate lines from files

Dear All, I have multiple files having number of records, consist of more than 10 columns some column values are duplicate and i want to remove these duplicate values from these files. Duplicate values may come in different files.... all files laying in single directory.. Need help to... (3 Replies)
Discussion started by: arvindng
3 Replies

8. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Input: hello hello hello hello monkey donkey hello hello drink dance drink Output should be: hello hello monkey donkey drink dance (9 Replies)
Discussion started by: cola
9 Replies

9. Shell Programming and Scripting

Remove duplicate files based on text string?

Hi I have been struggling with a script for removing duplicate messages from a shared mailbox. I would like to search for duplicate messages based on the “Message-ID” string within the messages files. I have managed to find the duplicate “Message-ID” strings and (if I would like) delete... (1 Reply)
Discussion started by: spangberg
1 Replies

10. Shell Programming and Scripting

Replacement of sentence in perl

Hi, I have 3 arrays: @arr1=("Furthermore, apigenin treatment increased the level of association of the RNA binding protein HuR with endogenous p53 mRNA","one of the mechanisms by which apigenin induces p53 protein expression is enhancement of translation through the RNA binding protein... (1 Reply)
Discussion started by: vanitham
1 Replies
Login or Register to Ask a Question