Sponsored Content
Full Discussion: Extracting words from file
Top Forums Shell Programming and Scripting Extracting words from file Post 302537749 by birei on Saturday 9th of July 2011 06:57:25 PM
Old 07-09-2011
Hi,

Test next 'perl' program:
Code:
$ cat script.pl
use warnings;
use strict;

@ARGV == 1 or die "Usage: perl $0 <input-file>\n";

my %word_length;

while ( <> ) {
        chomp;
        my @words = split /[^[:alpha:]]+/;
        my %repeated_word;
        for my $word ( @words ) {
                push @{ $word_length{ length $word } }, $word unless $repeated_word{ $word }++;
        }
}

for my $length ( keys %word_length ) {
        my $outfile = "file" . $length;
        open my $fh, ">", $outfile or do {
                warn "Cannot open $outfile: $!\n";
                next;
        };
        for my $word ( @{ $word_length{ $length } } ) {
                printf $fh "%s\n", $word;
        }

        close $fh or warn "Cannot close $outfile: $!\n";
}
$ cat infile
This is an example to 
test if 
my perl program works
as expected.
$ perl script.pl
Usage: perl script.pl <input-file>
$ perl script.pl infile
$ ls -1 file*
file2
file4
file5
file7
file8

Regards,
Birei
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

extracting some words

i run a command that submits a word to WordNET which stores the search results in a document which looks like this... i searched "car" in this instance and id like to extract auto, automobile, machine, and store it in a file with the , , stripped away just the words. WordNET's results' template... (2 Replies)
Discussion started by: mark_nsx
2 Replies

2. Shell Programming and Scripting

Extracting Text Between Two Words

Hi all! Im trying to extract a portion of text from a KML and put it into a new file. Im trying to get all of the points out of it, ignoring everything else so I need only the text between <Placement> and </Placement>. Is there a way to make it extract all instances of these points and not just... (2 Replies)
Discussion started by: Grizzly
2 Replies

3. Shell Programming and Scripting

Extracting part of line between two words

Hi, I have a file few hundred MB's with text like one below in single line. 20091117 abc xyg 20091117 def ghi 20091118 ppp ttt 20091118 zzz zzz xxx I need to extract part of line from 1st occurence of pattern 20091117 till first occurence of another pattern 20091118. I tried... (3 Replies)
Discussion started by: artistic94555
3 Replies

4. Shell Programming and Scripting

words extracting

Hi, Pls assist. dn: uid=test,ou=test,dc=com description: password sunIdentityServerDeviceStatus: Active uid: test objectClass: sunIdentityServerDevice objectClass: iplanet-am-user-service objectClass: top objectClass: iPlanetPreferences sunIdentityServerDeviceType: blabla cn: default... (3 Replies)
Discussion started by: hudson03051nh
3 Replies

5. UNIX for Dummies Questions & Answers

Extracting only words from a log file

hello: i have a file and i am trying to extract only unique words from that file. i used the command: cat messages.1 | tr " " "\n" | sort | uniq -c but using this command outputs everything unique in the file be it words, numbers, like all the characters..i need a command which will only... (6 Replies)
Discussion started by: vikbenq
6 Replies

6. Shell Programming and Scripting

Help with extracting words from fixed length files

I am very new to scripting and need to write a script that will extract the account number from a line that begins with HDR. For example, the file is as follows HDR2010072600300405505100726 00300405505 LBJ FREEWAY DALLAS TELEGRAPH ... (9 Replies)
Discussion started by: bds052189
9 Replies

7. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

8. Shell Programming and Scripting

grep - Extracting multiple key words from stdout

Hello. From command line, the command zypper info nxclient return a bloc of data : linux local # zypper info nxclient Loading repository data... Reading installed packages... Information for package nxclient: Repository: zypper_local Name: nxclient Version: 3.5.0-7 Arch: x86_64... (7 Replies)
Discussion started by: jcdole
7 Replies

9. Shell Programming and Scripting

Extracting Words from Text

Hi there, Unix Gurus Back in September last year you helped me find a way to extract the words in brackets in a textfile to a new one. In that case my textfile was made up of sentences containing an only bracketed word per sentence/line: 1. If the boss's son had been , someone would... (9 Replies)
Discussion started by: eldeingles
9 Replies

10. Shell Programming and Scripting

Extracting words and lines based on keywords

Hello! I'm trying to process a text file and am stuck at 2 extractions. Hoping someone can help me here: 1. Given a line in a text file and given a keyword, how can I extract the word preceeding the keyword using a shell command/script? For example: Given a keyword "world" in the line: ... (2 Replies)
Discussion started by: seemad
2 Replies
Bio::Tools::SeqWords(3pm)				User Contributed Perl Documentation				 Bio::Tools::SeqWords(3pm)

NAME
Bio::Tools::SeqWords - Object holding n-mer statistics for a sequence SYNOPSIS
# Create the SeqWords object, e.g.: my $inputstream = Bio::SeqIO->new(-file => "seqfile", -format => 'Fasta'); my $seqobj = $inputstream->next_seq(); my $seq_word = Bio::Tools::SeqWords->new(-seq => $seqobj); # Or: my $seqobj = Bio::PrimarySeq->new(-seq => "agggtttccc", -alphabet => 'dna', -id => 'test'); my $seq_word = Bio::Tools::SeqWords->new(-seq => $seqobj); # obtain a hash of word counts, eg: my $hash_ref = $seq_stats->count_words($word_length); # display hash table, eg: my %hash = %$hash_ref; foreach my $key(sort keys %hash) { print " $key $hash{$key}"; } # Or: my $hash_ref = Bio::Tools::SeqWords->count_words($seqobj,$word_length); DESCRIPTION
Bio::Tools::SeqWords is a featherweight object for the calculation of n-mer word occurrences in a single sequence. It is envisaged that the object will be useful for construction of scripts which use n-mer word tables as the raw material for statistical calculations; for instance, hexamer frequency for the calculation of coding protential, or the calculation of periodicity in repetitive DNA. Triplet frequency is already handled by Bio::Tools::SeqStats (author: Peter Schattner). There are a few possible applications for protein, e.g. hypothesised amino acid 7-mers in heat shock proteins, or proteins with multiple simple motifs. Sometimes these protein periodicities are best seen when the amino acid alphabet is truncated, e.g. Shulman alphabet. Since there are quite a few of these shortened alphabets, this module does not specify any particular alphabet. See Synopsis above for object creation code. Rationale Take a sequence object and create an object for the purposes of holding n-mer word statistics about that sequence. The sequence can be nucleic acid or protein. In count_words() the words are counted in a non-overlapping manner, ie. in the style of a codon table, but with any word length. In count_overlap_words() the words are counted in an overlapping manner. For counts on opposite strand (DNA/RNA), a reverse complement method should be performed, and then the count repeated. FEEDBACK
Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bioperl.org - General discussion http://bioperl.org/wiki/Mailing_lists - About the mailing lists Support Please direct usage questions or support issues to the mailing list: bioperl-l@bioperl.org rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ AUTHOR
Derek Gatherer, in the loosest sense of the word 'author'. The general shape of the module is lifted directly from the SeqStat module of Peter Schattner. The central subroutine to count the words is adapted from original code provided by Dave Shivak, in response to a query on the bioperl mailing list. At least 2 other people provided alternative means (equally good but not used in the end) of performing the same calculation. Thanks to all for your assistance. CONTRIBUTORS
Jason Stajich, jason-at-bioperl.org APPENDIX
The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ count_words Title : count_words Usage : $word_count = $seq_stats->count_words($word_length) or $word_count = $seq_stats->Bio::Tools::SeqWords->($seqobj,$word_length); Function: Counts non-overlapping words within a string, any alphabet is used Example : a sequence ACCGTCCGT, counted at word length 4, will give the hash {ACCG => 1, TCCG => 1} Returns : Reference to a hash in which keys are words (any length) of the alphabet used and values are number of occurrences of the word in the sequence. Args : Word length as scalar and, reference to sequence object if required Throws an exception word length is not a positive integer or if word length is longer than the sequence. count_overlap_words Title : count_overlap_words Usage : $word_count = $word_obj->count_overlap_words($word_length); Function: Counts overlapping words within a string, any alphabet is used Example : A sequence ACCAACCA, counted at word length 4, will give the hash {ACCA=>2, CCAA=>1, CAAC=>1, AACC=>1} Returns : Reference to a hash in which keys are words (any length) of the alphabet used and values are number of occurrences of the word in the sequence. Args : Word length as scalar Throws an exception if word length is not a positive integer or if word length is longer than the sequence. perl v5.14.2 2012-03-02 Bio::Tools::SeqWords(3pm)
All times are GMT -4. The time now is 11:08 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy