Sponsored Content
Top Forums Shell Programming and Scripting Assigning the same frequency to more than one words in a file Post 302850841 by gimley on Thursday 5th of September 2013 08:08:19 PM
Old 09-05-2013
Assigning the same frequency to more than one words in a file

I have a file of names with the following structure
Code:
NAME [tab] FREQUENCY
NAME NAME [tab] FREQUENCY
NAME NAME NAME [tab] FREQUENCY

i.e. more than one name is assigned the same frequency. An example will make this clear
Code:
SANDHYA DAS	6901
ARATI DAS	6201
KALPANA DAS	4714
GITA DAS	4550
BISWANATH DAS	3949
SWAPAN DAS	3941
SUKUMAR DAS	3876
GOPAL DAS	3835
SARASWATI DAS	3769
DILIP DAS	3653
TAPAN DAS	3607
ASHOKE DAS	3604
PRATIMA DAS	3558
PURNIMA DAS	3546
BASANTI DAS	3372
SHANKAR DAS	3279
SANDHYA GHOSH	3254
SANJAY DAS	3252
PRATIMA DAS	3212
KALPANA DAS	3203
ARATI GHOSH	3155
MALATI DAS	3151
SWAPAN DAS	3138
SANDHYA RANI DAS	3120
LAKSHMI DAS	3104
ANJALI DAS	3085

I want to assign the same frequency to both names or to all three names to ensure that statistically both or all three names within a field retain their frequency.
The expected output would be
Code:
ANJALI	3085
ARATI	6201
ARATI	3155
ASHOKE	3604
BASANTI	3372
BISWANATH	3949
DILIP	3653
GITA	4550
GOPAL	3835
KALPANA	4714
KALPANA	3203
LAKSHMI	3104
MALATI	3151
PRATIMA	3558
PRATIMA	3212
PURNIMA	3546
SANDHYA	6901
SANDHYA	3254
SANDHYA	3120
SANJAY	3252
SARASWATI	3769
SHANKAR	3279
SUKUMAR	3876
SWAPAN	3941
SWAPAN	3138
TAPAN	3607
DAS	3085
DAS	6201
DAS	3155
DAS	3604
DAS	3372
DAS	3949
DAS	3653
DAS	4550
DAS	3835
DAS	4714
DAS	3203
DAS	3104
DAS	3151
DAS	3558
DAS	3212
DAS	3546
DAS	3254
DAS	3120
DAS	3252
DAS	3279
DAS	3876
DAS	3138
DAS	3607
GHOSH	6901
GHOSH	3769
RANI	3941
DAS	3120

I am doing this field separation by means of a Macro in Excel but since the database is huge, the process is long and tedious.
Would it be possible to do the same with the help of a PERL/AWK script ? I already have written an awk tool to merge all frequencies, which I could use to merge the frequencies. Aa an example all occurencies of
Code:
DAS

would thus have a merged frequency.
I work under the Windows OS and UNIX (sigh) is not my OS. No shell scripts please.
Many thanks.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

2. Shell Programming and Scripting

count frequency of words in a file

I need to write a shell script "cmn" that, given an integer k, print the k most common words in descending order of frequency. Example Usage: user@ubuntu:/$ cmn 4 < example.txt :b: (3 Replies)
Discussion started by: mohit_iitk
3 Replies

3. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

4. Shell Programming and Scripting

Script to sort large file with frequency

Hello, I have a very large file of around 2 million records which has the following structure: I have used the standard awk program to sort: # wordfreq.awk --- print list of word frequencies { # remove punctuation #gsub(/_]/, "", $0) for (i = 1; i <= NF; i++) freq++ } END { for (word... (3 Replies)
Discussion started by: gimley
3 Replies

5. Shell Programming and Scripting

Sorting a file with frequency on length

Hello, I have a file which has the following structure word space Frequency The file is around 30,000 headwords each along with its frequency. The words have different lengths. What I need is a PERL or AWK script which can sort the file on length of the headword and once the file is sorted on... (12 Replies)
Discussion started by: gimley
12 Replies

6. Shell Programming and Scripting

Creating Frequency of words from a file by accessing a corpus

Hello, I have a large file of syllables /strings in Urdu. Each word is on a separate line. Example in English: be at for if being attract I need to identify the frequency of each of these strings from a large corpus (which I cannot attach unfortunately because of size limitations) and... (7 Replies)
Discussion started by: gimley
7 Replies

7. Shell Programming and Scripting

How count the number of two words associated with the two words occurring in the file?

Hi , I need to count the number of errors associated with the two words occurring in the file. It's about counting the occurrences of the word "error" for where is the word "index.js". As such the command should look like. Please kindly help. I was trying: grep "error" log.txt | wc -l (1 Reply)
Discussion started by: jmarx
1 Replies

8. UNIX for Dummies Questions & Answers

Replace the words in the file to the words that user type?

Hello, I would like to change my setting in a file to the setting that user input. For example, by default it is ONBOOT=ON When user key in "YES", it would be ONBOOT=YES -------------- This code only adds in the entire user input, but didn't replace it. How do i go about... (5 Replies)
Discussion started by: malfolozy
5 Replies

9. Shell Programming and Scripting

Frequency of Words in a File, sed script from 1980

tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-25} < book7.txt This is not my script, it can be found way back from 1980 but once it worked fine to give me the most used words in a text file. Now the shell is complaining about an error in sed sed: -e... (5 Replies)
Discussion started by: 1in10
5 Replies

10. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me... (8 Replies)
Discussion started by: Rajib Podder
8 Replies
Bio::Das::SegmentI(3pm) 				User Contributed Perl Documentation				   Bio::Das::SegmentI(3pm)

NAME
Bio::Das::SegmentI - DAS-style access to a feature database SYNOPSIS
# Get a Bio::Das::SegmentI object from a Bio::DasI database... $segment = $das->segment(-name=>'Landmark', -start=>$start, -end => $end); @features = $segment->overlapping_features(-type=>['type1','type2']); # each feature is a Bio::SeqFeatureI-compliant object @features = $segment->contained_features(-type=>['type1','type2']); @features = $segment->contained_in(-type=>['type1','type2']); $stream = $segment->get_feature_stream(-type=>['type1','type2','type3']; while (my $feature = $stream->next_seq) { # do something with feature } $count = $segment->features_callback(-type=>['type1','type2','type3'], -callback => sub { ... { } ); DESCRIPTION
Bio::Das::SegmentI is a simplified alternative interface to sequence annotation databases used by the distributed annotation system. In this scheme, the genome is represented as a series of landmarks. Each Bio::Das::SegmentI object ("segment") corresponds to a genomic region defined by a landmark and a start and end position relative to that landmark. A segment is created using the Bio::DasI segment() method. Features can be filtered by the following attributes: 1) their location relative to the segment (whether overlapping, contained within, or completely containing) 2) their type 3) other attributes using tag/value semantics Access to the feature list uses three distinct APIs: 1) fetching entire list of features at a time 2) fetching an iterator across features 3) a callback FEEDBACK
Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bio.perl.org Support Please direct usage questions or support issues to the mailing list: bioperl-l@bioperl.org rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible. Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via the web: https://redmine.open-bio.org/projects/bioperl/ AUTHOR - Lincoln Stein Email lstein@cshl.org APPENDIX
The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ seq_id Title : seq_id Usage : $ref = $s->seq_id Function: return the ID of the landmark Returns : a string Args : none Status : Public display_name Title : seq_name Usage : $ref = $s->seq_name Function: return the human-readable name for the landmark Returns : a string Args : none Status : Public This defaults to the same as seq_id. start Title : start Usage : $s->start Function: start of segment Returns : integer Args : none Status : Public This is a read-only accessor for the start of the segment. Alias to low() for Gadfly compatibility. end Title : end Usage : $s->end Function: end of segment Returns : integer Args : none Status : Public This is a read-only accessor for the end of the segment. Alias to high() for Gadfly compatibility. length Title : length Usage : $s->length Function: length of segment Returns : integer Args : none Status : Public Returns the length of the segment. Always a positive number. seq Title : seq Usage : $s->seq Function: get the sequence string for this segment Returns : a string Args : none Status : Public Returns the sequence for this segment as a simple string. ref Title : ref Usage : $ref = $s->ref([$newlandmark]) Function: get/set the reference landmark for addressing Returns : a string Args : none Status : Public This method is used to examine/change the reference landmark used to establish the coordinate system. By default, the landmark cannot be changed and therefore this has the same effect as seq_id(). The new landmark might be an ID, or another Das::SegmentI object. absolute Title : absolute Usage : $s->absolute([$new_value]) Function: get/set absolute addressing mode Returns : flag Args : new flag (optional) Status : Public Turn on and off absolute-addressing mode. In absolute addressing mode, coordinates are relative to some underlying "top level" coordinate system (such as a chromosome). ref() returns the identity of the top level landmark, and start() and end() return locations relative to that landmark. In relative addressing mode, coordinates are relative to the landmark sequence specified at the time of segment creation or later modified by the ref() method. The default is to return false and to do nothing in response to attempts to set absolute addressing mode. features Title : features Usage : @features = $s->features(@args) Function: get features that overlap this segment Returns : a list of Bio::SeqFeatureI objects Args : see below Status : Public This method will find all features that intersect the segment in a variety of ways and return a list of Bio::SeqFeatureI objects. The feature locations will use coordinates relative to the reference sequence in effect at the time that features() was called. The returned list can be limited to certain types, attributes or range intersection modes. Types of range intersection are one of: "overlaps" the default "contains" return features completely contained within the segment "contained_in" return features that completely contain the segment Two types of argument lists are accepted. In the positional argument form, the arguments are treated as a list of feature types. In the named parameter form, the arguments are a series of -name=>value pairs. Argument Description -------- ------------ -types An array reference to type names in the format "method:source" -attributes A hashref containing a set of attributes to match -rangetype One of "overlaps", "contains", or "contained_in". -iterator Return an iterator across the features. -callback A callback to invoke on each feature The -attributes argument is a hashref containing one or more attributes to match against: -attributes => { Gene => 'abc-1', Note => 'confirmed' } Attribute matching is simple string matching, and multiple attributes are ANDed together. More complex filtering can be performed using the -callback option (see below). If -iterator is true, then the method returns an object reference that implements the next_seq() method. Each call to next_seq() returns a new Bio::SeqFeatureI object. If -callback is passed a code reference, the code reference will be invoked on each feature returned. The code will be passed two arguments consisting of the current feature and the segment object itself, and must return a true value. If the code returns a false value, feature retrieval will be aborted. -callback and -iterator are mutually exclusive options. If -iterator is defined, then -callback is ignored. NOTE: the following methods all build on top of features(), and do not need to be explicitly implemented. overlapping_features() contained_features() contained_in() get_feature_stream() overlapping_features Title : overlapping_features Usage : @features = $s->overlapping_features(@args) Function: get features that overlap this segment Returns : a list of Bio::SeqFeatureI objects Args : see below Status : Public This method is identical to features() except that it defaults to finding overlapping features. contained_features Title : contained_features Usage : @features = $s->contained_features(@args) Function: get features that are contained in this segment Returns : a list of Bio::SeqFeatureI objects Args : see below Status : Public This method is identical to features() except that it defaults to a range type of 'contained'. contained_in Title : contained_in Usage : @features = $s->contained_in(@args) Function: get features that contain this segment Returns : a list of Bio::SeqFeatureI objects Args : see below Status : Public This method is identical to features() except that it defaults to a range type of 'contained_in'. get_feature_stream Title : get_feature_stream Usage : $iterator = $s->get_feature_stream(@args) Function: get an iterator across the segment Returns : an object that implements next_seq() Args : see below Status : Public This method is identical to features() except that it always generates an iterator. NOTE: This is defined in the interface in terms of features(). You do not have to implement it. factory Title : factory Usage : $factory = $s->factory Function: return the segment factory Returns : a Bio::DasI object Args : see below Status : Public This method returns a Bio::DasI object that can be used to fetch more segments. This is typically the Bio::DasI object from which the segment was originally generated. primary_tag Title : primary_tag Usage : $tag = $s->primary_tag Function: identifies the segment as type "DasSegment" Returns : a string named "DasSegment" Args : none Status : Public, but see below This method provides Bio::Das::Segment objects with a primary_tag() field that identifies them as being of type "DasSegment". This allows the Bio::Graphics engine to render segments just like a feature in order nis way useful. This does not need to be implemented. It is defined by the interface. strand Title : strand Usage : $strand = $s->strand Function: identifies the segment strand as 0 Returns : the number 0 Args : none Status : Public, but see below This method provides Bio::Das::Segment objects with a strand() field that identifies it as being strandless. This allows the Bio::Graphics engine to render segments just like a feature in order nis way useful. This does not need to be implemented. It is defined by the interface. perl v5.14.2 2012-03-02 Bio::Das::SegmentI(3pm)
All times are GMT -4. The time now is 05:44 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy