Sponsored Content
Top Forums Shell Programming and Scripting Challenging Awk array problem Post 302423685 by polsum on Friday 21st of May 2010 05:29:46 PM
Old 05-21-2010
Challenging Awk array problem

Hi,

I rather have a very complicated awk problem here, at least to me. I have two files.

File 1:

Code:
607    687    174    0    0    chr1    3000001    3000156    -194195276    -    L1_Mur2    LINE    L1    -4310    1567    1413    1
607    917    214    114    45    chr1    3000237    3000733    -194194699    -    L1_Mur2    LINE    L1    -4488    1389    913    1
607    215    31    0    30    chr1    3000733    3000766    -194194666    +    (TTTG)n    Simple_repeat    Simple_repeat    2    33    0    2
607    845    233    76    114    chr1    3000766    3000792    -194194640    -    L1_Mur2    LINE    L1    -6816    912    887    1
607    621    250    65    37    chr1    3001287    3001583    -194193849    -    Lx9    LINE    L1    -1596    6048    5742    3
607    1320    197    332    7    chr1    3001722    3002005    -194193427    -    RLTR25A    LTR    ERVK    0    1028    625    4

File 2:
Code:
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074 chr1  3000072  TTTATCGTCATCGTC
28|3721 + gi|149352351|ref|NC_000069.5|NC_000069  chr3  154935392 GAGTTTTACAGTCCA
28|3721 +  gi|149288852|ref|NC_000067.5|NC_000067 chr1  152633707 GAGTTTTACAGTCCA
28|3721  + gi|149361432|ref|NC_000073.5|NC_000073 chr7  86595415 GAGTTTTACAGTCCA
34|3145  - gi|149321426|ref|NC_000084.5|NC_000084 chr18  43464724 ACGGCTTACGA
34|3145  - gi|149354224|ref|NC_000071.5|NC_000071 chr5  37676290 ACGGCTTACGA

If field 6 of file 1 is same as field 4 of file 2, then see if field 5 of file 2 lies within the range specified by the fields 7 and 8 of file 1. If yes, extract the line from file 2 and add the fields 11, 12 and 13 of file 1 in to a separate file. Whew!

Ok for example - field 4 of file 2 i.e. chr1 is same as field 6 of file 1. Then see if field 5 of file 2 i.e.3000072 (which is always a number) lies in the range of fields 7 and 8 (3000001 3000156) of file 1. So, I need the output (the line from file 2 plus fields 11,12 and 13 of file 1) in a separate file as

Code:
4|17999 - gi|149361523|ref|NC_000074.5|NC_000074  chr1  3000072 TTTATCGTCATCGTC L1_Mur2    LINE    L1

Thank you very much in advance

Last edited by Scott; 05-21-2010 at 06:44 PM.. Reason: Please use code tags
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

A Challenging Situation : i hope the moderators will respond to this problem..

I have the following situation : i have 4 Unix Sco servers, one Windows 2000 server, and an ADSL internet connection. All the servers, that is the 4 unix and the windows server have real static IPs supplied by my ISP. the servers are connected to a Switch , the switch is connected to an... (2 Replies)
Discussion started by: BAM
2 Replies

2. Programming

A challenging problem involving symbolic links.

Hello, I'm working on an application that bridges together several applications involved in creating a video workflow for editing with digital cinema cameras. The main platform is MacOSX. Because of the nature of some of the utilities for working with this video footage I must spoof filenames... (2 Replies)
Discussion started by: ibloom
2 Replies

3. Shell Programming and Scripting

Very Challenging Problem. Please read fully.

Hi, This is the Third thread i'm putting here for the same problem. :( Actually, i'm trying a script like this.. but its taking a long time.. about 3 days to complete fully.. #!/bin/ksh if then exit 1 fi while read i do while read j do field7=`echo $j|cut -d "|"... (12 Replies)
Discussion started by: RRVARMA
12 Replies

4. Shell Programming and Scripting

awk array problem

hi i am trying to perform some calculations with awk and arrays. i have this so far: awk 'NR==FNR{ for(i=1; i<=NF; i++) {array+=$i} tot++;next} {for(i=1; i<=NF; i++) {avg=array/tot} {diff=(array - avg)}} {for(i=1; i<=NF; i++) {printf("%5.8f\n",diff)}}' "$count".txt "$count".ttt >... (4 Replies)
Discussion started by: npatwardhan
4 Replies

5. Shell Programming and Scripting

Problem with lookup values on AWK associative array

I'm at wits end with this issue and my troubleshooting leads me to believe it is a problem with the file formatting of the array referenced by my script: awk -F, '{if (NR==FNR) {a=$4","$3","$2}\ else {print a "," $0}}' WBTSassignments1.txt RNCalarms.tmp On the WBTSassignments1.txt file... (2 Replies)
Discussion started by: JasonHamm
2 Replies

6. Shell Programming and Scripting

AWK Array problem

Dear All, I am facing problem to get right output through awk program I have file in which “B” value is appearing multiple time and I need to capture all these values. My script is BEGIN { FS=" " } { if ( substr($1,1,5) == "START" ) { i =... (2 Replies)
Discussion started by: arvindng
2 Replies

7. Shell Programming and Scripting

awk array problem

Hi, Im trying to count bats flying through an infrared beam array. One of the experts here helped me a few months ago but now I am having a problem that is stumping me. here is the original code that works (with two differnt patterns in array): # this has been changed to operate under the... (15 Replies)
Discussion started by: cmp260
15 Replies

8. Shell Programming and Scripting

Using awk array problem

I am trying to map values in the input file, where 2nd column depends on the specific value in the 1st column. When 1st column is A place 1 into 2nd column, when it is B, place 2, when C place 3, otherwise no change. My input: U |100|MAIN ST |CLMN1|1 A |200|GREEN LN |CLMN2|2 1 |12... (4 Replies)
Discussion started by: migurus
4 Replies

9. Shell Programming and Scripting

Problem with awk array when loading from shell variable

Hi, I have a problem with awk array when iam trying to use awk in solaris box as below..Iam unable to figure out the problem.. Need your help. is there any alternative to make it in arrays from variable values nawk 'BEGIN {SUBSEP=" "; split("101880|110045 101887|110045 101896|110045... (9 Replies)
Discussion started by: cskumar
9 Replies

10. Shell Programming and Scripting

Index problem in associate array in awk

I am trying to reformat the table by filling any missing rows. The final table will have consecutive IDs in the first column. My problem is the index of the associate array in the awk script. infile: S01 36407 53706 88540 S02 69343 87098 87316 S03 50133 59721 107923... (4 Replies)
Discussion started by: yifangt
4 Replies
Bio::Graphics::Glyph::whiskerplot(3pm)			User Contributed Perl Documentation		    Bio::Graphics::Glyph::whiskerplot(3pm)

NAME
Bio::Graphics::Glyph::whiskerplot - The whiskerplot glyph SYNOPSIS
See L<Bio::Graphics::Panel> and L<Bio::Graphics::Glyph>. DESCRIPTION
This glyph is used for drawing features associated with numeric data using "box and whisker" style data points, which display the mean value, extreme ranges and first and third quartiles (or standard deviation). The boxes drawn by this glyph are similar to <http://www.abs.gov.au/websitedbs/D3310116.NSF/0/3c35ac1e828c23ef4a2567ac0020ec8a?OpenDocument>, except that they are oriented vertically so that the position and height of the box indicates the mean value and spread of the data, and the width indicates the genomic extent of the value. Like the xyplot glyph (from which it inherits the whiskerplot is designed to work on a single feature group that contains subfeatures. It is the subfeatures that carry the score information. The best way to arrange for this is to create an aggregator for the feature. We'll take as an example a histogram of repeat density in which interval are spaced every megabase and the score indicates the number of repeats in the interval; we'll assume that the database has been loaded in in such a way that each interval is a distinct feature with the method name "density" and the source name "repeat". Furthermore, all the repeat features are grouped together into a single group (the name of the group is irrelevant). If you are using Bio::DB::GFF and Bio::Graphics directly, the sequence of events would look like this: my $agg = Bio::DB::GFF::Aggregator->new(-method => 'repeat_density', -sub_parts => 'density:repeat'); my $db = Bio::DB::GFF->new(-dsn=>'my_database', -aggregators => $agg); my $segment = $db->segment('Chr1'); my @features = $segment->features('repeat_density'); my $panel = Bio::Graphics::Panel->new; $panel->add_track(@features, -glyph => 'xyplot', -scale => 'both', ); If you are using Generic Genome Browser, you will add this to the configuration file: aggregators = repeat_density{density:repeat} clone alignment etc Note that it is a good idea to add some padding to the left and right of the panel; otherwise the scale will be partially cut off by the edge of the image. The mean (or median) of the data will be taken from the feature score. The range and quartile data must either be provided in a feature tag named "range", or must be generated dynamically by a -range callback option passed to add_track. The data returned by the tag or option should be an array reference containing the following five fields: [$median,$range_low,$range_high,$quartile_low,$quartile_high] where $range_low and $range_high correspond to the low and high value of the "whiskers" and $quartile_low and $quartile_high correspond to the low and high value of the "box." If $median is undef or missing, then the score field of the feature will be used instead. It may be useful to repeat the median in the score field in any case, in order to allow the minimum and maximum range calculations of the graph itself to occur. See Examples for three ways of generating an image. OPTIONS The following options are standard among all Glyphs. See Bio::Graphics::Glyph for a full explanation. Option Description Default ------ ----------- ------- -fgcolor Foreground color black -outlinecolor Synonym for -fgcolor -bgcolor Background color turquoise -fillcolor Synonym for -bgcolor -linewidth Line width 1 -height Height of glyph 10 -font Glyph font gdSmallFont -label Whether to draw a label 0 (false) -description Whether to draw a description 0 (false) -hilite Highlight color undef (no color) In addition, the alignment glyph recognizes all the options of the xyplot glyph, as well as the following glyph-specific option: Option Description Default ------ ----------- ------- -range Callback to return median, none - data comes from feature "range" tag range and quartiles for each sub feature EXAMPLES
Here are three examples of how to use this glyph. Example 1: Incorporating the numeric data in each subfeature #!/usr/bin/perl use strict; use Bio::Graphics; use Bio::SeqFeature::Generic; my $bsg = 'Bio::SeqFeature::Generic'; my $feature = $bsg->new(-start=>0,-end=>1000); for (my $i=0;$i<1000;$i+=20) { my $y = (($i-500)/10)**2; my $range = make_range($y); my $part = $bsg->new(-start=>$i,-end=>$i+16, -score=>$y,-tag => { range=>$range }); $feature->add_SeqFeature($part); } my $panel = Bio::Graphics::Panel->new(-length=>1000,-width=>800,-key_style=>'between', -pad_left=>40,-pad_right=>40); $panel->add_track($feature, -glyph=>'arrow', -double=>1, -tick=>2); $panel->add_track($feature, -glyph=>'whiskerplot', -scale=>'both', -height=>200, -min_score => -500, -key =>'Whiskers', -bgcolor => 'orange', ); print $panel->png; sub make_range { my $score = shift; my $range_top = $score + 5*sqrt($score) + rand(50); my $range_bottom = $score - 5*sqrt($score) - rand(50); my $quartile_top = $score + 2*sqrt($score) + rand(50); my $quartile_bottom = $score - 2*sqrt($score) - rand(50); return [$score,$range_bottom,$range_top,$quartile_bottom,$quartile_top]; } Example 2: Generating the range data with a callback #!/usr/bin/perl use strict; use Bio::Graphics; use Bio::SeqFeature::Generic; my $bsg = 'Bio::SeqFeature::Generic'; my $feature = $bsg->new(-start=>0,-end=>1000); for (my $i=0;$i<1000;$i+=20) { my $y = (($i-500)/10)**2; my $part = $bsg->new(-start=>$i,-end=>$i+16,-score=>$y); $feature->add_SeqFeature($part); } my $panel = Bio::Graphics::Panel->new(-length=>1000,-width=>800,-key_style=>'between', -pad_left=>40,-pad_right=>40); $panel->add_track($feature, -glyph=>'arrow', -double=>1, -tick=>2); $panel->add_track($feature, -glyph=>'whiskerplot', -scale=>'both', -height=>200, -min_score => -500, -key =>'Whiskers', -bgcolor => 'orange', -range => &make_range, ); print $panel->png; sub make_range { my $feature = shift; my $score = $feature->score; my $range_top = $score + 5*sqrt($score) + rand(50); my $range_bottom = $score - 5*sqrt($score) - rand(50); my $quartile_top = $score + 2*sqrt($score) + rand(50); my $quartile_bottom = $score - 2*sqrt($score) - rand(50); return [$score,$range_bottom,$range_top,$quartile_bottom,$quartile_top]; } Example 3: Generating the image from a FeatureFile The file: [general] pixels = 840 pad_left = 40 pad_right = 40 [contig] glyph = arrow double = 1 tick = 2 [data] glyph = whiskerplot scale = both height = 200 min_score = -500 max_score = 2800 key = Whiskers bgcolor = orange chr1 . contig 1 1000 . . . Contig chr1 chr1 . data 0 16 2500 . . Dataset data1; range 2209,2769,2368,2619 chr1 . data 20 36 2304 . . Dataset data1; range 2051,2553,2163,2435 chr1 . data 40 56 2116 . . Dataset data1; range 1861,2384,1983,2253 chr1 . data 60 76 1936 . . Dataset data1; range 1706,2181,1819,2059 chr1 . data 80 96 1764 . . Dataset data1; range 1516,1995,1646,1849 chr1 . data 100 116 1600 . . Dataset data1; range 1359,1834,1513,1699 chr1 . data 120 136 1444 . . Dataset data1; range 1228,1654,1330,1565 chr1 . data 140 156 1296 . . Dataset data1; range 1105,1520,1198,1385 chr1 . data 160 176 1156 . . Dataset data1; range 983,1373,1062,1270 chr1 . data 180 196 1024 . . Dataset data1; range 853,1184,914,1116 chr1 . data 200 216 900 . . Dataset data1; range 722,1093,801,965 chr1 . data 220 236 784 . . Dataset data1; range 621,945,724,859 chr1 . data 240 256 676 . . Dataset data1; range 532,833,605,742 chr1 . data 260 276 576 . . Dataset data1; range 433,714,485,653 chr1 . data 280 296 484 . . Dataset data1; range 331,600,418,545 chr1 . data 300 316 400 . . Dataset data1; range 275,535,336,459 chr1 . data 320 336 324 . . Dataset data1; range 198,434,270,374 chr1 . data 340 356 256 . . Dataset data1; range 167,378,219,322 chr1 . data 360 376 196 . . Dataset data1; range 114,303,118,249 chr1 . data 380 396 144 . . Dataset data1; range 39,248,87,197 chr1 . data 400 416 100 . . Dataset data1; range 17,173,68,141 chr1 . data 420 436 64 . . Dataset data1; range -14,125,18,84 chr1 . data 440 456 36 . . Dataset data1; range -8,74,11,64 chr1 . data 460 476 16 . . Dataset data1; range -46,77,0,43 chr1 . data 480 496 4 . . Dataset data1; range -40,43,-7,36 chr1 . data 500 516 0 . . Dataset data1; range -43,0,-43,22 chr1 . data 520 536 4 . . Dataset data1; range -6,52,-4,54 chr1 . data 540 556 16 . . Dataset data1; range -5,38,-27,52 chr1 . data 560 576 36 . . Dataset data1; range -43,109,18,66 chr1 . data 580 596 64 . . Dataset data1; range -1,134,3,112 chr1 . data 600 616 100 . . Dataset data1; range 49,186,69,124 chr1 . data 620 636 144 . . Dataset data1; range 79,225,71,169 chr1 . data 640 656 196 . . Dataset data1; range 124,289,120,266 chr1 . data 660 676 256 . . Dataset data1; range 154,378,197,320 chr1 . data 680 696 324 . . Dataset data1; range 220,439,249,396 chr1 . data 700 716 400 . . Dataset data1; range 291,511,331,458 chr1 . data 720 736 484 . . Dataset data1; range 350,627,400,572 chr1 . data 740 756 576 . . Dataset data1; range 446,718,502,633 chr1 . data 760 776 676 . . Dataset data1; range 515,833,576,777 chr1 . data 780 796 784 . . Dataset data1; range 606,959,724,856 chr1 . data 800 816 900 . . Dataset data1; range 747,1058,799,1004 chr1 . data 820 836 1024 . . Dataset data1; range 817,1231,958,1089 chr1 . data 840 856 1156 . . Dataset data1; range 961,1341,1069,1225 chr1 . data 860 876 1296 . . Dataset data1; range 1103,1511,1219,1385 chr1 . data 880 896 1444 . . Dataset data1; range 1218,1660,1338,1535 chr1 . data 900 916 1600 . . Dataset data1; range 1377,1828,1496,1703 chr1 . data 920 936 1764 . . Dataset data1; range 1547,2020,1674,1858 chr1 . data 940 956 1936 . . Dataset data1; range 1691,2188,1824,2043 chr1 . data 960 976 2116 . . Dataset data1; range 1869,2376,2019,2225 chr1 . data 980 996 2304 . . Dataset data1; range 2040,2554,2178,2418 The script to render it #!/usr/bin/perl use strict; use Bio::Graphics::FeatureFile; my $data = Bio::Graphics::FeatureFile->new(-file=>'test.gff'); my(undef,$panel) = $data->render; print $panel->png; BUGS
Please report them. SEE ALSO
Bio::Graphics::Panel, Bio::Graphics::Track, Bio::Graphics::Glyph::transcript2, Bio::Graphics::Glyph::anchored_arrow, Bio::Graphics::Glyph::arrow, Bio::Graphics::Glyph::box, Bio::Graphics::Glyph::primers, Bio::Graphics::Glyph::segments, Bio::Graphics::Glyph::toomany, Bio::Graphics::Glyph::transcript, AUTHOR
Lincoln Stein <lstein@cshl.org> Copyright (c) 2001 Cold Spring Harbor Laboratory This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See DISCLAIMER.txt for disclaimers of warranty. perl v5.14.2 2012-02-20 Bio::Graphics::Glyph::whiskerplot(3pm)
All times are GMT -4. The time now is 06:27 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy