Sponsored Content
Top Forums UNIX for Beginners Questions & Answers awk script to extract transcript information from gff3 file Post 303043938 by rdrtx1 on Tuesday 11th of February 2020 11:02:45 AM
Old 02-11-2020
Code:
awk '
(! h++) {print "transcript_id", "gene_name", "description", "chromosome", "strand", "transcript_start", "transcript_end", "gene_start", "gene_end";}
$9 ~ /ID=.*Name=/ {n=$9; sub(".*Name=", "", n); sub(";.*", "", n); gs[n]=$4; ge[n]=$5;}
$3~/.RNA/ {
n=$9; sub(".*Name=", "", n); sub(";.*", "", n);
p=$9; sub(".*Parent=", "", p); sub(";.*", "", p);
print n, p, "Desc", $1, $7, $4, $5, gs[p], ge[p];
}
' FS="\t" OFS="\t" input


Last edited by RavinderSingh13; 02-28-2020 at 01:51 AM..
This User Gave Thanks to rdrtx1 For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

AWK to extract information

Hi all, I am working on a shell script to extract information from a file that has output from Oracle sqlplus. The problem is that the output of a single line is spread across multiple lines and i do not know as how to extract the particular filed at ones,which spans multiple lines.... (2 Replies)
Discussion started by: harris2107
2 Replies

2. Shell Programming and Scripting

extract and format information from a file

Hi, Following is sample portion of the file; <JDBCConnectionPool DriverName="oracle.jdbc.OracleDriver" MaxCapacity="10" Name="MyApp_DevPool" PasswordEncrypted="{3DES}7tXFH69Xg1c=" Properties="user=MYAPP_ADMIN" ShrinkingEnabled="false" ... (12 Replies)
Discussion started by: sujoy101
12 Replies

3. UNIX for Dummies Questions & Answers

Write a script to extract information from a db

Hi I need to put together a script that will search certain tables in a db and send that data to a csv file. Basically I am importing data to a db and I want to write a script to check that all information was imported correctly. Thank you (1 Reply)
Discussion started by: ladyAnne
1 Replies

4. Shell Programming and Scripting

Create shell script to extract unique information from one file to a new file.

Hi to all, I got this content/pattern from file http.log.20110808.gz mail1 httpd: Account Notice: close igchung@abc.com 2011/8/7 7:37:36 0:00:03 0 0 1 mail1 httpd: Account Information: login sastria9@abc.com proxy sid=gFp4DLm5HnU mail1 httpd: Account Notice: close sastria9@abc.com... (16 Replies)
Discussion started by: Mr_47
16 Replies

5. Shell Programming and Scripting

How to extract information from a file?

Hi, i have a file like this: <Iteration> <Iteration_iter-num>3</Iteration_iter-num> <Iteration_query-ID>lcl|3_0</Iteration_query-ID> <Iteration_query-def>G383C4U01EQA0A length=197</Iteration_query-def> <Iteration_query-len>197</Iteration_query-len> ... (9 Replies)
Discussion started by: the_simpsons
9 Replies

6. Shell Programming and Scripting

Help with shell script to extract certain information

Hi, I have a file which I need to programmatically split into two files. All the information in the file before pattern "STOP HERE" is to be stripped and output into one file while everything after "STOP HERE" is to be output into a separate file. I would appreciate help on how to do... (8 Replies)
Discussion started by: PTL
8 Replies

7. Shell Programming and Scripting

awk script to parse case with information in two fields of file

The below awk parser works for most data inputs, but I am having trouble with the last one. The problem is in the below rules steps 1 and 2 come from $2 (NC_000013.10:g.20763686_20763687delinsA) and steps 3 and 4 come from $1 (NM_004004.5:c.34_35delGGinsT). Parse Rules: The header is... (0 Replies)
Discussion started by: cmccabe
0 Replies

8. Shell Programming and Scripting

Extract information from file

Gents, If is possible please help. I have a big file (example attached) which contends exactly same value in column, but from column 2 to 6 these values are diff. I will like to compile for all records all columns like the example attached in .csv format (output.rar ).. The last column in the... (11 Replies)
Discussion started by: jiam912
11 Replies

9. Shell Programming and Scripting

Extract information from file

In a particular directory, there can be 1000 files like below. filename is job901.ksh #!/bin/ksh cront -x << EOJ submit file=$PRODPATH/scripts/genReport.sh maxdelay=30 &node=xnode01 tname=job901 &pfile1=/prod/mldata/data/test1.dat ... (17 Replies)
Discussion started by: vedanta
17 Replies

10. Shell Programming and Scripting

sed / awk / grep to extract information from log

Hi all, I have a query that runs that outputs data in the following format - 01/09/12 11:43:40,ADMIN,4,77,Application Group Load: Name(TESTED) LoadId(5137-1-0-1XX-15343-15343) File(/dir/dir/File.T03.CI2.RYR.2012009.11433350806.ARD) InputSize(5344) OutputSize(1359) Rows(2) Time(1.9960)... (8 Replies)
Discussion started by: jeffs42885
8 Replies
Bio::Graphics::Glyph::ideogram(3pm)			User Contributed Perl Documentation		       Bio::Graphics::Glyph::ideogram(3pm)

NAME
Bio::Graphics::Glyph::ideogram - The "ideogram" glyph SYNOPSIS
See L<Bio::Graphics::Panel> and L<Bio::Graphics::Glyph>. DESCRIPTION
This glyph draws a section of a chromosome ideogram. It relies on certain data from the feature to determine which color should be used (stain) and whether the segment is a telomere or centromere or a regular cytoband. The centromeres and 'var'-marked bands are rendered with diagonal black-on-white patterns if the "-patterns" option is true, otherwise they are rendered in dark gray. This is to prevent a libgd2 crash on certain 64-bit platforms when rendering patterned images. The cytobandband features would typically be formatted like this in GFF3: ... ChrX UCSC cytoband 136700001 139000000 . . . Parent=ChrX;Name=Xq27.1;Alias=ChrXq27.1;stain=gpos75; ChrX UCSC cytoband 139000001 140700000 . . . Parent=ChrX;Name=Xq27.2;Alias=ChrXq27.2;stain=gneg; ChrX UCSC cytoband 140700001 145800000 . . . Parent=ChrX;Name=Xq27.3;Alias=ChrXq27.3;stain=gpos100; ChrX UCSC cytoband 145800001 153692391 . . . Parent=ChrX;Name=Xq28;Alias=ChrXq28;stain=gneg; ChrY UCSC cytoband 1 1300000 . . . Parent=ChrY;Name=Yp11.32;Alias=ChrYp11.32;stain=gneg; which in this case is a GFF-ized cytoband coordinate file from UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg16/database/cytoBand.txt.gz and the corresponding GBrowse config options would be like this to create an ideogram overview track for the whole chromosome: The 'chromosome' feature below would aggregated from bands and centromere using the default chromosome aggregator [CYT:overview] feature = chromosome glyph = ideogram fgcolor = black bgcolor = gneg:white gpos25:silver gpos50:gray gpos:gray gpos75:darkgray gpos100:black acen:cen gvar:var arcradius = 6 height = 25 bump = 0 label = 0 A script to reformat UCSC annotations to GFF3 format can be found at the end of this documentation. OPTIONS The following options are standard among all Glyphs. See Bio::Graphics::Glyph for a full explanation. Option Description Default ------ ----------- ------- -fgcolor Foreground color black -outlinecolor Synonym for -fgcolor -linewidth Line width 1 -height Height of glyph 10 -font Glyph font gdSmallFont -connector Connector type 0 (false) -connector_color Connector color black -label Whether to draw a label 0 (false) -description Whether to draw a description 0 (false) The following options are specific to the ideogram glyph. Option Description Default ------ ----------- ------- -bgcolor Band coloring string none -bgfallback Coloring to use when no bands yellow are present -bgcolor is used to map each chromosome band's "stain" attribute into a color or pattern. It is a string that looks like this: gneg:white gpos25:silver gpos50:gray gpos:gray gpos75:darkgray gpos100:black acen:cen gvar:var This is saying to use "white" for features whose stain attribute is "gneg", "silver" for those whose stain attribute is "gpos25", and so on. Several special values are recognized: "stalk" draws a narrower gray region and is usually used to indicate an acrocentric stalk. "var" creates a diagonal black-on-white pattern. "cen" draws a centromere. If -bgcolor is just a color name, like "yellow", the glyph will ignore all bands and just draw a filled in chromosome. If -bgfallback is set to a color name or value, then the glyph will fall back to the indicated background color if the chromosome contains no bands. UCSC TO GFF CONVERSION SCRIPT
The following short script can be used to convert a UCSC cytoband annotation file into GFF format. If you have the lynx web-browser installed you can call it like this in order to download and convert the data in a single operation: fetchideogram.pl http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz Otherwise you will need to download the file first. Note the difference between this script and input data from previous versions of ideogram.pm: UCSC annotations are used in place of NCBI annotations. #!/usr/bin/perl use strict; my %stains; my %centros; my %chrom_ends; foreach (@ARGV) { if (/^(ftp|http|https):/) { $_ = "lynx --dump $_ |gunzip -c|"; } elsif (/.gz$/) { $_ = "gunzip -c $_ |"; } print STDERR "Processing $_ "; } print "##gff-version 3 "; while(<>) { chomp; my($chr,$start,$stop,$band,$stain) = split / /; $start++; $chr = ucfirst($chr); if(!(exists($chrom_ends{$chr})) || $chrom_ends{$chr} < $stop) { $chrom_ends{$chr} = $stop; } my ($arm) = $band =~ /(p|q)d+/; $stains{$stain} = 1; if ($stain eq 'acen') { $centros{$chr}->{$arm}->{start} = $stop; $centros{$chr}->{$arm}->{stop} = $start; next; } $chr =~ s/chr//i; print qq/$chr UCSC cytoband $start $stop . . . Parent=$chr;Name=$chr;Alias=$chr$band;stain=$stain; /; } foreach my $chr(sort keys %chrom_ends) { my $chr_orig = $chr; $chr =~ s/chr//i; print qq/$chr UCSC centromere $centros{$chr_orig}->{p}->{stop} $centros{$chr_orig}->{q}->{start} . + . Parent=$chr;Name=$chr\_cent /; } BUGS
Please report them. SEE ALSO
Bio::Graphics::Panel, Bio::Graphics::Glyph, Bio::Graphics::Glyph::arrow, Bio::Graphics::Glyph::cds, Bio::Graphics::Glyph::crossbox, Bio::Graphics::Glyph::diamond, Bio::Graphics::Glyph::dna, Bio::Graphics::Glyph::dot, Bio::Graphics::Glyph::ellipse, Bio::Graphics::Glyph::extending_arrow, Bio::Graphics::Glyph::generic, Bio::Graphics::Glyph::graded_segments, Bio::Graphics::Glyph::heterogeneous_segments, Bio::Graphics::Glyph::line, Bio::Graphics::Glyph::pinsertion, Bio::Graphics::Glyph::primers, Bio::Graphics::Glyph::rndrect, Bio::Graphics::Glyph::segments, Bio::Graphics::Glyph::ruler_arrow, Bio::Graphics::Glyph::toomany, Bio::Graphics::Glyph::transcript, Bio::Graphics::Glyph::transcript2, Bio::Graphics::Glyph::translation, Bio::Graphics::Glyph::triangle, Bio::DB::GFF, Bio::SeqI, Bio::SeqFeatureI, Bio::Das, GD AUTHOR
Gudmundur A. Thorisson <mummi@cshl.edu> Copyright (c) 2001-2006 Cold Spring Harbor Laboratory CONTRIBUTORS
Sheldon McKay <mckays@cshl.edu<gt> This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See DISCLAIMER.txt for disclaimers of warranty. perl v5.14.2 2012-02-20 Bio::Graphics::Glyph::ideogram(3pm)
All times are GMT -4. The time now is 03:13 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy