Mean score value by ID over a defined genomic region

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Average score

awk '{if(len==0){last=$4;total=$6;len=1;getline}if($4!=last){printf("%s\t%f\n", last, total/len);last=$4;total=$6;len=1}else{total+=$6;len+=1}}END{printf("%s\t%f\n", last, total/len)}' exon.txt > output.txt In the attached file I am just trying to group all the same names in column $4 and then...

2. Shell Programming and Scripting

Split a file in more files based on score content

Dear All, I have the following file tabulated: ID distanceTSS score 8434 571269 10 10122 393912 9 7652 6 10 4863 1451 9 8419 39 2 9363 564 21 9333 7714 22 9638 8334 9 1638 1231 11 10701 918 1000 6587 32056 111 What I would like to do is the following, create 100 new files based...

3. AIX

Change lv REGION in HDISK1

Dears my rootvg is missed up i can not extend the /opt as soon as i try to extend the Filesystem its give me that there is not enough space . as there any way to change the REGION of the LVs in HDISK1 ? lspv -p hdisk0 hdisk0: PP RANGE STATE REGION LV NAME TYPE ...

4. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2...

5. Shell Programming and Scripting

Region between lines

How can I find the regions between specific lines? I have a file which contains lines like this: chr1 0 17388 0 chr1 17388 17444 1 chr1 17444 17599 2 chr1 17599 17601 1 chr1 17601 569791 0 chr1 569791 569795 1 chr1 569795 569808 2 chr1 569808 569890 3 chr1 569890 570047 4 ...

6. UNIX for Dummies Questions & Answers

Genomic data processing

Dear fellow members, I've just joined the forum and am a newbie to shell scripting and programming. I'm stuck on the following problem. I'm working with large scale genomic data and need to do some analyses on it. Essentially it is text processing problem, so please don't mind the scientific...

7. Shell Programming and Scripting

Grade Score Script Project

What I thought would be an extremely simple project has proven more difficult for me than I thought. Here are the parameters: Thus far, I've been able to sort the final grades, but I'm having a lot of trouble with appending the correlating letter grade to the end of each line. Any help would be...

8. Shell Programming and Scripting

remove lines based on score criteria

Hi guys, Please guide for Solution. PART-I INPUT FILE (has 2 columns ID and score) TC5584_1 93.9 DV161411_2 79.5 BP132435_5 46.8 EB682112_1 34.7 BP132435_4 29.5 TC13860_2 10.1 OUTPUT FILE (It shudn't contain the line ' BP132435_4 29.5 ' as BP132435 is repeated...

9. Post Here to Contact Site Administrators and Moderators

I cant updated the score on space invaders

Hello The same thing happen to me yesterday I can�t record my score on invaders game.

10. UNIX for Advanced & Expert Users

stack region

how can i determine that what percentage of stack region is currently is used? (i am using tru64 unix)

LEARN ABOUT DEBIAN

spidey

SPIDEY(1)						     NCBI Tools User's Manual							 SPIDEY(1)

NAME

       spidey - align mRNA sequences to a genome

SYNOPSIS

       spidey [-] [-F N] [-G] [-L N] [-M filename] [-N filename] [-R filename] [-S p/m] [-T N] [-X] [-a filename] [-c N] [-d] [-e X] [-f X] [-g X]
       -i filename [-j] [-k filename] [-l N] -m filename [-n N] [-o str] [-p N] [-r c/d/m/p/v] [-s] [-t filename] [-u] [-w]

DESCRIPTION

       spidey is a tool for aligning one or more mRNA sequences to a given genomic sequence.  spidey was written with two main goals in mind: find
       good  alignments  regardless of intron size; and avoid getting confused by nearby pseudogenes and paralogs.  Towards the first goal, spidey
       uses BLAST and Dot View (another local alignment tool) to find its alignments; since these are both local alignment tools, spidey does  not
       intrinsically  favor shorter or longer introns and has no maximum intron size.  To avoid mistakenly including exons from paralogs and pseu-
       dogenes, spidey first defines windows on the genomic sequence and then performs the mRNA-to-genomic alignment separately within	each  win-
       dow.   Because  of the way the windows are constructed, neighboring paralogs or pseudogenes should be in separate windows and should not be
       included in the final spliced alignment.

   Initial alignments and construction of genomic windows
       spidey takes as input a single genomic sequence and a set of mRNA accessions or FASTA sequences.  All processing is done one mRNA  sequence
       at a time.  The first step for each mRNA sequence is a high-stringency BLAST against the genomic sequence.  The resulting hits are analyzed
       to find the genomic windows.

       The BLAST alignments are sorted by score and then assigned into windows by a recursive function which takes the first  alignment  and  then
       goes  down  the	alignment  list  to find all alignments that are consistent with the first (same strand of mRNA, both the mRNA and genomic
       coordinates are nonoverlapping and linearly consistent).  On subsequent passes, the remaining alignments are  examined  and  are  put  into
       their  own nonoverlapping, consistent windows, until no alignments are left.  Depending on how many gene models are desired, the top n win-
       dows are chosen to go on to the next step and the others are deleted.

   Aligning in each window
       Once the genomic windows are constructed, the initial BLAST alignments are freed and another BLAST search is performed, this time with  the
       entire mRNA against the genomic region defined by the window, and at a lower stringency than the initial search.  spidey then uses a greedy
       algorithm to generate a high-scoring, nonoverlapping subset of the alignments from the second BLAST search.  This consistent  set  is  ana-
       lyzed  carefully  to make sure that the entire mRNA sequence is covered by the alignments.  When gaps are found between the alignments, the
       appropriate region of genomic sequence is searched against the missing mRNA, first using a very low-stringency  BLAST  and,  if	the  BLAST
       fails  to  find	a  hit, using DotView functions to locate the alignment.  When gaps are found at the ends of the alignments, the BLAST and
       DotView searches are actually allowed to extend past the boundaries of the window.  If the 3' end of the mRNA does not align completely, it
       is first examined for the presence of a poly(A) tail.  No attempt is made to align the portion of the mRNA that seems to be a poly(A) tail;
       sometimes there is a poly(A) tail that does align to the genomic sequence, and these are noted because they indicate the possibility  of  a
       pseudogene.

       Now  that the mRNA is completely covered by the set of alignments, the boundaries of the alignments (there should be one alignment per exon
       now) are adjusted so that the alignments abut each other precisely and so that they are adjacent to good splice donor and  acceptor  sites.
       Most  commonly,	two adjacent exons' alignments overlap by as much as 20 or 30 base pairs on the mRNA sequence.	The true exon boundary may
       lie anywhere within this overlap, or (as we have seen empirically) even a few base pairs outside the overlap.  To position the exon  bound-
       aries, the overlap plus a few base pairs on each side is examined for splice donor sites, using functions that have different splice matri-
       ces depending on the organism chosen.  The top few splice donor sites (by score) are then evaluated as to how much they affect the original
       alignment  boundaries.	The site that affects the boundaries the least is chosen, and is evaluated as to the presence of an acceptor site.
       The alignments are truncated or extended as necessary so that they terminate at the splice donor site and so that they do not overlap.

   Final result
       The windows are examined carefully to get the percent identity per exon, the number of gaps per exon, the  overall  percent  identity,  the
       percent	coverage  of  the  mRNA,  presence  of	an aligning or non-aligning poly(A) tail, number of splice donor sites and the presence or
       absence of splice donor and acceptor sites for each exon, and the occurrence of an mRNA that has a 5' or 3' end (or  both)  that  does  not
       align  to  the genomic sequence.  If the overall percent identity and percent length coverage are above the user-defined cutoffs, a summary
       report is printed, and, if requested, a text alignment showing identities and mismatches is also printed.

   Interspecies alignments
       spidey is capable of performing interspecies alignments.  The major difference in interspecies alignments is that the mRNA-genomic identity
       will  not  be close to 100% as it is in intraspecies alignments; also, the alignments have numerous and lengthy gaps.  If spidey is used in
       its normal mode to do interspecies alignments, it produces gene models with many, many short exons.  When the  interspecies  flag  is  set,
       spidey  uses  different	BLAST  parameters  to encourage longer and more gaps and to not penalize as heavily for mismatches.  This way, the
       alignments for the exons are much longer and more closely approximate the actual gene structure.

   Extracting CDS alignments
       When spidey is run in network-aware mode or when ASN.1 files are used for the mRNA records, it is capable of  extracting  a  CDS  alignment
       from  an mRNA alignment and printing the CDS information also.  Since the CDS alignment is just a subset of the mRNA alignment, it is rela-
       tively straightforward to truncate the exon alignments as necessary and to generate a CDS alignment.  Furthermore, the untranslated regions
       are now defined, so the percent identity for the 5' and 3' untranslated regions is also calculated.

OPTIONS

       A summary of options is included below.

       -      Print usage message.

       -F N   Start of genomic interval desired (from; 0-based).

       -G     Input file is a GI list.

       -L N   The extra-large intron size to use (default = 220000).

       -M filename
	      File with donor splice matrix.

       -N filename
	      File with acceptor splice matrix.

       -R filename
	      File (including path) to repeat blast database for filtering.

       -S p/m Restrict to plus (p) or minus (m) strand of genomic sequence.

       -T N   Stop of genomic interval desired (to; 0-based).

       -X     Use  extra-large intron sizes (increases the limit for initial and terminal introns from 100kb to 240kb and for all others from 35kb
	      to 120kb); may result in significantly longer compute times.

       -a filename
	      Output file for alignments when directed to a separate file with -p 3 (default = spidey.aln).

       -c N   Identity cutoff, in percent, for quality control purposes.

       -d     Also try to align coding sequences corresponding to the given mRNA records (may require network access).

       -e X   First-pass e-value (default = 1.0e-10).  Higher values increase speed at the cost of sensitivity.

       -f X   Second-pass e-value (default = 0.001).

       -g X   Third-pass e-value (default = 10).

       -i filename
	      Input file containing the genomic sequence in ASN.1 or FASTA format.  If your computer is running on a network that can access  Gen-
	      Bank, you can substitute the desired accession number for the filename.

       -j     Print ASN.1 alignment?

       -k filename
	      File for ASN.1 output with -k (default = spidey.asn).

       -l N   Length coverage cutoff, in percent.

       -m filename
	      Input  file  containing the mRNA sequence(s) in ASN.1 or FASTA format, or a list of their accessions (with -G).  If your computer is
	      running on a network that can access GenBank, you can substitute a single accession number for the filename.

       -n N   Number of gene models to return per input mRNA (default = 1).

       -o str Main output file (default = stdout; contents controlled by -p).

       -p N   Print alignment?
	      0      summary and alignments together (default)
	      1      just the summary
	      2      just the alignments
	      3      summary and alignments in different files

       -r c/d/m/p/v
	      Organism of genomic sequence, used to determine splice matrices.
	      c      C. elegans
	      d      Drosophila
	      m      Dictyostelium discoideum
	      p      plant
	      v      vertebrate (default)

       -s     Tune for interspecies alignments.

       -t filename
	      File with feature table, in 4 tab-delimited columns:
	      seqid  (e.g., NM_04377.1)
	      name   (only repetitive_region is currently supported)
	      start  (0-based)
	      stop   (0-based)

       -u     Make a multiple alignment of all input mRNAs (which must overlap on the genomic sequence).

       -w     Consider lowercase characters in input FASTA sequences to be masked.

AUTHOR

       Sarah Wheelan and others at the National Center for Biotechnology Information; Steffen Moeller contributed to this documentation.

SEE ALSO

       blast(1), <http://www.ncbi.nlm.nih.gov/spidey>

NCBI
								    2005-01-25								 SPIDEY(1)

UNIX for Dummies Questions & Answers