Insert id from same block


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Insert id from same block
# 1  
Old 05-17-2014
Insert id from same block

Hello all,

Please help modify my file. I want to pick up the ID value when the 3rd column is either gene or protein_coding_gene and place the ID in the gene_id attribute corresponding rows which has the the keyword exon in the third column.



Input file

Code:
10      ensembl gene    195359  195445  .       +       .       ID=GRMZM2G495457;biotype=transposable_element;logic_name=genebuilder
10 ensembl mRNA 195359 195445 . + . ID=GRMZM2G495457_T01;Parent=GRMZM2G495457;biotype=transposable_element;description=est;logic_name=genebuilder
10      .       exon    195359  195445  .       +       .       ID=GRMZM2G495457_E01;Parent=GRMZM2G495457_T01;constitutive=1;rank=1
10      ensembl protein_coding_gene    279670  280203  .       +       .       ID=GRMZM2G041102;biotype=transposable_element;logic_name=genebuilder
10 ensembl mRNA 279670 280203 . + . ID=GRMZM2G041102_T01;Parent=GRMZM2G041102;biotype=transposable_element;description=est;logic_name=genebuilder
10      .       exon    279670  280203  .       +       .       ID=GRMZM2G041102_E01;Parent=GRMZM2G041102_T01;constitutive=1;ensembl_end_phase=-1;rank=1
10      ensembl gene    270190  270594  .       -       .       ID=GRMZM2G310936;biotype=low_confidence;logic_name=genebuilder
10 ensembl mRNA 270190 270594 . - . ID=GRMZM2G310936_T01;Parent=GRMZM2G310936;biotype=low_confidence;description=abinitio;logic_name=genebuilder
10      .       CDS     270190  270339  .       -       0       ID=GRMZM2G310936_P01;Parent=GRMZM2G310936_T01;rank=3
10      .       exon    270190  270339  .       -       .       ID=GRMZM2G310936_E03;Parent=GRMZM2G310936_T01;constitutive=1;ensembl_end_phase=-1;rank=3
10      .       CDS     270383  270430  .       -       0       ID=GRMZM2G310936_P01;Parent=GRMZM2G310936_T01;rank=2
10      .       exon    270383  270430  .       -       .       ID=GRMZM2G310936_E02;Parent=GRMZM2G310936_T01;constitutive=1;rank=2
10      .       CDS     270502  270594  .       -       0       ID=GRMZM2G310936_P01;Parent=GRMZM2G310936_T01;rank=1
10      .       exon    270502  270594  .       -       .       ID=GRMZM2G310936_E01;Parent=GRMZM2G310936_T01;constitutive=1;rank=1

Desired output with picked up IDs in red, and inserted values in blue

Code:
10      ensembl gene    195359  195445  .       +       .        ID=GRMZM2G495457;biotype=transposable_element;logic_name=genebuilder
10 ensembl mRNA 195359 195445 . + .  ID=GRMZM2G495457_T01;Parent=GRMZM2G495457;biotype=transposable_element;description=est;logic_name=genebuilder
10      .       exon    195359  195445  .       +       .        ID=GRMZM2G495457_E01;Parent=GRMZM2G495457_T01;constitutive=1;rank=1;gene_id=GRMZM2G495457
10      ensembl protein_coding_gene    279670  280203  .       +       .        ID=GRMZM2G041102;biotype=transposable_element;logic_name=genebuilder
10 ensembl mRNA 279670 280203 . + .  ID=GRMZM2G041102_T01;Parent=GRMZM2G041102;biotype=transposable_element;description=est;logic_name=genebuilder
10      .       exon    279670  280203  .       +       .        ID=GRMZM2G041102_E01;Parent=GRMZM2G041102_T01;constitutive=1;ensembl_end_phase=-1;rank=1;gene_id=GRMZM2G041102
10      ensembl gene    270190  270594  .       -       .       ID=GRMZM2G310936;biotype=low_confidence;logic_name=genebuilder
10 ensembl mRNA 270190 270594 . - .  ID=GRMZM2G310936_T01;Parent=GRMZM2G310936;biotype=low_confidence;description=abinitio;logic_name=genebuilder
10      .       CDS     270190  270339  .       -       0       ID=GRMZM2G310936_P01;Parent=GRMZM2G310936_T01;rank=3
10      .       exon    270190  270339  .       -       .        ID=GRMZM2G310936_E03;Parent=GRMZM2G310936_T01;constitutive=1;ensembl_end_phase=-1;rank=3;gene_id=GRMZM2G310936
10      .       CDS     270383  270430  .       -       0       ID=GRMZM2G310936_P01;Parent=GRMZM2G310936_T01;rank=2
10      .       exon    270383  270430  .       -       .        ID=GRMZM2G310936_E02;Parent=GRMZM2G310936_T01;constitutive=1;rank=2;gene_id=GRMZM2G310936
10      .       CDS     270502  270594  .       -       0       ID=GRMZM2G310936_P01;Parent=GRMZM2G310936_T01;rank=1
10      .       exon    270502  270594  .       -       .        ID=GRMZM2G310936_E01;Parent=GRMZM2G310936_T01;constitutive=1;rank=1;gene_id=GRMZM2G310936

# 2  
Old 05-17-2014
Is ID=value always the 1st element in the semicolon separated list in the last column on the lines that have gene or protein_coding_gene in the 3rd column, or can it appear later in the list or in an earlier column?
# 3  
Old 05-17-2014
I`m aware of this file type, it is also known as general feature format or gff .On behalf of OP i can say that
Code:
ID=value

is always the 1st element in the last semicolon separated string. For this particular file it will always be
Code:
ID=

as we can see. Sometimes it can also say
Code:
Name=

.. whatever it is, its consistent throughout the file.

Don, the short answer to your question is Yes.
# 4  
Old 05-18-2014
Hmmm, if the sample data is representative then it's not even necessary to take the "ID=" field from an earlier record. The value you want it already in the "ID=" field of the "exon" record. You just have to strip off the "_*" suffix.
# 5  
Old 05-18-2014
well, the id field in the exon rows may not always be of that form. It happens so in this case but they may be represented differently. the 'true gene id' is always in the gene row.
# 6  
Old 05-18-2014
Hello Don, the ID value is always the first part of the list separated by ;, it doesnt appear anywhere else
# 7  
Old 05-18-2014
The following seems to do what you want:
Code:
awk '
$3 == "gene" || $3 == "protein_coding_gene" {
	end = index($NF, ";")
	id = ";gene_id=" substr($NF, 4, end - 4)
}
$3 == "exon" {
	$0 = $0 id
}
1' file

If you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Add a block of code at the end of a specific block

I need to search for a block with the starting pattern say "tabId": "table_1", and ending pattern say "]" and then add a few lines before "]" "block1":"block_111" "tabId": "table_1", "title":"My title" ..... .... }] how do I achieve it using awk and sed. Thanks, Lakshmi (3 Replies)
Discussion started by: Lakshmikumari
3 Replies

2. Shell Programming and Scripting

Printing a block of lines from a file, if that block does not contain two patterns using sed

I want to process a file block by block using sed, and if that block does not contain two patterns, then that complete block has to be printed. See below for the example data. ................................server 1............................... running process 1 running... (8 Replies)
Discussion started by: Kesavan
8 Replies

3. UNIX for Advanced & Expert Users

Move a block of lines to file if string found in the block.

I have a "main" file which has blocks of data for each user defined by tags BEGIN and END. BEGIN ID_NUM:24879 USER:abc123 HOW:47M CMD1:xyz1 CMD2:arp2 STATE:active PROCESS:id60 END BEGIN ID_NUM:24880 USER:def123 HOW:4M CMD1:xyz1 CMD2:xyz2 STATE:running PROCESS:id64 END (7 Replies)
Discussion started by: grep_me
7 Replies

4. Shell Programming and Scripting

Grepping text block by block by using for loop

Hei buddies, Need ur help once again. I have a file which has bunch of lines which starts from a fixed pattern and ends with another fixed pattern. I want to make use of these fixed starting and ending patterns to select the bunch, one at a time. The input file is as follows. Hi welcome... (12 Replies)
Discussion started by: anushree.a
12 Replies

5. UNIX for Advanced & Expert Users

Deciding whether to get a buffer cache block or inode block

I was reading a book on UNIX internals "The design of the UNIX Operating system." There are two memory structures that are confusing me: 1) Buffer cache 2) Inode cache My questions are 1) Does a process get both buffer cache and Indoe cache allocated when it opens/creates a file? 2) if no,... (1 Reply)
Discussion started by: sreeharshasn
1 Replies

6. Shell Programming and Scripting

How to insert text after a block of text?

Input: fstab is a configuration file that contains information of all the partitions and storage devices in your computer. The file is located under /etc, so the full path to this file is /etc/fstab. The >>>>> characters would be replaced by some texts. For example if i run a... (5 Replies)
Discussion started by: cola
5 Replies

7. Shell Programming and Scripting

Insert Block of Text into a File After a Ranged Search

Hello, I've been racking my brain trying to find a good way to accomplish a task. I need to insert a block of text into a file in the format of FirewallRuleSet proxy-users { FirewallRule allow to 0.0.0.0/0 } I need to insert this block of text (which could have sed special... (2 Replies)
Discussion started by: 0xception
2 Replies

8. Shell Programming and Scripting

block of name value pair to db insert statements

Hi, I need to convert the following file into DB insert statements. $ cat input.txt START name=john id=123 date=12/1/09 END START name=sam id=4234 status=resigned date=12/1/08 END (2 Replies)
Discussion started by: vlinet
2 Replies

9. Shell Programming and Scripting

how to append a block of statements after another block in the file

Hi I need to append the following block of statements in the middle of the file: # openpipe tsdbdwn2 set -x exec >> /tmp/tsdbdwn2.fifo 2>&1 # This needs to be appended right after another block of statements: if test $# -eq 0 ;then echo "Safety check - do you really wish to run" $0 "... (5 Replies)
Discussion started by: aoussenko
5 Replies

10. Shell Programming and Scripting

Complex Insert block in the Script

I have script in that there are thousands of create statement... I want to add these lines, above every Create Stament ================================================ IF OBJECT_ID('dbo.account_account_relations') IS NOT NULL BEGIN DROP TABLE dbo.account_account_relations IF... (2 Replies)
Discussion started by: niceboykunal123
2 Replies
Login or Register to Ask a Question