Visit Our UNIX and Linux User Community


awk script to extract transcript information from gff3 file


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers awk script to extract transcript information from gff3 file
# 1  
Old 02-11-2020
awk script to extract transcript information from gff3 file

I need help to extract transcript information from gff3 file.
Here is the input
Code:
Chr01	JGI	gene	82773	86941	.	-	.	ID=Potri.001G000900;Name=Potri.001G000900
Chr01	JGI	mRNA	82793	86530	.	-	.	ID=PAC:27047814;Name=Potri.001G000900.1;pacid=27047814;longest=1;Parent=Potri.001G000900
Chr01	JGI	exon	86331	86530	.	-	.	ID=PAC:27047814.exon.1;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	86331	86530	.	-	0	ID=PAC:27047814.CDS.1;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	exon	85729	85816	.	-	.	ID=PAC:27047814.exon.2;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	85729	85816	.	-	1	ID=PAC:27047814.CDS.2;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	exon	85531	85590	.	-	.	ID=PAC:27047814.exon.3;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	85531	85590	.	-	0	ID=PAC:27047814.CDS.3;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	exon	85162	85224	.	-	.	ID=PAC:27047814.exon.4;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	85162	85224	.	-	0	ID=PAC:27047814.CDS.4;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	exon	84838	85020	.	-	.	ID=PAC:27047814.exon.5;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	84838	85020	.	-	0	ID=PAC:27047814.CDS.5;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	exon	84635	84746	.	-	.	ID=PAC:27047814.exon.6;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	84635	84746	.	-	0	ID=PAC:27047814.CDS.6;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	exon	84304	84521	.	-	.	ID=PAC:27047814.exon.7;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	84304	84521	.	-	2	ID=PAC:27047814.CDS.7;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	exon	82793	83260	.	-	.	ID=PAC:27047814.exon.8;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	three_prime_UTR	82793	83167	.	-	.	ID=PAC:27047814.three_prime_UTR.1;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	CDS	83168	83260	.	-	0	ID=PAC:27047814.CDS.8;Parent=PAC:27047814;pacid=27047814
Chr01	JGI	mRNA	82773	86941	.	-	.	ID=PAC:27047815;Name=Potri.001G000900.2;pacid=27047815;longest=0;Parent=Potri.001G000900
Chr01	JGI	exon	86686	86941	.	-	.	ID=PAC:27047815.exon.1;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	five_prime_UTR	86686	86941	.	-	.	ID=PAC:27047815.five_prime_UTR.1;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	86331	86489	.	-	.	ID=PAC:27047815.exon.2;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	86331	86470	.	-	0	ID=PAC:27047815.CDS.1;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	five_prime_UTR	86471	86489	.	-	.	ID=PAC:27047815.five_prime_UTR.2;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	85729	85816	.	-	.	ID=PAC:27047815.exon.3;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	85729	85816	.	-	1	ID=PAC:27047815.CDS.2;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	85531	85590	.	-	.	ID=PAC:27047815.exon.4;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	85531	85590	.	-	0	ID=PAC:27047815.CDS.3;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	85162	85224	.	-	.	ID=PAC:27047815.exon.5;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	85162	85224	.	-	0	ID=PAC:27047815.CDS.4;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	84838	85035	.	-	.	ID=PAC:27047815.exon.6;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	84838	85035	.	-	0	ID=PAC:27047815.CDS.5;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	84635	84746	.	-	.	ID=PAC:27047815.exon.7;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	84635	84746	.	-	0	ID=PAC:27047815.CDS.6;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	84304	84521	.	-	.	ID=PAC:27047815.exon.8;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	84304	84521	.	-	2	ID=PAC:27047815.CDS.7;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	exon	82773	83260	.	-	.	ID=PAC:27047815.exon.9;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	three_prime_UTR	82773	83167	.	-	.	ID=PAC:27047815.three_prime_UTR.1;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	CDS	83168	83260	.	-	0	ID=PAC:27047815.CDS.8;Parent=PAC:27047815;pacid=27047815
Chr01	JGI	gene	95641	101115	.	+	.	ID=Potri.001G001200;Name=Potri.001G001200
Chr01	JGI	tRNA	95641	100989	.	+	.	ID=PAC:27041679;Name=Potri.001G001200.2;pacid=27041679;longest=0;Parent=Potri.001G001200
Chr01	JGI	exon	95641	95818	.	+	.	ID=PAC:27041679.exon.1;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	95641	95818	.	+	0	ID=PAC:27041679.CDS.1;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	96385	96554	.	+	.	ID=PAC:27041679.exon.2;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	96385	96554	.	+	2	ID=PAC:27041679.CDS.2;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	97086	97143	.	+	.	ID=PAC:27041679.exon.3;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	97086	97143	.	+	0	ID=PAC:27041679.CDS.3;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	97438	97571	.	+	.	ID=PAC:27041679.exon.4;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	97438	97571	.	+	2	ID=PAC:27041679.CDS.4;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	97644	97768	.	+	.	ID=PAC:27041679.exon.5;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	97644	97768	.	+	0	ID=PAC:27041679.CDS.5;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	97920	98095	.	+	.	ID=PAC:27041679.exon.6;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	97920	98095	.	+	1	ID=PAC:27041679.CDS.6;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	98894	99082	.	+	.	ID=PAC:27041679.exon.7;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	98894	99082	.	+	2	ID=PAC:27041679.CDS.7;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	99193	100456	.	+	.	ID=PAC:27041679.exon.8;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	CDS	99193	100070	.	+	2	ID=PAC:27041679.CDS.8;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	three_prime_UTR	100071	100456	.	+	.	ID=PAC:27041679.three_prime_UTR.1;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	100508	100734	.	+	.	ID=PAC:27041679.exon.9;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	three_prime_UTR	100508	100734	.	+	.	ID=PAC:27041679.three_prime_UTR.2;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	exon	100874	100989	.	+	.	ID=PAC:27041679.exon.10;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	three_prime_UTR	100874	100989	.	+	.	ID=PAC:27041679.three_prime_UTR.3;Parent=PAC:27041679;pacid=27041679
Chr01	JGI	tRNA	95641	101115	.	+	.	ID=PAC:27041680;Name=Potri.001G001200.1;pacid=27041680;longest=1;Parent=Potri.001G001200
Chr01	JGI	exon	95641	95818	.	+	.	ID=PAC:27041680.exon.1;Parent=PAC:27041680;pacid=27041680
Chr01	JGI	CDS	95641	95818	.	+	0	ID=PAC:27041680.CDS.1;Parent=PAC:27041680;pacid=27041680
Chr01	JGI	exon	96385	96554	.	+	.	ID=PAC:27041680.exon.2;Parent=PAC:27041680;pacid=27041680
Chr01	JGI	CDS	96385	96554	.	+	2	ID=PAC:27041680.CDS.2;Parent=PAC:27041680;pacid=27041680
Chr01	JGI	exon	97086	97143	.	+	.	ID=PAC:27041680.exon.3;Parent=PAC:27041680;pacid=27041680
Chr01	JGI	CDS	97086	97143	.	+	0	ID=PAC:27041680.CDS.3;Parent=PAC:27041680;pacid=27041680
Chr01	JGI	exon	97438	97571	.	+	.	ID=PAC:27041680.exon.4;Parent=PAC:27041680;pacid=27041680
Chr01	JGI	CDS	97438	97571	.	+	2	ID=PAC:27041680.CDS.4;Parent=PAC:27041680;pacid=27041680JGI

Here is the output
Code:
transcript_id		gene_name		description	chromosome	strand	transcript_start	transcript_end	gene_start	gene_end
Potri.001G000900.1	Potri.001G000900	desc	Chr01	-	82793	86530	82773	86941
Potri.001G000900.2	Potri.001G000900	desc	Chr01	-	82773	86941	82773	86941
Potri.001G001200.2	Potri.001G001200	desc	Chr01	+	95641	100989	95641	101115
Potri.001G001200.1	Potri.001G001200	desc	Chr01	+	95641	101115	95641	101115

I have been trying to get this output for many months but I still couldn't find a good solution. I appreciate all your effort and help.
Thank you in advanced.
# 2  
Old 02-11-2020
I came up with this question but that is not correct.
Code:
awk '{if(g3=="mRNA"){split($9,a,"=");split(a[2],b,";");split(g9,ga,"=");split(ga[2],gb,";");print b[1]"\t"gb[1]"\tDesc\t"$1"\t"$7"\t"g4"\t"g5"\tPAC\tPEP\t"$4"\t"$5};g3=$3;g1=$1;g2=$2;g4=$4;g5=$5;g9=$9}'

# 3  
Old 02-11-2020
Code:
awk '
(! h++) {print "transcript_id", "gene_name", "description", "chromosome", "strand", "transcript_start", "transcript_end", "gene_start", "gene_end";}
$9 ~ /ID=.*Name=/ {split($9,a,";");split(a[2], b,"="); gs[b[2]]=$4; ge[b[2]]=$5;}
$3~/.RNA/ {
split($9,a,";");
split(a[2],b,"=");
split(a[5],c,"=");
print b[2], c[2], "Desc", $1, $7, $4, $5, gs[c[2]], ge[c[2]];
}
' OFS="\t" input


Last edited by RavinderSingh13; 02-28-2020 at 02:49 AM..
This User Gave Thanks to rdrtx1 For This Post:
# 4  
Old 02-11-2020
Private IDs

Last edited by Maduranga; 02-17-2020 at 05:43 AM..
# 5  
Old 02-11-2020
Code:
awk '
(! h++) {print "transcript_id", "gene_name", "description", "chromosome", "strand", "transcript_start", "transcript_end", "gene_start", "gene_end";}
$9 ~ /ID=.*Name=/ {n=$9; sub(".*Name=", "", n); sub(";.*", "", n); gs[n]=$4; ge[n]=$5;}
$3~/.RNA/ {
n=$9; sub(".*Name=", "", n); sub(";.*", "", n);
p=$9; sub(".*Parent=", "", p); sub(";.*", "", p);
print n, p, "Desc", $1, $7, $4, $5, gs[p], ge[p];
}
' FS="\t" OFS="\t" input


Last edited by RavinderSingh13; 02-28-2020 at 02:51 AM..
This User Gave Thanks to rdrtx1 For This Post:
# 6  
Old 02-12-2020
Thank you so much, this works perfectly.
# 7  
Old 02-12-2020
Quote:
Originally Posted by Maduranga
Thank you so much, this works perfectly.
Yes, but for future reference (when you post a question here):

https://www.unix.com/members-only/28...-new-post.html

Previous Thread | Next Thread
Test Your Knowledge in Computers #130
Difficulty: Easy
The original Unix code was developed by AT&T.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed / awk / grep to extract information from log

Hi all, I have a query that runs that outputs data in the following format - 01/09/12 11:43:40,ADMIN,4,77,Application Group Load: Name(TESTED) LoadId(5137-1-0-1XX-15343-15343) File(/dir/dir/File.T03.CI2.RYR.2012009.11433350806.ARD) InputSize(5344) OutputSize(1359) Rows(2) Time(1.9960)... (8 Replies)
Discussion started by: jeffs42885
8 Replies

2. Shell Programming and Scripting

Extract information from file

In a particular directory, there can be 1000 files like below. filename is job901.ksh #!/bin/ksh cront -x << EOJ submit file=$PRODPATH/scripts/genReport.sh maxdelay=30 &node=xnode01 tname=job901 &pfile1=/prod/mldata/data/test1.dat ... (17 Replies)
Discussion started by: vedanta
17 Replies

3. Shell Programming and Scripting

Extract information from file

Gents, If is possible please help. I have a big file (example attached) which contends exactly same value in column, but from column 2 to 6 these values are diff. I will like to compile for all records all columns like the example attached in .csv format (output.rar ).. The last column in the... (11 Replies)
Discussion started by: jiam912
11 Replies

4. Shell Programming and Scripting

awk script to parse case with information in two fields of file

The below awk parser works for most data inputs, but I am having trouble with the last one. The problem is in the below rules steps 1 and 2 come from $2 (NC_000013.10:g.20763686_20763687delinsA) and steps 3 and 4 come from $1 (NM_004004.5:c.34_35delGGinsT). Parse Rules: The header is... (0 Replies)
Discussion started by: cmccabe
0 Replies

5. Shell Programming and Scripting

Help with shell script to extract certain information

Hi, I have a file which I need to programmatically split into two files. All the information in the file before pattern "STOP HERE" is to be stripped and output into one file while everything after "STOP HERE" is to be output into a separate file. I would appreciate help on how to do... (8 Replies)
Discussion started by: PTL
8 Replies

6. Shell Programming and Scripting

How to extract information from a file?

Hi, i have a file like this: <Iteration> <Iteration_iter-num>3</Iteration_iter-num> <Iteration_query-ID>lcl|3_0</Iteration_query-ID> <Iteration_query-def>G383C4U01EQA0A length=197</Iteration_query-def> <Iteration_query-len>197</Iteration_query-len> ... (9 Replies)
Discussion started by: the_simpsons
9 Replies

7. Shell Programming and Scripting

Create shell script to extract unique information from one file to a new file.

Hi to all, I got this content/pattern from file http.log.20110808.gz mail1 httpd: Account Notice: close igchung@abc.com 2011/8/7 7:37:36 0:00:03 0 0 1 mail1 httpd: Account Information: login sastria9@abc.com proxy sid=gFp4DLm5HnU mail1 httpd: Account Notice: close sastria9@abc.com... (16 Replies)
Discussion started by: Mr_47
16 Replies

8. UNIX for Dummies Questions & Answers

Write a script to extract information from a db

Hi I need to put together a script that will search certain tables in a db and send that data to a csv file. Basically I am importing data to a db and I want to write a script to check that all information was imported correctly. Thank you (1 Reply)
Discussion started by: ladyAnne
1 Replies

9. Shell Programming and Scripting

extract and format information from a file

Hi, Following is sample portion of the file; <JDBCConnectionPool DriverName="oracle.jdbc.OracleDriver" MaxCapacity="10" Name="MyApp_DevPool" PasswordEncrypted="{3DES}7tXFH69Xg1c=" Properties="user=MYAPP_ADMIN" ShrinkingEnabled="false" ... (12 Replies)
Discussion started by: sujoy101
12 Replies

10. Shell Programming and Scripting

AWK to extract information

Hi all, I am working on a shell script to extract information from a file that has output from Oracle sqlplus. The problem is that the output of a single line is spread across multiple lines and i do not know as how to extract the particular filed at ones,which spans multiple lines.... (2 Replies)
Discussion started by: harris2107
2 Replies

Featured Tech Videos