Parse file for fields and specific text

07-07-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Parse file for fields and specific text

I have a file of ~500,000 entries in the following:

file.txt

Code:

chr1	11868	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	12009	12057	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12178	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";

I am using cygwin on a windows 7 OS trying to parse each line into 5 rows:

In the below
row 1 = $1 of the text
row 2 = $2 of the text
row 3 = $3 of the text
row 4 = gene_name= "..." - quotes removed
row 5 = exon_number "...." - quotes removed

Example of desired output:

Code:

chr1	11868	12227     DDX11L1     1 
chr1	12009	12057     DDX11L1     1  
chr1	12178	12227     DDX11L1     2

I was able to generate the file.txt but can not seem to parse it correctly.

Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

07-07-2015

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

After 2 years and 457 posts in this forum you must have tried something? As far as I know, the format of your file.txt is a pretty standard gtf/gff , so there is nothing to generate really. Please show what you tried.

senhia83

View Public Profile for senhia83

Find all posts by senhia83

07-07-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Yes the

Code:

gtf/gff

was created

Code:

wget ftp://ftp.sanger.ac.uk/pub/gencode/r...otation.gtf.gz
gunzip --stdout gencode.v21.annotation.gtf.gz \
    | gtf2bed - \
    | grep "exon" \
    > gencode.exons.bed
bedmap --echo --echo-map Regions.bed gencode.exons.bed

produced output close, but not desired and I thought maybe if I parsed the input it may help. That is if I had a exon file with only 5 rows that may be better.

I'm not sure but maybe:

Code:

 awk -f FNR > 1{for(i=1;i<=NF;i++) {n=split($i,a, "[.:>_]") print a[1]+0,a[2]+0,a[3]+0,substr(a[gene_name],length(a[exon_number])), a[n]} } OFS='\t' gencode.exons.txt > parse.txt

Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

07-07-2015

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

one way below using split, other methods include parsing by regex..

try

Code:

awk -F"\t"  '{split($10,a,";"); for (i=1;i<=length(a);i++) if (a[i]~/gene_name/) { split(a[i],b,"\"");x=b[2] } else if (a[i]~/exon_number/) {split(a[i],c," ");y=c[2]}; print $1,$2,$3,x,y}' OFS="\t" file

This User Gave Thanks to senhia83 For This Post:

senhia83

View Public Profile for senhia83

Find all posts by senhia83

07-08-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you

, works great and thank you for introducing me to split

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

07-08-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try also

Code:

awk '
        {match ($0, /gene_name [^ ]*/)
         T1=substr ($0, RSTART+11, RLENGTH-13)
         match ($0, /exon_number [^ ]*/)
         T2=substr ($0, RSTART+11, RLENGTH-12)
         print $1, $2, $3, T1, T2
        }
' FS="\t" OFS="\t" file
chr1    11868    12227    DDX11L1     1
chr1    12009    12057    DDX11L1     1
chr1    12178    12227    DDX11L1     2

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-08-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

Parse file for fields and specific text

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Script to parse and compare information in two fields of file

Discussion started by: GERMANOS

2. Shell Programming and Scripting

Replacing entire fields with specific text at end or beginning of field

Discussion started by: palmfrond

3. Shell Programming and Scripting

awk script to parse case with information in two fields of file

Discussion started by: cmccabe

4. Shell Programming and Scripting

Parse text file using specific tags

Discussion started by: cmccabe

5. Shell Programming and Scripting

Extract specific line in an html file starting and ending with specific pattern to a text file

Discussion started by: dejavo

6. Shell Programming and Scripting

Capture specific fields in file

Discussion started by: i150371485

7. Shell Programming and Scripting

Perl: Parse Hex file into fields

Discussion started by: morrbie

8. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Discussion started by: goodbenito

9. Shell Programming and Scripting

How to read and parse the content of csv file containing # as delimeter into fields using Bash?

Discussion started by: barani75

10. UNIX for Dummies Questions & Answers

How to parse the specific data from the file

Discussion started by: MuthuAlagappan