Parse file for fields and specific text


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parse file for fields and specific text
# 1  
Old 07-07-2015
Parse file for fields and specific text

I have a file of ~500,000 entries in the following:

file.txt
Code:
chr1	11868	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	12009	12057	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12178	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";

I am using cygwin on a windows 7 OS trying to parse each line into 5 rows:

In the below
row 1 = $1 of the text
row 2 = $2 of the text
row 3 = $3 of the text
row 4 = gene_name= "..." - quotes removed
row 5 = exon_number "...." - quotes removed

Example of desired output:

Code:
chr1	11868	12227     DDX11L1     1 
chr1	12009	12057     DDX11L1     1  
chr1	12178	12227     DDX11L1     2

I was able to generate the file.txt but can not seem to parse it correctly.

Thank you Smilie.
# 2  
Old 07-07-2015
After 2 years and 457 posts in this forum you must have tried something? As far as I know, the format of your file.txt is a pretty standard gtf/gff , so there is nothing to generate really. Please show what you tried.
# 3  
Old 07-07-2015
Yes the
Code:
gtf/gff

was created

Code:
wget ftp://ftp.sanger.ac.uk/pub/gencode/r...otation.gtf.gz
gunzip --stdout gencode.v21.annotation.gtf.gz \
    | gtf2bed - \
    | grep "exon" \
    > gencode.exons.bed
bedmap --echo --echo-map Regions.bed gencode.exons.bed

produced output close, but not desired and I thought maybe if I parsed the input it may help. That is if I had a exon file with only 5 rows that may be better.

I'm not sure but maybe:

Code:
 awk -f FNR > 1{for(i=1;i<=NF;i++) {n=split($i,a, "[.:>_]") print a[1]+0,a[2]+0,a[3]+0,substr(a[gene_name],length(a[exon_number])), a[n]} } OFS='\t' gencode.exons.txt > parse.txt

Thank you Smilie.
# 4  
Old 07-07-2015
one way below using split, other methods include parsing by regex..

try

Code:
awk -F"\t"  '{split($10,a,";"); for (i=1;i<=length(a);i++) if (a[i]~/gene_name/) { split(a[i],b,"\"");x=b[2] } else if (a[i]~/exon_number/) {split(a[i],c," ");y=c[2]}; print $1,$2,$3,x,y}' OFS="\t" file

This User Gave Thanks to senhia83 For This Post:
# 5  
Old 07-08-2015
Thank you Smilie, works great and thank you for introducing me to split Smilie.
# 6  
Old 07-08-2015
Try also
Code:
awk '
        {match ($0, /gene_name [^ ]*/)
         T1=substr ($0, RSTART+11, RLENGTH-13)
         match ($0, /exon_number [^ ]*/)
         T2=substr ($0, RSTART+11, RLENGTH-12)
         print $1, $2, $3, T1, T2
        }
' FS="\t" OFS="\t" file
chr1    11868    12227    DDX11L1     1
chr1    12009    12057    DDX11L1     1
chr1    12178    12227    DDX11L1     2

This User Gave Thanks to RudiC For This Post:
# 7  
Old 07-08-2015
Thank you Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Script to parse and compare information in two fields of file

Hello, I am working parsing a large input file1(field CFA) I have to compare the the file1 field(CFA byte 88-96) with the content of the file2(It contains only one field) and and insert rows equal in another file. Here is my code and sample input file: ... (7 Replies)
Discussion started by: GERMANOS
7 Replies

2. Shell Programming and Scripting

Replacing entire fields with specific text at end or beginning of field

Greetings. I've got a csv file with data along these lines: Spumoni's Pizza Place, Placemats n Things, Just Lamps Counterfeit Dollars by Vinnie, Just Shades, Dollar StoreI want to replace the entire comma-delimited field if it matches something ending in "Place" or beginning with "Dollar",... (2 Replies)
Discussion started by: palmfrond
2 Replies

3. Shell Programming and Scripting

awk script to parse case with information in two fields of file

The below awk parser works for most data inputs, but I am having trouble with the last one. The problem is in the below rules steps 1 and 2 come from $2 (NC_000013.10:g.20763686_20763687delinsA) and steps 3 and 4 come from $1 (NM_004004.5:c.34_35delGGinsT). Parse Rules: The header is... (0 Replies)
Discussion started by: cmccabe
0 Replies

4. Shell Programming and Scripting

Parse text file using specific tags

awk -F "" '/<href=>|<href=>|<top>|<top>/ {print $3, OFS=\t}' source.txt > output.txt I'm not quite sure how to parse the attached file, but what I am trying to do is in a output file have the link (href=), name (after the <), and count (<top>) in 3 separate columns. My attempt is the above... (2 Replies)
Discussion started by: cmccabe
2 Replies

5. Shell Programming and Scripting

Extract specific line in an html file starting and ending with specific pattern to a text file

Hi This is my first post and I'm just a beginner. So please be nice to me. I have a couple of html files where a pattern beginning with "http://www.site.com" and ending with "/resource.dat" is present on every 241st line. How do I extract this to a new text file? I have tried sed -n 241,241p... (13 Replies)
Discussion started by: dejavo
13 Replies

6. Shell Programming and Scripting

Capture specific fields in file

Dear Friends, I have a file a.txt 1|3478.12|487|4578.04|4505.5478|rhfj|rehtire|rhj I want to get the field numbers which have decimal values output: Fields: 2,4,5 Plz help (6 Replies)
Discussion started by: i150371485
6 Replies

7. Shell Programming and Scripting

Perl: Parse Hex file into fields

Hi, I want to split/parse certain bits of the hex data into another field. Example: Input data is Word1: 4f72abfd Output: Parse bits (5 to 0) into field word1data1=0x00cd=205 decimal Parse bits (7 to 6) into field word1data2=0x000c=12 decimal etc. Word2: efff3d02 Parse bits (13 to... (1 Reply)
Discussion started by: morrbie
1 Replies

8. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Hi, I have the following text file: 8 T1mapping_flip02 ok 128 108 30 1 665000-000008-000001.dcm 9 T1mapping_flip05 ok 128 108 30 1 665000-000009-000001.dcm 10 T1mapping_flip10 ok 128 108 30 1 665000-000010-000001.dcm 11 T1mapping_flip15 ok 128 108 30... (2 Replies)
Discussion started by: goodbenito
2 Replies

9. Shell Programming and Scripting

How to read and parse the content of csv file containing # as delimeter into fields using Bash?

#!/bin/bash i=0 cat 1.csv | while read fileline do echo "$fileline" IFS="#" flds=( $fileline ) nrofflds=${#flds} echo "noof fields$nrofflds" fld=0 while do echo "noof counter$fld" echo "$nrofflds" #fld1="${flds}" trying to store the content of line to fields but i... (4 Replies)
Discussion started by: barani75
4 Replies

10. UNIX for Dummies Questions & Answers

How to parse the specific data from the file

Hi, I need to parse this data FastEthernet0/9,|FastEthernet0/10,|FastEthernet0/11,FastEthernet0/13|, FastEthernet0/12,FastEthernet0/24 . and get only the value like e.g 0/24,0/11. how to do this in shell script. Thanks in Advance. (2 Replies)
Discussion started by: MuthuAlagappan
2 Replies
Login or Register to Ask a Question