Hi,
I am having some difficulty pulling out specific columns using awk. I think what I am doing is iterating through the various columns looking for a match and asking awk to print if a match is found.
Here are a few lines from my input:
HTML Code:
NC_015011.2 Gnomon gene 18691 26481 . + . ID=gene0;Dbxref=GeneID:100538868;Name=LOC100538868;gbkey=Gene;gene=LOC100538868;partial=true;start_range=.,18691
NC_015011.2 Gnomon mRNA 18691 26481 . + . ID=rna0;Parent=gene0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;Name=XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;start_range=.,18691;transcript_id=XM_010707932.1
NC_015011.2 Gnomon exon 18691 18743 . + . ID=id1;Parent=rna0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;start_range=.,18691;transcript_id=XM_010707932.1
NC_015011.2 Gnomon exon 18865 18994 . + . ID=id2;Parent=rna0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;transcript_id=XM_010707932.1
Here is my code, note that I am not interested in the first 8 fields. The 9th field in an info field that does not have a set number of fields (even though the ones shown do) such that a matching technique is more appropriate:
awk -F "\t" '{ print $9 }' mga_ref_Turkey_5.0_NCBI_FINAL_no_GI_no_region.gff3.txt | grep product | awk -F ";" '{ gsub(";","\t",$0);print $0 }' | awk -F "\t" '{for(i=0;i<NF;i++){if($i~/gene\=/){printf $i};if($i~/product\=/){printf $i }};printf "\n"}' | head
Now the output:
HTML Code:
ID=rna0 Parent=gene0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 Name=XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1ID=rna0 Parent=gene0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 Name=XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like
ID=id1 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1ID=id1 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like start_range=.,18691 transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like
ID=id2 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like transcript_id=XM_010707932.1ID=id2 Parent=rna0 Dbxref=GeneID:100538868,Genbank:XM_010707932.1 gbkey=mRNA gene=LOC100538868 partial=true product=hematopoietic lineage cell-specific protein-like transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like
What I don't get is why my match/print commands are not printing ONLY the matching fields???
Thanks
---------- Post updated at 09:37 PM ---------- Previous update was at 09:16 PM ----------
Nevermind, figured it out. I need to start my loop at 1, not 0. Grrr.