Parsing a file and pulling out specific columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parsing a file and pulling out specific columns
# 1  
Old 01-26-2015
Parsing a file and pulling out specific columns

Hi,

I am having some difficulty pulling out specific columns using awk. I think what I am doing is iterating through the various columns looking for a match and asking awk to print if a match is found.

Here are a few lines from my input:

HTML Code:
NC_015011.2     Gnomon  gene    18691   26481   .       +       .       ID=gene0;Dbxref=GeneID:100538868;Name=LOC100538868;gbkey=Gene;gene=LOC100538868;partial=true;start_range=.,18691
NC_015011.2     Gnomon  mRNA    18691   26481   .       +       .       ID=rna0;Parent=gene0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;Name=XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;start_range=.,18691;transcript_id=XM_010707932.1
NC_015011.2     Gnomon  exon    18691   18743   .       +       .       ID=id1;Parent=rna0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;start_range=.,18691;transcript_id=XM_010707932.1
NC_015011.2     Gnomon  exon    18865   18994   .       +       .       ID=id2;Parent=rna0;Dbxref=GeneID:100538868,Genbank:XM_010707932.1;gbkey=mRNA;gene=LOC100538868;partial=true;product=hematopoietic lineage cell-specific protein-like;transcript_id=XM_010707932.1
Here is my code, note that I am not interested in the first 8 fields. The 9th field in an info field that does not have a set number of fields (even though the ones shown do) such that a matching technique is more appropriate:

awk -F "\t" '{ print $9 }' mga_ref_Turkey_5.0_NCBI_FINAL_no_GI_no_region.gff3.txt | grep product | awk -F ";" '{ gsub(";","\t",$0);print $0 }' | awk -F "\t" '{for(i=0;i<NF;i++){if($i~/gene\=/){printf $i};if($i~/product\=/){printf $i }};printf "\n"}' | head

Now the output:

HTML Code:
ID=rna0	Parent=gene0	Dbxref=GeneID:100538868,Genbank:XM_010707932.1	Name=XM_010707932.1	gbkey=mRNA	gene=LOC100538868	partial=true	product=hematopoietic lineage cell-specific protein-like	start_range=.,18691	transcript_id=XM_010707932.1ID=rna0	Parent=gene0	Dbxref=GeneID:100538868,Genbank:XM_010707932.1	Name=XM_010707932.1	gbkey=mRNA	gene=LOC100538868	partial=true	product=hematopoietic lineage cell-specific protein-like	start_range=.,18691	transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like
ID=id1	Parent=rna0	Dbxref=GeneID:100538868,Genbank:XM_010707932.1	gbkey=mRNA	gene=LOC100538868	partial=true	product=hematopoietic lineage cell-specific protein-like	start_range=.,18691	transcript_id=XM_010707932.1ID=id1	Parent=rna0	Dbxref=GeneID:100538868,Genbank:XM_010707932.1	gbkey=mRNA	gene=LOC100538868	partial=true	product=hematopoietic lineage cell-specific protein-like	start_range=.,18691	transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like
ID=id2	Parent=rna0	Dbxref=GeneID:100538868,Genbank:XM_010707932.1	gbkey=mRNA	gene=LOC100538868	partial=true	product=hematopoietic lineage cell-specific protein-like	transcript_id=XM_010707932.1ID=id2	Parent=rna0	Dbxref=GeneID:100538868,Genbank:XM_010707932.1	gbkey=mRNA	gene=LOC100538868	partial=true	product=hematopoietic lineage cell-specific protein-like	transcript_id=XM_010707932.1gene=LOC100538868product=hematopoietic lineage cell-specific protein-like
What I don't get is why my match/print commands are not printing ONLY the matching fields???

Thanks

---------- Post updated at 09:37 PM ---------- Previous update was at 09:16 PM ----------

Nevermind, figured it out. I need to start my loop at 1, not 0. Grrr.
# 2  
Old 01-27-2015
This could be optimized somewhat in one awk script:
Code:
awk -F';' '/product/{sub(/.*\t/,x); for(i=1; i<NF; i++) if($i~/product=|gene=/) printf "%s", $i; print x}' file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to check for a specific number of columns in a file

Hi All I have a file which has five columns in each rows. cat file.txt a|b|c|d|e 1|2|3|4|5 a1|a2|a3|a4|a5 . . . I need to make sure that there are no less than five or more than five columns (in all the rows) by mistake. I tried this : cat file.txt | awk 'BEGIN{FS="|"};{print... (3 Replies)
Discussion started by: chatwithsaurav
3 Replies

2. Shell Programming and Scripting

File Parsing based on a character in a specific field

Hi All, I'm having a hard time finding a starting point for my issue. I have a 30k line file (fspsec.txt) that I would like to parse into smaller files based on any character existing in field 1. ACCOUNTANT LEVEL 1 (ACCT.ACCOUNTANT) OPERATORS: DOEJO (418) TOOLS: Branch Maintenance ... (2 Replies)
Discussion started by: aahlrich
2 Replies

3. Shell Programming and Scripting

Add specific text to columns in file by sed

What is the proper syntax to add specific text to a column in a file? Both the input and output below are tab-delineated. What if there are multiple text/fields, such as /CP&/2 /CM&/3 /AA&/4 Thank you :). sed 's/*/Index&/1' del.txt.hg19_multianno.txt > matrix.del.txt (4 Replies)
Discussion started by: cmccabe
4 Replies

4. Shell Programming and Scripting

Need specific columns in a log file as excel.

Hi All... I am in need of few columns from a log file.. in .xls file... below is what i have tried. my log file has 16 colums with " ; " as delimiter, but i need randomn columns 1 2 3 4 5 6 10 11 16 in an excel. I tried to awk the columns with delimiter ; and it worked, below is the log... (5 Replies)
Discussion started by: nanz143
5 Replies

5. Shell Programming and Scripting

Creating subset of a file based on specific columns

Hello Unix experts, I need a help to create a subset file. I know with cut comand, its very easy to select many different columns, or threshold. But here I have a bit problem as in my data file is big. And I don't want to identify the column numbers or names manually. I am trying to find any... (7 Replies)
Discussion started by: smitra
7 Replies

6. Shell Programming and Scripting

Transpose whole file and specific columns

Hi, I have a file like this a b c d e f g h i j k l Case1: I want to transpose the whole file Output1 a d g j b e h k c f i l Case2 Transpose a specific column - Say 3rd (6 Replies)
Discussion started by: jacobs.smith
6 Replies

7. Shell Programming and Scripting

Replace specific columns in one file with columns in another file

HELLO! This is my first post here! By the way, I think it is great that people do this. My question: I have two files, one is a .dilm and one is a .txt. It is my understanding that the .dilm file can be treated as a .txt file. I wrote another program where I was able to manipulate it as if it... (3 Replies)
Discussion started by: mehdib
3 Replies

8. UNIX for Dummies Questions & Answers

Displaying specific columns in a file

Hi, I'm just wondering how you display a specific set of columns of a specified file in Unix. For example, if you had an AddressBook file that stores the Names, Phone numbers, and Addresses of people the user entered in the following format (the numbers are just to give an idea of what column... (1 Reply)
Discussion started by: logorob
1 Replies

9. Shell Programming and Scripting

Parsing file, yaml file? Extracting specific sections

Here is a data file, which I believe is in YAML. I am trying to retrieve just the 'addon_domains" section, which doesnt seem to be as easy as I had originally thought. Any help on this would be greatly appreciated!! I have been trying to do this in awk and mostly bash scripting instead of perl... (3 Replies)
Discussion started by: Rhije
3 Replies

10. Shell Programming and Scripting

Deleting specific columns from a file

Hi Friends, I want to delete specific columns from a file. Say my file content is as follows: "1","a","ww1",1234" "2","b","wwr3","2222" "3","c","erre","3333" Now i want to delete the column 2 and 4 from this file. That is I want the file content to be: "1","ww1" "2","wwr3"... (11 Replies)
Discussion started by: premar
11 Replies
Login or Register to Ask a Question