More efficient awk parser


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting More efficient awk parser
# 1  
Old 03-12-2015
More efficient awk parser

I have an awk parser, that works great if the data is NC_0000 (four digits), but if it is not that then the data is parsed. I'm not sure the most efficient way to obtain the desired output. Thank you Smilie.

Code:
Code:
 awk 'FNR > 1 && match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a){ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' ${id}.txt > ${id}_parse.txt

For example:

Code:
NC_000013.10:g.20763466G>A

or
Code:
NC_00001.10:g.20763477C>G

would be parsed into the desired output of
Code:
13 20763466 20763466 G A

or
Code:
1 20763477 20763477 C G

,

but
Code:
NC_000004.11:g.41749507G>T

ould not work. The desired output format is listed below and is always that way. Thank you Smilie.
parse rules:

4 zeros after the NC_ (not always the case) and the digits before the .

digits after the g. repeated twice separated by a tab

letter before the >

letter after the >
[MOD]As has been stated many times before, PLEASE use CODE tags when displaying sample input and output as well as when displaying code segments.

Last edited by Don Cragun; 03-12-2015 at 03:49 PM.. Reason: Add CODE and ICODE tags again.
# 2  
Old 03-12-2015
It works for me:
Code:
[user@host ~]$ cat file
NC_000004.11:g.41749507G>T
NC_000013.10:g.20763466G>A
NC_00001.10:g.20763477C>G
[user@host ~]$ awk 'match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a){ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' file
04      41749507        41749507        G       T
13      20763466        20763466        G       A
1       20763477        20763477        C       G
[user@host ~]$

By the way, why do you have FNR>1 ; do you want to skip the first line? Does the first line have NC_000004.... ?
# 3  
Old 03-12-2015
Yes, the first line is a header so FNR>1 is used to skip it. I attached the input file that contains the data to be parsed. The issue with the parser the way it is that the line in bold is going error in a perl script I use later. Line 1 needs to look like line 3 in order for it to be used and I am not sure how to do this. Thank you Smilie.

Code:
 
NC_000004.11:g.41749507G>T
NC_000013.10:g.20763466G>A
NC_00001.10:g.20763477C>G
 
 
04      41749507        41749507        G       T
13      20763466        20763466        G       A
1       20763477        20763477        C       G

# 4  
Old 03-12-2015
Did you try to print a[1]+0, ... ?
# 5  
Old 03-12-2015
Since the digits after the g. might also vary:

Code:
 awk -F"[_.>]" 'FNR > 1 '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" ${id}.txt > ${id}_parse.txt

would this skip the header row and parse the third column? Thanks.
# 6  
Old 03-13-2015
Did you test that? What be the result?

One awk feature is that it uses leading digits only if you perform arithmetics on a field, dropping everything after the first non-digit. So - $4+0would yield the desired number regardless of its length. And a sub ($4+0, "", $4)would give the trailing char.
# 7  
Old 03-13-2015
If I do the below the format is incorect pressumambly because of the header in the input file.

Code:
 awk -F"[_.>]" '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" Target.txt
0
4004    244     244     G       A               NC
3924    288     288     C       A               NC
3924    385     385     G       A               NC

However, the below gives an error, I think because of the 'FNR > 1, but I'm not sure. Thank you Smilie.

Code:
 awk -F"[_.>]" 'FNR > 1 '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" ${id}.txt > ${id}_parse.txt 
-bash: syntax error near unexpected token `('

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Efficient awk way to add numbers in line fields

data.now: blah1,dah,blaha,sweet,games.log,5297484456,nagios-toin,529748456,on__host=93 SERVICE__ALERT_=51 Warning___The__results__of__service=16 Warning___on__host=92 Auto_save__of__retention__data__completed=1 Warning___Return=68 PASSIVE__SERVICE__CHECK_=53 ,1026--1313,1... (12 Replies)
Discussion started by: SkySmart
12 Replies

2. Shell Programming and Scripting

Combining awk command to make it more efficient

VARIABLE="jhovan 5259 5241 0 20:11 ? 00:00:00 /proc/self/exe --type=gpu-process --channel=5182.0.1597089149 --supports-dual-gpus=false --gpu-driver-bug-workarounds=2,45,57 --disable-accelerated-video-decode --gpu-vendor-id=0x80ee --gpu-device-id=0xbeef --gpu-driver-vendor... (3 Replies)
Discussion started by: SkySmart
3 Replies

3. Shell Programming and Scripting

Efficient way to search array in text file by awk

I have one array SPLNO with approx 10k numbers.Now i want to search the subscriber number from MDN.TXT file (containing approx 1.5 lac record)from the array.if subscriber number found in array it will perform below operation.my issue is that it's taking more time because for one number it's search... (6 Replies)
Discussion started by: siramitsharma
6 Replies

4. Shell Programming and Scripting

Java stack trace parser in awk

I want the developers to get a mail with Java stack traces on a daily bases. When something is flaged as known issue and will get a fix but mean while this does not need to get sent each dayl. This is what I got so far. It's a bash script that runs some AWK in it. To get the files that needs to... (6 Replies)
Discussion started by: chipmunken
6 Replies

5. UNIX for Dummies Questions & Answers

Help with awk (making simple/advanced ini parser)

Hello I'm searching some kind of example (or ready-made solution, but I don't really want it, because I want to learn awk more), to make something like a parser in awk for something like this (I put example, because I don't really know how to explain this): line1=1 line2=0 line3=1... (23 Replies)
Discussion started by: jormung
23 Replies

6. Emergency UNIX and Linux Support

Help to make awk script more efficient for large files

Hello, Error awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt What it is It is a unix shell script that contains an awk program as well as... (4 Replies)
Discussion started by: script_op2a
4 Replies

7. Shell Programming and Scripting

Is there a way to make this more efficient

I have the following code. printf "Test Message Report" > report.txt while read line do msgid=$(printf "%n" "$line" | cut -c1-6000| sed -e 's///g' -e 's|.*ex:Msg\(.*\)ex:Msg.*|\1|') putdate=$(printf "%n" "$line" | cut -c1-6000| sed -e 's///g' -e 's|.*PutDate\(.*\)PutTime.*|\1|')... (9 Replies)
Discussion started by: gugs
9 Replies

8. Shell Programming and Scripting

Can you suggest a more efficient way for this?

Hi I have the following at the end of a service shutdown script used in part of an active-passive failover setup: ### # Shutdown all primary Network Interfaces # associated with failover ### # get interface names based on IP's # and shut them down to simulate loss of # heartbeatd ... (1 Reply)
Discussion started by: mikie
1 Replies

9. Shell Programming and Scripting

Efficient way of Awk

Hi, Can someone let me know if the below AWK can be made much simpler / efficient ? I have 200 fields, I need to substr only the last fields. So i'm printing awk -F~ 'print {$1, $2, $3....................................$196,$197 , susbstr($198,1,3999), substr($199,1,3999)..}' Is there a... (4 Replies)
Discussion started by: braindrain
4 Replies

10. Shell Programming and Scripting

Is there a more efficient way?

I'm using korn shell to connect to oracle, retrieve certain values, put them in a list, and iterate through them. While this method works, I can't help but think there is an easier method. If you know of one, please suggest a shorter, more efficient method. ############### FUNCTIONS ... (6 Replies)
Discussion started by: SelectSplat
6 Replies
Login or Register to Ask a Question