awk to format each line by pattern


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to format each line by pattern
# 1  
Old 06-14-2018
awk to format each line by pattern

The four lines in the tab-delimeted input are a sample format from my actual data. The awk is meant to go line by line and check if a pattern is satisfied and if it is follow a particular format (there are 3). All the lines in the file should follow one of the three formats below. I added comments to the awk but can not get it to execute and there is probably a better way. Thank you Smilie.

Code:
format1= only text (alpha characters) are stored in variable p   --- so only NHLRC1 is stored in $p as the other parenthesis is a #
format2= parenthesis with a number in them are removed along with the parenthesis --- so in line 3 the (10866) is removed
format3= split $4 on the _ (underscore) and print the 3 field

input tab-delimited
Code:
6	18122723	18122843	469_380805_378884(NHLRC1)_1.1_1
6	31114121	31114241	344047_16724314_rs746647_1
6	31430946	31431066	344049_16724385_HCP5(10866)_1_1
6	32808479	32808599	445446_18754304_PSMB8-exon6_1

desired output tab-delimited
Code:
chr6	18122723	18122843	chr6:18122723-18122843	NHLRC1
chr6	31114121	31114241	chr6:31114121-31114241	rs746647
chr6	31430946	31431066	chr6:31430946-31431066	HCP5
chr6	32808479	32808599	chr6:32808479-32808599	PSMB8-exon6

awk
Code:
awk 'BEGIN{FS=OFS="\t"}  # define fs and output
       FNR==NR{ # process each field in each line of file
         if(/([A-Z])/) {  # pattern 1 for extracting only alpha in () not number
            p=$(awk -F"[()]" '{print $2}')      # extract string in variable p
              print "chr"$1,$2,$3,"chr:"$2"-"$3,$p  # print desired output
               next
  }
         if(/([0-9])/) {  # pattern remove # in () 
            n=$(awk -F"[()]" '{print $2}')   # extract number in ()in variable n
              awk -v num=$n 'BEGIN {sub([0-9],"",num) && sub (),"",$4)  ; print name}  # substitute # with null value and print
               next
  }
         if($4 ~ /_/) {  # pattern 3 for _ spilt
            awk '{split($0,a,"_"); print "chr"$1,$2,$3,"chr:"$2"-"$3,a[3]}'  # if conditions 1 and 2 not meet then split on _ and print 3rd field along with desired fields
               next
  }
}' input

# 2  
Old 06-14-2018
Try:
Code:
awk '
  {
    split($4,F,/_/)
    if(split(F[3],G,/[)(]/)) {
      if(G[2]~/[[:alpha:]]/)
        p=G[2]
      else 
        p=G[1]
    } 
    else 
      p=F[3]
  }
  {
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p
  }
' FS='\t' OFS='\t' file

This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 06-15-2018
The awk works great... thank you. I found two additional format types and commented your code to try and capture these two additional. However I don't think I am understanding it correctly. Would you be able to comment it so I can try to make the changes... I added the bold portion to capture the pattern in line 5 (split $4 on the _ and capture the 2nd value if alpha). Also, I can't figure out how does a numeric value inside a () not get printed? Thank you very much Smilie.

Code:
awk '
  {
    split($4,F,/_/)            # split field 4 on _ and strore in F
    if(split(F[3],G,/[)(]/)) { # store value of 3rd field in G
        if(G[2]~/[[:alpha:]]/) # check that it's alpha and store in G[2]
        p=G[2]  # store G[2] as p
      else 
        p=G[1]  # if numeric store as p
    } 
    else 
      p=F[3]   # store spilt value as p
  }
  { 
    split($4,A,/_/)
     if(split(A[2],B,/[_]/)) {
      if(B[2]~/[[:alpha:]]/)
        p=B[2]
  }
   }
    {
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p  # print desired output
  }
' FS='\t' OFS='\t' in   # define FS and OFS as tab-delimited

in tab-delimited
Code:
6	18122723	18122843	469_380805_378884(NHLRC1)_1.1_1
6	31114121	31114241	344047_16724314_rs746647_1
6	31430946	31431066	344049_16724385_HCP5(10866)_1_1
6	32808479	32808599	445446_18754304_PSMB8-exon6_1
1	33478785	33478905	19186497_AK2-Exon1_1
1	24022788	24022908	466743_18956150_RPL11-NM_000975-exon6_1

desired output tab-delimited
Code:
chr6	18122723	18122843	chr6:18122723-18122843	NHLRC1
chr6	31114121	31114241	chr6:31114121-31114241	rs746647
chr6	31430946	31431066	chr6:31430946-31431066	HCP5
chr6	32808479	32808599	chr6:32808479-32808599	PSMB8-exon6
chr1	33478785	33478905	chr1:33478785-33478905	AK2-Exon1
chr1	24022788	24022908	chr1:24022788-24022908	RPL11-NM_000975-exon6


Last edited by cmccabe; 06-15-2018 at 10:51 AM.. Reason: fixed format
# 4  
Old 06-16-2018
Hi try this instead:
Code:
awk '
  {
    gsub(/^[0-9_]+[_(]|[)(_][_)(0-9.]+$/,x,$4)
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3, $4
  }
' FS='\t' OFS='\t' file


Last edited by Scrutinizer; 06-16-2018 at 02:55 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 5  
Old 06-16-2018
Thank you very much Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to combine lines from line with pattern match to a line that ends in a pattern

I am trying to combine lines with these conditions: 1. First line starts with text of "libname VALUE db2 datasrc" where VALUE can be any text. 2. If condition1 is met then continue to combine lines through a line that ends with a semicolon. 3. Ignore case when matching patterns and remove any... (5 Replies)
Discussion started by: Wes Kem
5 Replies

2. Shell Programming and Scripting

Maintain line format using awk

Hello I have a file with the following format: ... text1 num num P # 2014--2-28-22---6 33.76--38.4173---21.9403----0.08-0.00--0.01--0.01--0.46----------0 text1 num num P text 2 num num S text 3 num num P ... (where "-"=space, "spaces" cannot... (4 Replies)
Discussion started by: phaethon
4 Replies

3. Shell Programming and Scripting

awk - To retrieve an expression from the last line containing a pattern

Hi All, I'm new on this forum, and i'm trying since several days to find out a way to retrieve a expression from the last line containing a pattern. Could you please help me with this ? E.g. The file is containing the following lines 08/05 17:33:47 STAT1 Response(22) is... (4 Replies)
Discussion started by: Antonio Fargas
4 Replies

4. Shell Programming and Scripting

awk to search for pattern and remove line

I am an awk beginner and need help figuring out how to search for a number in the first column and if it (or anything greater) exists, remove those lines. AM11400012012 2.26 2.12 1.98 2.52 3.53 3.01 3.62 5.00 3.65 7.95 0.79 3.88 0.00 AM11400012013 3.39 2.29 ... (1 Reply)
Discussion started by: ncwxpanther
1 Replies

5. Shell Programming and Scripting

awk to insert line previous to a pattern?

I have a very long line with certain patters embedded in there. I need to be able to read that line, and when it encounters that pattern, create a new line. I want the pattern to be the beginning of the new line. I thought sed or awk could do this, but everything I try in sed gives me a "sed... (2 Replies)
Discussion started by: Drenhead
2 Replies

6. Shell Programming and Scripting

a cut-command or special format pattern in awk

Hi i read data with awk, 01.07.2012 00:10 227.72 247.50 1.227 1.727 17.273 01.07.2012 00:20 237.12 221.19 2.108 2.548 17.367 01.07.2012 00:30 230.38 230.34 3.216 3.755 17.412 01.07.2012 00:40 243.18 242.91 4.662 5.172 17.328 01.07.2012 00:50 245.58 245.41 5.179 5.721 17.128... (3 Replies)
Discussion started by: IMPe
3 Replies

7. Shell Programming and Scripting

Grep the word from pattern line and update in subsequent lines till next pattern line reached

Hi, I have got the below requirement. please suggest. I have a file like, Processing Item is: /data/ing/cfg2/abc.txt /data/ing/cfg3/bgc.txt Processing Item is: /data/cmd/for2/ght.txt /data/kernal/config.klgt.txt I want to process the above file to get the output file like, ... (5 Replies)
Discussion started by: rbalaj16
5 Replies

8. Shell Programming and Scripting

awk script to move a line after the matched pattern line

I have the following text format in a file which lists the question first and then 5 choices after that the explanantion and finally the answer. 1.The amount of time it takes for most of a worker’s occupational knowledge and skills to become obsolete has been declining because of the... (2 Replies)
Discussion started by: nanchil_guy
2 Replies

9. Shell Programming and Scripting

Include Line Before Pattern Using Sed / Awk

Hi, I have a sql file that runs something like this vi Test.sql REVOKE EXECUTE ON DEMO_USER.SQC_SAMP FROM PUBLIC; REVOKE EXECUTE ON DEMO_USER.SQC_SAMP FROM DEMO_READ; REVOKE SELECT ON DEMO_USER.DEMO_NOMINEE_TEST FROM DEMO_READ; REVOKE EXECUTE ON DEMO_USER.SQC_SAMP FROM... (3 Replies)
Discussion started by: rajan_san
3 Replies

10. Shell Programming and Scripting

awk: need to extract a line before a pattern

Hello , I need your help to extract a line in a big file , and this line is always 11 lines before a specific pattern . Do you know a way via Awk ? Thanks in advance npn35 (17 Replies)
Discussion started by: npn35
17 Replies
Login or Register to Ask a Question