Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

awk to format each line by pattern

Shell Programming and Scripting


Tags
awk

Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 1 Week Ago   -   Original Discussion by cmccabe
cmccabe's Unix or Linux Image
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 21 June 2018, 8:40 AM EDT
Location: Chicago
Posts: 1,228
Thanks: 738
Thanked 14 Times in 13 Posts
awk to format each line by pattern

The four lines in the tab-delimeted input are a sample format from my actual data. The awk is meant to go line by line and check if a pattern is satisfied and if it is follow a particular format (there are 3). All the lines in the file should follow one of the three formats below. I added comments to the awk but can not get it to execute and there is probably a better way. Thank you Linux.



Code:
format1= only text (alpha characters) are stored in variable p   --- so only NHLRC1 is stored in $p as the other parenthesis is a #
format2= parenthesis with a number in them are removed along with the parenthesis --- so in line 3 the (10866) is removed
format3= split $4 on the _ (underscore) and print the 3 field

input tab-delimited


Code:
6	18122723	18122843	469_380805_378884(NHLRC1)_1.1_1
6	31114121	31114241	344047_16724314_rs746647_1
6	31430946	31431066	344049_16724385_HCP5(10866)_1_1
6	32808479	32808599	445446_18754304_PSMB8-exon6_1

desired output tab-delimited


Code:
chr6	18122723	18122843	chr6:18122723-18122843	NHLRC1
chr6	31114121	31114241	chr6:31114121-31114241	rs746647
chr6	31430946	31431066	chr6:31430946-31431066	HCP5
chr6	32808479	32808599	chr6:32808479-32808599	PSMB8-exon6

awk


Code:
awk 'BEGIN{FS=OFS="\t"}  # define fs and output
       FNR==NR{ # process each field in each line of file
         if(/([A-Z])/) {  # pattern 1 for extracting only alpha in () not number
            p=$(awk -F"[()]" '{print $2}')      # extract string in variable p
              print "chr"$1,$2,$3,"chr:"$2"-"$3,$p  # print desired output
               next
  }
         if(/([0-9])/) {  # pattern remove # in () 
            n=$(awk -F"[()]" '{print $2}')   # extract number in ()in variable n
              awk -v num=$n 'BEGIN {sub([0-9],"",num) && sub (),"",$4)  ; print name}  # substitute # with null value and print
               next
  }
         if($4 ~ /_/) {  # pattern 3 for _ spilt
            awk '{split($0,a,"_"); print "chr"$1,$2,$3,"chr:"$2"-"$3,a[3]}'  # if conditions 1 and 2 not meet then split on _ and print 3rd field along with desired fields
               next
  }
}' input

Sponsored Links
    #2  
Old Unix and Linux 1 Week Ago   -   Original Discussion by cmccabe
Scrutinizer's Unix or Linux Image
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 21 June 2018, 11:46 PM EDT
Location: Amsterdam
Posts: 11,834
Thanks: 545
Thanked 3,455 Times in 3,045 Posts
Try:


Code:
awk '
  {
    split($4,F,/_/)
    if(split(F[3],G,/[)(]/)) {
      if(G[2]~/[[:alpha:]]/)
        p=G[2]
      else 
        p=G[1]
    } 
    else 
      p=F[3]
  }
  {
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p
  }
' FS='\t' OFS='\t' file

The Following User Says Thank You to Scrutinizer For This Useful Post:
cmccabe (6 Days Ago)
Sponsored Links
    #3  
Old Unix and Linux 6 Days Ago   -   Original Discussion by cmccabe
cmccabe's Unix or Linux Image
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 21 June 2018, 8:40 AM EDT
Location: Chicago
Posts: 1,228
Thanks: 738
Thanked 14 Times in 13 Posts
The awk works great... thank you. I found two additional format types and commented your code to try and capture these two additional. However I don't think I am understanding it correctly. Would you be able to comment it so I can try to make the changes... I added the bold portion to capture the pattern in line 5 (split $4 on the _ and capture the 2nd value if alpha). Also, I can't figure out how does a numeric value inside a () not get printed? Thank you very much Linux.



Code:
awk '
  {
    split($4,F,/_/)            # split field 4 on _ and strore in F
    if(split(F[3],G,/[)(]/)) { # store value of 3rd field in G
        if(G[2]~/[[:alpha:]]/) # check that it's alpha and store in G[2]
        p=G[2]  # store G[2] as p
      else 
        p=G[1]  # if numeric store as p
    } 
    else 
      p=F[3]   # store spilt value as p
  }
  { 
    split($4,A,/_/)
     if(split(A[2],B,/[_]/)) {
      if(B[2]~/[[:alpha:]]/)
        p=B[2]
  }
   }
    {
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p  # print desired output
  }
' FS='\t' OFS='\t' in   # define FS and OFS as tab-delimited

in tab-delimited


Code:
6	18122723	18122843	469_380805_378884(NHLRC1)_1.1_1
6	31114121	31114241	344047_16724314_rs746647_1
6	31430946	31431066	344049_16724385_HCP5(10866)_1_1
6	32808479	32808599	445446_18754304_PSMB8-exon6_1
1	33478785	33478905	19186497_AK2-Exon1_1
1	24022788	24022908	466743_18956150_RPL11-NM_000975-exon6_1

desired output tab-delimited


Code:
chr6	18122723	18122843	chr6:18122723-18122843	NHLRC1
chr6	31114121	31114241	chr6:31114121-31114241	rs746647
chr6	31430946	31431066	chr6:31430946-31431066	HCP5
chr6	32808479	32808599	chr6:32808479-32808599	PSMB8-exon6
chr1	33478785	33478905	chr1:33478785-33478905	AK2-Exon1
chr1	24022788	24022908	chr1:24022788-24022908	RPL11-NM_000975-exon6


Last edited by cmccabe; 6 Days Ago at 09:51 AM.. Reason: fixed format
    #4  
Old Unix and Linux 6 Days Ago   -   Original Discussion by cmccabe
Scrutinizer's Unix or Linux Image
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 21 June 2018, 11:46 PM EDT
Location: Amsterdam
Posts: 11,834
Thanks: 545
Thanked 3,455 Times in 3,045 Posts
Hi try this instead:


Code:
awk '
  {
    gsub(/^[0-9_]+[_(]|[)(_][_)(0-9.]+$/,x,$4)
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3, $4
  }
' FS='\t' OFS='\t' file


Last edited by Scrutinizer; 6 Days Ago at 01:55 AM..
The Following User Says Thank You to Scrutinizer For This Useful Post:
cmccabe (5 Days Ago)
Sponsored Links
    #5  
Old Unix and Linux 5 Days Ago   -   Original Discussion by cmccabe
cmccabe's Unix or Linux Image
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 21 June 2018, 8:40 AM EDT
Location: Chicago
Posts: 1,228
Thanks: 738
Thanked 14 Times in 13 Posts
Thank you very much Linux.
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
awk to combine lines from line with pattern match to a line that ends in a pattern Wes Kem Shell Programming and Scripting 5 02-23-2016 07:11 PM
Maintain line format using awk phaethon Shell Programming and Scripting 4 09-09-2014 01:12 PM
a cut-command or special format pattern in awk IMPe Shell Programming and Scripting 3 08-06-2012 10:50 AM
awk script to move a line after the matched pattern line nanchil_guy Shell Programming and Scripting 2 06-02-2010 08:46 AM
awk: need to extract a line before a pattern npn35 Shell Programming and Scripting 17 06-29-2008 10:38 PM



All times are GMT -4. The time now is 05:32 AM.