Home Man
Search
Today's Posts
Register

BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

awk to format each line by pattern

Tags
awk, shell scripts

👤 Login to reply

 
Thread Tools Search this Thread
# 1  
Old 06-14-2018
awk to format each line by pattern

The four lines in the tab-delimeted input are a sample format from my actual data. The awk is meant to go line by line and check if a pattern is satisfied and if it is follow a particular format (there are 3). All the lines in the file should follow one of the three formats below. I added comments to the awk but can not get it to execute and there is probably a better way. Thank you .

Code:
format1= only text (alpha characters) are stored in variable p   --- so only NHLRC1 is stored in $p as the other parenthesis is a #
format2= parenthesis with a number in them are removed along with the parenthesis --- so in line 3 the (10866) is removed
format3= split $4 on the _ (underscore) and print the 3 field

input tab-delimited
Code:
6	18122723	18122843	469_380805_378884(NHLRC1)_1.1_1
6	31114121	31114241	344047_16724314_rs746647_1
6	31430946	31431066	344049_16724385_HCP5(10866)_1_1
6	32808479	32808599	445446_18754304_PSMB8-exon6_1

desired output tab-delimited
Code:
chr6	18122723	18122843	chr6:18122723-18122843	NHLRC1
chr6	31114121	31114241	chr6:31114121-31114241	rs746647
chr6	31430946	31431066	chr6:31430946-31431066	HCP5
chr6	32808479	32808599	chr6:32808479-32808599	PSMB8-exon6

awk
Code:
awk 'BEGIN{FS=OFS="\t"}  # define fs and output
       FNR==NR{ # process each field in each line of file
         if(/([A-Z])/) {  # pattern 1 for extracting only alpha in () not number
            p=$(awk -F"[()]" '{print $2}')      # extract string in variable p
              print "chr"$1,$2,$3,"chr:"$2"-"$3,$p  # print desired output
               next
  }
         if(/([0-9])/) {  # pattern remove # in () 
            n=$(awk -F"[()]" '{print $2}')   # extract number in ()in variable n
              awk -v num=$n 'BEGIN {sub([0-9],"",num) && sub (),"",$4)  ; print name}  # substitute # with null value and print
               next
  }
         if($4 ~ /_/) {  # pattern 3 for _ spilt
            awk '{split($0,a,"_"); print "chr"$1,$2,$3,"chr:"$2"-"$3,a[3]}'  # if conditions 1 and 2 not meet then split on _ and print 3rd field along with desired fields
               next
  }
}' input

# 2  
Old 06-14-2018
Try:
Code:
awk '
  {
    split($4,F,/_/)
    if(split(F[3],G,/[)(]/)) {
      if(G[2]~/[[:alpha:]]/)
        p=G[2]
      else 
        p=G[1]
    } 
    else 
      p=F[3]
  }
  {
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p
  }
' FS='\t' OFS='\t' file

The Following User Says Thank You to Scrutinizer For This Useful Post:
cmccabe (06-15-2018)
# 3  
Old 06-15-2018
The awk works great... thank you. I found two additional format types and commented your code to try and capture these two additional. However I don't think I am understanding it correctly. Would you be able to comment it so I can try to make the changes... I added the bold portion to capture the pattern in line 5 (split $4 on the _ and capture the 2nd value if alpha). Also, I can't figure out how does a numeric value inside a () not get printed? Thank you very much .

Code:
awk '
  {
    split($4,F,/_/)            # split field 4 on _ and strore in F
    if(split(F[3],G,/[)(]/)) { # store value of 3rd field in G
        if(G[2]~/[[:alpha:]]/) # check that it's alpha and store in G[2]
        p=G[2]  # store G[2] as p
      else 
        p=G[1]  # if numeric store as p
    } 
    else 
      p=F[3]   # store spilt value as p
  }
  { 
    split($4,A,/_/)
     if(split(A[2],B,/[_]/)) {
      if(B[2]~/[[:alpha:]]/)
        p=B[2]
  }
   }
    {
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3 OFS p  # print desired output
  }
' FS='\t' OFS='\t' in   # define FS and OFS as tab-delimited

in tab-delimited
Code:
6	18122723	18122843	469_380805_378884(NHLRC1)_1.1_1
6	31114121	31114241	344047_16724314_rs746647_1
6	31430946	31431066	344049_16724385_HCP5(10866)_1_1
6	32808479	32808599	445446_18754304_PSMB8-exon6_1
1	33478785	33478905	19186497_AK2-Exon1_1
1	24022788	24022908	466743_18956150_RPL11-NM_000975-exon6_1

desired output tab-delimited
Code:
chr6	18122723	18122843	chr6:18122723-18122843	NHLRC1
chr6	31114121	31114241	chr6:31114121-31114241	rs746647
chr6	31430946	31431066	chr6:31430946-31431066	HCP5
chr6	32808479	32808599	chr6:32808479-32808599	PSMB8-exon6
chr1	33478785	33478905	chr1:33478785-33478905	AK2-Exon1
chr1	24022788	24022908	chr1:24022788-24022908	RPL11-NM_000975-exon6


Last edited by cmccabe; 06-15-2018 at 09:51 AM.. Reason: fixed format
# 4  
Old 06-16-2018
Hi try this instead:
Code:
awk '
  {
    gsub(/^[0-9_]+[_(]|[)(_][_)(0-9.]+$/,x,$4)
    print "chr" $1, $2, $3, "chr" $1 ":" $2 "-" $3, $4
  }
' FS='\t' OFS='\t' file


Last edited by Scrutinizer; 06-16-2018 at 01:55 AM..
The Following User Says Thank You to Scrutinizer For This Useful Post:
cmccabe (06-16-2018)
# 5  
Old 06-16-2018
Thank you very much .
👤 Login to reply

« Previous Thread | Next Thread »
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
awk to combine lines from line with pattern match to a line that ends in a pattern Wes Kem Shell Programming and Scripting 5 02-23-2016 07:11 PM
Match Pattern and print pattern and multiple lines into one line tigerhills Shell Programming and Scripting 4 01-11-2015 09:26 AM
Find next line based on pattern, if it is similar pattern skip it nagpa531 UNIX for Dummies Questions & Answers 5 12-19-2012 04:18 AM
Insert new pattern in newline after the nth occurrence of a line pattern - Bash in Ubuntu 12.04 Phil3759 Shell Programming and Scripting 14 09-13-2012 08:05 AM
Grep the word from pattern line and update in subsequent lines till next pattern line reached rbalaj16 Shell Programming and Scripting 5 06-18-2012 04:39 AM
Searching a pattern in file and deleting th ewhole line containing the pattern Shazin Shell Programming and Scripting 1 07-24-2009 11:27 AM
sed: Find start of pattern and extract text to end of line, including the pattern TestTomas Shell Programming and Scripting 5 05-27-2009 11:16 AM
find pattern, delete line with pattern and line above and line below nickg Shell Programming and Scripting 4 01-29-2009 12:38 PM
find pattern delete line with pattern and line above and line below nickg UNIX for Dummies Questions & Answers 1 01-28-2009 05:46 PM
comment/delete a particular pattern starting from second line of the matching pattern imas Shell Programming and Scripting 4 10-13-2008 02:37 AM


All times are GMT -4. The time now is 01:06 PM.

Unix & Linux Forums Content Copyright©1993-2018. All Rights Reserved.
UNIX.COM Login
Username:
Password:  
Show Password