Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

awk to remove lines that do not start with digit and combine line or lines

Shell Programming and Scripting


Tags
awk, solved

Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 07-13-2017
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 22 September 2017, 1:22 PM EDT
Location: Chicago
Posts: 1,178
Thanks: 707
Thanked 15 Times in 14 Posts
awk to remove lines that do not start with digit and combine line or lines

I have been searching and trying to come up with an awk that will perform the following on a
converted text file (original is a pdf).


Code:
1. Since the first two lines are (begin with) text they are removed
2. if $1 is a number then all text is merged (combined) into one line until the next number in $1. There might be no lines until the next number, or 1 line, 2 lines, etc. The amount of lines is variable but what is constant is the number in $1.
3. Since the last 3 lines are (begin with) text they are removed

I added a awk script attempt with description as well. Thank you Linux.

file

Code:
TIER 1 MOLECULAR PATHOLOGY PROCEDURES
The following codes represent gene-specific and genomic procedures.
81161 This code is out of order. See page 714.
81162 This code is out of order. See page 712.
81170 ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) (eg, acquired imatinib tyrosine kinase inhibitor
resistance), gene analysis, variants in the kinase domain
81200
ASPA (aspartoacylase) (eg, Canavan disease) gene analysis, common variants (eg, E285A, Y231X)
81201 APC (adenomatous polyposis coli) (eg, familial adenomatosis polyposis [FAP], attenuated FAP) gene
analysis; full gene sequence
81202 known familiar variants
81203 duplication/deletion variants
81205 BCKDHB (branched-chain keto acid dehydrogenase E1, beta polypeptide) (eg, Maple syrup urine disease)
gene analysis, common variants (eg, R183P, G278S, E422X)
81206 BCR/ABL1 (t(9;22)) (eg, chronic myelogenous leukemia) translocation analysis; major breakpoint,
qualitative or quantitative
CPT codes and descriptions only ©2016 American Medical Association. All rights reserved.
CCI Comp. Code
Non-specific Procedure

desired output

Code:
81161 This code is out of order. See page 714.
81162 This code is out of order. See page 712.
81170 ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) (eg, acquired imatinib tyrosine kinase inhibitorresistance), gene analysis, variants in the kinase domain
81200 ASPA (aspartoacylase) (eg, Canavan disease) gene analysis, common variants (eg, E285A, Y231X)
81201 APC (adenomatous p
olyposis coli) (eg, familial adenomatosis polyposis [FAP], attenuated FAP) gene analysis; full gene sequence
81202 known familiar variants
81203 duplication/deletion variants
81205 BCKDHB (branched-chain keto acid dehydrogenase E1, beta polypeptide) (eg, Maple syrup urine disease) gene analysis, common variants (eg, R183P, G278S, E422X)
81206 BCR/ABL1 (t(9;22)) (eg, chronic myelogenous leukemia) translocation analysis; major breakpoint, qualitative or quantitative

awk

Code:
awk '$0==($0+0) {                        # remove lines that do not start with a number
       if ( $1 ~ /^[0-9]$/ )             # if $1 is a number
         if(l){print l;l=$0} {           # print line   
         else{l=l" "$0}}END{print l}     # if $1 is not a number combine line(l) until next number and print
                             }
                }' file


Last edited by cmccabe; 07-13-2017 at 08:21 PM..
Sponsored Links
    #2  
Old Unix and Linux 07-14-2017
RavinderSingh13 RavinderSingh13 is online now Forum Advisor  
Registered User
 
Join Date: May 2013
Last Activity: 26 September 2017, 5:14 AM EDT
Location: Chennai
Posts: 2,609
Thanks: 573
Thanked 1,238 Times in 1,116 Posts
Hello cmccabe,

Could you please try following and let me know if this helps.

Code:
awk '/^[0-9]/{val=$0;getline;if($0 !~ /^[0-9]/){print val,$0} else {print val ORS $0}}'   Input_file

Thanks,
R. Singh
The Following User Says Thank You to RavinderSingh13 For This Useful Post:
cmccabe (07-14-2017)
Sponsored Links
    #3  
Old Unix and Linux 07-14-2017
Scrutinizer's Unix or Linux Image
Scrutinizer Scrutinizer is offline Forum Staff  
Moderator
 
Join Date: Nov 2008
Last Activity: 25 September 2017, 12:06 PM EDT
Location: Amsterdam
Posts: 11,537
Thanks: 500
Thanked 3,337 Times in 2,944 Posts
One way:

Code:
awk 'NR>5{print A[NR%3]} {A[NR%3]=$0}' file |
awk '$1!~/[^0-9]/{if(p) print p; p=$0; next} {p=p OFS $0} END{print p}'

--
@Ravinder, that will fail in the following cases:
  • a broken line sequence that starts on an even line number.
  • The second part of a broken line begins with a number
  • The 3rd but last line is not broken, and has an even line number.

Last edited by Scrutinizer; 07-14-2017 at 06:08 AM..
The Following 2 Users Say Thank You to Scrutinizer For This Useful Post:
cmccabe (07-14-2017), RavinderSingh13 (07-14-2017)
    #4  
Old Unix and Linux 07-14-2017
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 22 September 2017, 1:22 PM EDT
Location: Chicago
Posts: 1,178
Thanks: 707
Thanked 15 Times in 14 Posts
Thank you both very much.... @Scrutinizer would you mind adding a brief description of how the awk works. Thank you Linux.

---------- Post updated at 07:51 AM ---------- Previous update was at 07:23 AM ----------

Can the number in $1 be restrited to 5 digits? That is if there is a random number that is 3 digits in the start of the line it is removed.


Code:
all whitespace and symbols in $1 are removed
line 3 is removed because the random digit is less than a length of 5 digits


file

Code:
      81262        direct probe methodology (eg, Southern blot)
      81263    IGH@ (Immunoglobulin heavy chain locus) (eg, leukemia and lymphoma, B-cell), variable region somatic
               mutation analysis
714       l   New Code     s Revised Code       +   Add-On Code        Ꮬ Modifier -51 Exempt                  H    Telemedicine
                                                              CPT codes and descriptions only ©2016 American Medical Association. All rights reserved.
                                                                                                            PATHOLOGY/ LABORATORY

desired output

Code:
81262        direct probe methodology (eg, Southern blot)
81263    IGH@ (Immunoglobulin heavy chain locus) (eg, leukemia and lymphoma, B-cell), variable region somatic
               mutation analysis

awk for leghth maybe:

Code:
awk 'NR>5{print A[NR%3]} {A[NR%3]=$0}' file | awk '{if(length($1) < 5 ) && $1!~/[^0-9]/ && { gsub(/[^[:alnum:]]/, "", $1);{if(p) print p; p=$0; next} {p=p OFS $0} END{print p}'

Thank you Linux.

Last edited by cmccabe; 07-15-2017 at 10:03 AM..
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Remove all lines which start with # kraljic Shell Programming and Scripting 3 07-04-2014 07:08 AM
Remove certain lines from file based on start of line except beginning and ending nwalsh88 Shell Programming and Scripting 10 02-20-2013 03:50 PM
remove spaces and lines that start with -- quincyjones Shell Programming and Scripting 7 01-23-2012 05:11 PM
Combine multiple lines in single line The One Shell Programming and Scripting 8 10-26-2010 12:15 PM
Need to remove lines that start with an IP address sfisk Shell Programming and Scripting 1 07-30-2009 01:41 PM



All times are GMT -4. The time now is 05:15 AM.