awk to remove lines that do not start with digit and combine line or lines


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to remove lines that do not start with digit and combine line or lines
# 1  
awk to remove lines that do not start with digit and combine line or lines

I have been searching and trying to come up with an awk that will perform the following on a
converted text file (original is a pdf).

Code:
1. Since the first two lines are (begin with) text they are removed
2. if $1 is a number then all text is merged (combined) into one line until the next number in $1. There might be no lines until the next number, or 1 line, 2 lines, etc. The amount of lines is variable but what is constant is the number in $1.
3. Since the last 3 lines are (begin with) text they are removed

I added a awk script attempt with description as well. Thank you Smilie.

file
Code:
TIER 1 MOLECULAR PATHOLOGY PROCEDURES
The following codes represent gene-specific and genomic procedures.
81161 This code is out of order. See page 714.
81162 This code is out of order. See page 712.
81170 ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) (eg, acquired imatinib tyrosine kinase inhibitor
resistance), gene analysis, variants in the kinase domain
81200
ASPA (aspartoacylase) (eg, Canavan disease) gene analysis, common variants (eg, E285A, Y231X)
81201 APC (adenomatous polyposis coli) (eg, familial adenomatosis polyposis [FAP], attenuated FAP) gene
analysis; full gene sequence
81202 known familiar variants
81203 duplication/deletion variants
81205 BCKDHB (branched-chain keto acid dehydrogenase E1, beta polypeptide) (eg, Maple syrup urine disease)
gene analysis, common variants (eg, R183P, G278S, E422X)
81206 BCR/ABL1 (t(9;22)) (eg, chronic myelogenous leukemia) translocation analysis; major breakpoint,
qualitative or quantitative
CPT codes and descriptions only 2016 American Medical Association. All rights reserved.
CCI Comp. Code
Non-specific Procedure

desired output
Code:
81161 This code is out of order. See page 714.
81162 This code is out of order. See page 712.
81170 ABL1 (ABL proto-oncogene 1, non-receptor tyrosine kinase) (eg, acquired imatinib tyrosine kinase inhibitorresistance), gene analysis, variants in the kinase domain
81200 ASPA (aspartoacylase) (eg, Canavan disease) gene analysis, common variants (eg, E285A, Y231X)
81201 APC (adenomatous p
olyposis coli) (eg, familial adenomatosis polyposis [FAP], attenuated FAP) gene analysis; full gene sequence
81202 known familiar variants
81203 duplication/deletion variants
81205 BCKDHB (branched-chain keto acid dehydrogenase E1, beta polypeptide) (eg, Maple syrup urine disease) gene analysis, common variants (eg, R183P, G278S, E422X)
81206 BCR/ABL1 (t(9;22)) (eg, chronic myelogenous leukemia) translocation analysis; major breakpoint, qualitative or quantitative

awk
Code:
awk '$0==($0+0) {                        # remove lines that do not start with a number
       if ( $1 ~ /^[0-9]$/ )             # if $1 is a number
         if(l){print l;l=$0} {           # print line   
         else{l=l" "$0}}END{print l}     # if $1 is not a number combine line(l) until next number and print
                             }
                }' file


Last edited by cmccabe; 07-13-2017 at 09:21 PM..
# 2  
Hello cmccabe,

Could you please try following and let me know if this helps.
Code:
awk '/^[0-9]/{val=$0;getline;if($0 !~ /^[0-9]/){print val,$0} else {print val ORS $0}}'   Input_file

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
One way:
Code:
awk 'NR>5{print A[NR%3]} {A[NR%3]=$0}' file |
awk '$1!~/[^0-9]/{if(p) print p; p=$0; next} {p=p OFS $0} END{print p}'

--
@Ravinder, that will fail in the following cases:
  • a broken line sequence that starts on an even line number.
  • The second part of a broken line begins with a number
  • The 3rd but last line is not broken, and has an even line number.

Last edited by Scrutinizer; 07-14-2017 at 07:08 AM..
These 2 Users Gave Thanks to Scrutinizer For This Post:
# 4  
Thank you both very much.... @Scrutinizer would you mind adding a brief description of how the awk works. Thank you Smilie.

---------- Post updated at 07:51 AM ---------- Previous update was at 07:23 AM ----------

Can the number in $1 be restrited to 5 digits? That is if there is a random number that is 3 digits in the start of the line it is removed.

Code:
all whitespace and symbols in $1 are removed
line 3 is removed because the random digit is less than a length of 5 digits


file
Code:
      81262        direct probe methodology (eg, Southern blot)
      81263    IGH@ (Immunoglobulin heavy chain locus) (eg, leukemia and lymphoma, B-cell), variable region somatic
               mutation analysis
714       l   New Code     s Revised Code       +   Add-On Code        Ꮬ Modifier -51 Exempt                  H    Telemedicine
                                                              CPT codes and descriptions only 2016 American Medical Association. All rights reserved.
                                                                                                            PATHOLOGY/ LABORATORY

desired output
Code:
81262        direct probe methodology (eg, Southern blot)
81263    IGH@ (Immunoglobulin heavy chain locus) (eg, leukemia and lymphoma, B-cell), variable region somatic
               mutation analysis

awk for leghth maybe:
Code:
awk 'NR>5{print A[NR%3]} {A[NR%3]=$0}' file | awk '{if(length($1) < 5 ) && $1!~/[^0-9]/ && { gsub(/[^[:alnum:]]/, "", $1);{if(p) print p; p=$0; next} {p=p OFS $0} END{print p}'

Thank you Smilie.

Last edited by cmccabe; 07-15-2017 at 11:03 AM..
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #132
Difficulty: Easy
Many of the command line and graphical utilities in a Linux distro are very similar to a Unix system.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk with sed to combine lines and remove specific odd # pattern from line

In the awk piped to sed below I am trying to format file by removing the odd xxxx_digits and whitespace after, then move the even xxxx_digit to the line above it and add a space between them. There may be multiple lines in file but they are in the same format. The Filename_ID line is the last line... (4 Replies)
Discussion started by: cmccabe
4 Replies

2. Shell Programming and Scripting

awk to combine lines if fields match in lines

In the awk below, what I am attempting to do is check each line in the tab-delimeted input, which has ~20 lines in it, for a keyword SVTYPE=Fusion. If the keyword is found I am splitting $3 using the . (dot) and reading the portion before and after the dot in an array a. If it does have that... (12 Replies)
Discussion started by: cmccabe
12 Replies

3. Shell Programming and Scripting

awk to combine lines from line with pattern match to a line that ends in a pattern

I am trying to combine lines with these conditions: 1. First line starts with text of "libname VALUE db2 datasrc" where VALUE can be any text. 2. If condition1 is met then continue to combine lines through a line that ends with a semicolon. 3. Ignore case when matching patterns and remove any... (5 Replies)
Discussion started by: Wes Kem
5 Replies

4. Shell Programming and Scripting

Remove all lines which start with #

Oracle Linux 6.4/Bash I have a file like below. I want to remove all lines which start with # character. Can I do this vi editor ? If not , which other utility can I use for this ? # This is a test script CUSER=`id |cut -d"(" -f2 | cut -d ")" -f1` # Some text CDATE=`date +%y%m%d` ## get... (3 Replies)
Discussion started by: kraljic
3 Replies

5. Shell Programming and Scripting

Combine multiple lines into single line

Hi All , I have a file with below data # User@Host: xyz @ # Query_time: t1 Lock_time: t2 Rows_sent: n1 Rows_examined: n2 SET timestamp=1396852200; select count(1) from table; # Time: 140406 23:30:01 # User@Host: abc @ # Query_time: t1 Lock_time: t2 Rows_sent: n1 Rows_examined:... (6 Replies)
Discussion started by: rakesh_411
6 Replies

6. Shell Programming and Scripting

Remove certain lines from file based on start of line except beginning and ending

Hi, I have multiple large files which consist of the below format: I am trying to write an awk or sed script to remove all occurrences of the 00 record except the first and remove all of the 80 records except the last one. Any help would be greatly appreciated. (10 Replies)
Discussion started by: nwalsh88
10 Replies

7. Shell Programming and Scripting

Combine multiple unique lines from event log text file into one line, use PERL or AWK?

I can't decide if I should use AWK or PERL after pouring over these forums for hours today I decided I'd post something and see if I couldn't get some advice. I've got a text file full of hundreds of events in this format: Record Number : 1 Records in Seq : ... (3 Replies)
Discussion started by: Mayday22
3 Replies

8. Shell Programming and Scripting

remove spaces and lines that start with --

Is it possible to remove empty lines between >humid-sets (bold) and also humidset that start with -- (for ex: > humid3 | () : | (+) ) Thanx in advance Note: The humid sets will be in thousands and lines will be more than 100 thousand. input > humid1 | () : | (+)... (7 Replies)
Discussion started by: quincyjones
7 Replies

9. Shell Programming and Scripting

Combine multiple lines in single line

This is related to one of my previous post but now with a slight difference: I need the "Updated:" to be in one line as well as the "Information:" on one line as well. These are in multiple lines right now as seen below. These can have 2 or more lines that needs to be in one line. System name:... (8 Replies)
Discussion started by: The One
8 Replies

10. Shell Programming and Scripting

Need to remove lines that start with an IP address

Hi, I keep having to remove lines have an IP address as the second field from my awstats logs, as it makes the processing fail. Rather than do it individually each time (once or twice a week) it fails, I'd like to remove any lines from the file that have 3 digits and then a dot as the start... (1 Reply)
Discussion started by: sfisk
1 Replies

Featured Tech Videos