Perl regex to remove a segment in a line

08-20-2012

Registered User

446, 96

Join Date: Oct 2010

Last Activity: 7 June 2018, 4:01 PM EDT

Posts: 446

Thanks Given: 32

Thanked 96 Times in 88 Posts

Perl regex to remove a segment in a line

Hello, ksh on Sun5.8 here. I have a pipe-delimited, variable length record file with sub-segments identified with a tilda that we receive from a source outside of our control. The records are huge, and Perl seems to be the only shell that can handle the huge lines. I am new to Perl, and am trying to come up with a regex to find segments > 15 and remove them. Some of these segments have sub-segments that should be ignored. i.e. ~DRG segments can have multiple ~DCT segments, and are followed by other segments, some of which are optional..

Here's a sample BEFORE:

Code:

|~DRG|15|qwe|qwe|qwe|~DCT|efs|efs|243545|~DRG|16|qwe|qwe|qwe|~DCT|efs|efs|243545|~DRG|17|fgh|fgg|dfg|~DCT|fgg|fhh|`123|~MSP|etc|

And the desired AFTER:

Code:

|~DRG|15|qwe|qwe|qwe|~DCT|efs|efs|243545|~MSP|etc|

What I need to do is match ~DRG segments where the next field is > 15, up to the next non- ~DRG or non-~DCT segment. I believe I am getting caught up using negative search vs a read-ahead method, etc.

I have tried many ways, with this one being the closest:

Code:

 $str =~ s/\|~DRG\|(1[6-9]|2[0-9]).*?\((?!~DRG)|(?!~DCT)\)/\1/g;

But this is not going all the way to the next non- ~DRG or non-~DCT segment. In the output below, ~DRG 15 only has one ~DCT but the match is not going all the way to the ~MSP segment:

Output (bad as it shows a ~DCT from one of the ~DRG's > 15)
(Lines wrapped for readability)

Code:

|~DRG|15|03|599942600|DECYL METHYL SULFOXIDE|0.060|I|0|O|
DECYL METHYL SULFOXIDE POWDER|1|99
|MISCELLANEOUS|U6W|BULK CHEMICALS|960000
|PHARMACEUTICAL AIDS|O||MISCELL.|POWDER|89.8 %|0|N||||~DCT|STD|0.00|01|AWPA|AWPA|38.5000016G||0|N
||||~DCT|STD|0.00|09||AWPA|2.50000
|~MSP|1|93392900|~MSP|2|72900
|~MSP|3|7512900
|~MSP|4|964850|~MSP|5|96500
|~MSP|6|96802900
|~MSP|7|6610000|~MSP|8|967900|~MSP|9|9932900
|~MSP|10|9680002900|~MSP|11|9662900
|~MSP|12
|79403800|~MSP|13|964900|~MSP|14|96700
|~MSP|15|9640|~MSP|16|96200|~MSP|17|96200037

If you have a suggestion on the regex or if there is a better approach, I will be grateful!

Gary

Last edited by gary_w; 08-20-2012 at 05:27 PM..

gary_w

View Public Profile for gary_w

Find all posts by gary_w

08-20-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Your first sample shows a single line record, but the second, larger sample appears to span multiple lines. Can a single record span multiple lines? If so, how is the end of record determined?

In the second sample you highlight a segment that according to your explanation should not be modified. The field after ~DRG is not greater than 15. Also, there do appear to be two ~DCT segments highlighted in blue, but your text mentions only one.

Edit: Hmm. Perhaps ther was a ~DRG|16 in that second sample that was deleted, and the second ~DCT belongs to it. Not sure. This is why it's good to show both the before and after with sample data.

In addition to answers for those questions, it would help if for each data sample you showed us the before and the (desired) after.

Regards,
Alister

Last edited by alister; 08-20-2012 at 04:54 PM..

alister

View Public Profile for alister

Find all posts by alister

08-20-2012

Registered User

446, 96

Join Date: Oct 2010

Last Activity: 7 June 2018, 4:01 PM EDT

Posts: 446

Thanks Given: 32

Thanked 96 Times in 88 Posts

My apologies for not being clear enough.

The first record is just a sample showing the DRG and DCT layout.

The second example output is split to multiple lines for readability. The actual records are separated by carriage returns and have TONS of columns so this is just a sample of the relevant part of the record. DRG 15 has only one DCT but it is showing another DCT from one of the records >15. My regex is not going all the way to the MSP record.

I will update my first example to show a before and desired after. The record is too huge to show the whole thing.

gary_w

View Public Profile for gary_w

Find all posts by gary_w

08-20-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

To confirm that your AWK can't handle these records, does the following fail to print the number of fields in each record?

Code:

awk -F\| '{print NF}' file

On Solaris, make sure to test with /usr/xpg4/bin/awk.

Also, do the records always begin and/or end with a pipe symbol? If yes, be specific, begin or end or both.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

08-20-2012

Registered User

446, 96

Join Date: Oct 2010

Last Activity: 7 June 2018, 4:01 PM EDT

Posts: 446

Thanks Given: 32

Thanked 96 Times in 88 Posts

This works:

Code:

/usr/xpg4/bin/awk -F\| '{print NF}' file

Some records have > 800 columns!

Records do not start with a pipe nor end with one; however the segments inside the record do start/end with a pipe.

gary_w

View Public Profile for gary_w

Find all posts by gary_w

08-20-2012

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

This works with several variations I created from your sample data:

Code:

perl -lpe 's/\|~DRG\|(\d+).*?(?=\|~(?!DCT))/$1>15?"":$&/ge' file

Regards,
Alister

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

08-20-2012

Registered User

446, 96

Join Date: Oct 2010

Last Activity: 7 June 2018, 4:01 PM EDT

Posts: 446

Thanks Given: 32

Thanked 96 Times in 88 Posts

Sweet! It works on my test file. I would be grateful if you could give an explanation on the regex? I need to do some similar operations on other parts of the file and want to understand it.

gary_w

View Public Profile for gary_w

Find all posts by gary_w

Shell Programming and Scripting

Perl regex to remove a segment in a line

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl, RegEx - Help me to understand the regex!

Discussion started by: alex_5161

2. Shell Programming and Scripting

Need to remove first 6 lines and last line in a array ---- perl scripting

Discussion started by: siva kumar

3. Programming

Data segment or Text segment

Discussion started by: royalibrahim

4. Shell Programming and Scripting

Converting perl regex to sed regex

Discussion started by: suntzu

5. Shell Programming and Scripting

Using Sed to remove part of line with regex

Discussion started by: msarro

6. Shell Programming and Scripting

Remove repeated line using Perl

Discussion started by: dinesh.4126

7. Shell Programming and Scripting

perl regex multi line cut

Discussion started by: tip78

8. Shell Programming and Scripting

how to remove line from /etc/vfstab using shell / perl

Discussion started by: tarunn.dubeyy

9. Shell Programming and Scripting

Perl REGEX - How do extract a string in a line?

Discussion started by: maverick-ski

10. Shell Programming and Scripting

how do i strip this line using perl regex.

Discussion started by: ramky79