Perl regex to remove a segment in a line


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl regex to remove a segment in a line
# 1  
Old 08-20-2012
Perl regex to remove a segment in a line

Hello, ksh on Sun5.8 here. I have a pipe-delimited, variable length record file with sub-segments identified with a tilda that we receive from a source outside of our control. The records are huge, and Perl seems to be the only shell that can handle the huge lines. I am new to Perl, and am trying to come up with a regex to find segments > 15 and remove them. Some of these segments have sub-segments that should be ignored. i.e. ~DRG segments can have multiple ~DCT segments, and are followed by other segments, some of which are optional..

Here's a sample BEFORE:

Code:
|~DRG|15|qwe|qwe|qwe|~DCT|efs|efs|243545|~DRG|16|qwe|qwe|qwe|~DCT|efs|efs|243545|~DRG|17|fgh|fgg|dfg|~DCT|fgg|fhh|`123|~MSP|etc|

And the desired AFTER:
Code:
|~DRG|15|qwe|qwe|qwe|~DCT|efs|efs|243545|~MSP|etc|

What I need to do is match ~DRG segments where the next field is > 15, up to the next non- ~DRG or non-~DCT segment. I believe I am getting caught up using negative search vs a read-ahead method, etc.

I have tried many ways, with this one being the closest:
Code:
 $str =~ s/\|~DRG\|(1[6-9]|2[0-9]).*?\((?!~DRG)|(?!~DCT)\)/\1/g;

But this is not going all the way to the next non- ~DRG or non-~DCT segment. In the output below, ~DRG 15 only has one ~DCT but the match is not going all the way to the ~MSP segment:

Output (bad as it shows a ~DCT from one of the ~DRG's > 15)
(Lines wrapped for readability)
Code:
|~DRG|15|03|599942600|DECYL METHYL SULFOXIDE|0.060|I|0|O|
DECYL METHYL SULFOXIDE POWDER|1|99
|MISCELLANEOUS|U6W|BULK CHEMICALS|960000
|PHARMACEUTICAL AIDS|O||MISCELL.|POWDER|89.8 %|0|N||||~DCT|STD|0.00|01|AWPA|AWPA|38.5000016G||0|N
||||~DCT|STD|0.00|09||AWPA|2.50000
|~MSP|1|93392900|~MSP|2|72900
|~MSP|3|7512900
|~MSP|4|964850|~MSP|5|96500
|~MSP|6|96802900
|~MSP|7|6610000|~MSP|8|967900|~MSP|9|9932900
|~MSP|10|9680002900|~MSP|11|9662900
|~MSP|12
|79403800|~MSP|13|964900|~MSP|14|96700
|~MSP|15|9640|~MSP|16|96200|~MSP|17|96200037

If you have a suggestion on the regex or if there is a better approach, I will be grateful!

Gary

Last edited by gary_w; 08-20-2012 at 05:27 PM..
# 2  
Old 08-20-2012
Your first sample shows a single line record, but the second, larger sample appears to span multiple lines. Can a single record span multiple lines? If so, how is the end of record determined?

In the second sample you highlight a segment that according to your explanation should not be modified. The field after ~DRG is not greater than 15. Also, there do appear to be two ~DCT segments highlighted in blue, but your text mentions only one.

Edit: Hmm. Perhaps ther was a ~DRG|16 in that second sample that was deleted, and the second ~DCT belongs to it. Not sure. This is why it's good to show both the before and after with sample data.

In addition to answers for those questions, it would help if for each data sample you showed us the before and the (desired) after.

Regards,
Alister

Last edited by alister; 08-20-2012 at 04:54 PM..
# 3  
Old 08-20-2012
My apologies for not being clear enough.

The first record is just a sample showing the DRG and DCT layout.

The second example output is split to multiple lines for readability. The actual records are separated by carriage returns and have TONS of columns so this is just a sample of the relevant part of the record. DRG 15 has only one DCT but it is showing another DCT from one of the records >15. My regex is not going all the way to the MSP record.

I will update my first example to show a before and desired after. The record is too huge to show the whole thing.
# 4  
Old 08-20-2012
To confirm that your AWK can't handle these records, does the following fail to print the number of fields in each record?
Code:
awk -F\| '{print NF}' file

On Solaris, make sure to test with /usr/xpg4/bin/awk.

Also, do the records always begin and/or end with a pipe symbol? If yes, be specific, begin or end or both.

Regards,
Alister
# 5  
Old 08-20-2012
This works:
Code:
/usr/xpg4/bin/awk -F\| '{print NF}' file

Some records have > 800 columns!

Records do not start with a pipe nor end with one; however the segments inside the record do start/end with a pipe.
# 6  
Old 08-20-2012
This works with several variations I created from your sample data:
Code:
perl -lpe 's/\|~DRG\|(\d+).*?(?=\|~(?!DCT))/$1>15?"":$&/ge' file

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 7  
Old 08-20-2012
Sweet! It works on my test file. I would be grateful if you could give an explanation on the regex? I need to do some similar operations on other parts of the file and want to understand it.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl, RegEx - Help me to understand the regex!

I am not a big expert in regex and have just little understanding of that language. Could you help me to understand the regular Perl expression: ^(?!if\b|else\b|while\b|)(?:+?\s+){1,6}(+\s*)\(*\) *?(?:^*;?+){0,10}\{ ------ This is regex to select functions from a C/C++ source and defined in... (2 Replies)
Discussion started by: alex_5161
2 Replies

2. Shell Programming and Scripting

Need to remove first 6 lines and last line in a array ---- perl scripting

Hi I have stored a command output in an array like below @a = `xyz`; actually xyz comnad will give the output like this tracker date xxxxxxx xxxxxxx --------------------- 1 a 2 b ---------------------- i have stored the "xyz" output to an... (3 Replies)
Discussion started by: siva kumar
3 Replies

3. Programming

Data segment or Text segment

Hi, Whether the following piece of code is placed in the read-only memory of code (text) segment or data segment? char *a = "Hello"; I am getting two different answers while searching in google :( that's why the confusion is (7 Replies)
Discussion started by: royalibrahim
7 Replies

4. Shell Programming and Scripting

Converting perl regex to sed regex

I am having trouble parsing rpm filenames in a shell script.. I found a snippet of perl code that will perform the task but I really don't have time to rewrite the entire script in perl. I cannot for the life of me convert this code into something sed-friendly: if ($rpm =~ /(*)-(*)-(*)\.(.*)/)... (1 Reply)
Discussion started by: suntzu
1 Replies

5. Shell Programming and Scripting

Using Sed to remove part of line with regex

Greetings everyone. Right now I am working on a script to be used during automated deployment of servers. What I have to do is remove localhost.localdomain and localhost6.localdomain6 from the /etc/hosts file. Simple, right? Except most of the examples I've found using sed want to delete the entire... (4 Replies)
Discussion started by: msarro
4 Replies

6. Shell Programming and Scripting

Remove repeated line using Perl

I am new to Perl and in text file of around 1000 lines having around 500 repeated line which I felt is no use and want to remove these line.so can somebody help in same for providing sample code how can i remove these repeated line in a file. (11 Replies)
Discussion started by: dinesh.4126
11 Replies

7. Shell Programming and Scripting

perl regex multi line cut

hello mighty all there's a file with lots of comments.. some of them looks like: =comment blabla blablabla bla =cut i'm trying to cut this out completely with this code: $line=~s/^=.+?=cut//sg; but no luck also tryed to change it abit but still I don't understand how the... (9 Replies)
Discussion started by: tip78
9 Replies

8. Shell Programming and Scripting

how to remove line from /etc/vfstab using shell / perl

Hi, could someone help me on this i want to remove line from /etc/vfstab in the system how to do that it is rite now like this /dev/vx/dsk/appdg1/mytestvol /dev/vx/rdsk/appdg1/mytestvol /mytest vxfs 3 no largefiles /dev/vx/dsk/appdg1/mytestvol1 ... (2 Replies)
Discussion started by: tarunn.dubeyy
2 Replies

9. Shell Programming and Scripting

Perl REGEX - How do extract a string in a line?

Hi Guys, In the following line: cn=portal.090710.191533.428571000,cn=groups,dc=mp,dc=rj,dc=gov,dc=br I need to extract this string: portal.090710.191533.428571000 As you can see this string always will be bettween "cn=" and "," strings. Someone know one regular expression to... (4 Replies)
Discussion started by: maverick-ski
4 Replies

10. Shell Programming and Scripting

how do i strip this line using perl regex.

I have a variable dynamically generated $batch = /dataload/R3P/interface/Bowne/reports/RDI00244.rpt Now I'd like to strip '/dataload/R3P/interface/Bowne/reports/RDI' and '.rpt' from this variable my output should be only 00244 how to do this using perl regex.I'm a newbie to perl and would... (1 Reply)
Discussion started by: ramky79
1 Replies
Login or Register to Ask a Question