Perl to change value based on set of rules Post: 303002688

Sponsored Content

Top Forums Shell Programming and Scripting Perl to change value based on set of rules Post 303002688 by durden_tyler on Wednesday 30th of August 2017 03:30:26 PM

08-30-2017

Registered User

Quote:

Originally Posted by cmccabe

...
...
is it more or less I am trying to capture to many conditions with the regex?
What would you recommend?
...
...

Yes, from your other Perl related posts, I do get the impression that you are trying to use the regexes for too many things. That should be avoided.
However, for this particular piece of code, I think, you may want to deepen your understanding of regexes.

You have two types of data in F[8] column.

Type 1:

Code:

27
35
>50

and

Type 2:

Code:

NM_018328:exon12:c.3055-9T>C
NM_003042:c.*234C>A

So use regular expressions that work specifically with each type of data.
Your regex "\D\d+" is meant for Type 1, but it will actually match Type 2 as well.
Why?
Because "\D" means "non-digit character" and so it matches the "_" after "NM".
And then that is followed by "\d+" - "one or more digits". That's why the regex doesn't work the way you want.

Here's a demonstration:

Code:

$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = _
\d+ or $2 = 018328

And for line # 5:

Code:

$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = _
\d+ or $2 = 003042

As you can see, the regex meant for Type 2 data is working on Type 1 data as well.

So, determine what exactly is there in Type 1 and Type 2 data that differentiates them? Here are a few observations:

(1) Type 1 has "\d+" - "one or more digits"
(2) Type 1 may or may not have a non-digit at the front. This non-digit could be ">", "+" or "-". But nothing else.
(3) If there is a non-digit at the front, there is only one such non-digit. There cannot be more than one. So you need: "zero or one non-digit". For that, you could use "\D{0,1}" or "\D?".

Let's test this on the one-liner above.
First, notice that "\D\d+" will not work on both ">50" and "50".

Code:

$
$ perl -le '$x = ">50"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = >
\d+ or $2 = 50
$
$ perl -le '$x = "50"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

That's because there is nothing before "50" in the second case, but the regex "\D\d+" demands exactly one non-digit at the beginning.
Since there was no non-digit, the match failed.

Now notice how "\D?\d+" works for both cases:

Code:

$
$ perl -le '$x = ">50"; if ($x =~ /(\D?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = >
\d+ or $2 = 50
$
$
$ perl -le '$x = "50"; if ($x =~ /(\D?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 =
\d+ or $2 = 50
$
$

Now, we make the regex more robust. We know that the "non-digit" character at the beginning is one of ">", "+" or "-".
So we use the bracket notation: "[>+-]"
This will match exactly one of the characters inside the brackets.
And since there can be 0 or 1 of such characters, we use "?" after the brackets: "[>+-]?"
In other words, we simply replaced "\D" by "[>+-]"
"\D" matches any non-digit character; it could match "#" or "A" or ">" etc.
"[>+-]" matches only one of the characters inside the brackets.

Testing again:

Code:

$
$ perl -le '$x = ">50"; if ($x =~ /([>+-]?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = >
\d+ or $2 = 50
$
$ perl -le '$x = "50"; if ($x =~ /([>+-]?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 =
\d+ or $2 = 50
$
$

Finally, we only want the sequence of digits at the end.
So we can remove the parentheses around the non-digits at the beginning.
We can also put the "beginning of string anchor", which is "^" to specify that the non-digits are at the beginning of the string.
The updated regex is "^[>+-]?(\d+)"

Testing again:

Code:

 $
$ perl -le '$x = ">50"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 50
\d+ or $2 =
$
$ perl -le '$x = "50"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 50
\d+ or $2 =
$
$

So that takes care of Type 1 data.

Now for Type 2 data.
Your regex "/(?:\.\d+[+*-])(\d+)/" looks for the following:

(1) A single dot character "." followed by
(2) One or more digits "\d+" followed by
(3) Exactly one of the characters "+", "*", "-" followed by
(4) One or more digits "\d+"

It matches (1), (2), (3) together but does not "group" them into $1 (due to "?:" at the beginning).
It matches (4) and groups the sequence of digits into $1.

Now, if you look at your Line # 5:

Code:

NM_003042:c.*234C>A

the data has:
(1) Single dot character "."
(2) But no sequence of digits after the dot!! There is a "*" after the dot "."

Hence your regex fails.
Here's the demonstration:

Code:

$
$ # Matches Line # 1
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /(\.\d+[+*-])(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = .3055-
\d+ or $2 = 9
$
$ # But does not match Line # 5
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /(\.\d+[+*-])(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

So what are the special characteristics of Type 2 data that distinguish it from Type 1 data? And how do we create the regex to match Type 2 data?

Firstly, if all Type 2 data start with "NM_", you could use that in your regex. So we have "NM_"

Now, it has a dot ">" at some point further on. So we get the regex "NM_.*\."
Here ".*" passes through "maximum number of characters till it reaches the right-most dot (.) character". It's a greedy search.

The dot character may or may not have a sequence of digits after it. (Line 1 has, Line 5 does not have.) "\d*" matches "zero or more digits" - "more" means "1 or more", so "zero or 1 or more than 1 digits".
So, we get: "NM_.*\.\d*"

After that, we definitely have one of the following characters "+", "*", "-".
So we use "[+*-]" for that. The regex now becomes "NM_.*\.\d*[+*-]"

Finally, that is followed by a sequence of digits that we want to capture.
Sequence of digits is "\d+". So the final regex is: "NM_.*\.\d*[+*-](\d+)"

Let's test this on Line 1 and Line 5 data:

Code:

$
$ # Line 1
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 9
\d+ or $2 =
$
$ # Line 5
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 234
\d+ or $2 =
$
$

Because of the "NM_" at the beginning of the regex, we are guaranteed that it will not match Type 1 data.
But let's confirm that that is really the case:

Code:

$
$ # Line 2. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "27"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Line 3. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "35"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Line 4. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = ">50"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Other Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "+50"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

Let's also confirm that the regex for Type 1 data does not match Type 2 data!

Code:

 $
$
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

Hope that helps.
If you are unable to incorporate the regexes in your script, do post the problem here.

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

how to change "set" values in perl, windows...

i am using perl in win2000advanced server... --------------------------- perl -version: --------------------------- This is perl, v5.6.1 built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail) Copyright 1987-2001, Larry Wall Binary build 638 provided by...

2. Shell Programming and Scripting

Need to change a set of lines between two given Pattrens

Hi All I have a Small Requiement I wanted to replace all the Follwing lines as follows Input:: file1 EVALUATE WS-TEMP-ATTR(15:1) WHEN 'D' MOVE DFHDARK TO WS-ATTR-COLOR WHEN OTHER MOVE DFHDFT ...

3. UNIX for Dummies Questions & Answers

Server wide password enforcement rules? 90 day force change.

Using Solaris 9 and 10. What we want to do is set up global rules for our password files to restrict all users, not only new ones set up with the rules but also the ones that have been sitting on the system for years. Is there a global way to force all users to change their password every 90...

4. Solaris

help me to change the character set

dears i am using solaris 10 i am facing a problem when i make setup for solaris i choose the country egypt and i select the language north america but i forget to do that the i found the date Jun written in arabic i want to change character set to written in english -rw-r--r-- 1 root ...

5. Shell Programming and Scripting

Matching string on two files based on match rules.

Hi, How to check if a string on file2 exactly matches with a part or complete string on file1, and return a match indicator based on some match rules. 1) only records on file1 with category A should be matched. for other category, the output match indicator should default to 'N' 2) on file2...

6. Shell Programming and Scripting

Help with allocated text content based on specific rules...

Input file format: /tag="ABL" /note="abl homolog 2 /tag="ABLIM1" /note="actin binding LIM 1 /tag="ABP1" /note="amiloride binding protein 1 (amine oxidase (copper- containing)) /tag="ABR" /note="active BCR-related /tag="AC003042.1" /note="SDR family member 11 precursor . . .

7. Shell Programming and Scripting

Generating a passwordlist based on rules

Hy there! Some time ago I encrypted the harddrive of my notebook. Now, I can't remember it correctly. I want to create a list with all possible combinations of the words I used (I still remember all the words....). The password was created like this: ...

8. Shell Programming and Scripting

Delete lines based on Rules

Hi My requirement is very simple . I juts need to delte some lines from a file. here comes theactual scenario I have some data in file like say srinivasa prabhu kumar antony srinivas king prabhu antony srinivas prabhu king yar venkata venkata kingson srinivas...

9. Shell Programming and Scripting

Filtering duplicates based on lookup table and rules

please help solving the following. I have access to redhat linux cluster having 32gigs of ram. I have duplicate ids for variable names, in the file 1,2 are duplicates;3,4 and 5 are duplicates;6 and 7 are duplicates. My objective is to use only the first occurrence of these duplicates. Lookup...

10. Shell Programming and Scripting

Perl to update field based on a specific set of rules

In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f accordinly along with the value in f. There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater...

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

how to change "set" values in perl, windows...

Discussion started by: sekar sundaram

2. Shell Programming and Scripting

Need to change a set of lines between two given Pattrens

Discussion started by: pbsrinivas

3. UNIX for Dummies Questions & Answers

Server wide password enforcement rules? 90 day force change.

Discussion started by: LordJezo

4. Solaris

help me to change the character set

Discussion started by: hosney00ux

5. Shell Programming and Scripting

Matching string on two files based on match rules.

Discussion started by: effay

6. Shell Programming and Scripting

Help with allocated text content based on specific rules...

Discussion started by: perl_beginner

7. Shell Programming and Scripting

Generating a passwordlist based on rules

Discussion started by: santiago10k

8. Shell Programming and Scripting

Delete lines based on Rules

Discussion started by: ptappeta

9. Shell Programming and Scripting

Filtering duplicates based on lookup table and rules

Discussion started by: ritakadm

10. Shell Programming and Scripting

Perl to update field based on a specific set of rules

Discussion started by: cmccabe