How to prevent incorrect string using reg expr in Java?

How to prevent incorrect string using reg expr in Java?

Hi All,

I need your input on how to mask out / ignore a string that does not match a working regular expression (continually refining) pattern in Java. Below is the code snippet which is picking up all the lines with the correct regular expression string except one known so far:

public static void main(String[] args)

          String correctPropertyDetail = "Los Angeles 4 Rose St 7 br h $350,000 J M&C Bunker Hill";
          String incorrectPropertyDetail = "Los Angeles 4 Rose St S 7 br h $350,000 J M&C Bunker Hill";
           Pattern pattern1 = Pattern.compile("\\A[A-Z][a-z]*|[A-Z][a-z]* [A-Z][a-z]* [A-Z]?[0-9]{0,4}/?[0-9]{0,4}-?[0-9]{0,4}|[0-9]{0,4}[a-z] [A-Z][a-z]* [A-Z][a-z]* (?:St|Rd|Av|Sq|Cl|Pl|Cr|Dr|La) [0-9] br [hut] \\$([0-9]){0,3},([0-9]){0,3}|\\$([0-9]){0,3},([0-9]){0,3},([0-9]){0,3} ([A-Z][a-z]*){1,}\\Z");
           Pattern pattern2 = Pattern.compile("\\A\\b[A-Z][a-z]*\\b|\\b[A-Z][a-z]* [A-Z][a-z]*\\b \\b[A-Z]?[0-9]{0,4}/?[0-9]{0,4}-?[0-9]{0,4}\\b|\\b[0-9]{0,4}[a-z]\\b \\b[A-Z][a-z]*\\b \\b[A-Z][a-z]*\\b \\bSt|Rd|Av|Sq|Cl|Pl|Cr|Dr|La)\\b \\b[0-9]\\b \\bbr\\b \\b[hut]\\b \\$([0-9]){0,3},([0-9]){0,3}|\\$([0-9]){0,3},([0-9]){0,3},([0-9]){0,3} ([A-Z][a-z]*){1,}\\Z");
           Pattern pattern3 = Pattern.compile("\\A(?:[A-Z][a-z]*|[A-Z][a-z]* [A-Z][a-z]*) (?:[A-Z]?[0-9]{0,4}/?[0-9]{0,4}-?[0-9]{0,4}|[0-9]{0,4}[a-z]) [A-Z][a-z]* [A-Z][a-z]* \\b(?:St|Rd|Av|Sq|Cl|Pl|Cr|Dr|La)\\b \\b[0-9]\\b br [hut] \\$([0-9]){0,3},([0-9]){0,3}|\\$([0-9]){0,3},([0-9]){0,3},([0-9]){0,3} ([A-Z][a-z]*){1,}\\Z");
           Matcher matcher = pattern.matcher(propertyDetail);
           if (matcher.find())
               System.out.println("Property detail is " + propertyDetail);

The difference between correctPropertyDetail and incorrectPropertyDetail is the S' after Rose St. A sample of few hundred lines of data has been picked up properly but a few incorrect ones managed to slip through. Neither pattern1 nor 2 achieve the desired objective but appears to accept other correct strings, like the correctPropertyDetail. On the other hand, pattern3 successfully masked out incorrectPropertyDetail (good!), but also stopped many correct ones from being accepted.

Note that it is the second sub-pattern (?:[A-Z]?[0-9]{0,4}/?[0-9]{0,4}-?[0-9]{0,4}|[0-9]{0,4}[a-z]) of pattern3 that is responsible for causing the masking of incorrectPropertyDetail not to be picked up. However, it is also breaking the regular expression by no longer accepting the good strings from coming through as well. Can you see what is wrong with it or offer an alternative approach to achieving the same objective?

Regular expression is relatively new to me and can do with some advice.

Your assistance would be appreciated.


