Python with Regex and Excel


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Python with Regex and Excel
# 1  
Old 06-01-2017
Python with Regex and Excel

Hello

I have a big excel file for Ticket Data Analysis. The idea is to make meaningful insight from Resolution Field. Now as people write whatever they feel like while resolving the ticket it makes quite a task.

1. They may or may not tag it with something like below within the resolution field

Problem:
Analysis:
Resolution:

So I am suppose to pick the write ups after Resolution: tag and Put it in a new excel file with the ticket number.

Now the problem starts because people write their own tagging mechanism. Sometimes they Skip Problem and Analysis and just write Steps etc. It is very random and no common factors.

After going through lot of data we figured out the following tags which are most common.

So logic is

If we find the following tags :
pick after it
elif we find the following Tags:
pick after it.
The elseif condition can sometimes be true in if condition. So if statement gets presidence
else
pick the whole field


in the if statements the below are common words
Code:
regularexp="r'Action Performed\:(.*)|" \
           "ACTION PERFORMED\:(.*)|" \
           "Steps taken to resolve the issue\:(.*)|" \
           "steps taken to resolve issue\:(.*)|" \
           "Steps Taken to Resolved the issue\:(.*)|" \
           "Steps taken to resolve the issue \:(.*)|" \
           "Steps taken to resolve\:(.*)|" \
           "Steps\-(.*)|" \
           "steps \-(.*)|" \
           "steps\:(.*)|" \
           "Steps taken\:(.*)" \
           "Action taken\:(.*)|" \
           "Actions Taken\:(.*)|" \
           "Action\:(.*)|" \
           "Action \-(.*)|" \
           "Action taken \-(.*)|" \
           "Actions Taken\;(.*)|" \
           "Action Taken \:(.*)|" \
           "Action \:(.*)|" \
           "Action taken to resolve\:(.*)|" \
           "Resolution\-(.*)|" \
           "Resolution\:(.*)|" \
           "Resolution \-(.*)|" \
           "Action taken for resolution\:(.*)|" \
           "Solution \:(.*)|" \
           "analysis\:(.*)|" \
           "analysis \:(.*)|" \
           "Investigation\:(.*)" \
           "observed\/investigated \:(.*)'"

ELIF:
Now if the above is not found we need to check for the following. The above gets presidence if both are found

Code:
"Update\:(.*)|" \
           "Update \:(.*)|" \
           "UPDATE\-(.*)|" \
           "updates\:(.*)'"

Else:

Code:
Just re-write the whole statements found

I have written the following code without the elif block for now

Code:
# -*- coding: utf-8 -*-
"""
Created on Mon May 29 19:34:54 2017

@author: anirbaba
"""

from openpyxl import Workbook, load_workbook
import re
import xlsxwriter
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write('A1', 'IncidentID')
worksheet.write('B1', 'Resolution')


wb=load_workbook("D:\Backup\Drive_D\W0rk\Script\Python\RegularX\Output_file_high_effort_no_pks.xlsx", read_only=True)
sheet_ranges=wb['High_effort_without_burst']
regularexp="r'Action Performed\:(.*)|" \
           "ACTION PERFORMED\:(.*)|" \
           "Steps taken to resolve the issue\:(.*)|" \
           "steps taken to resolve issue\:(.*)|" \
           "Steps Taken to Resolved the issue\:(.*)|" \
           "Steps taken to resolve the issue \:(.*)|" \
           "Steps taken to resolve\:(.*)|" \
           "Steps\-(.*)|" \
           "steps \-(.*)|" \
           "steps\:(.*)|" \
           "Steps taken\:(.*)" \
           "Action taken\:(.*)|" \
           "Actions Taken\:(.*)|" \
           "Action\:(.*)|" \
           "Action \-(.*)|" \
           "Action taken \-(.*)|" \
           "Actions Taken\;(.*)|" \
           "Action Taken \:(.*)|" \
           "Action \:(.*)|" \
           "Action taken to resolve\:(.*)|" \
           "Resolution\-(.*)|" \
           "Resolution\:(.*)|" \
           "Resolution \-(.*)|" \
           "Action taken for resolution\:(.*)|" \
           "Solution \:(.*)|" \
           "analysis\:(.*)|" \
           "analysis \:(.*)|" \
           "Investigation\:(.*)" \
           "observed\/investigated \:(.*)'"
#           "Update\:(.*)|" \
#           "Update \:(.*)|" \
#           "UPDATE\-(.*)|" \
#           "updates\:(.*)'"
i=0           
for row in sheet_ranges.iter_rows(row_offset=1):
#    for i in range(0,50001):
    act_resolution=re.compile(regularexp, re.IGNORECASE)
    act_resolutiongroup=act_resolution.search(str(row[16].value))
    if act_resolutiongroup is not None:
        print(row[12].value,act_resolutiongroup.group())
        worksheet.write(i+1,0,row[12].value)
        worksheet.write(i+1,1,act_resolutiongroup.group())
        i+=1
    else:
        print(row[12].value,row[16].value)
        worksheet.write(i+1,0,row[12].value)
        worksheet.write(i+1,1,row[16].value)
        i+=1
workbook.close()
#    if act_resolutiongroup is None:
#        print(row[12].value)

1. I need help in shortening the Regular expression search for variable regularexp.
2. I have seen the keyword is there but still it goes into the Else loop and just write the whole statement instead of picking it.
3. Once the above is done need to run 2 Grams 3 Grams TFIDF algorithm non english non numeric (This is far fetched and not my immediate requirement)
Moderator's Comments:
Mod Comment Please use CODE (not ICODE) tags for full-line and multi-line sample input, output, and code segments.


---------- Post updated 05-31-17 at 10:29 PM ---------- Previous update was 05-30-17 at 10:54 PM ----------

Hello

Can someone shorten the below regular expression.
Code:
regularexp="r'Action Performed\:(.*)|" \
           "ACTION PERFORMED\:(.*)|" \
           "Steps taken to resolve the issue\:(.*)|" \
           "steps taken to resolve issue\:(.*)|" \
           "Steps Taken to Resolved the issue\:(.*)|" \
           "Steps taken to resolve the issue \:(.*)|" \
           "Steps taken to resolve\:(.*)|" \
           "Steps\-(.*)|" \
           "steps \-(.*)|" \
           "steps\:(.*)|" \
           "Steps taken\:(.*)" \
           "Action taken\:(.*)|" \
           "Actions Taken\:(.*)|" \
           "Action\:(.*)|" \
           "Action \-(.*)|" \
           "Action taken \-(.*)|" \
           "Actions Taken\;(.*)|" \
           "Action Taken \:(.*)|" \
           "Action \:(.*)|" \
           "Action taken to resolve\:(.*)|" \
           "Resolution\-(.*)|" \
           "Resolution\:(.*)|" \
           "Resolution \-(.*)|" \
           "Action taken for resolution\:(.*)|" \
           "Solution \:(.*)|" \
           "analysis\:(.*)|" \
           "analysis \:(.*)|" \
           "Investigation\:(.*)" \
           "observed\/investigated \:(.*)'"


Last edited by Don Cragun; 05-31-2017 at 01:29 AM..
# 2  
Old 06-01-2017
I do not have openpyxl module in my system, so I tried your regular expression with a simple text file with some dummy data that matches your regex pattern.

Note a couple of things:
1) Characters like ";" and ":" do not have special meaning in a regex, so they need not be escaped by backslash "\" character.
2) The hyphen "-" has special meaning only inside brackets. Otherwise, it need not be escaped.
3) You can reduce the regular expression, but it becomes unreadable very quickly. That's true for any regular expression. So you have to find a balance between readability and succinctness.

Code:
C:\data\>
C:\data\>type testdata.log
Line  1 : Action Performed:
Line  2 : ACTION PERFORMED:
Line  3 : Steps taken to resolve the issue:
Line  4 : steps taken to resolve issue:
Line  5 : Steps Taken to Resolved the issue:
Line  6 : Steps taken to resolve the issue :
Line  7 : Steps taken to resolve:
Line  8 : Steps-
Line  9 : steps -
Line 10 : steps:
Line 11 : Steps taken:
Line 12 : Action taken:
Line 13 : Actions Taken:
Line 14 : Action:
Line 15 : Action -
Line 16 : Action taken -
Line 17 : Actions Taken;
Line 18 : Action Taken :
Line 19 : Action :
Line 20 : Action taken to resolve:
Line 21 : Resolution-
Line 22 : Resolution:
Line 23 : Resolution -
Line 24 : Action taken for resolution:
Line 25 : Solution :
Line 26 : analysis:
Line 27 : analysis :
Line 28 : Investigation:
Line 29 : observed/investigated :

C:\data\>
C:\data\>type processdata.py
import sys
import re
regularexp = "Steps Taken to Resolve(d)*\s*(the)*\s*issue\s*:(.*)|" \
             "Steps taken to resolve:(.*)|" \
             "Steps taken:(.*)|" \
             "steps\s*[:-](.*)|" \
             "Action taken (for|to) resol(ve|ution):(.*)|" \
             "Action(s)* (Performed|Taken)\s*[;:-](.*)|" \
             "Action\s*[:-](.*)|" \
             "(Resolution|Solution|analysis|investigation|observed\/investigated)\s*[:-](.*)"
act_resolution = re.compile(regularexp, re.IGNORECASE)
datafile = sys.argv[1]
fh = open(datafile, 'r')
for line in fh:
    line = line.rstrip("\n")
    if act_resolution.search(line) is not None:
        print line
    else:
        print "[UNMATCHED]>> ", line
fh.close()
 
C:\data\>
C:\data\>python processdata.py testdata.log
Line  1 : Action Performed:
Line  2 : ACTION PERFORMED:
Line  3 : Steps taken to resolve the issue:
Line  4 : steps taken to resolve issue:
Line  5 : Steps Taken to Resolved the issue:
Line  6 : Steps taken to resolve the issue :
Line  7 : Steps taken to resolve:
Line  8 : Steps-
Line  9 : steps -
Line 10 : steps:
Line 11 : Steps taken:
Line 12 : Action taken:
Line 13 : Actions Taken:
Line 14 : Action:
Line 15 : Action -
Line 16 : Action taken -
Line 17 : Actions Taken;
Line 18 : Action Taken :
Line 19 : Action :
Line 20 : Action taken to resolve:
Line 21 : Resolution-
Line 22 : Resolution:
Line 23 : Resolution -
Line 24 : Action taken for resolution:
Line 25 : Solution :
Line 26 : analysis:
Line 27 : analysis :
Line 28 : Investigation:
Line 29 : observed/investigated :
  
C:\data\>
C:\data\>

If you try hard enough, you can reduce that regex to a single string.
But the maintainer of your script may not be very happy about it.

Last edited by durden_tyler; 06-01-2017 at 04:43 PM..
This User Gave Thanks to durden_tyler For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

Creating filters with Python on excel

Hello, I have an excel sheet with 11 tabs. I need to take data from the first tab and write the output to the second tab. The first tab looks like this, starting from Row 3 The filters that needs to be created are 1) keep anything greater than 'POS' 5 and less than 160 AND 2)... (2 Replies)
Discussion started by: nans
2 Replies

2. Shell Programming and Scripting

Convert excel to csv in python date not display exactly

Hi, Anyone can help I am just converting the excel file to csv using python, I can get the conversion output but the date not display exactly. test.xlsx date format 167 1588 risks/SCS JP CAMPANA & CIE.pdf SCS JP CAMPANA & CIE 2 1 1 0 2015-03-16 16:56:25 167 1146 risks/AirBNB... (1 Reply)
Discussion started by: fspalero
1 Replies

3. Programming

Python Regex List Creation

Here is a snippet of my code: blahblahblah... blah for link in goodies.soup.find_all('a'): blah.append(link.get('href')) blah=list(set(blah)) which gives my list of urls. So now I use a regex to search for the relevant urls which I want in a list: for r... (0 Replies)
Discussion started by: metallica1973
0 Replies

4. Shell Programming and Scripting

Python Regex Removing One Too Many...

Well, I'm a python noob and my last post here I was introduced to Regex. I thought this would be easy since I knew Regex with Bash. However, I've been banging my head a while to extract an ip address from ifconfig with this: #!/usr/bin/python import re import subprocess from subprocess... (5 Replies)
Discussion started by: Azrael
5 Replies

5. Programming

Python Reading Individual files and Regex through them

As a newbie to Python, I am trying to write a script in which is will add all the log files (*.log) from within a directory to a list, open the files and search for an ip using a regex and single it out (appending the ip's to the list). So far, I have: import re, os def list_files() content = ... (4 Replies)
Discussion started by: metallica1973
4 Replies

6. Shell Programming and Scripting

Perl script to Merge contents of 2 different excel files in a single excel file

All, I have an excel sheet Excel1.xls that has some entries. I have one more excel sheet Excel2.xls that has entries only in those cells which are blank in Excel1.xls These may be in different workbooks. They are totally independent made by 2 different users. I have placed them in a... (1 Reply)
Discussion started by: Anamika08
1 Replies

7. Shell Programming and Scripting

Python Newbie Question Regex

I starting teaching myself python and am stuck on trying to understand why I am not getting the output that I want. Long story short, I am using PDB for debugging and here my function in which I am having my issue: import re ... ... ... def find_all_flvs(url): soup =... (1 Reply)
Discussion started by: metallica1973
1 Replies

8. Shell Programming and Scripting

Converting specific Excel file tabs to CSV in Python

Hi list, This is probably something really simple, but I am not particularly familiar with Python so I thought I would ask as I know that python has an excel module. I have an excel document with multiple tabs of data and graphs. One of the tabs is just data which I require to have dumped to... (8 Replies)
Discussion started by: landossa
8 Replies

9. Shell Programming and Scripting

Python Regex

I have the below string and regex. However I cant understand why it works the way it does. IP has been changed for safety ;) String = NowSMS Error Report. Error initializing SMSC Interface 'SMPP - 10.15.8.10:17600'. Interface is not available. Regex = (.+\.)\s(.+) I get two... (1 Reply)
Discussion started by: barney34
1 Replies

10. Shell Programming and Scripting

PERL: Split Excel Workbook to Indiv Excel files

Hi, I am trying to find a way to read an excel work book with multiple worksheets. And write each worksheet into a new excel file using perl. My environment is Unix. For example: I have an excel workbook TEST.xls and it has Sheet1, Sheet2, Sheet3 worksheets. I would like to create... (2 Replies)
Discussion started by: sandeep78
2 Replies
Login or Register to Ask a Question