Python Script for keyword and Stemming

Login or Register to Reply

Thread Tools Search this Thread
# 1  
Python Script for keyword and Stemming

Hello All,

I have python script that pulls out a keyword from the data set. The data set contains 3 columns,
1. SysID 2. ID 3. Comment Section.

This script just pulls out keyword for certain extent from Comment section and display only keyword, not any other columns.

Can someone help out to alter this script so that script trim comment column sparing with precise key words from each row of columns, without truncating the other columns.

#!/usr/bin/env python2.7
import numpy as np
from collections import Counter
import csv

class Preprocess_data():

        def __init__(self, data, k_number_of_features=5000):
                self.k = k_number_of_features
                self.words = zip(*data)[2]

        def get_word(self, data):
                punc1 = ("~`!@#$%^&*()_-+=[]{}\|;:',<.>/?")
                punc2 = ('"')
                wordsbag = []
                words = zip(*data)[2]
                words = [item.lower().translate(None, punc1).translate(None, punc2) for item in words]
                self.words = [item.split() for item in words]
                for line in self.words:
                return wordsbag

        def count_attr(self,data):
                c = Counter(self.get_word(data))
                feature = c.most_common(100+self.k)[100:100+self.k]
                return feature

        def summarize_feature(self, data):
                words = self.words
                feature = self.count_attr(data)
                feature_value = np.zeros((len(data), len(feature)))
                for i in range(len(words)):
                        for j in range(len(feature)):
                                if (feature[j][0] in words[i]):
                                        feature_value[i][j] = 1
                                        feature_value[i][j] = 0
                return feature_value

if __name__=='__main__':
        file = open('testfile', 'rU')
        data = list(csv.reader(file, delimiter='\t'))
        preprocessed = Preprocess_data(data, k_number_of_features='n')
        wordsbag = preprocessed.get_word(data)
        feature = preprocessed.count_attr(data)
        feature_value = preprocessed.summarize_feature(data)
        #-------print the most common ten words---------#
        for i in range(3000):
                print 'WORD' + str(i+1), feature[i][0]

Sample Dataset

4819	810	The locker doors "Inside" were marked and not polished properly.
4885	1313	The seal around / on top of the flush panel is damaged.
4932	825	The clock facing the bag drop drive way is not set correctly / displays incorrect time.
5067	744	Gaps are visible between the interlock flooring tiles.
5027	737	The menu is damaged.
5067	748	The wall is seen blistered.
4845	825	The left side of the panel is fused.
4952	810	The terrace tiles are damaged.
5496	1044	tetst
5022	732	The service door is left open and construction equipment is left unattended.
5496	1044	test
5496	2009	test
4952	810	The terrace tiles are cracked /damaged.
5058	1110	The
5067	2022	The umbrella's  bases  of the restaurant are seen dusty and dirty.
5058	1110	The Interlock flooring is seen damaged and stained.
5058	1110	Gaps are visible between Interlock flooring.
5058	1110	Several toilet cubicles doors are seen chipped.
5489	824	tttt
5058	1110	The prayer timings electrical board has been removed during painting and never returned back and a mark is visible on the wall.
4771	693	The toilet cubicle skirtings are scratched.
5026	52	The terrace is damaged.
5027	737	The menu is damaged.
5026	743	The terrace is damaged.
4906	24	fgfgf
5059	829	The wall around the A/C grill is stained.
5059	829	The door stopper is missing and tile is damaged by door handle.
5059	829	The soap holder is missing.
5059	829	The douche tap fitting is loose.
5059	829	The corner of the wall is damaged and moldy.
5059	829	The ping pong table is damaged.
5059	829	The sign at the gate to pool area is faded.
5059	829	The protective net is not properly installed. The fitting is untidy.
5059	829	The pool loungers are stained.
5059	829	The corner of the wall is damaged and moldy.
5059	829	The corner of the wall is damaged and moldy.
5059	829	The corner of the wall is damaged and moldy.
5058	1117	The empty unit is seen not hoarded; window is dirty and dust is visible from the window.
5058	1110	The flooring arrows are faded and worn.
5490	1957	test
5022	732	There appears to be water damage on the dipped ceiling.
4825	833	The
5022	727	The information about where the stairs lead to is missing.
5022	732	The stairs walls are all blank. Information about what is at the top of the stairs needs to be added to those walls.
5022	732	The yellow exit sign painted on the wall is damaged above it and the paint is uneven and untidy.
5022	732	The yellow car park sign hanging from the ceiling is chipped at the lower left ledge.
5056	833	Ceiling access panels are still found missing.
5056	833	Main door is damaged on lower edge.
5022	732	There is yellow tape in a square shape left above the Tche Tche Cafe sign on the wall.
5056	833	Tiles panels are damaged.

Current Output from the script is below

WORD1 working
WORD2 correctly
WORD3 cover
WORD4 ac
WORD5 doors
WORD6 it
WORD7 full
WORD8 display
WORD9 parking
WORD10 heavily
WORD11 wooden
WORD12 for
WORD13 edges
WORD14 humidity
WORD15 cubicles
WORD16 fitted
WORD17 out
WORD18 room
WORD19 tree
WORD20 behind
WORD21 fence
WORD22 ok
WORD23 dusty
WORD24 cabinet
WORD25 along
WORD26 rusty
WORD27 overgrown
WORD28 as
WORD29 signs
WORD30 protruding
WORD31 painted
WORD32 fountain
WORD33 covered
WORD34 does
WORD35 dry
WORD36 availability
WORD37 lift
WORD38 operational
WORD39 severally
WORD40 poor
WORD41 found
WORD42 litter
WORD43 blistered

Expected Result should be

SysID    ID      Keywords

5067	2022	  umbrella's , dusty, dirty.
5058	1110	  Interlock, damaged, stained.
5058	1110	  Gaps, flooring.
5058	1110	  toilet, doors,  chipped.

Thanking you in advance, hope someone will address.
# 2  
I foresee problems with the approach of excluding common words. "Damaged" is an important word, but also common in your data. "Not" is also common and kind of vital. And when your data changes, so will whatever words you exclude.

And how important many words are, depends on context. Data is not lost from deleting "left" from "door left open", but it is lost from "left door open".

You can build lists of exclusions and special words until the cows come home, and then one funny case will come along which blows it all out of the water. Add one more special case for that word and special case special cases for any odd but valid ways that word might be used. Rinse and repeat until you lose your mind or your code gains sentience.

I'm not sure true English language processing can be implemented in a tinkertoy.

Deleting common words like "the" and "is", that's certainly doable.

Last edited by Corona688; 11-16-2018 at 12:19 PM..
Login or Register to Reply

Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
How to execute python script on remote with python way..?
Hi all, I am trying to run below python code for connecting remote windows machine from unix to run an python file exist on that remote windows machine.. Below is the code I am trying: #!/usr/bin/env python import wmi c = wmi.WMI("xxxxx", user="xxxx", password="xxxxxxx")...... Windows & DOS: Issues & Discussions
Windows & DOS: Issues & Discussions
Stemming of words that contained affixes by using shell script
I just learning shell script. Need your shell script expertise to help me. I would like to stemming the words by matching the root words first between both files and replace all words by "I" character but replace "B" character after root words and "E" before root words in affix_words.txt. ...... Shell Programming and Scripting
Shell Programming and Scripting
I want my script to NOT to send an e-mail if it finds the same keyword more than twice.
My script triggers and e-mail if keywords supplied to it were found. Problem is if it find the same keyword continously (due to continous server errors), it triggers mails and fillup my mail box with same message (which is not required) I want my script to NOT to send an e-mail if it finds the...... Shell Programming and Scripting
Shell Programming and Scripting
Search for a Keyword in file and replace another keyword or add at the end of line
Hi I want to implement something like this: if( keyword1 exists) then check if(keyword2 exists in the same line) then replace keyword 2 with New_Keyword else Add New_Keyword at the end of line end if eg: Check for Keyword JUNGLE and add/replace...... Shell Programming and Scripting
Shell Programming and Scripting
Need help in checking the date and looking for a keyword in a script
Hi, I have a cron process that runs daily and generates a log file. The process writes the date it ran and also any errors to the log file. I need to write a script that will check if the process ran yesterday and also look for the keyword 'ERROR'. If it did not run yesterday or if it found...... Shell Programming and Scripting
Shell Programming and Scripting

Featured Tech Videos