word filtering


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers word filtering
# 1  
Old 10-08-2005
Error word filtering

i have written a script (word-filter) that filters certain words that are stored in a separate file (stopwords.txt) from a huge document set dir containing 2000 docs (docset/*). the filtered docs are stored in a temp docset (docset.temp)
under the same basename. i have a main script that calls word-filter and passes each document.

problem is that it takes a hella amount of time to even filter ~500 docs.
im worried that the 'nesting' sort of, of the for loops kills performace.
Any ideas of speeding up my code? Thanks a million! Smilie

main script code:
Code:
        for doc in "$DOCSET_ORIG"/*
        do
                BASENAME=`basename $doc`
                $ADDONS_HOME/word-filter $doc > $DOCSET_TEMP/$BASENAME
                echo "Filtered ... $BASENAME"
        done


word-filter script code:
Code:
DOC="$1"
STOPLIST="stopwords.txt"

# Scan through the document, a word at a time
    for WORD in `cat $DOC`
    do
        # Ignore all single asterisks and option
        # like words that messes up grep.
        if [[ "${WORD}" == ? || "${WORD}" == "-"? || "${WORD}" == "--"*? ]];          then
              continue
        fi

        # Look-up current word in stop list, ignore upper/lower
        # case distinction during comparisons.

        /usr/xpg4/bin/grep -F -s -i -q "$WORD" "$STOPLIST"

        # If word is not in stoplist, write it out
        if [ $? = 1 ]; then
                echo -n -e "$WORD\t"
        fi
    done

# 2  
Old 10-08-2005
The main problem is doing a grep for each word. The key to speed is to avoid that somehow. How big is stoplist.txt? Maybe you can store it in memory somehow.
# 3  
Old 10-08-2005
Error

its just a list of about ~20 words..
a word on each line. so the grep kills it huh?
it though that it can speed it up.. so what replacement can u
suggest for grep? a cat of stopwords.txt and an if test? thanks!
# 4  
Old 10-09-2005
How about something like:
Code:
#! /usr/bin/ksh

exec < stopwords.txt
stoplist=" "
while read word ; do
        stoplist="${stoplist}${word} "
done

echo stoplist = $stoplist


exec < x.doc

while read inline ; do
        outline=""
        for word in $inline ; do
                if [[ $stoplist != *$word* ]] ; then
                        outline="$outline $word"
                fi
        done
        echo "$outline"
done
exit 0

# 5  
Old 10-09-2005
thanks ill try it later. so 'x.doc' is the document that i wish to filter right?
thanks! Smilie
# 6  
Old 10-09-2005
Quote:
Originally Posted by mark_nsx
so 'x.doc' is the document that i wish to filter right?
right.
# 7  
Old 10-09-2005
how do i prevent the echoing of the whole stopword list to stdout? thanks
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to search for a word in column header that fully matches the word not partially in awk?

I have a multicolumn text file with header in the first row like this The headers are stored in an array called . which contains I want to search for each elements of this array from that multicolumn text file. And I am using this awk approach for ii in ${hdr} do gawk -vcol="$ii" -F... (1 Reply)
Discussion started by: Atta
1 Replies

2. UNIX for Beginners Questions & Answers

UNIX script to check word count of each word in file

I am trying to figure out to find word count of each word from my file sample file hi how are you hi are you ok sample out put hi 1 how 1 are 1 you 1 hi 1 are 1 you 1 ok 1 wc -l filename is not helping , i think we will have to split the lines and count and then print and also... (4 Replies)
Discussion started by: mirwasim
4 Replies

3. Shell Programming and Scripting

Find a word and increment the number in the word & save into new files

Hi All, I am looking for a perl/awk/sed command to auto-increment the numbers line in file, P1.tcl: run_build_model sparc_ifu_dec run_drc set_faults -model path_delay -atpg_effectiveness -fault_coverage add_delay_paths P1 set_atpg -abort_limit 1000 run_atpg -ndetects 1000 I would like... (6 Replies)
Discussion started by: jypark22
6 Replies

4. Shell Programming and Scripting

Search for a specific word and print only the word from the input file

Hi, I have a sample file as shown below, I am looking for sed or any command which prints the complete word only from the input file. Ex: $ cat "sample.log" I am searching for a word which is present in this file We can do a pattern search using grep but I need to cut only the word which... (1 Reply)
Discussion started by: mohan_kumarcs
1 Replies

5. Shell Programming and Scripting

Search for the word and exporting 35 characters after that word using shell script?

I have a file input.txt which have loads of weird characters, html tags and useful materials. I want to display 35 characters after the word description excluding weird characters like $$#$#@$#@***$# and without html tags in the new file output.txt. Help me. Thanx in advance. My final goal is to... (11 Replies)
Discussion started by: sachit adhikari
11 Replies

6. UNIX for Dummies Questions & Answers

Find EXACT word in files, just the word: no prefix, no suffix, no 'similar', just the word

I have a file that has the words I want to find in other files (but lets say I just want to find my words in a single file). Those words are IDs, so if my word is ZZZ4, outputs like aaZZZ4, ZZZ4bb, aaZZZ4bb, ZZ4, ZZZ, ZyZ4, ZZZ4.8 (or anything like that) WON'T BE USEFUL. I need the whole word... (6 Replies)
Discussion started by: chicchan
6 Replies

7. UNIX for Dummies Questions & Answers

Script to search for a particular word in files and print the word and path name

Hi, i am new to unix shell scripting and i need a script which would search for a particular word in all the files present in a directory. The output should have the word and file path name. For example: "word" "path name". Thanks for the reply in adv,:) (3 Replies)
Discussion started by: virtual_45
3 Replies

8. Shell Programming and Scripting

To read data word by word from given file & storing in variables

File having data in following format : file name : file.txt -------------------- 111111;name1 222222;name2 333333;name3 I want to read this file so that I can split these into two paramaters i.e. 111111 & name1 into two different variables(say value1 & value2). i.e val1=11111 &... (2 Replies)
Discussion started by: sjoshi98
2 Replies

9. UNIX for Dummies Questions & Answers

regular expression for replacing the fist word with a last word in line

I have a File with the below contents File1 I have no prior experience in unix. I have just started to work in unix. My experience in unix is 0. My Total It exp is 3 yrs. I need to replace the first word in each line with the last word for example unix have no prior experience in... (2 Replies)
Discussion started by: kri_swami
2 Replies

10. Shell Programming and Scripting

Can a shell script pull the first word (or nth word) off each line of a text file?

Greetings. I am struggling with a shell script to make my life simpler, with a number of practical ways in which it could be used. I want to take a standard text file, and pull the 'n'th word from each line such as the first word from a text file. I'm struggling to see how each line can be... (5 Replies)
Discussion started by: tricky
5 Replies
Login or Register to Ask a Question