search from a list of words


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting search from a list of words
# 1  
Old 07-31-2011
search from a list of words

Hello,
I'm trying to write a bash script that will search for words from one list that may be found in another list. Once the record is found, it will create a new text file for each word.

For example, list1.txt contains the following:
Code:
Dog
Cat
Fish

List2.txt contains
Code:
Dog - Buddy 14
Charlie - Rhino
Bird - Steph 32
Ralph - Dog
Cat - John
Mike - Fish

Since Dog and Cat are found in both files, two files will be created. The first file (Dog) will be a .txt file containing
Code:
Dog - Buddy 14
Raph - Dog

The second file will be called Cat.txt and will have
Code:
Cat - John

Here's what I have so far. I'm stuck and I'm not quite sure how to proceed
Code:
#!/bin/bash
for $i in list1.txt; do
grep -wi '$i' list2.txt >> $i.txt
done

I'm dealing with VERY large files where list1.txt contains 213 entries while list2.txt contaings 12,000 entries. I think I'm on the right track, but my method seems like it would also take a VERY long time since it's a FOR LOOP for each iteration (yikes!)

Any help would be greatly appreciated.

Last edited by radoulov; 07-31-2011 at 05:50 PM.. Reason: Code tags!
# 2  
Old 07-31-2011
Yes, making a pass across your 12,000 record data file for each entry in the list isn't very efficient. First thing I'll point out is that your for loop will not be listing the contents of the list, but the file name. You'd need something like this:

Code:
#!/bin/bash
while read i 
do
grep -wi '$i' list2.txt >> $i.txt
done <list1.txt

This reads the contents of list1.txt placing each line into the variable i. Still not efficient, but I wanted to point out the problem with your code.

Using awk, you can make one pass across each file. Way more efficient in terms of numer of i/o operations, but not as efficient as writing a programme to do the same thing in C.

Code:
#!/usr/bin/env ksh

# assume list1 list2 are placed on the command line
awk -v list=$1 '
    BEGIN {
        while( (getline<list) > 0 )   # load all target words from first list
            targets[$1] = 1;
        close( list );
    }

    {
        for( i = 1; i <= NF; i++ )  # examine each token to see if it is a target
        {
            if( targets[$(i)] )   # if this token in the input is in the target list, save the line
            {
                printf( "%s\n", $0 ) >>$(i) ".txt";
                close( $(i) ".txt" );    # prevent problems if process limit for number of open files is small
                break;      # remove if line can have multiple targets
            }
            else
              delete  targets[$(i)];    # prevent an entry for every word 
        }
    }
' $2
exit

You could make this more efficient by tracking most recently used files and allowing awk to keep some number (100) open and closing the rest. The programme would be executing far less opens/closes on the output files. You'd probably not have any issue keeping 212 of them open, but if your target list grows, or your system has smallish quotas on open files, you could have issues which is why I suggested closing the file after each write. Another, and easier, way would be to write a single output file of the form <filename> <text> as an intermediate file. Once the initial processing is finished, the intermediate file could be sorted and a single pass made to write each separate file. This has the advantage of opening/closing each output file just once and thus avoids the efficiency problems in my example above.

The need for the delete stems from some awk implementations which create an entry in the hash when the test is made (when targets[foo] does not exist). Without the delete, the hash will eventually contain an entry for every word in the list2.txt file rather than just the ones from the first list. These extra entries all have the value 0, so the programme works, but the memory usage is unnecessarily large. The delete statement prevents awk from keeping entries in the target hash that have a zero value, but it adds to the execution time.

Last edited by agama; 07-31-2011 at 05:03 PM.. Reason: additional thought about output
# 3  
Old 07-31-2011
thanks for the reply. I understand that my method is inefficient, but I was wondering why the following wont work. Do I have a syntax error somewhere? When I run the following code, I get the error "syntax error near unexpected token 'done'"

PHP Code:
#!/bin/bash
while read word; do
grep -"$wordlist2.txt
done 
list1.txt >> "$word".txt
cat 
"$word".txt 
When I run the command
PHP Code:
grep -w SAMPLE_TEXT list2.txt 
it gives me the desired output.

Last edited by jl487; 07-31-2011 at 07:11 PM.. Reason: additional info
# 4  
Old 07-31-2011
You're on the right track. The redirection to $word.txt needs to happen inside of the loop. Yes, you can redirect the output of the loop to a file, but that output file is opened once by the shell at the start of the loop. When the loop starts $word is empty and thus you're getting a syntax error (nothing after >>). This is the small change that will get you going:

Code:
#!/bin/bash
while read word; do
grep -w "$word" list2.txt >> "$word".txt
done < list1.txt

Further, your cat command will only have the last word from list1 to work on unless you put it into a loop too:
Code:
while read word
do
    echo "===== $word.txt ======="
    cat "$word".txt 
done <list1.txt

This User Gave Thanks to agama For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search words in any quote position and then change the words

hi, i need to replace all words in any quote position and then need to change the words inside the file thousand of raw. textfile data : "Ninguno","Confirma","JuicioABC" "JuicioCOMP","Recurso","JuicioABC" "JuicioDELL","Nulidad","Nosino" "Solidade","JuicioEUR","Segundo" need... (1 Reply)
Discussion started by: benjietambling
1 Replies

2. UNIX for Beginners Questions & Answers

Non-root script used search and list specific key words

Hy there all. Im new here. Olso new to terminal & bash, but it seams that for me it's much easyer to undarsatnd scripts than an actual programming language as c or anyother languare for that matter. S-o here is one og my home works s-o to speak. Write a shell script which: -only works as a... (1 Reply)
Discussion started by: Crisso2Face
1 Replies

3. Shell Programming and Scripting

Search string within a file and list common words from the line having the search string

Hi, Need your help for this scripting issue I have. I am not really good at this, so seeking your help. I have a file looking similar to this: Hello, i am human and name=ABCD. How are you? Hello, i am human and name=PQRS. I am good. Hello, i am human and name=ABCD. Good bye. Hello, i... (12 Replies)
Discussion started by: royzlife
12 Replies

4. Shell Programming and Scripting

Search between two words

Hello, I try to print out with sed or awk the 21.18 between "S3 Temperature" and "GrdC" in a text file. The blanks are all real blanks no tabs. Only the two first chars from temperture are required. So the "21" i need as output. S3 Temperatur 21.18 GrdC No Alarm ... (3 Replies)
Discussion started by: felix123
3 Replies

5. Shell Programming and Scripting

search several words with awk command

Hello, I want to test if i find the word CACCIA AND idlck in a file, i have to print a message Ok. For that , i need to user a awk command with a && logical. Can you help me ? :confused: ### CACCIA: DEBUT ### if $(grep -wqi "$2" /etc/passwd); then && rm /etc/security/.idlck ... (3 Replies)
Discussion started by: khalidou13
3 Replies

6. Shell Programming and Scripting

want to search for the words in the files

Hi Friends, I have been trying to write the script since morning and reached some where now. but i think i am stuck in the final step. please help I want to search the strings below in red in the be be searched in the directories below. How can i do that in my shell script. Thanks Adi ... (8 Replies)
Discussion started by: asirohi
8 Replies

7. Shell Programming and Scripting

Search 3 words

Hi All, I have almost 1000+ files and I want to search specific pattern. Looking forwarded your input. Pls note that need to ignore words in between /* */ Search for: "insert into xyz" (Which procedure contain all 3). Expected output: procedure test1 procedure test2 procedure test3 File... (12 Replies)
Discussion started by: susau_79
12 Replies

8. UNIX for Dummies Questions & Answers

search words in different file

Hi, I have 1 - 100 file I want the list of such file which contains word 'internet' Please provide command to do this (3 Replies)
Discussion started by: kaushik02018
3 Replies

9. Shell Programming and Scripting

search two words in sed

I've following sed command working fine - sed '/search_pattern1/ !s/pattern1/pattern2/" file Now, I want to search two patterns - search_pattern1 and search_pattern2 . How can put these into above sed statement ? Thanks in advance. (12 Replies)
Discussion started by: ajitkumar2
12 Replies

10. Shell Programming and Scripting

search for words in file

hi all, i would like to search in a directory. all files they were found shoul be opend and looked about a keyword. if keyword is found i want to see the name of the file. i've rtfm of find and have a command like this : find /etc -exec cat \{}\ | grep KEYWORD but don't work, and : find... (4 Replies)
Discussion started by: Agent_Orange
4 Replies
Login or Register to Ask a Question