search of common words in set of files Post: 302321240

Sponsored Content

Top Forums Shell Programming and Scripting search of common words in set of files Post 302321240 by mala on Sunday 31st of May 2009 09:36:41 AM

05-31-2009

Registered User

search of common words in set of files

Hi,

I have a set of simple, one columned text files (in thousands).
file1:
a
b
c
d
file 2:
b
c
d
e
and so on. There is a collection of words in another file:
b d
b c d e
I have to find out the set of words (in each row) is present or absent in the given set of files. So, the output would be in matrix form (file*set) like:
1 0
1 1
I have the following code in bash, which is working well, but it involves very high computational cost with the increase of the file and set size. Any suggestion for better checking for the words is much appreciated.

My code segment:

Code:

#!/bin/sh
rm -f feat.txt 
touch feat.txt 
rm -f tem.txt
touch tem.txt
#read the rows of set file s.txt and put into seperate files 
lables=1
while read myline
  do
   echo $myline > temp.txt
   cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt
   lables=`expr $lables + 1`
  done < s.txt
p=`wc -l s.txt| awk '{print $1}'`
q=`expr $p - 1`
a=1
c=`expr $a + $q`
while [ $a -le $c ]
  do
  rm -f a.txt
  touch a.txt
  fileno=1
  while [ $fileno -le 1000 ]
   do
     cat l$a.txt|fgrep -xvf $fileno.txt| awk '{printf ($1 " ")}' >> a.txt
     echo >>a.txt
     fileno=`expr $fileno + 1`
   done 
  cat a.txt |awk '{if ($0 != NULL) print "0"; if ($0 == null) print "1"}'>c.txt
  paste c.txt feat.txt > feat1.txt
  cat feat1.txt >feat.txt
  a=`expr $a + 1`
 done

Thank you in advance

Last edited by vidyadhar85; 05-31-2009 at 10:45 AM.. Reason: code tag added

mala

View Public Profile for mala

Find all posts by mala

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search files that all contain 4 specific words

Hello, I want an one line command that brings me back all the files in a folder that contain 4 specific words anywhere inside them. I want to use find,xargs and grep. for example i know for one word the command would be: find . | xargs grep 'Word1' But i don't know for 4 specific words...

2. UNIX for Dummies Questions & Answers

how to find common words and take them out from two files

Hi, everyone, Let's say, we have xxx.txt A 1 2 3 4 5 C 1 2 3 4 5 E 1 2 3 4 5 yyy.txt A 1 2 3 4 5 B 1 2 3 4 5 C 1 2 3 4 5 D 1 2 3 4 5 E 1 2 3 4 5 First I match the first column I find intersection (A,C, E), then I want to take those lines with ACE out from yyy.txt, like A 1...

3. Shell Programming and Scripting

Drop common lines at head/tail of a large set of files

Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails...

4. Shell Programming and Scripting

want to search for the words in the files

Hi Friends, I have been trying to write the script since morning and reached some where now. but i think i am stuck in the final step. please help I want to search the strings below in red in the be be searched in the directories below. How can i do that in my shell script. Thanks Adi ...

5. Shell Programming and Scripting

Finding compound words from a set of files from another set of files

Hi All, I am completely stuck here. I have a set of files (with names A.txt, B.txt until L.txt) which contain words like these: computer random access memory computer networking mouse terminal windows All the files from A.txt to L.txt have the same format i.e. complete words in...

6. Shell Programming and Scripting

Extract common words from two/more csv files

I have two (or more, to make it generic) csv files. Each line contains words separated by comma. None of words have any space. The number of words per line is not fixed. Some may have one, and some may have 12... The number of lines per file is also not fixed. What I need is to find common words...

7. Shell Programming and Scripting

Search string within a file and list common words from the line having the search string

Hi, Need your help for this scripting issue I have. I am not really good at this, so seeking your help. I have a file looking similar to this: Hello, i am human and name=ABCD. How are you? Hello, i am human and name=PQRS. I am good. Hello, i am human and name=ABCD. Good bye. Hello, i...

8. Shell Programming and Scripting

Help needed with shell script to search and replace a set of strings among the set of files

Hi, I am looking for a shell script which serves the below purpose. Please find below the algorithm for the same and any help on this would be highly appreciated. 1)set of strings need to be replaced among set of files(directory may contain different types of files) 2)It should search for...

9. Shell Programming and Scripting

Search a column a return a set of words

Hi I have two files. One is a text file consisting of sentences i.e. INPUT.txt and the second file is SEARCH.txt consisting of two or three columns. I need help to write a script to search the second column of SEARCH.txt for each set of five words (blue color as set one and green color as set...

LEARN ABOUT ULTRIX

lookbib

lookbib(1)						      General Commands Manual							lookbib(1)

Name
       indxbib, lookbib - build inverted index for a bibliography, lookup bibliographic references

Syntax
       indxbib database...
       lookbib database

Description
       The  makes  an inverted index to the named databases (or files) for use by and These files contain bibliographic references (or other kinds
       of information) separated by blank lines.

       A bibliographic reference is a set of lines, constituting fields of bibliographic information.  Each field starts on a line beginning  with
       a  ``%'',  followed  by	a key-letter, then a blank, and finally the contents of the field, which may continue until the next line starting
       with ``%''.

       The command is a shell script that calls and The first program, truncates words to 6 characters, and maps upper case  to  lower	case.	It
       also  discards words shorter than 3 characters, words among the 100 most common English words, and numbers (dates) < 1900 or > 2000.  These
       parameters can be changed.  The second program, inv, creates an entry file (.ia), a posting file (.ib), and a tag file (.ic),  all  in  the
       working directory.

       The command uses an inverted index made by to find sets of bibliographic references.  It reads keywords typed after the ``>'' prompt on the
       terminal, and retrieves records containing all these keywords.  If nothing matches, nothing is returned except another ``>'' prompt.

       It is possible to search multiple databases, as long as they have a common index made by In that case, only the first argument given to	is
       specified to

       If  does  not  find the index files (the .i[abc] files), it looks for a reference file with the same name as the argument, without the suf-
       fixes.  It creates a file with a '.ig' suffix, suitable for use with It then uses this fgrep file to find references.  This method is  sim-
       pler to use, but the .ig file is slower to use than the .i[abc] files, and does not allow the use of multiple reference files.

Files
       x.ia, x.ib, x.ic, where x is the first argument, or if these are not present, then x.ig, x

See Also
       addbib(1), lookbib(1), refer(1), roffbib(1), sortbib(1),

																	lookbib(1)