I have a set of simple, one columned text files (in thousands).
file1:
a
b
c
d
file 2:
b
c
d
e
and so on. There is a collection of words in another file:
b d
b c d e
I have to find out the set of words (in each row) is present or absent in the given set of files. So, the output would be in matrix form (file*set) like:
1 0
1 1
I have the following code in bash, which is working well, but it involves very high computational cost with the increase of the file and set size. Any suggestion for better checking for the words is much appreciated.
My code segment:
Code:
#!/bin/sh
rm -f feat.txt
touch feat.txt
rm -f tem.txt
touch tem.txt
#read the rows of set file s.txt and put into seperate files
lables=1
while read myline
do
echo $myline > temp.txt
cat temp.txt|awk '{for (i=1;i<=NF;i++) print $i}'|sort|uniq > l$lables.txt
lables=`expr $lables + 1`
done < s.txt
p=`wc -l s.txt| awk '{print $1}'`
q=`expr $p - 1`
a=1
c=`expr $a + $q`
while [ $a -le $c ]
do
rm -f a.txt
touch a.txt
fileno=1
while [ $fileno -le 1000 ]
do
cat l$a.txt|fgrep -xvf $fileno.txt| awk '{printf ($1 " ")}' >> a.txt
echo >>a.txt
fileno=`expr $fileno + 1`
done
cat a.txt |awk '{if ($0 != NULL) print "0"; if ($0 == null) print "1"}'>c.txt
paste c.txt feat.txt > feat1.txt
cat feat1.txt >feat.txt
a=`expr $a + 1`
done
Thank you in advance
Last edited by vidyadhar85; 05-31-2009 at 10:45 AM..
Reason: code tag added
Hello,
I want an one line command that brings me back all the files in a folder that contain 4 specific words anywhere inside them.
I want to use find,xargs and grep. for example i know for one word the command would be:
find . | xargs grep 'Word1'
But i don't know for 4 specific words... (13 Replies)
Hi, everyone,
Let's say, we have
xxx.txt
A 1 2 3 4 5
C 1 2 3 4 5
E 1 2 3 4 5
yyy.txt
A 1 2 3 4 5
B 1 2 3 4 5
C 1 2 3 4 5
D 1 2 3 4 5
E 1 2 3 4 5
First I match the first column I find intersection (A,C, E), then I want to take those lines with ACE out from yyy.txt, like
A 1... (11 Replies)
Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails... (1 Reply)
Hi Friends,
I have been trying to write the script since morning and reached some where now. but i think i am stuck in the final step. please help
I want to search the strings below in red in the be be searched in the directories below. How can i do that in my shell script.
Thanks
Adi
... (8 Replies)
Hi All,
I am completely stuck here.
I have a set of files (with names A.txt, B.txt until L.txt) which contain words like these:
computer
random access memory
computer networking
mouse
terminal
windows
All the files from A.txt to L.txt have the same format i.e. complete words in... (2 Replies)
I have two (or more, to make it generic) csv files. Each line contains words separated by comma. None of words have any space. The number of words per line is not fixed. Some may have one, and some may have 12... The number of lines per file is also not fixed.
What I need is to find common words... (1 Reply)
Hi,
Need your help for this scripting issue I have. I am not really good at this, so seeking your help.
I have a file looking similar to this:
Hello, i am human and name=ABCD.
How are you?
Hello, i am human and name=PQRS.
I am good.
Hello, i am human and name=ABCD.
Good bye.
Hello, i... (12 Replies)
Hi,
I am looking for a shell script which serves the below purpose.
Please find below the algorithm for the same and any help on this would be highly appreciated.
1)set of strings need to be replaced among set of files(directory may contain different types of files)
2)It should search for... (10 Replies)
Hi
I have two files. One is a text file consisting of sentences i.e. INPUT.txt and the second file is SEARCH.txt consisting of two or three columns. I need help to write a script to search the second column of SEARCH.txt for each set of five words (blue color as set one and green color as set... (6 Replies)
Discussion started by: my_Perl
6 Replies
LEARN ABOUT SUSE
prezip-bin
PREZIP-BIN(1) Aspell Abbreviated User's Manual PREZIP-BIN(1)NAME
prezip-bin - prefix zip delta word list compressor/decompressor
SYNOPSIS
prezip-bin [ -V | -d | -z ]
DESCRIPTION
prezip-bin compresses/decompresses sorted word lists from standard input to standard output.
Prezip-bin is similar to word-list-compress(1) but it allows a larger character set of {0x00...0x09, 0x0B, 0x0C, 0x0E...0xFF} and
multi-words larger than 255 characters in length. It can also decompress word-list-compress(1) compatible files.
COMMANDS
Prezip-bin accepts only one of these commands.
-V Display prezip-bin version number to standard output.
-d Read a compressed word list from standard input and decompress it to standard output. This can be a word-list-compress(1) or a
prezip-bin compressed file.
-z Read a binary word list from standard input and compress it to standard output.
EXAMPLES
prezip-bin -d <wordlist.cwl >wordlist.txt
Decompress file wordlist.cwl to text file wordlist.txt
prezip-bin -z <wordlist.txt >wordlist.pz 2>errors.txt
Compress wordlist.txt to binary file wordlist.pz and send any error messages to a text file named errors.txt
LC_COLLATE=C sort -u <wordlist.txt | prezip-bin -z >wordlist.pz
Sort a word list, then pipe it to prezip-bin to create a compressed binary wordlist.pz file.
prezip-bin -d <words.pz | aspell create master ./words.rws
Decompress a wordlist, then pipe it to aspell(1) to create a spelling list. Please check the aspell(1) info manual for proper usage
and options.
TIPS
Prezip-bin is best used with sorted word list type files. It is not a general purpose compression program since resulting files may actu-
ally increase in size.
Unlike word-list-compress(1) if your word list has leading or trailing blank spaces for formatting purposes, you should remove them first
before you compress your list using prezip-bin -z , otherwise those spaces will be included in the compressed binary output.
DIAGNOSTICS
Prezip-bin normally exits with a return code of 0. If it encounters an error, a message is sent to standard error output (stderr), and
prezip-bin exits with a non-zero return value. Error messages are listed below:
(display help/usage message)
Unknown command given on the command line so prezip-bin displays a usage message to standard error output.
unknown format
The input file appears not to be an expected format, or may possibly be a more advanced format. The output file will be empty.
corrupt input
This is only for the decompression command -d. The input file appeared to be of a correct format, but something appears wrong now.
There may be some valid data in output, but due to input corruption, the rest of the file can not be completed.
unexpected EOF
The input file appeared okay but ended sooner than expected, therefore the output file is not complete.
SEE ALSO aspell(1), aspell-import(1), run-with-aspell(1), word-list-compress(1)
Aspell is fully documented in its Texinfo manual. See the `aspell' entry in info for more complete documentation.
REPORTING BUGS
For help, see the Aspell homepage at <http://aspell.net>. Send bug reports/comments to the Aspell user list at the above address.
AUTHOR
This info page was written by Jose Da Silva <digital@joescat.com>.
prezip-bin-0.1.2 2005-09-30 PREZIP-BIN(1)