Counting number of files that contain words stored in another file
Hi All,
I have written a script on this but it does not do the requisite job. My requirement is this:
1. I have two kinds of files each with different extensions. One set of files are *.dat (6000 unique DAT files all in one directory) and another set *.dic files (6000 unique DIC files in all in the same directory where DAT files are located)
2. The files only contain words all in new lines. For example:
1.dat contains something like this:
Code:
computer
red
apple
orange
1.dic looks like this:
Code:
computer
apple
red
blue
3. For every corresponding DAT file there is a DIC file. For 1.dat, I have 1.dic, 2.dat and 2.dic .......6000.dat and 6000.dic
4. What I want to do is to read every word from DIC files and search in all DAT files and find the number of DAT files that contain that word from the DIC file and store the result in FIL files. This means I have to only count once in the DAT files even if that word appears several times in that DAT file. For example:
1.dic contains 10 words, I read every word from 1.dic line by line and search in all DAT files as to how many DAT files contain that word from 1.dic. Then I write the result (i.e. count values) in every line in 1.fil. Similarly, I read every word in 2.dic line by line, search words in all DAT files and write the count values in 2.fil. My 2.fil should look something like this:
Code:
2
3
1
3
i.e word in the first line (of 2.dic) appears 2 times in all the DAT files (counting that word only once in all DAT files even if one DAT file contains that word several times). Same thing has to be done with all the 6000 DIC files.
What I have done so far:
Code:
for DAT in *.dat
do
for DIC in *.dic
do
while read word
CNT=$(basename "$DAT" .dat).fil
DIC=$(basename "$DAT" .dat).dic
grep -il "$word" | find . | wc -l $DIC $DAT > $FIL
done
done
I'm trying to figure out a way to count the number of words in the follwing file:
cal 2002 > file1
Is there anyway to do this without using wc but instead using the cut command? (1 Reply)
Some simple questions from a simple man.
If i wanted to count the number of files contained within a directory, say /tmp would ls -l /tmp ¦ wc -l suffice and will it be accurate?
second one: How would i check the number of files with a certain string in the filename, in the same directory.
... (2 Replies)
I need to count the number of files which have a search string, but counting the file only once
if search string is found.
eg: File1: Please note that there are 2 occurances of "aaa"
aaa
bbb
ccc
aaa
File2: Please note that there are 3 occurances of "aaa"
aaa
bbb
ccc... (1 Reply)
i want to count the number of words in a file and then redirect this to a file
echo 'total number of words=' wc -users>file
THis isnt working, anyone any ideas. (1 Reply)
Please find the below program. It contains the purpose of the program itself.
/* Program : Write a program to count the number of words in a given text file */
/* Date : 12-June-2010 */
# include <stdio.h>
# include <stdlib.h>
# include <string.h>
int main( int argc, char *argv )
{... (6 Replies)
Hi Pls help in solving my doubt.Iam having file like below
file1.txt
priya
jenny
jenny
priya
raj
radhika
priya
bharti
bharti
Output required:
I need a output like count of repeated words with name for ex:
priya 3
jenny 2 (4 Replies)
Hi, Given below is the input file:
http://i53.tinypic.com/2vmvzb8.png
Given below is what the output file should look like:
http://i53.tinypic.com/1e6lfq.png
I know how to count the occurrence of 1 word from a file, but not all of them. Can someone help please? An explanation on the... (1 Reply)
Hey Unix gurus,
I would like to count the number occurrences of all the words (regardless of case) across multiple files, preferably outputting them in descending order of occurrence. This is well beyond my paltry shell scripting ability.
Researching, I can find many scripts/commands that... (4 Replies)
Hello,
I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear:
kpaware
nlupset
rrrbring
In other words these words are invalid in English and constitute garbage in the data.
I have identified such... (2 Replies)
Hi ,
I need to count the number of errors associated with the two words occurring in the file. It's about counting the occurrences of the word "error" for where is the word "index.js". As such the command should look like. Please kindly help. I was trying: grep "error" log.txt | wc -l (1 Reply)
Discussion started by: jmarx
1 Replies
LEARN ABOUT CENTOS
dat.conf
DAT.CONF(5)DAT.CONF(5)NAME
dat.conf - configuration file for static registration of user-level DAT rdma providers
DESCRIPTION
The DAT (direct access transport) architecture supports the use of multiple DAT providers within a single consumer application. Consumers
implicitly select a provider using the Interface Adapter name parameter passed to dat_ia_open().
The subsystem that maps Interface Adapter names to provider implementations is known as the DAT registry. When a consumer calls
dat_ia_open(), the appropriate provider is found and notified of the consumer's request to access the IA. After this point, all DAT API
calls acting on DAT objects are automatically directed to the appropriate provider entry points.
A persistent, administratively configurable database is used to store mappings from IA names to provider information. This provider infor-
mation includes: the file system path to the provider library object, version information, and thread safety information. The location and
format of the registry is platform dependent. This database is known as the Static Registry (SR) and is provided via entries in the
dat.conf file. The process of adding a provider entry is termed Static Registration.
Registry File Format
* All characters after # on a line are ignored (comments).
* Lines on which there are no characters other than whitespace
and comments are considered blank lines and are ignored.
* Non-blank lines must have seven whitespace separated fields.
These fields may contain whitespace if the field is quoted
with double quotes. Within fields quoated with double quotes,
the backslash or qoute are valid escape sequences:
* Each non-blank line will contain the following fields:
- The IA Name.
- The API version of the library:
[k|u]major.minor where "major" and "minor" are both integers
in decimal format. User-level examples: "u1.2", and "u2.0".
- Whether the library is thread-safe: [threadsafe|nonthreadsafe]
- Whether this is the default section: [default|nondefault]
- The library image, version included, to be loaded.
- The vendor id and version of DAPL provider: id.major.minor
- ia params, IA specific parameters - device name and port
- platform params, (not used)
OpenFabrics RDMA providers:
Provider options for both 1.2 and 2.0, each using different CM services
1. cma - OpenFabrics rdma_cm - uses rdma_cm services for connections
- requires IPoIB and SA/SM services for IB
- netdev used for device name, without port designation (ia_params)
- Supports any transport rdma_cm supports including IB, iWARP, RoCEE
- libdaplcma.so (1.2), libdaplofa (2.0)
2. scm - uDAPL socket based CM - exchanges CM information over sockets
- eliminates the need for rdma_cm, IPoIB, and SA for IB
- verbs device used for device name with port designation (ia_param)
- Supports IB, RoCEE. Doesn't support iWARP
- libdaplscm.so (1.2), libdaploscm (2.0)
3. ucm - uDAPL unreliable IB CM - exchanges CM information via IB UD QP's
- eliminates the need for sockets or rdma_cm
- verbs device used for device name with port designation (ia_param)
- Supports IB only, no name service.
- libdaplucm.so (1.2), libdaploucm (2.0)
Example entries for each OpenFabrics provider
1. cma - OpenFarbrics rdma_cm (v1.2 and v2.0 examples)
OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0" ""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0" ""
ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth2 0" ""
ofa-v2-cma-roe-eth3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth3 0" ""
NOTE: The OpenFabrics CMA providers use <ia_params> to specify the device with one of the following:
network address, network hostname, or netdev name; along with port number.
2. scm - uDAPL socket based CM (v1.2 and v2.0 examples)
OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "mlx4_0 1" ""
OpenIB-ipath0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 "ipath0 1" ""
ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-mlx4_1-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_1 1" ""
ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "ehca0 1" ""
ofa-v2-scm-roe-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 "mlx4_0 1" ""
3. ucm - uDAPL unreliable IB CM (not supported in 1.2, v2.0 examples)
ofa-v2-mlx4_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 1" ""
ofa-v2-mlx4_0-2u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4_0 2" ""
ofa-v2-ipath0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "ipath0 1" ""
ofa-v2-ehca0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "ehca0 1" ""
Note: OpenIB- and ofa-v2- IA names are unique mappings, reserved for OpenFabrics providers.
The default location for this configuration file is /etc/dat.conf.
The file location may be overridden with the environment variable DAT_OVERRIDE=/your_own_directory/your_dat.conf.
SEE ALSO rdma_cm verbs socket
25 March 2008 DAT.CONF(5)