Finding compound words from a set of files from another set of files

Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Finding compound words from a set of files from another set of files
# 1  
Old 08-26-2011
Finding compound words from a set of files from another set of files

Hi All,

I am completely stuck here.

I have a set of files (with names A.txt, B.txt until L.txt) which contain words like these:

random access memory
computer networking

All the files from A.txt to L.txt have the same format i.e. complete words in newlines.

I have another set of text files (files names as 1.dat, 2.dat till n.dat where n is an integer number) which are complete texts like:

use descriptive titles when posting for example do not post questions with subjects like help me urgent or doubt post subjects like execution problems with cron or help with backup shell script.

As you can see in the complete texts:
1. All are in lowercase.
2. No punctuation, no full-stops. Only alphabetical texts.


1. Open 1.dat
2. Open A.txt, B.txt, C.txt until L.txt
3. Find all the words from A.txt and the "word's position in A.txt i.e. its line number" (both single words and compound words like random access memory) which are there in 1.dat
4. Then open B.txt and find words and line numbers in 1.dat and do the same until L.txt
5. Then open 1.num and write the "average of the line numbers of the matching words from all the TXT files".
6. Then open 2.dat and again A.txt till L.txt and then make 2.num and keep doing until all DAT files have been read and all corresponding NUM files are being created. So, in the end I'll have as many NUM files as DAT files containing only one number in each of them which have the average of the line numbers in them.

I have done something and it indeed works very well in matching both single and compound words but I am not able to loop it up for files and find the average of the line number. I have used PERL but for me sed or awk will also do as I just care for the output.

print "Enter a File name :";
chomp ($file = <STDIN>);
print "\n Searching file :";
if (-e $file)
    print "File Found\n";

    $lines = `wc -l < $file`;
    chomp $lines;

    print "Total number of lines in the file = $lines \n";

    print "Enter the pattern to search :";
    chomp ($pattern = <STDIN>);
    print "\n";
    # to search the no of words (pattern search)
    $abc=`grep "$pattern" $file`;
    print "here are the results ...\n$abc\n";
    print "File not Found\n";

I am using Linux with BASH.
# 2  
Old 08-26-2011
Well, I read your post one time, then tried one more, and at the third I saw that I couldn't understand. Smilie
Please give examples (for understanding and possible testing) of files B.txt, C.txt, 2.dat, 1.num, and 2.num.
# 3  
Old 08-27-2011
I am very sorry for not being clear. I realized that I really don't make much sense in my first post. Smilie

I am trying to come up with a solution but got stuck in the file looping operation. This is what I intend to do.

Step by Step Procedure:
1. Open 1.dat
2. Open A.txt
3. Take the first word from A.txt (this word could be single or compound word like random access memory)
4. Search that first word in 1.dat
5. If the first word is found then store the value as 1 (in my case A.txt means 1, B.txt means 2, C.txt means 3 until L.txt which is 12)
6. If the second word is found from A.txt in 1.dat then add 1 to the previous value
7. If again another word from A.txt matches in 1.dat, add 1 again to the previous value and keep adding them as they are found in 1.dat
8. When A.txt is EOF, take B.txt now and search 1.dat again (from beginning) and as and when a word is found keep adding "2" to the previous addition counter.
9. When B.txt is EOF, take C.txt and search 1.dat again. If any word which is present in C.txt and also in 1.dat, now keep adding "3" to the previous counter value as and when the match occurs.
10. Keep doing it until L.txt and for L.txt keeping adding 12 as and when matching occurs.
12. Now, take the mean (average) that is entire number of match that has occurred from A.txt to L.txt
13. Store this mean in 1.num
14. Now, take 2.dat and repeat the procedure from A.txt until L.txt

One thing that is important: When A.txt is being read and match occurs I only keep adding 1 as and when match happens else 0. For B.txt its 2, for C.txt its 3 etc until L.txt

Consider A.txt
random access memory
computer networking

Consider B.txt

Consider 1.dat

random access memory is a form of computer data storage today it takes the form of integrated circuits that allow stored data to be accessed in any order with a worst case performance of constant time strictly speaking modern types of dram are therefore not random access as data is read in bursts although the name dram ram has stuck however many types of sram rom otp and nor flash are still random access even in a strict sense ram is often associated with volatile types of memory such as dram memory modules where its stored information is lost if the power is removed many other types of non-volatile memory are ram as well, including most types of rom and a type of flash memory called nor flash the first ram modules to come into the market were created in and were sold until the late and early however other memory devices magnetic tapes disks can access the storage data in a predetermined order because mechanical designs only allow this

I read A.txt and extract "computer" from it. A.txt is 1

I then search "computer" in the entire text 1.dat and luckily find it in 1.dat. I only have to find match once even if "computer" exists several times. Now, I store this value in some array or variable as 1

I then goto line 2 of A.txt and extract "random access memory". I again search 1.dat for the presence of "random access memory" and BINGO!! found it again and store in the array or variable as 1 as it is in line in A.txt

I do it for entire A.txt and now I find none. So, I goto B.txt

I have an addition counter. So, now my addition counter has 2 (1+1) until now.

Now, I read B.txt

"rom" found in text and hence now add "2" to the previsous addition counter.
"flash" found again add 2 again to the previous counter.

Now I open C.txt and keep adding 3 if found.

Suppose, none of the other TXT files contain any matching words. So, my final addition counter is 6 (1+1+2+2=6 from A.txt and B.txt matches) and mean is 6/4 = 1.5 and store this value in 1.num. (4 is the number of times matches occur)

When all done until L.txt, then I begin the same procedure with 2.dat (and store result in 2.num) until all DAT files have been read.

I think the complexity has increased due to many files else its very simple.

---------- Post updated 08-27-11 at 01:26 PM ---------- Previous update was 08-26-11 at 11:48 PM ----------

Hi All,

After bit of work, I came up with my own code. I am not a PhD in shell scripting but can do farely well in C (I know this section of the forum is not for C codes!). I am pasting my code below so that in future if any one who comes across a similar problem; one can use my code straight away or with a bit of modification according to the needs.

This is what one needs to do in order to run this code.

Place all the .txt and .dat files in one directory and run this code. One case see a file named average_values.res created when the code completes execution. I have used GNU GCC for compiling and creating the binary.

In case any one cannot understand the code, kindly comment here and I shall happily respond.

//Match patterns from several files.

#define _GNU_SOURCE


char *chomp ( char * );
void read_file ( char * , int32_t * );

int32_t main ( int32_t argc , char ** argv )
    char *dat_line = NULL;
    char *txt_line = NULL;
    char *file_name_txt = NULL;
    char *file_name_dat = NULL;
    char *entire_dat_file = NULL;
    char *line_from_txt_file = NULL;
    char *chomped_line = NULL;

    int32_t summation = 0;
    int32_t value = 0;
    float average = 0.0;

    int32_t dat_len = 0;
    int32_t read;
    int32_t txt_len = 0;
    int32_t number_of_characters = 0;
    int32_t number_of_words = 0;

    FILE *open_dat_file = NULL;
    FILE *open_txt_file = NULL;
    FILE *output = NULL;

    //let get the TXT files...
    system ("ls -1 *.txt > text_files.tmp" );
    //lets get the DAT files...
    system ( "ls -1 *.dat > dat_files.tmp" );

    //Outer look is for the DAT files and inner loop is for the TXT

    FILE *txt_pointer = NULL;
    txt_pointer = fopen ( "text_files.tmp" , "r" );
    if ( txt_pointer == NULL )
        fprintf ( stderr , "The file list for text files does not exist\n" );

    FILE *dat_pointer = NULL;
    dat_pointer = fopen ( "dat_files.tmp" , "r" );
    if ( dat_pointer == NULL )
        fprintf ( stderr , "The file list for the dat files does not exist\n" );

    output = fopen ( "average_values.res" , "a" );
    if ( output == NULL )
        fprintf ( stderr , "File append error\n" );

    while ( ( read = getline ( &dat_line , &dat_len , dat_pointer ) ) != -1 )
        file_name_dat = chomp ( dat_line );
        open_dat_file = fopen ( file_name_dat , "r" );

        ( void ) fseek ( open_dat_file , 0L , SEEK_END );
        number_of_characters = ftell ( open_dat_file );

        entire_dat_file = ( char * ) malloc ( number_of_characters * sizeof ( char ) );
        if ( entire_dat_file == NULL )
            fprintf ( stderr , "malloc() memory allocation failure in entire_dat_file\n" );

        rewind ( open_dat_file );

        read_file ( file_name_dat , &number_of_words );

        fgets ( entire_dat_file , number_of_characters , open_dat_file );

        while ( ( read = getline ( &txt_line , &txt_len , txt_pointer ) ) != -1 )
            file_name_txt = chomp ( txt_line );
            open_txt_file = fopen ( file_name_txt , "r" );

            //Now read to read the txt files one by one and search for the pattern...

            while ( ( read = getline ( &line_from_txt_file , &txt_len , open_txt_file ) ) != 1 && !feof ( open_txt_file ) )
                chomped_line = chomp ( line_from_txt_file );
                if ( strstr ( entire_dat_file , chomped_line ) != NULL )
                    summation = summation + value;

        average = ( float ) summation / ( number_of_words + 1 );
        fprintf ( output , "%f\n" , average );
        number_of_words = 0;
        summation = 0;
        value = 0;
        number_of_characters = 0;
        rewind ( txt_pointer );

        memset ( entire_dat_file , 0 , strlen ( entire_dat_file ) );
        fclose ( open_dat_file );

    fclose ( open_txt_file );
    fclose ( output );

    if ( txt_line )
        free ( txt_line );

    if ( dat_line )
        free ( dat_line );

    return ( EXIT_SUCCESS );

char *chomp ( char *s )
    char *n = malloc( strlen( s ? s : "\n" ) );
    if( s )
        strcpy( n, s );
    return n;

void read_file ( char *path , int32_t *number_of_words )
    FILE *pointer = NULL;

    char ch;

    pointer = fopen ( path , "r" );
    if ( pointer == NULL)
        perror ( "File read error " );

    (*number_of_words) = 0;

    while ( !feof ( pointer ) )
        ch = fgetc ( pointer );
        if ( ch == ' ' && ch != EOF )
            (*number_of_words) ++;

    fclose ( pointer );

Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Finding non-existing words in a list of files in a directory and its sub-directories

Hi All, I have a list of words (these are actually a list of database table names separated by comma). Now, I want to find only the non-existing list of words in the *.java files of current directory and/or its sub-directories. Sample list of words:... (8 Replies)
Discussion started by: Bhanu Dhulipudi
8 Replies

2. Shell Programming and Scripting

Help needed with shell script to search and replace a set of strings among the set of files

Hi, I am looking for a shell script which serves the below purpose. Please find below the algorithm for the same and any help on this would be highly appreciated. 1)set of strings need to be replaced among set of files(directory may contain different types of files) 2)It should search for... (10 Replies)
Discussion started by: Amulya
10 Replies

3. Shell Programming and Scripting

Find Set of files

All, I am trying to find a set of files, it could be one file OR set of file , all with extension .DAT I need to do some acticity, only if the files exist in a partificular folder like if ; then CntV=`ls $Landing/*.DAT |wc -l` echo "Lst Value " $Cnt... (3 Replies)
Discussion started by: Shanks
3 Replies

4. UNIX for Dummies Questions & Answers

Adding words after a set of words

Greetings. I am a UNIX newbies. I am currently facing difficulties dealing with a large data set and I would like to ask for helps. I have a input file like this: ak 1 AAM1 ak 2 AAM1 ak 3 AAM1 ak 11 AMM2 ak 12 AMM2 ak 13 AMM2 ak 14 AMM2 Is there any possibility for me to... (7 Replies)
Discussion started by: Amanda Low
7 Replies

5. Shell Programming and Scripting

Finding the most frequently occurring set of words

Hi guys, I have a file with a list of phoneme for words, it looks like this: AILS EY1 L Z AIMLESSLY EY1 M L AH0 S L IY0 AIMONE EY1 M OW2 N AIMS EY1 M Z AINGE EY1 NG AINGE(2) EY1 N JH AINLEY EY1 N L IY0 AINSLIE EY1 N Z L IY0 AIR EH1 R AIRBAGS EH1 R B AE2 G Z and I need to... (5 Replies)
Discussion started by: Andrew9191
5 Replies

6. Shell Programming and Scripting

search of common words in set of files

Hi, I have a set of simple, one columned text files (in thousands). file1: a b c d file 2: b c d e and so on. There is a collection of words in another file: b d b c d e I have to find out the set of words (in each row) is present or absent in the given set of files. So, the... (4 Replies)
Discussion started by: mala
4 Replies

7. UNIX for Dummies Questions & Answers

move a set of files

Hi Everyone!!! Is there any command to move/copy set of files in a specific range. Eg : I have 800 text files in a directory A1 ... A800 I would like to copy only files in range A40 ... A250. I can acheive this using a "for" loop , but I guess there could be some command or... (8 Replies)
Discussion started by: joey_reddy
8 Replies

8. UNIX for Dummies Questions & Answers

Create individual tgz files from a set of files

Hello I have a ton of files in a directory of the format app.log.2008-04-04 I'd like to run a command that would archive each of these files as app.log.2008-04-04.tgz I tried a few combinations of find with xargs etc but no luck. Thanks Amit (4 Replies)
Discussion started by: amitg
4 Replies

9. Shell Programming and Scripting

Purging a Set of Files

Hi Frineds, I want to delete a set of files which are older than 7 days from teh current date.I am totally enw to shell scripting, can anyone help me with a sample code to list out the files which are older and then remove them from the directory. Please help THanks Viswa (5 Replies)
Discussion started by: svishh123
5 Replies
Login or Register to Ask a Question