![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Bash script for searching a string | shoponek | Shell Programming and Scripting | 10 | 01-14-2009 03:32 PM |
| Searching Bash Arrays | msb65 | Shell Programming and Scripting | 1 | 08-13-2008 10:22 AM |
| Performing Script on Multiple Files | zanetti321 | UNIX for Advanced & Expert Users | 3 | 04-16-2008 11:42 AM |
| Unix file operations(shell script) | nivas | Shell Programming and Scripting | 6 | 02-07-2008 07:11 AM |
| fast searching algorithm | rochitsharma | Shell Programming and Scripting | 3 | 02-28-2006 03:48 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Performing fast searching operations with a bash script
Hi,
Here is a tough requirement , to be served by bash script. I want to perform 3,00,000 * 10,000 searches. i.e. I have 10,000 doc files and 3,00,000 html files in the file-system. I want to check, which of the doc files are referred in any html files. (ex- <a href="abc.doc">abc</a>) Finally, I want to remove all the doc files, which are not referenced from any of the html files. Approach -1 :- Initially I have tried with nested loops, outer loop on list of html files, and inner loop on list of doc files. Then, inside the inner loop, I was checking (with fgrep command) whether one file is present in one html. # html_list :- list of all html files # doc_file_list :- list of all doc files # tmp_doc_file_list :- list of temp doc files while read l_line_outer do while read l_line_inner do fgrep <file> <html> return_code=$? if [ $return_code -ne 0 ] then printf "%s\t%s\n" $l_alias_name_file $l_alias_path_file >> tmp_doc_file_list fi done < doc_file_list mv tmp_doc_file_list doc_file_list done < html_list This approach was giving correct output, but it was taking a long time to perform this huge no. of searches. Approach -2 :- Then, we switched to a different logic, by launching many threads in parallel. 1. Outer loop on "doc_file_list" and inner loop on html_list. 2. under a single process, (inside the inner loop ) i was searching (fgrep) existence of one file into 30 html files at once. 3. I was launching 10 such processes in parallel (by using & at the end.) The sample code is as follows. ........ ......... while read l_line_outer do ....... < Logic to jump the loop pointer in 10 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 11th position.> ....... while read l_line_inner do < Logic to jump the loop pointer in 30 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 31th position. > ........ # Loop to launch multiple fgrep in parallel for ((i=1; i<=10; i++)) do ( fgrep -s -m 1 <file{i}> <html1> <html2> <html3> ... <html30> > /dev/null ; echo $? >> thread_status{i} ) & done .... done < html_list ..... <Logic to prepare the doc_file_list for the next loop and handle the multiple threads> ..... done < doc_file_list ...... ..... However, This approach is also not working. a ) I am getting the correct output, on small no of files/folders. b ) While performing 300,000 * 10,000 searches, my shell script is getting dead-locked some-where, and the execution is getting halted. c ) Even if, I am managing the dead-locking (thread management) to some exetent, it will take a long time to finish such a huge search. Is there any alternative approach , for making this search faster, so that this search can be finished atleast in 2-3 days ? Please help. Thanks and Regards, Jitendriya Dash. |
|
||||
|
Use a database!
Create a table "Table1" with columns "HTML" and "DOC" populated with one line for each document that appears in an html file. Create another table "Table2" with a unique list of all DOCs select DOC from Table 2 where DOC not in (select distinct DOC from Table1) This will give a list of all unused DOCs. |
|
||||
|
Thanks for the input.
Using database is a good idea, however placing the contents of all html files here into DB, (as a BLOB/CLOB field) or placing only the <a href > lines, that contains reference to any documents, into DB is a big task. ( i.e. How to insert all these lines into DB, ex- in one html file, if there are 100 <a href> lines, how to place all those lines into DB for 300,000 html files ?) Can it be done quickly , with an execution of any linux command ? Actually, the linux server is a 8 core processor. Is there any other way, to quicken the search/grep operation and loop operation by assigning the tasks to multiple cores ? Please give your inputs. Thanks and Regards, Jitendriya Dash. |
|
||||
|
Concatenate all the contents of the html files into one big file. The do searches for the filenames in parallel over that file, on non-matches print the pattern (use grep -m 1 if using GNU grep to prevent searching the whole file on match). If you have the memory for the html concat (to be cached), should be pretty fast.
|
![]() |
| Bookmarks |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|