Sponsored Content
Top Forums Shell Programming and Scripting Performing fast searching operations with a bash script Post 302321385 by jitendriya.dash on Monday 1st of June 2009 03:06:38 AM
Old 06-01-2009
Performing fast searching operations with a bash script

Hi,

Here is a tough requirement , to be served by bash script.

I want to perform 3,00,000 * 10,000 searches.

i.e. I have 10,000 doc files and 3,00,000 html files in the file-system. I want to check, which of the doc files are referred in any html files. (ex- <a href="abc.doc">abc</a>) Finally, I want to remove all the doc files, which are not referenced from any of the html files.

Approach -1 :-
Initially I have tried with nested loops, outer loop on list of html files, and inner loop on list of doc files. Then, inside the inner loop, I was checking (with fgrep command) whether one file is present in one html.
# html_list :- list of all html files
# doc_file_list :- list of all doc files
# tmp_doc_file_list :- list of temp doc files
while read l_line_outer
do
while read l_line_inner
do
fgrep <file> <html>
return_code=$?
if [ $return_code -ne 0 ]
then
printf "%s\t%s\n" $l_alias_name_file $l_alias_path_file >> tmp_doc_file_list
fi
done < doc_file_list
mv tmp_doc_file_list doc_file_list
done < html_list

This approach was giving correct output, but it was taking a long time to perform this huge no. of searches.

Approach -2 :-
Then, we switched to a different logic, by launching many threads in parallel.

1. Outer loop on "doc_file_list" and inner loop on html_list.
2. under a single process, (inside the inner loop ) i was searching (fgrep) existence of one file into 30 html files at once.
3. I was launching 10 such processes in parallel (by using & at the end.)

The sample code is as follows.
........
.........
while read l_line_outer
do
.......
< Logic to jump the loop pointer in 10 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 11th position.>
.......
while read l_line_inner
do
< Logic to jump the loop pointer in 30 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 31th position. >
........
# Loop to launch multiple fgrep in parallel
for ((i=1; i<=10; i++))
do
( fgrep -s -m 1 <file{i}> <html1> <html2> <html3> ... <html30> > /dev/null ; echo $? >> thread_status{i} ) &
done
....
done < html_list
.....
<Logic to prepare the doc_file_list for the next loop and handle the multiple threads>
.....
done < doc_file_list
......
.....

However, This approach is also not working.
a ) I am getting the correct output, on small no of files/folders.
b ) While performing 300,000 * 10,000 searches, my shell script is getting dead-locked some-where, and the execution is getting halted.
c ) Even if, I am managing the dead-locking (thread management) to some exetent, it will take a long time to finish such a huge search.


Is there any alternative approach , for making this search faster, so that this search can be finished atleast in 2-3 days ?

Please help.

Thanks and Regards,

Jitendriya Dash.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

fast searching algorithm

hello, i need a searching algorithm in unix. since my input file is very bulky, so need a real fast searching algorithm, to match words. i am already using grep. (3 Replies)
Discussion started by: rochitsharma
3 Replies

2. UNIX for Advanced & Expert Users

Performing Script on Multiple Files

Dear All I have group of files named : CDR.1,CDR.2.,CDR.3,CDR.4,CDR.5,CDR.6,etc....... I am performing an awk command look like this : nawk -f script CDR.* What i want is that i want to perform this command on range of files not all of them. Instead of writing CDR.* i want to write... (3 Replies)
Discussion started by: zanetti321
3 Replies

3. Shell Programming and Scripting

Bash script for searching a string

Hi, I have a file that contains thousands of records. Each record starts with "New Record". I need to search this file based on a given value of "timestamp" and write all the records that match this timestamp into a new file. I was able to locate the existence of a given value of timestamp using... (10 Replies)
Discussion started by: shoponek
10 Replies

4. Shell Programming and Scripting

Need help is manipulating a file with some arithmetic operations using bash script

Friends, I have a file with contents like: interface Serial0/4/0/0/1/1/1/1:0 encapsulation mfr multilink group 101 Now I need to manipulate the file in such a way that to all the numbers less than 163, 63 gets added and to all numbers greater than 163, 63 gets deducted.(The numbers... (2 Replies)
Discussion started by: shrijith1
2 Replies

5. Shell Programming and Scripting

Performing remote operations in a script

I have a scropt that looks something like this: #!/bin/bash ssh user@domain1.com sleep 10 some_command exit ssh different_user@domain2.com sleep 10 some_command exit However, the script is not logging into those accounts and doing the actions. The accounts are configured in my... (3 Replies)
Discussion started by: dotancohen
3 Replies

6. Shell Programming and Scripting

Arithmetic operations in bash,ksh,sh

Guys, The below expression is valid in which shells (sh,ksh,bash,csh)? VAR1=2 VAR2=$(($VAR1 -2)) Thanks (1 Reply)
Discussion started by: rprajendran
1 Replies

7. Shell Programming and Scripting

Can BASH handle mathematical operations and do a Search & Replace?

Hello, I have a bunch of xml file that needs to have edits made and I was wondering if a BASH script could handle it. I would like the script to look within my xml files and replace all integers greater than 5px with a value that is 25% smaller. For example, 100px = 75px. Since the integers... (12 Replies)
Discussion started by: jl487
12 Replies

8. Shell Programming and Scripting

Reading the data from CSV and performing search through shell script

Hello, I am working on building a script that does the below actions together in my Linux server. 1) First, have to read the list of strings mentioned in CSV and store it in the shell script 2) Second, pick one by one from the string list, and search a particular folder for files that... (2 Replies)
Discussion started by: vikrams
2 Replies

9. Shell Programming and Scripting

Performing arithmetic operations on output of `wc -l`

Hi I want to perform arithmetic operations on output of `wc -l`. for example user046@sshell ~ $ ls -l total 0 where "total 0" will increase one line in wc -l filecount=`ls -l | wc -l` here $filecount will be 1 but is should be 0 how to get rid of it ? (1 Reply)
Discussion started by: anandgodse
1 Replies

10. Shell Programming and Scripting

Where to place operations bash scripts?

As I have sometimes problems with passenger module loading correctly after restart of apache2 we wrote a short bash-script to check correct loading of application (redmine) and - if not- restarting apache2 until application is loaded by passenger. Script is invoked using cron. To do everything... (2 Replies)
Discussion started by: awilhelmy
2 Replies
ENVEXT(1)						  The Canonical Csound Reference						 ENVEXT(1)

NAME
envext - Extracts the envelope of a file to a text file. . SYNTAX
envext [-flags] soundfile csound -U envext [-flags] soundfile INITIALIZATION
soundfile - Name of the input soundfile. The following flags are available for envext (The default values are stated in parenthesis): -o fnam Name of output filename (newenv) -w size (in seconds) of analysis window (0.25) The envext utility generates a text file containing time and amplitude pairs by finding the absolute peak within each window. EXAMPLE
Using the command (while in the manual directory): csound -U envext examples/mary.wav will produce the a text file containing the following: 0.000 0.000 0.000 0.000 0.250 0.000 0.500 0.000 0.750 0.000 1.249 0.170 1.499 0.269 1.530 0.307 1.872 0.263 2.056 0.304 2.294 0.241 2.570 0.216 2.761 0.178 3.077 0.011 3.251 0.001 3.500 0.000 Which shows the time for the peak amplitude within each measured window. CREDITS
Author: John ffitch 1995 AUTHORS
Barry Vercoe MIT Media Lab Author. Dan Ellis MIT Media Lab, Cambridge Massachussetts Author. COPYRIGHT
5.07 06/23/2009 ENVEXT(1)
All times are GMT -4. The time now is 10:52 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy