Efficient method of determining if a string is in a file.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Efficient method of determining if a string is in a file.
# 1  
Old 09-25-2012
Efficient method of determining if a string is in a file.

Hi,

I was hoping someone could suggest an alternative to code I currently have as mine takes up far too much processor time and it to slow.

The situation:
I have a programme that runs on some files just before they are zipped up and archived, the program appends a one line summary of the file contents to a file called summary.log in the following format:

<filename1><file timestamp><data>
<filename2><file timestamp><data>
...
...
etc

It is imperative that each file is only processed once (this should happen my default but it is very sensitive information so we want a method of checking that the file has not already been processed and there could be several files that do need processing.

The current solution:

We found out quickly that using grepping for the file name was too slow so my current solution is as follows:

Code:
    ###########################################################################
    # Get a list of the pri files, and strip off the last one, as we          #
    # don't want to compress or parse this yet.  Store these in               #
    # a file called prilist2.                                                 #
    ###########################################################################
    ls -rt1 *.pri | grep -v XML > /tmp/prilist 2>/dev/null
    last_pri=`tail -1 /tmp/prilist`

    if [ -n "$last_pri" ]
    then
      grep -v $last_pri /tmp/prilist > /tmp/prilist2

      #########################################################################
      # Only process files that have not already been processed.              #
      #########################################################################
      awk -F, '{print $1}' $SUMMARYLOG > /tmp/processed 2>/dev/null
      comm -23 /tmp/prilist2 /tmp/processed  > /tmp/unprocessed 2>/dev/null

      for IMPORTANT_FILE in `cat /tmp/unprocessed`
      do 

....
etc

Sadly the awk in this command is also too slow and so I need something faster. I was thinking of doing something based on the timestamp but we can not guarantee that the items in teh summary log will be in chronological order (although all but the first few entries should be).

I hope that is clear. If you need any more information I will be happy to provide it in the morning (I have to dash for a train now)

Thanks in advance.
# 2  
Old 09-25-2012
Please provide sample data of the file you want to extract from and what you want to extract...
# 3  
Old 09-25-2012
Try this:

Code:
ls -t1 *.pri | awk -F, 'FNR==NR{have[$1];next}
  /XML/||!(F++){next}!($1 in have)' $SUMMARYLOG - | while read IMPORTANT_FILE
do

...
etc


Last edited by Chubler_XL; 09-25-2012 at 03:22 PM..
# 4  
Old 09-26-2012
I can't provide real examples as it relates to customer billing information but the file will look pretty much like this:

Code:
file1;00:00:09:12:12;0;1;2;0;0.6;9;0.6;6,4
file2;00:00:10:12:12;23;4.3;34;0;0;0;34
file3;00:00:11:12:12,2;4,5;53;5,3;53;0;0;0;
file4;00:00:12;12;12;4;23;5.4;4.5;0;2;1;1
..
etc

I want to search for the file name, in the above example "file#". As mentioned in the proginal post the general format of each line is <filename>;<timeshamp>;<data> where the data is maximum 1dp numbers separated by semi-colons.

The original files are called file#.pri, but we strip the extension off before adding the summary data to the file.

The file should mainly be in chronological order but some files may have been processed out of order so I would rather not have to rely on this. We do have a script that does some processing during quiet times so I could make this sort the file once a day but I feel that would be very processor expensive as the file can have thousands of lines on data in it.

@Chubler_XL: Thanks, I will take a look at your suggestion and see if it helps.
# 5  
Old 09-26-2012
Chubler_XL's solution should work for you.
That aside, remember that comm works properly on sorted files and I don't think your 2 files (/tmp/prilist2 and /tmp/processed) are sorted. Or are they?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Printf or any other method to put long string of spec characters - passing passwords

Hello, I am looking for a method to use in my bash script which allows me to use long strings with all special characters. I have found that printf method could be helpful for me but unfortunately, when I trying root@machine:~# tevar=`printf "%s%c"... (2 Replies)
Discussion started by: elxa1
2 Replies

2. Shell Programming and Scripting

Determining length of string

I have this script which is very easy: file=`echo 01114` echo $file 01114 then I ran this if ; then echo "yes";fi it returned yes even though there are only 5 digits there So then I tried file=`echo abcd` echo $file abcd if ]; then echo "yes";fi if ]; then echo "yes";fi It... (2 Replies)
Discussion started by: newbie2010
2 Replies

3. Shell Programming and Scripting

Most efficient method to extract values from text files

I have a list of files defined in a single file , one on each line.(No.of files may wary each time) eg. content of ETL_LOOKUP.dat /data/project/randomname /data/project/ramname /data/project/raname /data/project/radomname /data/project/raame /data/project/andomname size of these... (5 Replies)
Discussion started by: h0x0r21
5 Replies

4. Shell Programming and Scripting

Determining whether given string is path or not

I have an issue while determing whether given string is unix path or not There is a text file which is normally a report in that at some place we have unix path as shown below /opt/smart/dev/eur/sources/sqr and not unix path as shown below Threshold Year/Month/Ref/ActLine/OUC Is there... (3 Replies)
Discussion started by: lalitpct
3 Replies

5. Shell Programming and Scripting

File transformation - what is most efficient method

I've done quite a bit of searching on this but cannot seem to find exactly what I'm looking for. Say I have a | delimited input file with 6 columns and I need to change the value of a few columns and create an output file. With my limited knowledge I can do this with many lines of code but want... (5 Replies)
Discussion started by: 1superdork
5 Replies

6. UNIX for Dummies Questions & Answers

efficient raid file server

I need to put together a RAID1 file server for use by Windoze systems. I've built zillions of windows systems from components. I was a HPUX SE for a long time at HP, but have been out of the game for years. I've got an old workhorse mobo FIC PA-2013 with a 450 MHz K6 III+ I could use, but I'd... (2 Replies)
Discussion started by: pcmacd
2 Replies

7. Shell Programming and Scripting

Problem determining file

I got the following code, it partially works. Can someone tell me why it partially doenst work? #!/bin/sh file=$1 if then echo "File is a directory" else echo "File is not a directory!" fi heres the output: philip@philip-laptop:~/Desktop$ sh exFive.sh test.java File is... (4 Replies)
Discussion started by: philmetz
4 Replies

8. UNIX for Dummies Questions & Answers

Determining type of file

Hello, I'm attempting to modify a script so it can be executed via a batch scheduler. Part of the script calls a program called direct (which I believe may have something to do with Connect Direct). I have tried cat and vi on the file; cat returns absolute gibberish, vi states the file is... (2 Replies)
Discussion started by: JWilliams
2 Replies

9. UNIX for Dummies Questions & Answers

Determining file length

How can I determine what UNIX thinks the record size of any given file is?? (1 Reply)
Discussion started by: jbrubaker
1 Replies

10. Shell Programming and Scripting

Is there any one liner method to mirror string ?

for example I have 123 and I will like it to be 321 thanks (2 Replies)
Discussion started by: umen
2 Replies
Login or Register to Ask a Question