Thanks everybody for the interesting comments. It has helped me to try different options and get around the problem.
The values that I use for filtering is not fixed so I had to be cautious to use the regular expressions. Also, I have a number of other informative columns that I did not include in the original post for simplicity.
Instead, as Corona688 suggested, I have realized that my main problem is the huge amount of data.
Quote:
Originally Posted by Corona688:
The problem, really, is that you have a huge amount of data, not a slow program. How big are your records, really?
Therefore, I have made two changes to speed things up (not perfect but to an acceptable level):
(1) I sorted the files according to the relevant columns, and then used a modified awk-line that would not need to parse through the full files but exit when relevant:
(2) I split the original, but sorted, file (>50,000,000 lines) into smaller fragments with the range of numbers in the columns that are on priori known to be relevant for the current filtering requirements.
Use a relational database - they are specifically designed to do the type of queries you are talking about, and people spend their whole careers optimising them.
Other thoughts - If you have the files on a nice quick SAN or somthing, you might benifit for doing a multi-threaded lookup. Get all your 16 cores working on the problem. You may need to split the file(s) up so you can run multiple processes against their own file.
It means: match any line that starts (^) with 83 followed by 1 or more spaces and then a number that starts with a 1 or a 2 ([12]) followed by 3 digits ([0-9]).
I have nginx web server logs with all requests that were made and I'm filtering them by date and time.
Each line has the following structure:
127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br)
These text files are... (21 Replies)
I have the below command which is referring a large file and it is taking 3 hours to run. Can something be done to make this command faster.
awk -F ',' '{OFS=","}{ if ($13 == "9999") print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12 }' ${NLAP_TEMP}/hist1.out|sort -T ${NLAP_TEMP} |uniq>... (13 Replies)
awk "/May 23, 2012 /,0" /var/tmp/datafile
the above command pulls out information in the datafile. the information it pulls is from the date specified to the end of the file.
now, how can i make this faster if the datafile is huge? even if it wasn't huge, i feel there's a better/faster way to... (8 Replies)
Can someone help me edit the below script to make it run faster?
Shell: bash
OS: Linux Red Hat
The point of the script is to grab entire chunks of information that concerns the service "MEMORY_CHECK".
For each chunk, the beginning starts with "service {", and ends with "}".
I should... (15 Replies)
Hi,
I have a script below for extracting xml from a file.
for i in *.txt
do
echo $i
awk '/<.*/ , /.*<\/.*>/' "$i" | tr -d '\n'
echo -ne '\n'
done
.
I read about using multi threading to speed up the script.
I do not know much about it but read it on this forum.
Is it a... (21 Replies)
Hi -- I have the following SQL query in my UNIX shell script -- but the subquery in the second section is very slow. I know there must be a way to do this with a union or something which would be better. Can anyone offer an alternative to this query? Thanks.
select
count(*)
from
... (2 Replies)
I am processing some terabytes of information on a computer having 8 processors (each with 4 cores) with a 16GB RAM and 5TB hard drive implemented as a RAID. The processing doesn't seem to be blazingly fast perhaps because of the IO limitation.
I am basically running a perl script to read some... (13 Replies)
If I just wanted to get andred08 from the following ldap dn
would I be best to use AWK or CUT?
uid=andred08,ou=People,o=example,dc=com
It doesn't make a difference if it's just one ldap search I am getting it from but when there's a couple of hundred people in the group that retruns all... (10 Replies)