I can't thank you all enough!
I ended up going with the awk script suggested by Otheus above ... I was so amazed at the speed I felt obligated to check the result with an alternative method - and the result was indeed verified.
My box has 6GB of memory BTW - but it would appear that my gawk has a 1.5GB limit (either compiled in or part of the OS - but in either event I don't think I can change it). The limit is approached when the size of FILE1 approaches 1.5GB ... for larger files I split the input file and ran it against each of the parts. The size of FILE2 does not play into the amount of memory required by the awk program. Your awk may have a different limit which you'll discover if it is an issue.
FILE1: START.100K
66,831,529 bytes with 100K lines (yes - my data is actually 600+ bytes/record)
FILE2: REF.1M
648,903,713 bytes with 1M lines - obviously similar data
Quote:
time (awk ' NR==FNR { A[$0]=1; next; }
{ if ($0 in A) { A[$0]=0; } }
END { for (k in A) { if (A[k]==1) { print k; } } } ' $FILE1 $FILE2 > $FILE3 )
real 0m23.323s !!!!
user 0m11.484s
sys 0m8.233s
AND THE OUTPUT IS:
48295948 bytes with 71,836 lines - i.e. there were 71,836 lines of the 100K lines that did NOT appear in the 1M line file.