Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)
# 22  
Old 03-28-2014
I am aware of the collision issues. There is a fair amount of entropy in the data (timestamps) so I'm probably not gaming the odds down with patterns.

Does perl support SHA-1 or SHA-2? That would take the odds from very very unlikely to (even more) astronomical.

# 23  
Old 03-28-2014
Perl supports everything and anything, but I'm not sure it'd come with it by default. Try it and see. Digest::SHA2 - search.cpan.org
# 24  
Old 03-28-2014
Originally Posted by Corona688
Perl supports everything and anything, but I'm not sure it'd come with it by default. Try it and see. Digest::SHA2 - search.cpan.org
Looks like I should try regular SHA
This module has numerious known bugs, is not compatable with the Digest interface and its functionality is a subset of the functionality of Digest::SHA (which is in perl core as of 5.9.3).
Please use Digest::SHA instead of this module in new and old code.
It looks like SAH even supports Base-64 which will keep the associative array table to the minimum size (what I suspect was breaking my initial AWK routine).


# 25  
Old 04-01-2014
You did not say how much RAM you have, which is a definite factor.

The MD5 result is impressive. Is MD5 as cheap as a good hash? Perhaps CPUs have gotten so much faster than disk that it is not a factor!

In perl/C/C++/JAVA you can mmap the input file both for input and so your hash map can hold a 64 bit char* for exact verification, reducing copying and space allocation overhead.

The exact verification seems to expand the vm footprint a lot, but the most likely case is that the md5 or hash is new and so the exact compare is not done, greatly reducing processing and the vm footprint with a small minority of duplicates. If there were a lot of duplicates, that hurts the VM footprint with more exact verifications, but conversely there is less final data in the map.
# 26  
Old 04-01-2014
Originally Posted by DGPickett
You did not say how much RAM you have, which is a definite factor.

The MD5 result is impressive. Is MD5 as cheap as a good hash? Perhaps CPUs have gotten so much faster than disk that it is not a factor!

In perl/C/C++/JAVA you can mmap the input file both for input and so your hash map can hold a 64 bit char* for exact verification, reducing copying and space allocation overhead.

The exact verification seems to expand the vm footprint a lot, but the most likely case is that the md5 or hash is new and so the exact compare is not done, greatly reducing processing and the vm footprint with a small minority of duplicates. If there were a lot of duplicates, that hurts the VM footprint with more exact verifications, but conversely there is less final data in the map.
I have 16 GB of ram but my "disks" are also solid state. It's a fast system. The md5 solution is working well for me.

# 27  
Old 04-01-2014
Yes I had considered a method of storing the file position against each MD5 sum and when a potential collision occurs one could then fseek back and re-read the original data to verify a duplicate. This would only double the memory required but does require a random access file, so no stream processing.

As long as the frequency of duplicates is low I wouldn't expect a significant increase in speed.
# 28  
Old 04-01-2014
Duplicates appear to be ~3%.

