Your tip with chunk size was good... did some benchmarking and found out that the python implementation I use has a flat performance maximum between 1kb and 32kb chunks.
]]>There are implementations which are designed for easy code maintenance, better portability and readability. Such implementations tend to be object-orientated, generalized and usually use most generic and native to the language/platform functions.
On the other hand, there are implementations which focus purely on speed/performance. These are usually coded up in machine code (if not familiar, look up Assembly on wikipedia). They will provide outstanding performance, but can be a lot more difficult to integrate and use in other applications.
You have guessed correctly. I have addressed the performance aspect in Hasher v3. It will be a lot faster. I have moved onto more efficient hashing algorithms, partly written in assembly. I might be able to give you a tryout version soon, if you are interested?
By the way, the size of chunks does not matter that much, as long as it is not too small and not too big. Generally speaking, chunks should be somewhere between 16KB and 256KB (powers of two are the best), depending on system and hard drive performance. Chunk size of 1MB will usually perform slower than 64KB (for example).
]]>btw. I think there is another little problem in your application: the performance is far below the expectations. On today's hardware the calculation of checksums like SHA1 is limited by file system access only.
Test with a 4GB file:
Hasher 1.20: avg 19.5MB/s, CPU load 65%
HashMyFiles (NirSoft): avg 40MB/s, CPU load 65%
own python(!) script: avg: 53MB/s, CPU load 30%
File-I/O and CPU-load monitored with ProcessExplorer (SysInternals).
Surprisingly, the python script performs best and reaches the limits of the relatively slow harddisk in the testing system.
so what's the trick?! don't really know... as I don't have own code in "real" programming languages to compare. My script reads chunks of 1MB from disk and uses the update-method from the hashlib library, just like it is recommended in the docs.
How are you processing the files? How large are the chunks you read and process at once?
Maybe you considered this already for version 3.0.
]]>I will now update the official release of Hasher, but it will not be the one that is completely redesigned (version 3.0).
]]>sha1
]]>I think there is a bug in Hasher. The SHA1 for large files (tested with different files sized 1GB and more) is not identical to the hashes calculated with several other checksum tools.
Other tools which produce identical hashes (supposed to be correct):
sha1deep
Fsum
HashMyFiles (NirSoft)
Hash (Robin Keir)
Please check your implementation!
Regards,
sha1