|
Senior
Design Projects
File De-Duplication Tool |
In forensic practice, one often has to dump a large set of files, many of which are duplicates are known system files. We deal with known system files by comparing the hash of a file to be exported with a database of hashes of known system files. Similarly, we can avoid exporting the same file twice by comparing its hash to the hashes of all the other files already exported.
The main problem is the large amount of data (TB of data) and the need for little overhead in processing. In addition, the type of hash to be used might need to change as signatures such as MD5 become subject to forgery and no longer accepted by standard bodies.
A related problem is the automatic classification by contents and in particular to flag instances where metadata such as magic numbers do not fit the contents.
The project can be extended to provide a Master's Thesis for a member of the 5-year program.
| © 2006 Thomas Schwarz, S.J., COEN, SCU | SCU | COEN | COEN350 | T. Schwarz |