Using SourceHash to deduplicate exported Drive files

When you create an export of Google Docs, Sheets, and Slides files from Drive, the XML metadata file includes each file's source hash. The source hash should provide a unique identifier for each version of the item, so that you can compare multiple exports and deduplicate files with identical content before you analyze the items.

The source hash calculation includes the modification date. A file's modification date is updated when its content changes, but the modification date can also be updated by changes to the file that aren't related to changes in content, such as changes to sharing permissions. Thus, 2 versions of a file can have different source hashes but contain the same content.

