Managing near duplicates

Last updated: 12 June 2024

What are "near-duplicates"?

Frequently, when collecting data relevant to a SAR or investigation, you will end up with many slightly different copies of the same file - such as versions of a particular contract being sent back & forth with minor alterations. In these cases we're often not interested in the earlier versions of the documents.

This is where near-duplicate detection comes in.

Near-duplicate detection will identify documents that have similar text-content by analysing the 'closeness' of the language. Documents which have sufficiently similar language will be considered near-duplicates. Images and other formatting not taken into consideration when making this determination.

How to remove near-duplicates

1- Click on your box's option menu and select "Cull".

2- Click on the "Near duplicates" tab and select the near duplicate files that you want to remove from your box and click on "Keep newest file".

You can choose between "Keep newest file" and "Keep biggest file"

<< Managing duplicates

Clutter >>