Near duplicates

Posted almost 2 years ago by Mark Callahan

What are "near-duplicates"?


Frequently, when collecting data relevant to a SAR or investigation, we end up with many slightly different copies of the same file - such as versions of a particular contract being sent back & forth with minor alterations. 

In these cases we're often not interested in the earlier versions of the documents.


This is where near-duplicate detection comes in. 


Near-duplicate detection will identify documents that have similar text-content by analysing the 'closeness' of the language. Documents which have sufficiently similar language will be considered near-duplicates. Images and other formatting not taken into consideration when making this determination. 


How to remove near-duplicates


1- Click on your box's option menu and select "Cull".


 


2- Click on the "Near duplicates" tab and select the near duplicate files that you want to remove from your box and click on "Keep newest file".


You can choose between "Keep newest file" and "Keep biggest file"

 




Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article