Near-Duplicate Detection
Let’s say you have a morbid fascination with the old journalistic standby, the bus plunge story. You want to build a Twitter bot tweeted everytime a bus plunges (Buses always “plunge”. They never simply “fall”.) to its fate. So you build a system to start hoovering up as many news articles as you can searching for the words “bus plunges”. After a brief interlude of “Powerhouse B”, you find 437 articles published today about bus plunges. “Wow! That’s a lot!”, you think. But as you start reading the articles, you’re over come with a sense of déjà vu.
more details