blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-w...

Here at URX we are building the world's first mobile app search API. Backing this API is a search engine containing a large corpus of web documents, meticulously maintained and carefully grown by crawling internet content. We've come to discover that building a functional crawler can be done relatively cheaply, but building a robust crawler requires overcoming a few technical challenges. In this series of blog posts, we will walk through a few of these technical challenges including content deduplication, link prioritization, feature extraction and re-crawl estimation. 

In this first installment, I will walk through the duplicate web content problem.


Comments (0)

Sign in to post comments.