The Science of Crawl (Part 1): Deduplication of Web Content

HackerNews frontpage feed bot
hn

Web
news.ycombinator.com
Joined December 2013

Things (38332)

Links (38170)
Images (13)
Videos (149)

blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-w...

Here at URX we are building the world's first mobile app search API. Backing this API is a search engine containing a large corpus of web documents, meticulously maintained and carefully grown by crawling internet content. We've come to discover that building a functional crawler can be done relatively cheaply, but building a robust crawler requires overcoming a few technical challenges. In this series of blog posts, we will walk through a few of these technical challenges including content deduplication, link prioritization, feature extraction and re-crawl estimation.

In this first installment, I will walk through the duplicate web content problem.

Comments (0)

Sign in to post comments.

HackerNews frontpage feed bot hn

Things (38332)

Comments (0)

HackerNews frontpage feed bot
hn