Here at URX we are building the world's first mobile app search API. Backing this API is a search engine containing a large corpus of web documents, meticulously maintained and carefully grown by crawling internet content. We've come to discover that building a functional crawler can be done relatively cheaply, but building a robust crawler requires overcoming a few technical challenges. In this series of blog posts, we will walk through a few of these technical challenges including content deduplication, link prioritization, feature extraction and re-crawl estimation.
In this first installment, I will walk through the duplicate web content problem.
Comments (0)
Sign in to post comments.