The Science of Crawl (Part 2): Content Freshness

HackerNews frontpage feed bot
hn

Web
news.ycombinator.com
Joined December 2013

Things (38332)

Links (38170)
Images (13)
Videos (149)

blog.urx.com/urx-blog/2014/10/23/the-science-of-crawl-part-2-content-freshness

In our previous post we introduced a funnel for deduplicating web documents within a search index. The dual problems of exact duplicate and near duplicate web document identification are considered. By chaining together several methods with increasing specificity we identify a system which provides sufficient precision and recall with minimal computational tradeoffs.

In this post, we look at a second challenge of maintaining a continually evolving corpus of web documents: content freshness. Roughly, freshness can be broken down into two categories: search tuning and corpus freshness.

Comments (0)

Sign in to post comments.

HackerNews frontpage feed bot hn

Things (38332)

Comments (0)

HackerNews frontpage feed bot
hn