Here at URX, we've built the world's first deep link search API. Developers who integrate our API can monetize their apps or websites with deep links to contextually relevant actions in other apps. To ensure the content we surface on publisher sites and applications is relevant, we've built a search engine on top of a large web corpus. Publisher pages are used as search queries to discover the most relevant third party documents to surface from within our corpus. This corpus of web documents has been meticulously maintained and carefully grown by crawling internet content. We've come to discover that building a functional crawler can be done relatively cheaply, but building a robust crawler requires overcoming a few technical challenges.
This is Part 3 in our series "The Science of Crawl" addressing the technical challenges around growing and maintaining a web corpus. In Part 1 , we introduced a funnel for deduplicating web documents within a search index. The dual problems of exact-duplicate and near-duplicate web document identification are considered. By chaining together several methods with increasing specificity we identify a system that provides sufficient precision and recall with minimal computational trade-offs. In Part 2 we visit the problem of balancing resource allocation between crawling new pages and revisiting stale ones.
In this post, we look at the challenge of prioritizing which web documents to capture first. To ensure our search engine contains relevant results for our publishers, we need a crawler which continuously discovers new content. To provide new content into the search engine, a persistent web crawler extracts previously unseen links within pages and adds the links onto a priority queue. The new link is then popped from the queue and subsequently traversed. The resulting content is downloaded and indexed into the search engine.
Naively, links in the queue could be ordered temporally, based on the classic First-In-First-Out (FIFO) paradigm. However, given the scale at which URX crawls web content, there are hundreds of millions of uncrawled links queueing at any given time. Given such large volume there is likely to be important documents arbitrarily distributed across the queue, uncorrelated with temporal ordering. There is a need to assess the potential value of a link even before pushing it to the queue.
Research within the field of prioritization can be broken into two areas: query relevance and query-independent importance. Query relevance refers to biasing a crawler to download content "most relevant" to the content actively searched. Query independent importance refers to leveraging the graph structure of web links to encourage an "importance" based prioritization. There are tradeoffs to both. For query relevance methods, it is difficult to evaluate the correlation between unseen links and search intents. The only information available is often the link itself and surrounding anchor text. On the other hand, query-independent methods typically represent the web as a graph structure, where vertices are web pages and edges are links. This formulation yields a natural interpretation of importance as connectivity within the graph. However, crawling new pages using graph-based importance may not necessarily correlate with improvement in overall user experience. Ultimately, a solution which combines and balances both methodologies may yield higher accuracy.
Query-independent importance
The first facet of prioritization estimates page importance independent of user search queries. By representing the web as a large graph, classic methods can be used to discover important nodes. Larry Page's famous pagerank is perhaps the most popular method for estimating the importance of pages. The method seeks to solve the iterative equation (taken from Wikipedia)
$$ PR(u) = \sum_{v\in B_u}{\frac{PR(v)}{L(v)}} $$
where $PR(u)$ is the relative rank of a page $u$, $B_u$ is the set of all pages pointing to $u$, and $L(v)$ is the number of out-links in page $v$.
At a high level, a page is important if: 1) many pages point into it and 2) the incoming pages are themselves, important. The PageRank problem is well studied. It was discovered that solutions of this iterative system are degenerate - often converging to the page with highest rank. To fix this issue, Page and Brin added a small connection weight between all pairs of pages to create a more stable system. This is realized by adding a damping factor $d$ normalized by the number of pages $N$ to the above equation:
$$ PR(u) = \frac{1-d}{N} + d\sum_{v\in B_u}{\frac{PR(v)}{L(v)}} $$
Below is a simple PageRank implementation taken from one of Apache Spark's examples with a damping factor of 0.85
In [18]:
""" This is an example implementation of PageRank. For more conventional use, Please refer to PageRank implementation provided by graphx """
import re import sys from operator import add
from pyspark import SparkContext
def computeContribs(urls, rank): """Calculates URL contributions to the rank of other URLs.""" num_urls = len(urls) for url in urls: yield (url, rank / num_urls)
def parseNeighbors(urls): """Parses a urls pair string into urls pair.""" parts = re.split(r'\s+', urls) return parts[0], parts[1]
if name == "main": if len(sys.argv) != 3: print >> sys.stderr, "Usage: pagerank <file> <iterations>" exit(-1)
print >> sys.stderr, """WARN: This is a naive implementation of PageRank and is given as an example! Please refer to PageRank implementation provided by graphx""" # Initialize the spark context. sc = SparkContext(appName="PythonPageRank") # Loads in input file. It should be in format of: # URL neighbor URL # URL neighbor URL # URL neighbor URL # ... lines = sc.textFile(sys.argv[1], 1) # Loads all URLs from input file and initialize their neighbors. links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache() # Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one. ranks = links.map(lambda (url, neighbors): (url, 1.0)) # Calculates and updates URL ranks continuously using PageRank algorithm. for iteration in xrange(int(sys.argv[2])): # Calculates URL contributions to the rank of other URLs. contribs = links.join(ranks).flatMap( lambda (url, (urls, rank)): computeContribs(urls, rank)) # Re-calculates URL ranks based on neighbor contributions. ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15) # Collects all URL ranks and dump them to console. for (link, rank) in ranks.collect(): print "%s has rank: %s." % (link, rank) sc.stop()
For those who enjoy scala, a more efficient PageRank example can be seen in one of Apache GraphX examples.
Unfortunately, by adding a damping factor to every connection, the PageRank adjacency matrix becomes a prohibitively dense, large matrix. It quickly becomes impossible to store in memory. Many have studied ways to improve the PageRank calculation. One popular approach is Adaptive Online Page Importance Computation (APOIC), which modifies the above system of equations to avoid holding the full PageRank adjacency matrix in memory. AOIPIC has been shown to approximate the PageRank score after a sufficient number of iterations. Ricardo Baeza-Yates et al. provide a great comparison of PageRank with OPIC (non adaptive, APOIC) and simple back-link count.
Query relevant prioritization
The goal of a query-based prioritization scheme is to rank unseen links in the priority queue as a function of relevance to incoming search queries. One specific implementation is diagrammed below, taken from Figure 4 from Olston's recent paper.
Impact-driven crawl selection steps
At a high level, Olston's goal is to identify "needy" queries - i.e., queries whose top K-results do not contain relevant results. The type of content needed to improve these queries is identified and used to prioritize uncrawled links likely to contain such needed content. Using this methodology, underperforming queries are continuously supplied with new content.
To achieve this, the first step is to build an index of "needy" queries; those queries that stand to benefit the most from new, diverse content. In the definition of Olston, a search query is needy if the average relevancy score returned in the top-K results is less than a predefined threshold. Defining search relevancy is a subtle topic deserving of its own blog post. I recommend Buettcher's book on Information Retrieval for a full history and quantification of search relevancy. In short, relevancy is commonly calculated as a function of number of returned documents, BM-25 scoring and click through rate.
Given an index of needy queries, a function can be constructed to measure the priority of a given uncrawled page, $p$. Olston writes this function as:
$$ P(p, Q) = \sum_{q\in Q} f(q) * I(p, q) $$
Where
Q is the index of needy queries f(q) is the frequency of a given query I(p, q) is an indicator of whether an unseen page, p, matches query, q
Pages matching needy, frequently searched queries will have the largest prioritization scores. Similarly pages which are either infrequently searched or contain rich results will be de-prioritized.
To measure $I(p, q)$, Olston describes an interesting, approach which seeks to jointly maximize the expected priority across all unseen pages and queries. Unfortunately in the worst case, this full expectation maximization is NP hard. To reduce complexity, Olston suggests w-shingling pages and queries. If the w-shingles of the query and page match $\rho$ fraction of shingles, $I(p, q)$ will be 1. Otherwise $I(p, q)$ will be 0.
There exists a catch-22. If a page has not yet been crawled, its content is unknown. However, the decision to crawl a page is biased by the content of a page. To approximate a solution, Olston describes an uncrawled "page" as the tokens generated from a uncrawled URL along with any surrounding anchor text. For example, in the following article from The New York Times, we see a link:
<a href="Samantha">http://thedailyshow.cc.com/news-team/samantha-bee">Samantha Bee</a>
which can be tokenized into the following words
p = [news, team, samantha, bee]
We can write a custom tokenizer to split urls and anchor text into "words" by splitting on non alphanumeric words
In [1]:
def clean_get_unicode(raw): if not raw: return '' badchrs = set(['.', ',', ';', '\'', '\', '/', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '~', '`', '-', '{', '}', '[', ']', ':', '<', '>', '?', '\n', '\t', '\r', '"']) prevWasBad = False prevWasSpace = False ret = for c in raw: if not c.isdigit() and not c in badchrs: if c == ' ': if prevWasSpace: continue prevWasSpace = True else: prevWasSpace = False prevWasBad = False ret.append(c) else: if prevWasBad or prevWasSpace: continue else: prevWasBad = True ret.append(' ') prevWasSpace = True
return ''.join(ret)
In [2]:
url = 'http://thedailyshow.cc.com/news-team/samantha-bee' atext = 'Samantha Bee'
url_words = clean_get_unicode(url.lower()).split(' ') atext_words = clean_get_unicode(atext.lower()).split(' ') page = list(set(url_words).union(set(atext_words))) print page
['http', 'cc', 'samantha', 'team', 'bee', 'thedailyshow', 'news', 'com']
We can clean the remaining tokens further by removing common html words ( http, www, com, etc.) and filtering the remaining words against an English dictionary. Query tokenization is performed using the same code. A mapping from a query to the set of relevant pages for that query can then be constructed. Such mappings are referred to as needy query sketches.
There are many ways to construct needy query sketches. One simple methodology is to index the set of needy queries and their frequencies using a tool like Elasticsearch. Page tokens are then used as search query. The frequencies of the returned needy queries can then be summed.
Combining strategies
A final priority weighting can be determined using a linear combination of query-independent and query-dependent relevance scores. The exact weighting of the two can be determined through classic machine learning for ensemble fusion. The objective function is problem-dependent but likely is a function of overall search relevance and user click-through rates.
Questions or comments about the content? Join the discussion on Hacker News or email us at research@urx.com
URX is hiring! View our job openings to learn more and apply to join our team.
Comments (0)
Sign in to post comments.