yahoo-security.tumblr.com/post/119615486115/spidering-techniques-for-content-...

In this article I will talk about web spidering, why it is useful when pentesting a rich web application, and different spider techniques the Yahoo! pentest team has put to use.

Often one of the most useful things you can do at the start of a pentest is enumerate all of the available attack surface of an application. An application’s attack surface stretches beyond their intended use, so as a pentester you look for entry points a developer may not have considered, or forgot about. One of the fastest ways to discover content on web applications is through spidering. A spider is a tool that crawls a website looking for all the available content. There’s a few different ways to discover content:

- Static Content  - Dirbuster  - HTTP Method  - Ascension Fuzz  - Query Fuzz  - Cookie Fuzz  - Robots.txt / Sitemap.xml  - RIA Checks  - UserAgent  - Regexp path/url  - Public cache search  - /status Static Content

The most common technique for spidering is the use of page elements as seeds for further exploration. Here we parse the HTML and look for any element that has a link we have not yet seen. A good list of elements to look for might be.

Where tag[attr] = <tag attr=”URL”>

- a[href]  - link[href]  - script[src]  - img[src]  - iframe[src]  - object[data]  - embed[src]  - frame[src]  - source[src]  - form[action]

The form[action] requires a little extra attention. You want to make sure you also grab all input[name,value] pairs, and use the correct form[method]. Forms normally use POST as their method, but it can also be a GET. Dirbuster The name of this method comes from a popular technique that brute forces web directories for possible valid URLs. A lot of web pages have similar pages in their webroots, and so web pages can sometimes be guessed. For example, a lot of Apache httpd instances have /cgi-bin/, or info.php scripts. Information panel websites commonly have a config.html, or an /admin directory. The Dirbuster technique uses a list of common directories and files, and checks to see if any of them actually exist. It’s crude, but it works well. HTTP Method HTTP uses method verbs for specifying what type of action to take. The four most common verbs are:  POST - Create content/Perform Action  GET - Return content  PUT - Update content  DELETE- Remove content In older webstacks you’ll normally only see GET requests and POST on forms. But in newer webstacks, more of the HTTP verbs are used commonly in RESTful APIs.

There are approximately 30 different verbs, but most are not supported by the majority of webstacks.

We begin by checking for the following: OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, CONNECT, FOOBAR

Foobar is not a valid HTTP method, but it can be useful to analyze the error returned by the web server.

Some other verbs that may be supported: ACL, BASELINE-CONTROL, BCOPY, BDELETE, BMOVE, BPROPFIND, BPROPPATCH, CHECKIN, CHECKOUT, COPY, LABEL, LOCK, MERGE, MKACTIVITY, MKCOL, MKWORKSPACE, MOVE, NOTIFY, ORDERPATCH, PATCH, POLL, PROPFIND, PROPPATCH, REPORT, RPC_IN_DATA, RPC_OUT_DATA, SEARCH, SUBSCRIBE, UNCHECKOUT, UNLOCK, UNSUBSCRIBE, UPDATE, VERSION-CONTROL, X-MS-ENUMATTS Ascension Fuzz The idea of ascension fuzz is to move up the path in the url to find new content. For example, if we found content here:    https://www.yahoo.com/v3/content/static/handler/okaythatsenough We should also check for content at:    https://www.yahoo.com/v3/content/static/handler/    https://www.yahoo.com/v3/content/static/    https://www.yahoo.com/v3/content/    https://www.yahoo.com/v3/    https://www.yahoo.com/ This strategy is useful for finding open directory listings, which can be a gold mine or additional content to spider. Query Fuzz We can also fuzz the query string. Normally query strings come in the form of ?a=1&amp;b=2&amp;c=3 at the end of the URL. This technique relies on swapping out these values for other strings and submitting the request. For example, we could try these GET requests:    ?a=1&amp;b=2&amp;c=3    ?a=1&amp;b=2&amp;c=false    ?a=1&amp;b=2&amp;c=;ls    ?a=1&amp;b=2&amp;c=ls    ?a=1&b=2&c=/etc/passwd    ?a=1&b=2&c=NULL    ?a=1&b=2&c=%s%s%s%stest It’s also useful to submit query parameters such as admin=true, or debug=1. Every once in awhile you’ll get a hit, and it’s another goldmine of information (and possibly a privilege escalation). Cookie Fuzz Web browsers maintain session information through HTTP Cookies. Your browser hides this from you, but on the back end websites are telling your browser: “Every time you make a request to my website, send the value


Comments (0)

Sign in to post comments.