Why aren't all my pages crawled?
Oh Dear crawls websites to report broken links and mixed content reporting. In some circumstances we won't crawl all pages. This page explains some of those situations.
Crawl prevented by robots.txt #
If you have a
robots.txt page with content similar to this, Oh Dear will not crawl a single page on your site.
User-agent: * Disallow: /
This content essentially tells robots (search engines like Google/Bing, but also our crawler) to not crawl any page on this site, starting from and including the root page
If you have a robots.txt page like this, we will report 0 pages scanned in Oh Dear.
You can tweak which pages we can/can't crawl in your robots.txt though, for more fine grained controls.
Crawl prevented by HTML meta-tag #
Additionally, we also listen to both the HTML tags as well as the
x-robots-tag HTTP header. If we see an HTML tag similar to this, we won't crawl that particular page:
<meta name="robots" content="noindex" />
And here's an example HTTP header that prevents robots from crawling the site:
Rate limits against our crawler #
Webservers can sometimes implement a feature called rate limiting. It reduces the amount of requests a particular IP address or User-Agent can make. Since we crawl websites on a frequent basis, our crawlers are sometimes affected by this.
If we receive an
429 Too Many Requests HTTP status message during our crawls, we'll notify you of this in the detailed view of the report.
To resolve this issue, please whitelist our IP addresses so we are no longer rate limited.
Limitations of the crawler & broken links checking #
We have a few limitations in place to help protect your website when we crawl it.
- We crawl at most 5.000 pages per website added in Oh Dear
- Crawls are limited to a 20 minute duration
Whichever limit is hit first (max amount of pages or the 20 minute) will stop our crawling.
This helps protect your site from infinite page loops or excessive load caused by our crawler.
If your site has more than 5.000 pages, you can add the site multiple times with different starting points. We start crawling based on the URL you enter. For instance, if you add the following 5 sites, each will start crawling on its own and report broken pages based on that entry point.
This gives you more control over where we should start crawling and checking for broken pages.
Increase or decrease the crawl speed #
You can change the speed of our crawler, to either crawl faster or slower, depending on your preference. You'll find this in the Settings tab of each website.
If you notice excessive server load whenever we crawl your site, you may pick a slower crawling preference.
If you hit the 20-minute limit often, you may increase the speed and concurrency, to allow us to crawl more pages in that limit. Please note, your server load will increase as a result, as we're doing more HTTP(s) requests in the same timespan.