As part of our AI features, you can potentially import data from your website automatically. Our system captures the relevant data through crawling and stores it in Lime Connect. This article explains why it may sometimes not be possible to capture certain URLs.
Reasons why crawling may not be possible
- Robots.txt restriction
- The
robots.txt
file can exclude specific areas or the entire website for crawlers.
- Meta tags / Headers
- Tags such as
<meta name="robots" content="noindex,nofollow">
or anX-Robots-Tag
in the header prevent indexing or crawling.
- Technical barriers
- IP blocking, CAPTCHAs, or bot protection (e.g., Cloudflare, reCAPTCHA, WAF) can block crawlers.
- Rate limiting: too many requests in a short time β blocking.
- Lack of accessibility
- Server is down, DNS problems, or timeouts.
- Dynamic content
- Content is only loaded via JavaScript, which our crawlers cannot process.
- Access rights / Authentication
- Pages behind logins or paywalls are not accessible to crawlers.
Important HTTP status codes for crawling
- 2xx (Success)
200 OK
: Page loaded successfully.204 No Content
: No content delivered.
- 3xx (Redirects)
301 Moved Permanently
: Permanent redirect.302 Found
/307 Temporary Redirect
: Temporary redirect.304 Not Modified
: Page not changed (cache response).
- 4xx (Client errors)
400 Bad Request
: Faulty request.401 Unauthorized
: Login required.403 Forbidden
: Access denied (e.g., crawler blocked).404 Not Found
: Page doesn't exist.410 Gone
: Page permanently removed.429 Too Many Requests
: Too many requests β crawling blocked.
- 5xx (Server errors)
500 Internal Server Error
: Server problem.502 Bad Gateway
: Faulty response from upstream server.503 Service Unavailable
: Server overloaded or maintenance.504 Gateway Timeout
: Server doesn't respond in time.