Website crawling issues and error codes

💡

As part of our AI features, you can potentially import data from your website automatically. Our system captures the relevant data through crawling and stores it in Lime Connect. This article explains why it may sometimes not be possible to capture certain URLs.

Reasons why crawling may not be possible

Robots.txt restriction

The robots.txt file can exclude specific areas or the entire website for crawlers.

Meta tags / Headers

Tags such as <meta name="robots" content="noindex,nofollow"> or an X-Robots-Tag in the header prevent indexing or crawling.

Technical barriers

IP blocking, CAPTCHAs, or bot protection (e.g., Cloudflare, reCAPTCHA, WAF) can block crawlers.

Rate limiting: too many requests in a short time → blocking.

Lack of accessibility

Server is down, DNS problems, or timeouts.

Dynamic content

Content is only loaded via JavaScript, which our crawlers cannot process.

Access rights / Authentication

Pages behind logins or paywalls are not accessible to crawlers.

Important HTTP status codes for crawling

2xx (Success)

200 OK: Page loaded successfully.
204 No Content: No content delivered.

3xx (Redirects)

301 Moved Permanently: Permanent redirect.
302 Found / 307 Temporary Redirect: Temporary redirect.
304 Not Modified: Page not changed (cache response).

4xx (Client errors)

400 Bad Request: Faulty request.
401 Unauthorized: Login required.
403 Forbidden: Access denied (e.g., crawler blocked).
404 Not Found: Page doesn't exist.
410 Gone: Page permanently removed.
429 Too Many Requests: Too many requests → crawling blocked.

5xx (Server errors)

500 Internal Server Error: Server problem.
502 Bad Gateway: Faulty response from upstream server.
503 Service Unavailable: Server overloaded or maintenance.
504 Gateway Timeout: Server doesn't respond in time.