Article directory
Baidu Spider captures diagnostic exception information: What should I do if the socket reads and writes incorrectly?
Assuming that your website has not been included by Baidu, you must first perform spider crawling diagnosis on the Baidu search resource platform.
What should I do if Baidu crawler fails to crawl diagnostic links?
If the Baidu crawler crawl diagnosis fails several times, the firewall may have blocked the crawler program.
Baidu Search Resource Platform > Crawl Diagnosis > Crawl Exception Information: Socket read and write errors ▼
- Especially when using Cloudflare CDN, it is blocked by default.
- On the Internet, it is said to add the IP address
xxx.xxx.xxx.xxx/24
- However, tried that to no avail.
I didn't block Baidu spiders on the server, so the problem should be Cloudflare's WAF!
Login to Cloudflare → Security → WAF → Firewall Rules → Create Firewall Rule
- Find the WAF rules related to crawlers on Cloudflare, and found the option of "legitimate robot crawler" ▼
- After creating the firewall rules, wait for 10 minutes, and then grab the diagnosis, and all of them were successfully captured!
What's wrong with Baidu crawler Sitemap crawling failure and connection timeout?
If you submit the address of the Sitemap file on the Baidu search resource platform, there will be problems such as crawling failure and connection timeout ▼
Solution to the failure of Baidu crawler to grab the Sitemap map
Login to Cloudflare → Security → WAF → Firewall Rules → Create Firewall Rules ▼
- field, select "User Agent"
- operator, select Contains
- Add a new user agent, click the last "Or"
- Value, respectively enter the following Baidu Spider UA user agent:
-
Baiduspider/2.0
-
Baiduspider-image
-
Baiduspider-render/2.0
-
http://www.baidu.com/search/spider.html
-
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
-
Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile Safari/10600.6.3 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
After the completion, test the fetching again, and the result returns HTTP header 200, indicating that the fetching is successful▼
-
抓取诊断 > 抓取详情
以下是百度Spider抓取结果及页面信息:
-
提交网址: https://www.etufo.org/sitemap_baidu.xml
-
抓取网址: https://www.etufo.org/sitemap_baidu.xml
-
抓取UA: Mozilla/5.0 (compatible; Baiduspider/2.0;
-
+http://www.baidu.com/search/spider.html)
-
抓取时间: 2022-11-11 19:03:44
-
网站IP: 172.***.***.149
-
下载时长: 0.868秒
-
返回HTTP头:HTTP/2 200
The user agents of other spiders and crawlers can also search for themselves in the same way.
Hope Chen Weiliang Blog ( https://www.chenweiliang.com/ ) shared "Baidu Spider Crawl Failure Diagnosis Abnormal Information What to Do if Socket Read and Write Error Connection Timed Out", which is helpful to you.
Welcome to share the link of this article:https://www.chenweiliang.com/cwl-29315.html
Welcome to the Telegram channel of Chen Weiliang's blog to get the latest updates!
📚 This guide contains huge value, 🌟This is a rare opportunity, don’t miss it! ⏰⌛💨
Share and like if you like!
Your sharing and likes are our continuous motivation!