Scrapy-个人使用总结
对scrapy使用pycharm进行调试
1 | 1.创建项目 |
1 | 3.在项目根目录下创建main.py |
1 | ROBOTSTXT_OBEY = False一定要设置成 False,断点调试才能正常进行 |
Downloader Middleware
1 | 1.process_request(request, spider) |
用process_request(request, Spider)修改User-Agent
1 | def process_request(self,request,spider): |
使用process_response(request, response,spider)修改响应码
1 | def process_response(self,request,response,spider): # Request对象、Response对象之一,或者抛出IgnoreRequest异常。 |
对process_request和process_response不同返回值的分析
1 | 1.当process_requests返回response时,不仅过downloader下载;scrapy对接selenium和splash时,就是相当于把渲染过后的页面内容赋值给response返回,因为downloader获取的内容不是渲染过后的数据 |
IgnoreRequest
process_request返回IgnoreRequest时
1 | If it raises an IgnoreRequest exception, the process_exception() methods of installed downloader middleware will be called. If none of them handle the exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions). |
process_reponse返回IgnoreRequest时
1 | If it raises an IgnoreRequest exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions). |
process_exception(request, exception, spider)
1 | 启用条件 |
1 | 返回值分析 |
1 | If it returns a Response object, the process_response() method chain of installed middleware is started, and Scrapy won’t bother calling any other process_exception() methods of middleware. |
from_crawler(cls, crawler)
1 | If present, this classmethod is called to create a middleware instance from a Crawler. It must return a new instance of the middleware. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy. |
Spider Middleware
Pipeline Item
settings文件
DOWNLOAD_DELAY
1 | 1.DOWNLOAD_DELAY:(可以用于抑制爬取速度) |
DOWNLOADER_MIDDLEWARES_BASE和DOWNLOADER_MIDDLEWARES
1 | DOWNLOADER_MIDDLEWARES_BASE和DOWNLOADER_MIDDLEWARES的关系 |
DOWNLOADER_STATS
1 | 3.DOWNLOADER_STATS |
DOWNLOAD_HANDLERS
1 | Default: {} |
DOWNLOAD_HANDLERS_BASE
1 | A dict containing the request download handlers enabled by default in Scrapy. You should never modify this setting in your project, modify DOWNLOAD_HANDLERS instead. |
1 | You can disable any of these download handlers by assigning None to their URI scheme in DOWNLOAD_HANDLERS. E.g., to disable the built-in FTP handler (without replacement), place this in your settings.py: |
How to access settings
coroutine
1 | Many libraries that use coroutines, such as aio-libs, require the asyncio loop and to use them you need to enable asyncio support in Scrapy. |
asyncio
srapy效率提高
没有伞的孩子,必须努力奔跑!
Typewriter Mode** 已开启。
可以在视图菜单中关闭
不再显示关闭