Scrapy-个人使用总结

对scrapy使用pycharm进行调试

1.创建项目
scrapy startproject project_name

2.创建Spider
scrapy genspider spider_name start_url

3.在项目根目录下创建main.py

from scrapy.cmdline import execute
import sys
import os

# 打断点调试py文件
# sys.path.append('D:\PyCharm\project_name')
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
print(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy', 'crawl', 'spider_name'])

ROBOTSTXT_OBEY = False一定要设置成 False，断点调试才能正常进行


然后运行main.py

参考

Downloader Middleware

1.process_request(request, spider)
返回None,request,response,IgnoreRequest 的情况

2.process_response(request, response, spider)
返回request, response, IgnoreRequest 的情况

3.process_except(request, exception, spider)
返回None、Response对象、Request对象的情况

用process_request(request, Spider)修改User-Agent

1 2	def process_request(self,request,spider): request.headers['User-Agent']='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'

使用process_response(request, response,spider)修改响应码

1
2
3

def process_response(self,request,response,spider): # Request对象、Response对象之一，或者抛出IgnoreRequest异常。
        response.status=201
        return response

对process_request和process_response不同返回值的分析

参考

1.当process_requests返回response时，不仅过downloader下载；scrapy对接selenium和splash时，就是相当于把渲染过后的页面内容赋值给response返回，因为downloader获取的内容不是渲染过后的数据

IgnoreRequest

process_request返回IgnoreRequest时

If it raises an IgnoreRequest exception, the process_exception() methods of installed downloader middleware will be called. If none of them handle the exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).

process_reponse返回IgnoreRequest时

If it raises an IgnoreRequest exception, the errback function of the request (Request.errback) is called. If no code handles the raised exception, it is ignored and not logged (unlike other exceptions).

process_exception(request, exception, spider)

1 2	启用条件 Scrapy calls process_exception() when a download handler or a process_request() (from a downloader middleware) raises an exception (including an IgnoreRequest exception)

返回值分析
process_exception() should return: either None, a Response object, or a Request object.

If it returns None, Scrapy will continue processing this exception, executing any other process_exception() methods of installed middleware, until no middleware is left and the default exception handling kicks in.

1
2
3

If it returns a Response object, the process_response() method chain of installed middleware is started, and Scrapy won’t bother calling any other process_exception() methods of middleware.

If it returns a Request object, the returned request is rescheduled to be downloaded in the future. This stops the execution of process_exception() methods of the middleware the same as returning a response would.

from_crawler(cls, crawler)

If present, this classmethod is called to create a middleware instance from a Crawler. It must return a new instance of the middleware. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for middleware to access them and hook its functionality into Scrapy.

Parameters
crawler (Crawler object) – crawler that uses this middleware

Spider Middleware

Pipeline Item

settings文件

DOWNLOAD_DELAY

1.DOWNLOAD_DELAY:(可以用于抑制爬取速度)
Default: 0

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example:

DOWNLOAD_DELAY = 0.25    # 250 ms of delay

DOWNLOADER_MIDDLEWARES_BASE和DOWNLOADER_MIDDLEWARES

1 2	DOWNLOADER_MIDDLEWARES_BASE和DOWNLOADER_MIDDLEWARES的关系 (对应章节已经具体分析)

DOWNLOADER_STATS

3.DOWNLOADER_STATS
Default: True

Whether to enable downloader stats collection.

DOWNLOAD_HANDLERS

1
2
3

Default: {}

A dict containing the request downloader handlers enabled in your project. See DOWNLOAD_HANDLERS_BASE for example format.

DOWNLOAD_HANDLERS_BASE

A dict containing the request download handlers enabled by default in Scrapy. You should never modify this setting in your project, modify DOWNLOAD_HANDLERS instead.


Default:

{
    'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler',
    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
    'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
    'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}

You can disable any of these download handlers by assigning None to their URI scheme in DOWNLOAD_HANDLERS. E.g., to disable the built-in FTP handler (without replacement), place this in your settings.py:

DOWNLOAD_HANDLERS = {
    'ftp': None,
}

How to access settings

官方文档

coroutine

官方文档

1
2
3

Many libraries that use coroutines, such as aio-libs, require the asyncio loop and to use them you need to enable asyncio support in Scrapy.

To enable asyncio support, set the TWISTED_REACTOR setting to 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'.

asyncio

官方文档

srapy效率提高

推荐方法

没有伞的孩子，必须努力奔跑!

Typewriter Mode** 已开启。

可以在视图菜单中关闭

不再显示关闭

比你优秀的人都努力，有什么理由不努力！

崔庆才python3爬虫-13章 Scrapy-个人使用总结

Scrapy-个人使用总结

对scrapy使用pycharm进行调试

Downloader Middleware

用process_request(request, Spider)修改User-Agent

使用process_response(request, response,spider)修改响应码

对process_request和process_response不同返回值的分析

IgnoreRequest

process_request返回IgnoreRequest时

process_reponse返回IgnoreRequest时

process_exception(request, exception, spider)

from_crawler(cls, crawler)

Spider Middleware

Pipeline Item

settings文件

DOWNLOAD_DELAY

DOWNLOADER_MIDDLEWARES_BASE和DOWNLOADER_MIDDLEWARES

DOWNLOADER_STATS

DOWNLOAD_HANDLERS

DOWNLOAD_HANDLERS_BASE

How to access settings

coroutine

asyncio

srapy效率提高