崔庆才python3爬虫-13章 Scrapy框架的使用-Scrapy通用爬虫


Scrapy框架的使用-Scrapy通用爬虫

CrawlSpider

1
2
3
在实现通用爬虫之前,我们需要先了解一下 CrawlSpider, 其官方文档链接为: http://scrapy.readthedocs.io/en/latest/topics/spiders.html#crawlspider

CrawlSpider是Scrapy提供的一个通用Spider。 在 Spider里,我们可以指定一些爬取规则来实现页面的提取,这些爬取规则由一个专门的数据结构Rule表示。 Rule里包含提取和跟进页面的配置,Spider会根据Rule来确定当前页面中的哪些链接需要继续爬取、哪些页面的爬取结果需要用哪个方法解析等。
1
2
3
CrawlSpider继承自Spider类。除了Spider类的所有方法和属性,它还提供了一个非常重要的属性和方法。
□ rules, 它是爬取规则属性,是包含一个或多个Rule对象的列表。每个Rule对爬取网站的动作都做了定义, CrawlSpider会读取rules的每一个Rule并进行解析。
□ parse_start_url(), 它是一个可重写的方法。当start_urls里对应的Request得到Response时,该方法被调用,它会分析Response并必须返回Item对象或者Request对象。

Rule

1
2
3
这里最重要的内容莫过于Rule的定义了,它的定义和参数如下所示:
class scrapy.contrib.spiders.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None,
process_links=None, process_request=None)
1
2
3
4
5
6
7
8
9
10
11
12
13
□ link_extractor:是Link Extractor对象。通过它, Spider可以知道从爬取的页面中提取哪些链接。提取出的链接会自动生成Requesto它又是一个数据结构,一般常用 LxmlLinkExtractor对象作为参数,其定义和参数如下所示:
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(),
deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'),
mttrs=('href', ), canonicalize=False, unique=T:rue, process_value=None, strip=True)

allow 是一个正则表达式或正则表达式列表,它定义了从当前页面提取出的链接哪些是符合要求的,只有符合要求的链接才会被跟进。deny则相反。

allow_domains定义了符合要求的域名,只有此域名的链接才会被跟进生成新的R equest,它相当于域名白名单。deny_domains则相反,相当于域名黑名单。

restrict_xpaths定义了从当前页面中XPath匹配的区域提取链接,其值是 XPath表达式或XPath表达式列表。restrict_css定义了从当前页面中CSS选择器匹配的区域提取链接,其值是CSS选择器或CSS选择器列表。还有一些其他参数代表了提取链接的标签、是否去重、链接的处理等内容,使用的频率不高。可以参考文档的参数说明:

http://scrapy.readthedocs.io/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxm
lhtml
1
2
3
4
5
6
7
8
9
□ callback:即回调函数,和之前定义Request的callback有相同的意义。每 link_extractor中获取到链接时,该函数将会调用。该回调函数接收一个response作为其第一个参城,并返回一个包含Item或Request对象的列表。注意,避免使用parse()作为回调函数。由于CrawlSpider使用parse()方法来实现其逻辑,如果parse()方法覆盖了,CrawlSpider将会运行失败。

□ cb_kwargs: 字典,它包含传递给回调函数的参数。

□ follow:布尔值,即TrueFalse ,它指定根据该规则从response提取的链接是否需要跟进。如果callback参数为 None, follow默认设置为True ,否则默认为False

□ process links: 指定处理函数,从link_extractor中获取到链接列表时,该函数将会调用,它主要用于过滤。

□ process_request:同样是指定处理函数,根据该Rule提取到每个Request时,该函数都会调用,对Request进行处理。该函数必须返回Request或者None
1
以上内容便是CrawlSpider中的核心Rule的基本用法 。但这些内容可能还不足以完成一个CrawlSpider爬虫。下面我们利用CrawlSpider实现新闻网站的爬取实例,来更好地理解Rule的用法

Item Loader

官方文档

1
2
我们了解了利用CrawlSpider的Rule来定义页面的爬取逻辑,这是可配置化的一部分内容。但是,Rule并没有对Item的提取方式做规则定义。对于Item的提取,我们需要借助另一个模块Item Loader来实现。
Item Loader提供一种便捷的机制来帮助我们方便地提取Item。它提供的一系列API可以分析原始数据对Item进行赋值。Item提供的是保存抓取数据的容器,而Item Loader提供的是填充容器的机制有了它,数据的提取会变得更加规则化。
1
2
3
我们了解了利用CrawlSpider的 Rule来定义页面的爬取逻辑,这是可配置化的一部分内容 但是Rule并没有对Item的提取方式做规则定义。对于Item的提取,我们需要借助另一个模块Item Loader来实现。

Item Loader提供一种便捷的机制来帮助我们方便地提取Item。 它提供的一系列API可以分析原始数据对Item进行赋值。Item提供的是保存抓取数据的容器,而Item Loader提供的是填充容器的机制有了它,数据的提取会变得更加规则化。
1
2
3
4
5
Item Loader的 API如下所示:
class scrapy.loader.ItemLoader([item, selector, response, ] **kwargs)
Item LoaderAPI返回一个新的Item Loader来填充给定的Item
如果没有给出Item,则使用中的类自动实例化default_item_class
另外,它传入selector和response参数来使用选择器或响应参数实例化。
1
2
3
4
下面将依次说明Item Loader的API参数。
□ item:它是Item对象,可以调用add_xpath()、add_css()或 add_value()等方法来填充Item对象。
□ selector:它是Selector对象,用来提取填充数据的选择器。
□ response:它是Response对象,用于使用构造选择器的Response
1
2
3
4
To use an Item Loader, you must first instantiate it. You can either instantiate it with an item object or without one, in which case an item object is automatically created in the Item Loader __init__ method using the item class specified in the ItemLoader.default_item_class attribute.


Then, you start collecting values into the Item Loader, typically using Selectors. You can add more than one value to the same item field; the Item Loader will know how to “join” those values later using a proper processing function.

实例

1
2
3
4
5
6
7
8
9
10
11
一个比较典型的Item Loader实例如下所示:
from scrapy.loader import ItemLoader
from project.items import Product
def parse(self, response):
loader = ItemLoader(item=Product(), response=response)
loader.add_xpath('name', '//div[@class="product_name"j')
loader.add_xpath('name', '//div[@class="product_title"]')
loader.add_xpath('price', '//p[@id="price"]')
loader.add_css('stock', 'p#stock]')
loader.add_value('last_updated', 'today')
return loader.load_item()
1
2
3
4
5
这里首先声明一个Product Item ,用该Item和Response对象实例化ItemLoader,调用add_xpath()方法把来自两个不同位置的数据提取出来,分配给name属性,再用add_xpath()、add_css()、add_value()等方法对不同属性依次赋值,最后调用load_item()方法实现Item 的解析。

Finally, when all data is collected, the ItemLoader.load_item() method is called which actually returns the item populated with the data previously extracted and collected with the add_xpath(), add_css(), and add_value() calls.

这种方式比较规则化,我们可以把一些参数和规则单独提取出来做成配置文件或存到数据库,即可实现可配置化。

Input Processor和Output Processor

1
2
3
4
5
另外, Item Loader每个字段中都包含了一个Input Processor ( 输入处理器)和一个 Output Processor(输出处理器)

Input Processor收到数据时立刻提取数据, Input Processor的结果被收集起来并且保存在ItemLoader内,但是不分配给Item。收集到所有的数据后, load_item()方法被调用来填充再生成Item对象。

在调用时会先调用Output Processor来处理之前收集到的数据,然后再存入Item 中,这样就生成了Item
1
An Item Loader contains one input processor and one output processor for each (item) field. The input processor processes the extracted data as soon as it’s received (through the add_xpath(), add_css() or add_value() methods) and the result of the input processor is collected and kept inside the ItemLoader. After collecting all data, the ItemLoader.load_item() method is called to populate and get the populated item object. That’s when the output processor is called with the data previously collected (and processed using the input processor). The result of the output processor is the final value that gets assigned to the item.

例子解释说明

1
2
3
4
5
6
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)
1
2
3
4
5
6
7
8
9
Data from xpath1 is extracted, and passed through the input processor of the name field. The result of the input processor is collected and kept in the Item Loader (but not yet assigned to the item).

Data from xpath2 is extracted, and passed through the same input processor used in (1). The result of the input processor is appended to the data collected in (1) (if any).

This case is similar to the previous ones, except that the data is extracted from the css CSS selector, and passed through the same input processor used in (1) and (2). The result of the input processor is appended to the data collected in (1) and (2) (if any).

This case is also similar to the previous ones, except that the value to be collected is assigned directly, instead of being extracted from a XPath expression or a CSS selector. However, the value is still passed through the input processors. In this case, since the value is not iterable it is converted to an iterable of a single element before passing it to the input processor, because input processor always receive iterables.

The data collected in steps (1), (2), (3) and (4) is passed through the output processor of the name field. The result of the output processor is the value assigned to the name field in the item.
1
It’s worth noticing that processors are just callable objects, which are called with the data to be parsed, and return a parsed value. So you can use any function as input or output processor. The only requirement is that they must accept one (and only one) positional argument, which will be an iterable.
1
2
3
The other thing you need to keep in mind is that the values returned by input processors are collected internally (in lists) and then passed to output processors to populate the fields.

Last, but not least, itemloaders comes with some commonly used processors built-in for convenience.

内置的Item Loader

官方文档

1
Even though you can use any callable function as input and output processors, itemloaders provides some commonly used processors, which are described below.

Identity

1
The simplest processor, which doesn’t do anything. It returns the original values unchanged. It doesn’t receive any __init__ method arguments, nor does it accept Loader contexts.
1
2
3
4
>>> from itemloaders.processors import Identity
>>> proc = Identity()
>>> proc(['one', 'two', 'three'])
['one', 'two', 'three']

TakeFirst

Join

1
2
3
Returns the values joined with the separator given in the __init__ method, which defaults to ' '. It doesn’t accept Loader contexts.

When using the default separator, this processor is equivalent to the function: ' '.join
1
2
3
4
5
6
7
>>> from itemloaders.processors import Join
>>> proc = Join()
>>> proc(['one', 'two', 'three'])
'one two three'
>>> proc = Join('<br>')
>>> proc(['one', 'two', 'three'])
'one<br>two<br>three'

Compose

1
2
3
A processor which is constructed from the composition of the given functions. This means that each input value of this processor is passed to the first function, and the result of that function is passed to the second function, and so on, until the last function returns the output value of this processor.

By default, stop process on None value. This behaviour can be changed by passing keyword argument stop_on_none=False.
1
2
3
4
>>> from itemloaders.processors import Compose
>>> proc = Compose(lambda v: v[0], str.upper)
>>> proc(['hello', 'world'])
'HELLO'

MapCompose

1
2
3
4
5
The input value of this processor is iterated and the first function is applied to each element. The results of these function calls (one for each element) are concatenated to construct a new iterable, which is then used to apply the second function, and so on, until the last function is applied to each value of the list of values collected so far. The output values of the last function are concatenated together to produce the output of this processor.

Each particular function can return a value or a list of values, which is flattened with the list of values returned by the same function applied to the other input values. The functions can also return None in which case the output of that function is ignored for further processing over the chain.

This processor provides a convenient way to compose functions that only work with single values (instead of iterables). For this reason the MapCompose processor is typically used as input processor, since data is often extracted using the extract() method of parsel selectors, which returns a list of unicode strings.
1
2
3
4
5
6
7
>>> def filter_world(x):
... return None if x == 'world' else x
...
>>> from itemloaders.processors import MapCompose
>>> proc = MapCompose(filter_world, str.upper)
>>> proc(['hello', 'world', 'this', 'is', 'something'])
['HELLO', 'THIS', 'IS', 'SOMETHING']
1
As with the Compose processor, functions can receive Loader contexts, and __init__ method keyword arguments are used as default context values. See Compose processor for more info.

SelectJmes

Declaring Item Loaders

1
Item Loaders are declared using a class definition syntax. Here is an example:
1
2
3
4
5
6
7
8
9
10
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader

class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(str.title)
name_out = Join()
price_in = MapCompose(str.strip)

# ...
1
As you can see, input processors are declared using the _in suffix while output processors are declared using the _out suffix. And you can also declare a default input/output processors using the ItemLoader.default_input_processor and ItemLoader.default_output_processor attributes.

Declaring Input and Output Processors

1
As seen in the previous section, input and output processors can be declared in the Item Loader definition, and it’s very common to declare input processors this way. However, there is one more place where you can specify the input and output processors to use: in the Item Field metadata. Here is an example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy
from itemloaders.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags

def filter_price(value):
if value.isdigit():
return value

class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
1
2
3
4
5
6
from scrapy.loader import ItemLoader
il = ItemLoader(item=Product())
il.add_value('name', ['Welcome to my', '<strong>website</strong>'])
il.add_value('price', ['&euro;', '<span>1000</span>'])
il.load_item()
{'name': 'Welcome to my website', 'price': '1000'}
1
2
3
4
5
6
7
The precedence order, for both input and output processors, is as follows:

Item Loader field-specific attributes: field_in and field_out (most precedence)

Field metadata (input_processor and output_processor key)

Item Loader defaults: ItemLoader.default_input_processor() and ItemLoader.default_output_processor() (least precedence)

ItemLoader Objects

1
2
3
classscrapy.loader.ItemLoader(item=None, selector=None, response=None, parent=None, **context)[source]¶

A user-friendly abstraction to populate an item with data by applying field processors to scraped data. When instantiated with a selector or a response it supports data extraction from web pages using selectors.
1
2
3
4
5
6
Parameters
item (scrapy.item.Item) – The item instance to populate using subsequent calls to add_xpath(), add_css(), or add_value().

selector (Selector object) – The selector to extract data from, when using the add_xpath(), add_css(),replace_xpath(), or replace_css() method.

response (Response object) – The response used to construct the selector using the default_selector_class, unless the selector argument is given, in which case this argument is ignored.
1
2
3
If no item is given, one is instantiated automatically using the class in default_item_class.

The item, selector, response and remaining keyword arguments are assigned to the Loader context (accessible through the context attribute).

Item

context

1
The currently active Context of this Item Loader

default_item_class

default_input_processor

default_output_processor

default_selector_class

selector

1
The Selector object to extract data from. It’s either the selector given in the __init__ method or one created from the response given in the __init__ method using the default_selector_class. This attribute is meant to be read-only.

add_css(field_name, css, processors,kw)

1
Similar to ItemLoader.add_value() but receives a CSS selector instead of a value, which is used to extract a list of unicode strings from the selector associated with this ItemLoader.

add_value(field_name,value,processors,kw)

add_xpath(field_name, xpath, processors,kw)

get_collected_values(field_name)

get_css(css,processors,kw)

get_output_value(field_name)

get_value(value,processors,kw)

get_xpath(xpath,processors,kw)

项目实战(Scrapy通用爬虫)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
首先新建一个Scrapy项目,名为 scrapyuniversal, 如下所示:
scrapy startproject scrapyuniversal
创建一个CrawlSpider,需要先制定一个模板。我们可以先看看有哪些可用模板,命令如下所示:
scrapy genspider -1
运行结果如下所示:
Available templates:
basic
crawl
csvfeed
xmlfeed

之前创建Spider的时候,我们默认使用了第一个模板basic
这次要创建CrawlSpider,就需要使用第二个模板crawl,创建命令如下所示:
scrapy genspider -t crawl china tech.china.com
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
运行之后便会生成一个CrawlSpider,其内容如下所示:
class ChinaSpider(CrawlSpider):
name = 'china'
allowed_domains = ['tech.china.com']
start_urls = ['http://tech.china.com/']

rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)

def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
return item
1
这次生成的Spider内容多了一个rules属性的定义。Rule的第一个参数是LinkExtractor,就是上文所说的LxmlLinkExtractor,只是名称不同。同时,默认的回调函数也不再是parse,而是 parse_item

定义Rule

1
2
3
4
5
6
7
要实现新闻的爬取,我们需要做的就是定义好Rule,然后实现解析函数。下面我们就来一步步实现这个过程。
首先将start_urls修改为起始链接,代码如下所示:
start_urls = ['http://tech.china.com/articles/']

之后 , Spider爬取start_urls里面的每一个链接。所以这里第一个爬取的页面就是我们刚才所定义的链接。得到Response之后,Spider就会根据每一个Rule来提取这个页面内的超链接,去生成进一步的Request

接下来,我们就需要定义Rule来指定提取哪些链接
1
2
这是新闻的列表页,下一步自然就是将列表中的每条新闻详情的链接提取出来。这里直接指定这些链接所在区域即可。查看源代码,所有链接都在ID为left_side 的节点内,具体来说是它内部的
classcon_item的节点
1
此处我们可以用LinkExtractor的restrict_xp aths属性来指定,之后 Spider就会从这个区域提取所有的超链接并生成Request 但是,每篇文章的导航中可能还有一些其他的超链接标签,我们只想把需要的新闻链接提取出来。真正的新闻链接路径都是以article开头的,我们用一个正则表达式将其匹配出来再赋值给allow参数即可。另外,这些链接对应的页面其实就是对应的新闻详情页,而我们需要解析的就是新闻的详情信息,所以此处还需要指定一个回调函数callback
1
2
3
到现在我们就可以构造出一个Rule了,代码如下所示:
Rule(LinkExtractor(allow='article\/.*\.html', restrict_xpaths='〃 div[@id="left_side"]
//div[@class="con_item")'), callback='parse_item')
1
接下来,我们还要让当前页面实现分页功能,所以还需要提取下一页的链接。分析网页源码之后可以发现下一页链接是在ID 为 pageStyle的节点内
1
2
3
4
5
6
7
8
9
10
但是,下一页节点和其他分页链接区分度不高,要取出此链接我们可以直接用XPath的文本匹配方式,所以这里我们直接用LinkExtractor的 restrict_xpaths属性来指定提取的链接即可。另外,我们不需要像新闻详情页一样去提取此分页链接对应的页面详情信息,也就是不需要生成Item ,所以不
需要加callback参数。另外这下一页的页面如果请求成功了就需要继续像上述情况一样分析,所以它还需要加一个follow参数为True ,代表继续跟进匹配分析。其实follow参数也可以不加,因为当callback为空的时候follow默认为True。 此处 Rule定义为如下所示:

Rule(LinkExtractor(restrict_xpaths='//div[@id="pageStyle"]//a[contains(.,"下一页")]'))
所 以 现 在rule s就变成了:
rules = (
Rule(LinkExtractor(allow='article\/.*\.html',
restrict_xpaths='//div[@id="left_side"]//div[@class="con_item"]'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[@id="pageStyle"]//a[contains(.,"下一页")]'),follow=True)
)
1
2
3
接着我们运行代码,命令如下所示:
scrapy crawl china
现在已经实现页面的翻页和详情页的抓取了,我们仅仅通过定义了两个Rule即实现了这样的功能

qYbbX8.md.png

解析页面

1
2
3
4
5
6
7
8
9
接下来我们需要做的就是解析页面内容了,将标题、发布时间、正文、来源提取出来即可。首先定义一个Item ,如下所示:
from scrapy import Field, Item
class Newsitem(Item):
title = Field()
url = Field()
text = Field()
datetime = Field()
source = Field()
website = Field()
1
这里的字段分别指新闻标题、链接、正文、发布时间、来源、站点名称,其中站点名称直接赋值为中华网。因为既然是通用爬虫,肯定还有很多爬虫也来爬取同样结构的其他站点的新闻内容,所以需要一个字段来区分一下站点名称
1
2
3
4
5
6
7
8
9
10
如果像之前一样提取内容,就直接调用response变量的xpath()、css()等方法即可 。这里parse_item()方法的实现如下所示:
def parse_item(self, response):
item = Newsltem()
item['title'] = response.xpath('//hl[@id="chan_newsTitle"]/text()').extract_first()
item['url'] = response.url
item['text'] = ''.join(response.xpath('//div[@id="chan_newsDetail"]//text()').extract()).strip()
item['datetime'] = response.xpath('//div[@id="chan_newslnfo"]/text()').re_first('(\d+-\d+-\d+\s\d+:\d+:\d+)')
item['source'] = response.xpath('//div[@id="chan_newslnfo"]/text()').re_first('来源:(.*)').strip()
item[ 'website'] = '中华网'
yield item

qtSpO1.md.png

使用ItemLoader

1
2
现在我们就可以成功将每条新闻的信息提取出来。
不过我们发现这种提取方式非常不规整。下面我们再用Item Loader,通过 add_xpath()、add_css()、add_value()等方式实现配置化提取。我们可以改写parse_ item (), 如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def parse_item(self, response):
loader = ChinaLoader(item=NewsItem(), response=response)
loader.add_xpath('title', '//hl[@id="chan_newsTitle"]/text()')
loader.add_value('url', response.url)
loader.add_xpath('text', '//div[@id="chan_newsDetail"]//text()')
loader.add_xpath('datetime', 1//div[@id="chan_newslnfo"]/text()', (\d+-\d+-\d+\s\d+:\d+:\d+)')
loader.add_xpath('source', '//div[@id="chan_newslnfo"]/text()', re='来源:(.*)')
loader.add_value('website', ,中华网')
yield loader.load_item()

这里我们定义了一个Item Loader的子类,名为ChinaLoader,其实现如下所示:
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, Join, Compose
class NewsLoader(ItemLoader):
default_output_processor = TakeFirst()
class ChinaLoader(NewsLoader):
text_out = Compose(Join(), lambda s: s.strip())
source_out = Compose(Join(), lambda s: s.strip())
1
ChinaLoader继承了News Loader类,其内定义了一个通用的 Out Processor为TakeFirst,这相当于之前所定义的extrac t_first()方法的功能。我们在ChinaLoader中定义了text_out和source_out字段。这里使用了一个Compose Processor,它有两个参数:第一个参数Join也是一个Processor,它可以把列表拼合成一个字符串;第二个参数是一个匿名函数,可以将字符串的头尾空白字符去掉。经过这一系列处理之后,我们就将列表形式的提取结果转化为去重头尾空白字符的字符串。
1
2
代码重新运行,提取效果是完全一样的。
至此,我们已经实现了爬虫的半通用化配置

通用配置抽取

1
2
为什么现在只做到了半通用化?如果我们需要扩展其他站点,仍然需要创建一个新的CrawlSpider,定义这个站点的Rule,单独实现parse_item()方法还有很多 代码是重复的,如CrawlSpider的变量、方法名几乎都是一样的。那么我们可不可以把多个类似的几个爬虫的代码共用,把完全不相同的地方抽离出来,做成可配置文件呢?
当然可以。那我们可以抽离出哪些部分?所有的变量都可以抽取,如 name、 allowed_domains,start_urls、rules等。这些变量在CrawlSpider初始化的时候赋值即可。我们就可以新建一个通用的Spider来实现这个功能,命令如下所示:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
scrapy genspider -t crawl universal universal
这个全新的Spider名为universalo 接下来,我们将刚才所写的Spider内的属性抽离出来配置成一个JSON,命名为china.json,放到configs文件夹内,和spiders文件夹并列,代码如下所示:
{
"spider": "universal",
"website":"中华网科技"
"type":"新闻"
"index": "http://tech.china.com/",
"settings": {
"USER_AGENT": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/60.0.3112.90 Safari/537.36"
},
"start_urls":[
"http://tech.china.com/articles/"
],
"allowed_domains": [
"tech.china.com"
"rules": "china"
}
1
2
第一个字段spider即Spider的名称,在这里是universal。后面是站点的描述,比如站点名称、类型、首页等。随后的settings是该Spider特有的se ttings配置,如果要覆盖全局项目,settings.py内的配置可以单独为其配置。随后是Spider的一些属性,如start_u:rls、allowed_domains、rules等。
rules也可以单独定义成一个rules.py文件,做成配置文件,实现Rule的分离,如下所示:
1
2
3
4
5
6
7
8
9
10
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
rules = {
'china': (
Rule(LinkExtractor(allow='article\/.*\.html', restrict_xpaths='//div[@id="left_side"]
//div[@class="con_item"]'),
callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[@id="pageStyle"]//a[contains(.,"下一页")]'))
)
}
1
2
3
4
5
6
7
这样我们将基本的配置抽取出来。如果要启动爬虫,只需要从该配置文件中读取然后动态加载到Spider中即可。所以我们需要定义一个读取该JSON文件的方法,如下所示:
from os.path import realpath, dirname
import json
def get_config(name):
patR = dirname(:realpath(_ file_ ) ) + '/configs/' + name + '.json'
with open(path, 'r', encoding='utf-8') as f :
return json.loads(f.read())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
定义了 get_corrfig()方法之后,我们只需要向其传入JSON配置文件的名称即可获取此JSON配置信息。随后我们定义入口文件run.py,把它放在项目根目录下,它的作用是启动Spider,如下所示:
import sys
from scrapy.utiIs.project import get_project_settings
from scrapyuniversal.spiders.universal import UniversalSpider
from scrapyuniversal.utils import get_config
from scrapy.crawler import CrawlerProcess
def run():
name = sys.argv[l]
custom_settings = get_config(name)
# 爬取 茯用的Spider名荡
spider = custom_settings.get('spider', 'universal')
project_settings = get_project_settings()
settings = diet(project_settings.copy())
# 合并配置
settings.update(custom_settings.get(,settings'))
process = CrawlerProcess(settings)
# 启动爬虫
process.crawl(spider, **{'name': name})
process.start()
if _ name_ == ' _ main_ ' :
run()


运行入口为run()
首先获取命令行的参数并赋值为name, name就是JSON文件的名称,其实就
是要爬取的目标网站的名称。我们首先利用get_config()方法,传入该名称读取刚才定义的配置文件。
获取爬取使用的spider的名称、配置文件中的settings配置,然后将获取到的s ettings配置和项目全局的settings配置做了合并。新建一CrawlerProcess,传入爬取使用的配置。调用crawl()和 start()方法即可启动爬取。

没有伞的孩子,必须努力奔跑!

Typewriter Mode** 已开启。

可以在视图菜单中关闭

不再显示关闭

本文标题:崔庆才python3爬虫-13章 Scrapy框架的使用-Scrapy通用爬虫

文章作者:TTYONG

发布时间:2020年03月22日 - 17:03

最后更新:2022年03月27日 - 15:03

原始链接:http://tianyong.fun/%E5%B4%94%E5%BA%86%E6%89%8Dpython3%E7%88%AC%E8%99%AB-13%E7%AB%A0(13.10)%20%20Scrapy%E6%A1%86%E6%9E%B6%E7%9A%84%E4%BD%BF%E7%94%A8-%20Scrapy%E9%80%9A%E7%94%A8%E7%88%AC%E8%99%AB.html

许可协议: 转载请保留原文链接及作者。

多少都是爱
0%