Scrapy是一个非常好的抓取框架,它不仅提供了一些开箱可用的基础组建,还能够根据自己的需求,进行强大的自定义。本文对基本用法和常见问题做一个记录。
1、安装
Scrapy虽然是python的模块,但是依赖包比较多,所以我推荐使用apt安装:
sudo apt-get install python-scrapy
sudo apt-get install python-scrapy
sudo apt-get install python-scrapy
编译狂人 或者 处女座 可以从Pypi上下载自行编译安装。友情提示下:pip或者ezsetup上的自动依赖是不全的,需要自己再补其他包。
本文所用的版本是当前最新版:0.24
2、抓取DMOZ
scrapy自带了创建项目的shell,所以我们首先抓创建1个项目:
scrapy startproject tutorial
scrapy startproject tutorial
scrapy startproject tutorial
然后,我们在tutorial/spiders下创建dmoz_spider.py,这个是爬虫的核心啦:
from scrapy.spider import Spider
class DmozSpider(Spider):
allowed_domains = ["dmoz.org"]
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
from scrapy.spider import Spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
from scrapy.spider import Spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
说明下,上面的class中,参数说明如下:
- name是Scrapy 识别的爬虫名字,一定要唯一。
- allowed_domains 是域名白名单
- start_urls 即种子url (如果没有定义其他Rule的话,也就是只抓取这几页)
parse参数中传入的response包含了抓取后的结果。
执行抓取:
scrapy crawl dmoz
然后,我们可以发现目录下多了 Books 和 Resources 2个文件。
在执行上面的shell命令时,scrapy会创建一个scrapy.http.Request对象,将start_url传递给它,抓取完毕后,回调parse函数。
3、抽取结构化数据 & 存储为json
在抓取任务中,我们一般不会只抓取网页,而是要将抓取的结果直接变成结构化数据。
from scrapy.item import Item, Field
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
有了定义的DO,我们就可以修改Parser,并用Scrapy内置的XPath解析HTML文档。
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(Spider):
allowed_domains = ["dmoz.org"]
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
def parse(self, response):
sites = sel.xpath('//ul/li')
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items
from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import DmozItem
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
item['desc'] = site.xpath('text()').extract()
items.append(item)
return items
如果我们想保存结果为json,需要在shell中增加一个选项:
scrapy crawl dmoz -o items.json -t json
scrapy crawl dmoz -o items.json -t json
scrapy crawl dmoz -o items.json -t json
4、如何“跟踪”和“过滤”
在很多情况下,我们并不是只抓取某个页面,而需要“顺藤摸瓜”,从几个种子页面,通过超级链接索,最终定位到我们想要的页面。
Scrapy对这个功能进行了很好的抽象:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
class Coder4Spider(CrawlSpider):
allowed_domains = ['xxx.com']
start_urls = ['http://www.xxx.com']
Rule(SgmlLinkExtractor(allow=('page/[0-9]+', ))),
Rule(SgmlLinkExtractor(allow=('archives/[0-9]+', )), callback='parse_item'),
def parse_item(self, response):
self.log('%s' % response.url)
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
class Coder4Spider(CrawlSpider):
name = 'coder4'
allowed_domains = ['xxx.com']
start_urls = ['http://www.xxx.com']
rules = (
Rule(SgmlLinkExtractor(allow=('page/[0-9]+', ))),
Rule(SgmlLinkExtractor(allow=('archives/[0-9]+', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('%s' % response.url)
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
class Coder4Spider(CrawlSpider):
name = 'coder4'
allowed_domains = ['xxx.com']
start_urls = ['http://www.xxx.com']
rules = (
Rule(SgmlLinkExtractor(allow=('page/[0-9]+', ))),
Rule(SgmlLinkExtractor(allow=('archives/[0-9]+', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('%s' % response.url)
在上面,我们用了CrawlSpider而不是Spider,name、 allowed_domains、start_urls就不解释了。
重点说下Rule:
- 第1条不带callback的,表示只是“跳板”,即只下载网页并根据allow中匹配的链接,去继续遍历下一步的页面,实际上Rule还可以指定deny=xxx 表示过滤掉哪些页面。
- 第2条带callback的,是最终会回调parse_item函数的网页。
5、如何过滤重复的页面
Scrapy支持通过RFPDupeFilter来完成页面的去重(防止重复抓取)。
RFPDupeFilter实际是根据request_fingerprint实现过滤的,实现如下:
def request_fingerprint(request, include_headers=None):
include_headers = tuple([h.lower() for h in sorted(include_headers)])
cache = _fingerprint_cache.setdefault(request, {})
if include_headers not in cache:
fp.update(request.method)
fp.update(canonicalize_url(request.url))
fp.update(request.body or '')
for hdr in include_headers:
if hdr in request.headers:
for v in request.headers.getlist(hdr):
cache[include_headers] = fp.hexdigest()
return cache[include_headers]
def request_fingerprint(request, include_headers=None):
if include_headers:
include_headers = tuple([h.lower() for h in sorted(include_headers)])
cache = _fingerprint_cache.setdefault(request, {})
if include_headers not in cache:
fp = hashlib.sha1()
fp.update(request.method)
fp.update(canonicalize_url(request.url))
fp.update(request.body or '')
if include_headers:
for hdr in include_headers:
if hdr in request.headers:
fp.update(hdr)
for v in request.headers.getlist(hdr):
fp.update(v)
cache[include_headers] = fp.hexdigest()
return cache[include_headers]
def request_fingerprint(request, include_headers=None):
if include_headers:
include_headers = tuple([h.lower() for h in sorted(include_headers)])
cache = _fingerprint_cache.setdefault(request, {})
if include_headers not in cache:
fp = hashlib.sha1()
fp.update(request.method)
fp.update(canonicalize_url(request.url))
fp.update(request.body or '')
if include_headers:
for hdr in include_headers:
if hdr in request.headers:
fp.update(hdr)
for v in request.headers.getlist(hdr):
fp.update(v)
cache[include_headers] = fp.hexdigest()
return cache[include_headers]
我们可以看到,去重指纹是sha1(method + url + body + header)
所以,实际能够去掉重复的比例并不大。
如果我们需要自己提取去重的finger,需要自己实现Filter,并配置上它。
下面这个Filter只根据url去重:
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
"""A dupe filter that considers the URL"""
def __init__(self, path=None):
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
self.urls_seen.add(request.url)
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
"""A dupe filter that considers the URL"""
def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
"""A dupe filter that considers the URL"""
def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)
不要忘记配置上:
<code><span class="pln">DUPEFILTER_CLASS </span><span class="pun">=</span><span class="str">'scraper.custom_filters.SeenURLFilter'</span></code>
<code><span class="pln">DUPEFILTER_CLASS </span><span class="pun">=</span><span class="str">'scraper.custom_filters.SeenURLFilter'</span></code>
DUPEFILTER_CLASS ='scraper.custom_filters.SeenURLFilter'
6、如何设置代理
为了实现代理,需要配置2个Middleware:
setting.py中定义:
'project_name.middlewares.MyProxyMiddleware': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110
SPIDER_MIDDLEWARES = {
'project_name.middlewares.MyProxyMiddleware': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110
}
SPIDER_MIDDLEWARES = {
'project_name.middlewares.MyProxyMiddleware': 100,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110
}
其中100、110是执行的优先级,即我们自己定义的MyProxyMiddleware先执行,系统的HttpProxyMiddleware后执行。
我们自己的MyProxyMiddleware,主要负责向meta中写入代理信息:
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
如果你用的是socks5代理,那么对不起,目前scrapy还不能直接支持,可以通过Privoxy等软件将其本地转化为http代理。
7、如何防止死循环
在Scrapy的默认配置中,是根据url进行去重的。这个对付一般网站是够的。但是有一些网站的SEO做的很变态:为了让爬虫多抓,会根据request,动态的生成一些链接,导致爬虫 在网站上抓取大量的随机页面,甚至是死循环。。
为了解决这个问题,有2个方案:
(1) 在setting.py中,设定爬虫的嵌套次数上限(全局设定,实际是通过DepthMiddleware实现的):
DEPTH_LIMIT = 20
(2) 在parse中通过读取response来自行判断(spider级别设定) :
def parse(self, response):
if response.meta['depth'] > 100:
def parse(self, response):
if response.meta['depth'] > 100:
print 'Loop?'
def parse(self, response):
if response.meta['depth'] > 100:
print 'Loop?'
参考文章:
Make Scrapy work with socket proxy
Using Scrapy with proxies
博主不知能否发一份源代码给我, 我在过滤重复页面的时候遇到了一些问题.
nimozcq@163.com
刚开始学 哪些代码是放一起的有点不太清楚
去重的那个配置 要放在哪里?去重的文件是和Coder4Spider 这个文件分开的吗