Scrapy快速写爬虫
看到别人的教程,学着测了一下,不错。
scrapy startproject ren
直接保存到文件中
# -*- coding: utf-8 -*-
import scrapy
class LuarenSpider(scrapy.Spider):
name = "luaren"
allowed_domains = ["lua.ren"]
start_urls = [
'http://lua.ren/',
'http://lua.ren/topic/342/'
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
保存到ORM中
ORM定义
# -*- coding: utf-8 -*-
import scrapy
class RenItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
爬虫
# -*- coding: utf-8 -*-
import scrapy
from ren.items import RenItem
class LuarenSpider(scrapy.Spider):
name = "luaren"
allowed_domains = ["lua.ren"]
start_urls = [
'http://lua.ren/',
'http://lua.ren/topic/342/'
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = RenItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
先生成一堆的代码, 添加代码,按url进行依次访问,然的把body reponse通过回调返回给用用户处理,用户在cb中写自己的代码, xpath解析返回的数据,整个机制不是很复杂,不多说了。
运行与保存结果为JSON数据。
scrapy crawl luaren
scrapy crawl luaren -o items.json