Scrapy快速写爬虫

看到别人的教程，学着测了一下，不错。

scrapy startproject ren

直接保存到文件中

# -*- coding: utf-8 -*-
import scrapy

class LuarenSpider(scrapy.Spider):
    name = "luaren" 
    allowed_domains = ["lua.ren"]
    start_urls = [
        'http://lua.ren/',
        'http://lua.ren/topic/342/'
    ]        
      
    def parse(self, response):
        filename = response.url.split("/")[-2] 
        with open(filename, 'wb') as f: 
            f.write(response.body)

保存到ORM中

ORM定义

# -*- coding: utf-8 -*-
import scrapy

class RenItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

爬虫

# -*- coding: utf-8 -*-
import scrapy
from ren.items import RenItem

class LuarenSpider(scrapy.Spider):
    name = "luaren" 
    allowed_domains = ["lua.ren"]
    start_urls = [
        'http://lua.ren/',
        'http://lua.ren/topic/342/'
    ]        
      
    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = RenItem() 
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

先生成一堆的代码，添加代码，按url进行依次访问，然的把body reponse通过回调返回给用用户处理，用户在cb中写自己的代码， xpath解析返回的数据，整个机制不是很复杂，不多说了。

运行与保存结果为JSON数据。

scrapy crawl luaren
scrapy crawl luaren -o items.json