看到别人的教程,学着测了一下,不错。

scrapy startproject ren

直接保存到文件中

# -*- coding: utf-8 -*-
import scrapy

class LuarenSpider(scrapy.Spider):
    name = "luaren" 
    allowed_domains = ["lua.ren"]
    start_urls = [
        'http://lua.ren/',
        'http://lua.ren/topic/342/'
    ]        
      
    def parse(self, response):
        filename = response.url.split("/")[-2] 
        with open(filename, 'wb') as f: 
            f.write(response.body)

保存到ORM中

ORM定义

# -*- coding: utf-8 -*-
import scrapy

class RenItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

爬虫

# -*- coding: utf-8 -*-
import scrapy
from ren.items import RenItem

class LuarenSpider(scrapy.Spider):
    name = "luaren" 
    allowed_domains = ["lua.ren"]
    start_urls = [
        'http://lua.ren/',
        'http://lua.ren/topic/342/'
    ]        
      
    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = RenItem() 
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

先生成一堆的代码, 添加代码,按url进行依次访问,然的把body reponse通过回调返回给用用户处理,用户在cb中写自己的代码, xpath解析返回的数据,整个机制不是很复杂,不多说了。

运行与保存结果为JSON数据。

scrapy crawl luaren
scrapy crawl luaren -o items.json