现在我有 2 只蜘蛛,我想做的是
for now i have 2 spiders, what i would like to do is
- spider 1 转到 url1 并且如果出现 url2 ,用 url2<调用蜘蛛 2/代码>.也使用管道保存url1的内容.
- 蜘蛛2去url2做点什么.
- spider 1 goes to url1 and if url2 appears, call spider 2 with url2. also saves the content of url1 by using pipeline.
- spider 2 goes to url2 and do something.
due to the complexities of both spiders i would like to have them separated.
我使用 scrapy crawl 的尝试:
def parse(self, response): p = multiprocessing.process( target=self.testfunc()) p.join() p.start() def testfunc(self): settings = get_project_settings() crawler = crawlerrunner(settings) crawler.crawl(, )
it does load the settings but doesn't crawl:
2015-08-24 14:13:32 [scrapy] info: enabled extensions: closespider, logstats, corestats, spiderstate 2015-08-24 14:13:32 [scrapy] info: enabled downloader middlewares: downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, httpauthmiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2015-08-24 14:13:32 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2015-08-24 14:13:32 [scrapy] info: spider opened 2015-08-24 14:13:32 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
文档中有一个关于从脚本启动的示例,但我想做的是在使用 scrapy crawl 命令时启动另一个蜘蛛.
the documentations has a example about launching from script, but what i'm trying to do is launch another spider while using scrapy crawl command.
from scrapy.crawler import crawlerrunner from scrapy.utils.project import get_project_settings from twisted.internet import reactor from multiprocessing import process import scrapy import os def info(title): print(title) print('module name:', __name__) if hasattr(os, 'getppid'): # only available on unix print('parent process:', os.getppid()) print('process id:', os.getpid()) class testspider1(scrapy.spider): name = "test1" start_urls = ['http://www.google.com'] def parse(self, response): info('parse') a = myclass() a.start_work() class myclass(object): def start_work(self): info('start_work') p = process(target=self.do_work) p.start() p.join() def do_work(self): info('do_work') settings = get_project_settings() runner = crawlerrunner(settings) runner.crawl(testspider2) d = runner.join() d.addboth(lambda _: reactor.stop()) reactor.run() return class testspider2(scrapy.spider): name = "test2" start_urls = ['http://www.google.com'] def parse(self, response): info('testspider2') return
- scrapy 抓取测试1(例如,当 response.status_code 为 200 时:)
- 在test1中,调用scrapy crawl test2
我不会深入给出,因为这个问题真的很老,但我会继续从官方 scrappy 文档中删除这个片段......你非常接近!哈哈
i won't go in depth given since this question is really old but i'll go ahead drop this snippet from the official scrappy docs.... you are very close! lol
import scrapy from scrapy.crawler import crawlerprocess class myspider1(scrapy.spider): # your first spider definition ... class myspider2(scrapy.spider): # your second spider definition ... process = crawlerprocess() process.crawl(myspider1) process.crawl(myspider2) process.start() # the script will block here until all crawling jobs are finished
and then using callbacks you can pass items between your spiders do do w.e logic functions your talking about