hub / github.com/scrapy/scrapy / CrawlerProcess

Class CrawlerProcess

scrapy/crawler.py:720–793 · view source on GitHub ↗

A class to run multiple scrapy crawlers in a process simultaneously. This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support for starting a :mod:`~twisted.internet.reactor` and handling shutdown signals, like the keyboard interrupt command Ctrl-C. It also config

Source from the content-addressed store, hash-verified

718
719
720	class CrawlerProcess(CrawlerProcessBase, CrawlerRunner):
721	"""
722	A class to run multiple scrapy crawlers in a process simultaneously.
723
724	This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support
725	for starting a :mod:`~twisted.internet.reactor` and handling shutdown
726	signals, like the keyboard interrupt command Ctrl-C. It also configures
727	top-level logging.
728
729	This utility should be a better fit than
730	:class:`~scrapy.crawler.CrawlerRunner` if you aren't running another
731	:mod:`~twisted.internet.reactor` within your application.
732
733	The CrawlerProcess object must be instantiated with a
734	:class:`~scrapy.settings.Settings` object.
735
736	:param install_root_handler: whether to install root logging handler
737	(default: True)
738
739	This class shouldn't be needed (since Scrapy is responsible of using it
740	accordingly) unless writing scripts that manually handle the crawling
741	process. See :ref:`run-from-script` for an example.
742
743	This class provides Deferred-based APIs. Use :class:`AsyncCrawlerProcess`
744	for modern coroutine APIs.
745	"""
746
747	def __init__(
748	self,
749	settings: dict[str, Any] \| Settings \| None = None,
750	install_root_handler: bool = True,
751	):
752	super().__init__(settings, install_root_handler)
753	self._initialized_reactor: bool = False
754	logger.debug("Using CrawlerProcess")
755
756	def _create_crawler(self, spidercls: type[Spider] \| str) -> Crawler:
757	if isinstance(spidercls, str):
758	spidercls = self.spider_loader.load(spidercls)
759	init_reactor = not self._initialized_reactor
760	self._initialized_reactor = True
761	return Crawler(spidercls, self.settings, init_reactor=init_reactor)
762
763	def _stop_dfd(self) -> Deferred[Any]:
764	return self.stop()
765
766	def start(
767	self, stop_after_crawl: bool = True, install_signal_handlers: bool = True
768	) -> None:
769	"""
770	This method starts a :mod:`~twisted.internet.reactor`, adjusts its pool
771	size to :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS
772	resolver based on :setting:`DNSCACHE_ENABLED`.
773
774	If ``stop_after_crawl`` is True, the reactor will be stopped after all
775	crawlers have finished, using :meth:`join`.
776
777	:param bool stop_after_crawl: stop or not the reactor when all

Callers 15

executeFunction · 0.90

test_crawler_process_accepts_dictMethod · 0.90

test_crawler_process_accepts_NoneMethod · 0.90

test_log_scrapy_infoFunction · 0.90

twisted_reactor_custom_settings_conflict.pyFile · 0.90

simple.pyFile · 0.90

asyncio_enabled_reactor.pyFile · 0.90

asyncio_enabled_no_reactor.pyFile · 0.90

multi.pyFile · 0.90

asyncio_enabled_reactor_same_loop.pyFile · 0.90

caching_hostname_resolver.pyFile · 0.90

reactor_select_twisted_reactor_select.pyFile · 0.90

Calls

no outgoing calls

Tested by 3

test_crawler_process_accepts_dictMethod · 0.72

test_crawler_process_accepts_NoneMethod · 0.72

test_log_scrapy_infoFunction · 0.72