MCPcopy
hub / github.com/scrapy/scrapy / CrawlerProcess

Class CrawlerProcess

scrapy/crawler.py:720–793  ·  view source on GitHub ↗

A class to run multiple scrapy crawlers in a process simultaneously. This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support for starting a :mod:`~twisted.internet.reactor` and handling shutdown signals, like the keyboard interrupt command Ctrl-C. It also config

Source from the content-addressed store, hash-verified

718
719
720class CrawlerProcess(CrawlerProcessBase, CrawlerRunner):
721 """
722 A class to run multiple scrapy crawlers in a process simultaneously.
723
724 This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support
725 for starting a :mod:`~twisted.internet.reactor` and handling shutdown
726 signals, like the keyboard interrupt command Ctrl-C. It also configures
727 top-level logging.
728
729 This utility should be a better fit than
730 :class:`~scrapy.crawler.CrawlerRunner` if you aren't running another
731 :mod:`~twisted.internet.reactor` within your application.
732
733 The CrawlerProcess object must be instantiated with a
734 :class:`~scrapy.settings.Settings` object.
735
736 :param install_root_handler: whether to install root logging handler
737 (default: True)
738
739 This class shouldn't be needed (since Scrapy is responsible of using it
740 accordingly) unless writing scripts that manually handle the crawling
741 process. See :ref:`run-from-script` for an example.
742
743 This class provides Deferred-based APIs. Use :class:`AsyncCrawlerProcess`
744 for modern coroutine APIs.
745 """
746
747 def __init__(
748 self,
749 settings: dict[str, Any] | Settings | None = None,
750 install_root_handler: bool = True,
751 ):
752 super().__init__(settings, install_root_handler)
753 self._initialized_reactor: bool = False
754 logger.debug("Using CrawlerProcess")
755
756 def _create_crawler(self, spidercls: type[Spider] | str) -> Crawler:
757 if isinstance(spidercls, str):
758 spidercls = self.spider_loader.load(spidercls)
759 init_reactor = not self._initialized_reactor
760 self._initialized_reactor = True
761 return Crawler(spidercls, self.settings, init_reactor=init_reactor)
762
763 def _stop_dfd(self) -> Deferred[Any]:
764 return self.stop()
765
766 def start(
767 self, stop_after_crawl: bool = True, install_signal_handlers: bool = True
768 ) -> None:
769 """
770 This method starts a :mod:`~twisted.internet.reactor`, adjusts its pool
771 size to :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS
772 resolver based on :setting:`DNSCACHE_ENABLED`.
773
774 If ``stop_after_crawl`` is True, the reactor will be stopped after all
775 crawlers have finished, using :meth:`join`.
776
777 :param bool stop_after_crawl: stop or not the reactor when all

Calls

no outgoing calls