A class to run multiple scrapy crawlers in a process simultaneously. This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support for starting a :mod:`~twisted.internet.reactor` and handling shutdown signals, like the keyboard interrupt command Ctrl-C. It also config
| 718 | |
| 719 | |
| 720 | class CrawlerProcess(CrawlerProcessBase, CrawlerRunner): |
| 721 | """ |
| 722 | A class to run multiple scrapy crawlers in a process simultaneously. |
| 723 | |
| 724 | This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support |
| 725 | for starting a :mod:`~twisted.internet.reactor` and handling shutdown |
| 726 | signals, like the keyboard interrupt command Ctrl-C. It also configures |
| 727 | top-level logging. |
| 728 | |
| 729 | This utility should be a better fit than |
| 730 | :class:`~scrapy.crawler.CrawlerRunner` if you aren't running another |
| 731 | :mod:`~twisted.internet.reactor` within your application. |
| 732 | |
| 733 | The CrawlerProcess object must be instantiated with a |
| 734 | :class:`~scrapy.settings.Settings` object. |
| 735 | |
| 736 | :param install_root_handler: whether to install root logging handler |
| 737 | (default: True) |
| 738 | |
| 739 | This class shouldn't be needed (since Scrapy is responsible of using it |
| 740 | accordingly) unless writing scripts that manually handle the crawling |
| 741 | process. See :ref:`run-from-script` for an example. |
| 742 | |
| 743 | This class provides Deferred-based APIs. Use :class:`AsyncCrawlerProcess` |
| 744 | for modern coroutine APIs. |
| 745 | """ |
| 746 | |
| 747 | def __init__( |
| 748 | self, |
| 749 | settings: dict[str, Any] | Settings | None = None, |
| 750 | install_root_handler: bool = True, |
| 751 | ): |
| 752 | super().__init__(settings, install_root_handler) |
| 753 | self._initialized_reactor: bool = False |
| 754 | logger.debug("Using CrawlerProcess") |
| 755 | |
| 756 | def _create_crawler(self, spidercls: type[Spider] | str) -> Crawler: |
| 757 | if isinstance(spidercls, str): |
| 758 | spidercls = self.spider_loader.load(spidercls) |
| 759 | init_reactor = not self._initialized_reactor |
| 760 | self._initialized_reactor = True |
| 761 | return Crawler(spidercls, self.settings, init_reactor=init_reactor) |
| 762 | |
| 763 | def _stop_dfd(self) -> Deferred[Any]: |
| 764 | return self.stop() |
| 765 | |
| 766 | def start( |
| 767 | self, stop_after_crawl: bool = True, install_signal_handlers: bool = True |
| 768 | ) -> None: |
| 769 | """ |
| 770 | This method starts a :mod:`~twisted.internet.reactor`, adjusts its pool |
| 771 | size to :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS |
| 772 | resolver based on :setting:`DNSCACHE_ENABLED`. |
| 773 | |
| 774 | If ``stop_after_crawl`` is True, the reactor will be stopped after all |
| 775 | crawlers have finished, using :meth:`join`. |
| 776 | |
| 777 | :param bool stop_after_crawl: stop or not the reactor when all |
no outgoing calls