Thanks a lot for your light-speed guidance!
I'm using Scrapy, a scraping framework in Python. Internally, it uses the standard logging library that streams logs to stdout.
Unfortunately, an exit Code approach like you suggested is impossible because it returns 0 when the queuing scheduler has finished, regardless of whether the crawling worked out.
However, a stats report is generated in the logs at the end of execution, which I am trying to parse to determine whether the test has passed.
2025-01-17 04:24:10 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2025-01-17 04:24:10 [web] INFO: Number of search results: 10
2025-01-17 04:24:10 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (10 items) in: test.jsonl
2025-01-17 04:24:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 10.298047,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'closespider_itemcount',
'finish_time': datetime.datetime(2025, 1, 17, 3, 24, 10, 796087, tzinfo=datetime.timezone.utc),
'httpcache/hit': 13,
'httpcompression/response_bytes': 64974,
'httpcompression/response_count': 2,
'item_scraped_count': 10,
'items_per_minute': None,
'log_count/DEBUG': 46,
'log_count/INFO': 50,
'log_count/WARNING': 1,
'memusage/max': 162893824,
'memusage/startup': 162893824,
'request_depth_max': 3,
'response_received_count': 13,
'responses_per_minute': None,
'scheduler/dequeued': 13,
'scheduler/dequeued/memory': 13,
'scheduler/enqueued': 24,
'scheduler/enqueued/memory': 24,
'scrapy-zyte-api/sessions/use/disabled': 13,
'start_time': datetime.datetime(2025, 1, 17, 3, 24, 0, 498040, tzinfo=datetime.timezone.utc)}
2025-01-17 04:24:10 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
I have several projects like these and was hoping I could monitor which crawlers require fixes.