Module scrapfly.scrapy.pipelines
Classes
class FilesPipeline (store_uri: str | PathLike[str],
download_func: Callable[[Request, Spider], Response] | None = None,
settings: Settings | dict[str, Any] | None = None,
*,
crawler: Crawler | None = None)-
Expand source code
class FilesPipeline(ScrapyFilesPipeline): def get_media_requests(self, item, info): scrape_configs = ItemAdapter(item).get(self.files_urls_field, []) requests = [] for config in scrape_configs: # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object # Auto migrate string url to ScrapeConfig object if isinstance(config, str): config = scrape_config=ScrapeConfig(url=config) if isinstance(config, ScrapeConfig): requests.append(ScrapflyScrapyRequest(scrape_config=config)) else: raise ValueError('FilesPipeline item must ScrapeConfig Object or string url') return requests
Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing, doing stat of the files and determining if file is new, up-to-date or expired.
new
files are those that pipeline never processed and needs to be downloaded from supplier site the first time.uptodate
files are the ones that the pipeline processed and are still valid files.expired
files are those that pipeline already processed but the last modification was made long time ago, so a reprocessing is recommended to refresh it in case of change.Ancestors
- scrapy.pipelines.files.FilesPipeline
- scrapy.pipelines.media.MediaPipeline
- abc.ABC
Methods
def get_media_requests(self, item, info)
-
Expand source code
def get_media_requests(self, item, info): scrape_configs = ItemAdapter(item).get(self.files_urls_field, []) requests = [] for config in scrape_configs: # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object # Auto migrate string url to ScrapeConfig object if isinstance(config, str): config = scrape_config=ScrapeConfig(url=config) if isinstance(config, ScrapeConfig): requests.append(ScrapflyScrapyRequest(scrape_config=config)) else: raise ValueError('FilesPipeline item must ScrapeConfig Object or string url') return requests
Returns the media requests to download
class ImagesPipeline (store_uri: str | PathLike[str],
download_func: Callable[[Request, Spider], Response] | None = None,
settings: Settings | dict[str, Any] | None = None,
*,
crawler: Crawler | None = None)-
Expand source code
class ImagesPipeline(ScrapyImagesPipeline): def get_media_requests(self, item, info): scrape_configs = ItemAdapter(item).get(self.images_urls_field, []) requests = [] for config in scrape_configs: # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object # Auto migrate string url to ScrapeConfig object if isinstance(config, str): config = scrape_config = ScrapeConfig(url=config) if isinstance(config, ScrapeConfig): requests.append(ScrapflyScrapyRequest(scrape_config=config)) else: raise ValueError('ImagesPipeline item must ScrapeConfig Object or string url') return requests
Abstract pipeline that implement the image thumbnail generation logic
Ancestors
- scrapy.pipelines.images.ImagesPipeline
- scrapy.pipelines.files.FilesPipeline
- scrapy.pipelines.media.MediaPipeline
- abc.ABC
Methods
def get_media_requests(self, item, info)
-
Expand source code
def get_media_requests(self, item, info): scrape_configs = ItemAdapter(item).get(self.images_urls_field, []) requests = [] for config in scrape_configs: # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object # Auto migrate string url to ScrapeConfig object if isinstance(config, str): config = scrape_config = ScrapeConfig(url=config) if isinstance(config, ScrapeConfig): requests.append(ScrapflyScrapyRequest(scrape_config=config)) else: raise ValueError('ImagesPipeline item must ScrapeConfig Object or string url') return requests
Returns the media requests to download