Module scrapfly.scrapy.pipelines

Classes

class FilesPipeline (store_uri: str | PathLike[str],
download_func: Callable[[Request, Spider], Response] | None = None,
settings: Settings | dict[str, Any] | None = None,
*,
crawler: Crawler | None = None)
Expand source code
class FilesPipeline(ScrapyFilesPipeline):
    def get_media_requests(self, item, info):
        scrape_configs = ItemAdapter(item).get(self.files_urls_field, [])

        requests = []

        for config in scrape_configs:
            # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object
            # Auto migrate string url to ScrapeConfig object
            if isinstance(config, str):
                config = scrape_config=ScrapeConfig(url=config)

            if isinstance(config, ScrapeConfig):
                requests.append(ScrapflyScrapyRequest(scrape_config=config))
            else:
                raise ValueError('FilesPipeline item must ScrapeConfig Object or string url')

        return requests

Abstract pipeline that implement the file downloading

This pipeline tries to minimize network transfers and file processing, doing stat of the files and determining if file is new, up-to-date or expired.

new files are those that pipeline never processed and needs to be downloaded from supplier site the first time.

uptodate files are the ones that the pipeline processed and are still valid files.

expired files are those that pipeline already processed but the last modification was made long time ago, so a reprocessing is recommended to refresh it in case of change.

Ancestors

  • scrapy.pipelines.files.FilesPipeline
  • scrapy.pipelines.media.MediaPipeline
  • abc.ABC

Methods

def get_media_requests(self, item, info)
Expand source code
def get_media_requests(self, item, info):
    scrape_configs = ItemAdapter(item).get(self.files_urls_field, [])

    requests = []

    for config in scrape_configs:
        # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object
        # Auto migrate string url to ScrapeConfig object
        if isinstance(config, str):
            config = scrape_config=ScrapeConfig(url=config)

        if isinstance(config, ScrapeConfig):
            requests.append(ScrapflyScrapyRequest(scrape_config=config))
        else:
            raise ValueError('FilesPipeline item must ScrapeConfig Object or string url')

    return requests

Returns the media requests to download

class ImagesPipeline (store_uri: str | PathLike[str],
download_func: Callable[[Request, Spider], Response] | None = None,
settings: Settings | dict[str, Any] | None = None,
*,
crawler: Crawler | None = None)
Expand source code
class ImagesPipeline(ScrapyImagesPipeline):
    def get_media_requests(self, item, info):
        scrape_configs = ItemAdapter(item).get(self.images_urls_field, [])

        requests = []

        for config in scrape_configs:
            # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object
            # Auto migrate string url to ScrapeConfig object
            if isinstance(config, str):
                config = scrape_config = ScrapeConfig(url=config)

            if isinstance(config, ScrapeConfig):
                requests.append(ScrapflyScrapyRequest(scrape_config=config))
            else:
                raise ValueError('ImagesPipeline item must ScrapeConfig Object or string url')

        return requests

Abstract pipeline that implement the image thumbnail generation logic

Ancestors

  • scrapy.pipelines.images.ImagesPipeline
  • scrapy.pipelines.files.FilesPipeline
  • scrapy.pipelines.media.MediaPipeline
  • abc.ABC

Methods

def get_media_requests(self, item, info)
Expand source code
def get_media_requests(self, item, info):
    scrape_configs = ItemAdapter(item).get(self.images_urls_field, [])

    requests = []

    for config in scrape_configs:
        # If pipeline are not migrated to scrapfly - config is the url instead of ScrapeConfig object
        # Auto migrate string url to ScrapeConfig object
        if isinstance(config, str):
            config = scrape_config = ScrapeConfig(url=config)

        if isinstance(config, ScrapeConfig):
            requests.append(ScrapflyScrapyRequest(scrape_config=config))
        else:
            raise ValueError('ImagesPipeline item must ScrapeConfig Object or string url')

    return requests

Returns the media requests to download