围炉网

一行代码,一篇日志,一个梦想,一个世界

Link Extractors — Scrapy 0.24.4 documentation

Link Extractors

LinkExtractors are objects whose only purpose is to extract links from web
pages (scrapy.http.Response objects) which will be eventually
followed.

There are two Link Extractors available in Scrapy by default, but you create
your own custom Link Extractors to suit your needs by implementing a simple
interface.

The only public method that every LinkExtractor has is extract_links,
which receives a Response object and returns a list
of scrapy.link.Link objects. Link Extractors are meant to be instantiated once and their
extract_links method called several times with different responses, to
extract links to follow.

Link extractors are used in the CrawlSpider
class (available in Scrapy), through a set of rules, but you can also use it in
your spiders, even if you don’t subclass from
CrawlSpider, as its purpose is very simple: to
extract links.

Built-in link extractors reference

All available link extractors classes bundled with Scrapy are provided in the
scrapy.contrib.linkextractors module.

If you don’t know what link extractor to choose, just use the default which is
the same as LxmlLinkExtractor (see below):

 scrapy.contrib.linkextractors import LinkExtractor

LxmlLinkExtractor

class scrapy.contrib.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=(‘a’, ‘area’), attrs=(‘href’, ), canonicalize=True, unique=True, process_value=None)

LxmlLinkExtractor is the recommended link extractor with handy filtering
options. It is implemented using lxml’s robust HTMLParser.

Parameters:
  • allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions)
    that the (absolute) urls must match in order to be extracted. If not
    given (or empty), it will match all links.
  • deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions)
    that the (absolute) urls must match in order to be excluded (ie. not
    extracted). It has precedence over the allow parameter. If not
    given (or empty) it won’t exclude any links.
  • allow_domains (str or list) – a single value or a list of string containing
    domains which will be considered for extracting the links
  • deny_domains (str or list) – a single value or a list of strings containing
    domains which won’t be considered for extracting the links
  • deny_extensions (list) – a single value or list of strings containing
    extensions that should be ignored when extracting links.
    If not given, it will default to the
    IGNORED_EXTENSIONS list defined in the scrapy.linkextractor
    module.
  • restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines
    regions inside the response where links should be extracted from.
    If given, only the text selected by those XPath will be scanned for
    links. See examples below.
  • tags (str or list) – a tag or a list of tags to consider when extracting links.
    Defaults to (‘a’, ‘area’).
  • attrs (list) – an attribute or list of attributes which should be considered when looking
    for links to extract (only for those tags specified in the
    parameter). Defaults to (‘href’,)
  • canonicalize (boolean) – canonicalize each extracted url (using
    scrapy.utils.url.canonicalize_url). Defaults to .
  • unique (boolean) – whether duplicate filtering should be applied to extracted
    links.
  • process_value (callable) – see process_value argument of
    BaseSgmlLinkExtractor class constructor

SgmlLinkExtractor

Warning

SGMLParser based link extractors are unmantained and its usage is discouraged.
It is recommended to migrate to LxmlLinkExtractor if you are still
using SgmlLinkExtractor.

class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), tags=(‘a’, ‘area’), attrs=(‘href’), canonicalize=True, unique=True, process_value=None)

The SgmlLinkExtractor is built upon the base BaseSgmlLinkExtractor
and provides additional filters that you can specify to extract links,
including regular expressions patterns that the links must match to be
extracted. All those filters are configured through these constructor
parameters:

Parameters:
  • allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions)
    that the (absolute) urls must match in order to be extracted. If not
    given (or empty), it will match all links.
  • deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions)
    that the (absolute) urls must match in order to be excluded (ie. not
    extracted). It has precedence over the allow parameter. If not
    given (or empty) it won’t exclude any links.
  • allow_domains (str or list) – a single value or a list of string containing
    domains which will be considered for extracting the links
  • deny_domains (str or list) – a single value or a list of strings containing
    domains which won’t be considered for extracting the links
  • deny_extensions (list) – a single value or list of strings containing
    extensions that should be ignored when extracting links.
    If not given, it will default to the
    IGNORED_EXTENSIONS list defined in the scrapy.linkextractor
    module.
  • restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines
    regions inside the response where links should be extracted from.
    If given, only the text selected by those XPath will be scanned for
    links. See examples below.
  • tags (str or list) – a tag or a list of tags to consider when extracting links.
    Defaults to (‘a’, ‘area’).
  • attrs (list) – an attribute or list of attributes which should be considered when looking
    for links to extract (only for those tags specified in the
    parameter). Defaults to (‘href’,)
  • canonicalize (boolean) – canonicalize each extracted url (using
    scrapy.utils.url.canonicalize_url). Defaults to .
  • unique (boolean) – whether duplicate filtering should be applied to extracted
    links.
  • process_value (callable) – see process_value argument of
    BaseSgmlLinkExtractor class constructor

BaseSgmlLinkExtractor

class scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor(tag="a", attr="href", unique=False, process_value=None)

The purpose of this Link Extractor is only to serve as a base class for the
SgmlLinkExtractor. You should use that one instead.

The constructor arguments are:

Parameters:
  • tag (str or callable) – either a string (with the name of a tag) or a function that
    receives a tag name and returns if links should be extracted from
    that tag, or False if they shouldn’t. Defaults to . request
    (once it’s downloaded) as its first parameter. For more information, see
    Passing additional data to callback functions.
  • attr (str or callable) – either string (with the name of a tag attribute), or a
    function that receives an attribute name and returns if
    links should be extracted from it, or False if they shouldn’t.
    Defaults to .
  • unique (boolean) – is a boolean that specifies if a duplicate filtering should
    be applied to links extracted.
  • process_value (callable) –

    a function which receives each value extracted from
    the tag and attributes scanned and can modify the value and return a
    new one, or return to ignore the link altogether. If not
    given, process_value defaults to lambda .

    For example, to extract links from this code:

    <a href="javascript:goToPage('../other/page.html'); return false"Link text</a>
    

    You can use the following function in process_value:

     process_valuevalue
          search"javascript:goToPage\('(.*?)'" value
         
            return group
    

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

沪ICP备15009335号-2