selenium

scrapy + selenium 抓取动态页面

腾讯云博客 open in new window

安装 selenium

下载 selenium 调试工具

Chrome下载地址 open in new window,版本与 Chrome 最近即可

使用

1、在 middlewares.py 文件中修改返回的 response 对象

找到 Scrapy[这里是你的项目名字]sDownloaderMiddleware 类,修改process_request方法

1def process_request(self, request, spider):
2
3        # 在 DownloaderMiddleware 中更改 process_request 返回的 response 对象
4        # 通过 webdriver 构建的 driver 对象去请求js渲染后的页面
5        options = webdriver.ChromeOptions()
6        options.add_argument('--headless')  # 浏览器不提供可视化界面。Linux下如果系统不支持可视化不加这条会启动失败
7        options.add_argument('blink-settings=imagesEnabled=false')  # 不加载图片,提升运行速度
8        options.add_argument('--disable-gpu')  # 谷歌文档提到需要加上这个属性来规避bug
9        options.add_argument("no-sandbox")  # 取消沙盒模式
10        options.add_argument("disable-blink-features=AutomationControlled")  # 禁用启用Blink运行时的功能
11        options.add_experimental_option('excludeSwitches', ['enable-automation'])    # 开发者模式
12
13        # executable_path 是你的 selenium 调试工具的路径
14        driver = webdriver.Chrome(executable_path='/Users/mulin/Chrome/chromedriver', options=options)
15        # 移除 `window.navigator.webdriver`. scrapy 默认为True
16        driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
17            "source": """
18                     Object.defineProperty(navigator, 'webdriver', {
19                       get: () => undefined
20                     })
21                   """
22        })
23
24        driver.get(request.url)
25        driver.implicitly_wait(5)
26        content = driver.page_source
27        # 关闭 webdriver
28        driver.quit()
29
30        # 引入 HtmlResponse 函数来重新返回 response 对象
31        return HtmlResponse(url=request.url, body=content, request=request, encoding='utf-8')

3、在settings.py文件中打开SPIDER_MIDDLEWARESDOWNLOADER_MIDDLEWARESDOWNLOAD_DELAY配置

1# Configure a delay for requests for the same website (default: 0)
2# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
3# See also autothrottle settings and docs
4DOWNLOAD_DELAY = 2
5
6......
7
8# Enable or disable spider middlewares
9# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
10SPIDER_MIDDLEWARES = {
11   'scrapy_yys.middlewares.ScrapyYysSpiderMiddleware': 543,
12}
13
14# Enable or disable downloader middlewares
15# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
16DOWNLOADER_MIDDLEWARES = {
17   'scrapy_yys.middlewares.ScrapyYysDownloaderMiddleware': 543,
18}