Chrome下载地址 open in new window,版本与 Chrome 最近即可
1、在 middlewares.py
文件中修改返回的 response
对象
找到 Scrapy[这里是你的项目名字]sDownloaderMiddleware
类,修改process_request
方法
1def process_request(self, request, spider):
2
3 # 在 DownloaderMiddleware 中更改 process_request 返回的 response 对象
4 # 通过 webdriver 构建的 driver 对象去请求js渲染后的页面
5 options = webdriver.ChromeOptions()
6 options.add_argument('--headless') # 浏览器不提供可视化界面。Linux下如果系统不支持可视化不加这条会启动失败
7 options.add_argument('blink-settings=imagesEnabled=false') # 不加载图片,提升运行速度
8 options.add_argument('--disable-gpu') # 谷歌文档提到需要加上这个属性来规避bug
9 options.add_argument("no-sandbox") # 取消沙盒模式
10 options.add_argument("disable-blink-features=AutomationControlled") # 禁用启用Blink运行时的功能
11 options.add_experimental_option('excludeSwitches', ['enable-automation']) # 开发者模式
12
13 # executable_path 是你的 selenium 调试工具的路径
14 driver = webdriver.Chrome(executable_path='/Users/mulin/Chrome/chromedriver', options=options)
15 # 移除 `window.navigator.webdriver`. scrapy 默认为True
16 driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
17 "source": """
18 Object.defineProperty(navigator, 'webdriver', {
19 get: () => undefined
20 })
21 """
22 })
23
24 driver.get(request.url)
25 driver.implicitly_wait(5)
26 content = driver.page_source
27 # 关闭 webdriver
28 driver.quit()
29
30 # 引入 HtmlResponse 函数来重新返回 response 对象
31 return HtmlResponse(url=request.url, body=content, request=request, encoding='utf-8')
3、在settings.py
文件中打开SPIDER_MIDDLEWARES
、DOWNLOADER_MIDDLEWARES
和DOWNLOAD_DELAY
配置
1# Configure a delay for requests for the same website (default: 0)
2# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
3# See also autothrottle settings and docs
4DOWNLOAD_DELAY = 2
5
6......
7
8# Enable or disable spider middlewares
9# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
10SPIDER_MIDDLEWARES = {
11 'scrapy_yys.middlewares.ScrapyYysSpiderMiddleware': 543,
12}
13
14# Enable or disable downloader middlewares
15# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
16DOWNLOADER_MIDDLEWARES = {
17 'scrapy_yys.middlewares.ScrapyYysDownloaderMiddleware': 543,
18}