DrissionPage is a python-based web page automation tool. It can control the browser, send and receive packets, and combine the two. You can combine the convenience of browser automation with the efficiency of requests. It's powerful, with countless user-friendly designs and convenient features built-in. Its syntax is concise and elegant, with a small amount of code, making it newbie-friendly.
background
When using requests for data collection in the face of the website to be logged in, it is necessary to analyze the data packets and JS source code, construct complex requests, and often have to deal with anti-crawling methods such as captcha, JS obfuscation, and signature parameters, which has a high threshold and low development efficiency. With a browser, you can largely bypass these pitfalls, but the browser doesn't work very efficiently.
Therefore, the original intention of this library is to combine them into one, and to achieve "write fast" and "run fast" at the same time. It can switch between different modes when needed, and provides a user-friendly way to use it to improve development and operation efficiency. In addition to merging the two, the library also encapsulates common functions in the form of web pages, providing very simple operations and statements, so that users can focus on the implementation of functions with less detail. Implement powerful features in a simple way to make your code more elegant.
The previous version was a recapsulation of Selenium. Starting with 3.0, the authors have redeveloped the underlying layer from scratch, getting rid of the dependence on Selenium, enhancing functionality, and improving operational efficiency.
Core Competencies
This library adopts a fully self-developed kernel, built-in N multiple practical functions, integrates and optimizes common functions, compared with selenium, has the following advantages:
- No webdriver features
- There is no need to download different drivers for different versions of browsers
- Run faster
- Elements can be found across iframes without having to cut in and out
- Treat the iframe as an ordinary element, and after obtaining it, you can directly find the element in it, and the logic is clearer
- You can manipulate multiple tabs in the browser at the same time, even if the tab is inactive, so you don't need to switch
- You can read the browser cache directly to save the image, without having to click Save with the GUI
- You can take a screenshot of the entire web page, including parts outside the viewport (supported by browsers from 90)
- It can handle shadow-root in a non-open state
Get started with the demo
The s mode of the SessionPage object and the WebPage object can be used to access the web page in the form of sending and receiving packets.
As the name suggests, SessionPage is a page that uses the Session(requests library) object, which uses the POM pattern to encapsulate network connections and html parsing functions, making sending and receiving packets as easy as manipulating a page.
并且,由于加入了本库独创的查找元素方法,使数据的采集便利性远超 requests + beautifulsoup 等组合。
SessionPage is the simplest of the several page objects in the library, so let's start with it.
Let's look at a simple example to understand how SessionPage works.
# 导入
from DrissionPage import SessionPage
# 创建页面对象
page = SessionPage()
# 访问网页
page.get('https://gitee.com/explore/all')
# 在页面中查找元素
items = page.eles('t:h3')
# 遍历元素
for item in items[:-1]:
# 获取当前<h3>元素下的<a>元素
lnk = item('tag:a')
# 打印<a>元素文本和href属性
print(lnk.text, lnk.link)
Input Effect:
Compare the pages that are visited on the official website.