Python lxml parsing library practical application determines the information element structure benchmark expression extraction data expression complete program code

The following uses the lxml library to grab the cat's eye movie top 100 list (click to access), in the process of writing the program, pay attention to the comparison with the regular parsing method used in the "Python crawler crawler cat's eye movie ranking", so that you will find that the lxml parsing library is so convenient.

<h1 class="pgc-h-arrow-right" data-track="2" > determine the structure of the information element</h1>

First of all, it is clear that the structure of the web page elements to capture the information, such as the name of the movie, the starring actors, and the release time. Through a simple analysis, it can be known that the information of each film is contained in the <d> tag, and each < dd> tag is included in the <dl > tag, so for the dd tag, the dl tag is a larger node, that is, its parent node, as follows:

Python lxml parsing library practical application determines the information element structure benchmark expression extraction data expression complete program code

When the extraction of the movie information in a <d> tag is complete, you need to use the same Xpath expression to extract the next movie information until all the movie information is extracted, which is obviously cumbersome. So is there a better way?

<h1 class="pgc-h-arrow-right" data-track="5" > benchmark expression</h1>

Because each node object uses the same Xpath expression to match information, it's easy to think of a for loop. We put 10 <dd> nodes into a list, and then use the for loop to traverse each node object, which greatly improves the efficiency of coding.

By < dd> the parent node of the node<dl> it is possible to match 10 <d> nodes at the same time and put these node objects into a list. We call the Xpath expression that matches 10 <dd > nodes "benchmark expression". This is shown below:

The following is a <d> node object matched by a benchmark expression, as follows:

Output:

<h1 class="pgc-h-arrow-right" data-track="13" > extract data expression</h1>

Since the information we want to grab is contained in the <d > node, let's start analyzing the HTML code contained in the <d > node, and the following is a random piece of <d> node contains the movie information, as follows:

Analyze the above code snippet and write out the Xpath expression for the information to be crawled, as follows:

<h1 class="pgc-h-arrow-right" data-track="18" > the complete program code</h1>

The above content describes the Xpath expression used when writing the program, and the following is the formal writing of the bot program, the code is as follows:

The output is as follows:

Let's start a class square - talent learning and exchange platform