有這麼一本Python的書: <<Python 網絡資料采集>>
我準備用.NET Core及第三方庫實作裡面所有的例子.
這是第一部分, 主要使用的是AngleSharp:
https://anglesharp.github.io/ (文章的章節書與該書是對應的)第1章 初見網絡爬蟲
發送Http請求
在python裡面這樣發送http請求, 它使用的是python的标準庫urllib:
在.NET Core裡面, 你可以使用
HttpClient, 相應的C#代碼如下:
var client = new HttpClient();
HttpResponseMessage response = await client.GetAsync("http://pythonscraping.com/pages/page1.html");
response.EnsureSuccessStatusCode();
var responseBody = await response.Content.ReadAsStringAsync();
Console.WriteLine(responseBody);
return responseBody;
或者可以簡寫為:
var client = new HttpClient();
var responseBody = await client.GetStringAsync("http://pythonscraping.com/pages/page1.html");
Console.WriteLine(responseBody);
其結果如下:
使用AngleSharp解析html源碼
python裡面可以使用BeautifulSoup或者MechanicalSoup等庫對html源碼進行解析.
而.NET Core可以使用
AngleSharp,
Html Agility Pack DotnetSpider(國産, 也支援元素抽取).等庫來操作Html文檔.
這裡我先使用的是AngleSharp, AngleSharp的解析庫可以使用标準的W3C規範來解析HTML, MathML, XML, SVG和CSS. 它支援.NET Standard 1.0.
安裝AngleSharp
通過Nuget即可:
https://www.nuget.org/packages/AngleSharp/Install-Package AngleSharp
或者dotnet-cli:
dotnet add package AngleSharp
AngleSharp的一個簡單例子
下面這個例子(1.2.2)是把頁面中h1元素的内容顯示出來.
書中Python的代碼:
下面是.NET Core的C#代碼:
public static async Task ReadWithAngleSharpAsync()
{
var htmlSourceCode = await SendRequestWithHttpClientAsync();
var parser = new HtmlParser();
var document = await parser.ParseAsync(htmlSourceCode);
Console.WriteLine($"Serializing the (original) document: {document.QuerySelector("h1").OuterHtml}");
Console.WriteLine($"Serializing the (original) document: {document.QuerySelector("html > body > h1").OuterHtml}");
}
在這裡AngleSharp首先需要建立一個可以循環使用的HtmlParser(Html解析器), 然後使用解析器解析html源碼即可: parser.Parse() 或者異步版本 parser.ParseAsync().
解析傳回對象的類型是IHtmlDocument, 裡面是解析好的DOM. 其中DOM是和AngleSharp裡的類這樣對應的:
這個圖其實是老一點的版本, 新版本的DOM模型是稍微有點不同的, 不過你隻要了解這個意思就行...
AngleSharp有很多特點, 但是最重要的特點就是它支援querySelector()和querySelectorAll()方法, 就像DOM的方法一樣.
上面這個例子裡, 其html的結構大緻如下:
是以針對傳回的IHtmlDocument對象document, 我們使用document.QuerySelector("h1").OuterHtml, 就可以傳回h1的OuterHtml. 而使用document.QuerySelector("html > body > h1").OuterHtml 也是同樣的效果, 因為标準的CSS選擇器是都支援的.
QuerySelector()傳回的是一個/0個元素, 相當于Linq的FirstOrDefault().
其運作結果如下:
異常情況處理
發送Http請求之後, 可能會發生錯誤, 例如網頁不存在(或者請求時出錯), 伺服器不存在等等.
針對這些情況, .NET Core程式會傳回HTTP錯誤, 可能是404也可能是500等. 但是所有的類型HttpClient都會抛出HttpRequestException, 我們可以這樣處理這種異常:
public static async Task ResponseWithErrorsAsync()
{
try
{
var client = new HttpClient();
var responseBody = await client.GetStringAsync("http://notexistwebsite");
Console.WriteLine(responseBody);
}
catch (HttpRequestException e)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine("\nException Caught!");
Console.WriteLine("Message :{0} ", e.Message);
}
}
但是即使網頁擷取成功了, 網頁上的内容也并非完全是我們所期待的, 仍可能會抛出異常. 比如說你想要找的标簽不存在, 那麼就會傳回null, 然後再調用改标簽的屬性, 就會發生NullReferenceException.
是以這種情況可以捕獲NullReferenceException, 也可以使用代碼判斷:
public static async Task ReadNonExistTagAsync()
{
var htmlSourceCode = await SendRequestWithHttpClientAsync();
var parser = new HtmlParser();
var document = await parser.ParseAsync(htmlSourceCode);
var nonExistTag = document.QuerySelector("h8");
Console.WriteLine(nonExistTag);
Console.WriteLine($"nonExistTag is null: {nonExistTag is null}");
try
{
Console.WriteLine(nonExistTag.QuerySelector("p").OuterHtml);
}
catch (NullReferenceException)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine("Tag was not found");
}
}
完整的例子:
public static async Task RunAllAsync()
{
Console.ForegroundColor = ConsoleColor.Red;
async Task<string> GetTileAsync(string uri)
{
var httpClient = new HttpClient();
try
{
var responseHtml = await httpClient.GetStringAsync(uri);
var parser = new HtmlParser();
var document = await parser.ParseAsync(responseHtml);
var tagContent = document.QuerySelector("body > h8").TextContent;
return tagContent;
}
catch (HttpRequestException e)
{
Console.WriteLine($"{nameof(HttpRequestException)}:");
Console.WriteLine("Message :{0} ", e.Message);
return null;
}
catch (NullReferenceException)
{
Console.WriteLine($"{nameof(NullReferenceException)}:");
Console.WriteLine("Tag was not found");
return null;
}
}
var title = await GetTileAsync("http://www.pythonscraping.com/pages/page1.html");
if (string.IsNullOrWhiteSpace(title))
{
Console.WriteLine("Title was not found");
}
else
{
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine(title);
}
}
第2章 複雜HTML解析
首先我把請求Http傳回HTML代碼的那部分封裝成了一個方法以便複用:
public static async Task<string> GetHtmlSourceCodeAsync(string uri)
{
var httpClient = new HttpClient();
try
{
var htmlSource = await httpClient.GetStringAsync(uri);
return htmlSource;
}
catch (HttpRequestException e)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"{nameof(HttpRequestException)}: {e.Message}");
return null;
}
}
CSS是網絡爬蟲的福音, 下面這兩個元素在頁面中可能會出現很多次:
我們可以使用AngleSharp裡面的QuerySelectorAll()方法把所有符合條件的元素都找出來, 傳回到一個結果集合裡.
public static async Task FindGreenClassAsync()
{
const string url = "http://www.pythonscraping.com/pages/warandpeace.html";
var html = await GetHtmlSourceCodeAsync(url);
if (!string.IsNullOrWhiteSpace(html))
{
var parser = new HtmlParser();
var document = await parser.ParseAsync(html);
var nameList = document.QuerySelectorAll("span > .green");
Console.WriteLine("Green names are:");
Console.ForegroundColor = ConsoleColor.Green;
foreach (var item in nameList)
{
Console.WriteLine(item.TextContent);
}
}
else
{
Console.WriteLine("No html source code returned.");
}
}
非常簡單, 和DOM的标準操作是一樣的.
如果隻需要元素的文字部分, 那麼就是用其TextContent屬性即可.
再看個例子
1. 找出頁面中所有的h1, h2, h3, h4, h5, h6元素
2. 找出class為green或red的span元素.
public static async Task FindByAttributeAsync()
{
const string url = "http://www.pythonscraping.com/pages/warandpeace.html";
var html = await GetHtmlSourceCodeAsync(url);
if (!string.IsNullOrWhiteSpace(html))
{
var parser = new HtmlParser();
var document = await parser.ParseAsync(html);
var headers = document.QuerySelectorAll("*")
.Where(x => new[] { "h1", "h2", "h3", "h4", "h5", "h6" }.Contains(x.TagName.ToLower()));
Console.WriteLine("Headers are:");
PrintItemsText(headers);
var greenAndRed = document.All
.Where(x => x.TagName == "span" && (x.ClassList.Contains("green") || x.ClassList.Contains("red")));
Console.WriteLine("Green and Red spans are:");
PrintItemsText(greenAndRed);
var thePrinces = document.QuerySelectorAll("*").Where(x => x.TextContent == "the prince");
Console.WriteLine(thePrinces.Count());
}
else
{
Console.WriteLine("No html source code returned.");
}
void PrintItemsText(IEnumerable<IElement> elements)
{
foreach (var item in elements)
{
Console.WriteLine(item.TextContent);
}
}
}
這裡我們可以看到QuerySelectorAll()的傳回結果可以使用Linq的Where方法進行過濾, 這樣就很強大了.
TagName屬性就是元素的标簽名.
此外, 還有一個document.All, All屬性是該Document所有元素的集合, 它同樣也支援Linq.
(該方法中使用了一個本地方法).
由于同時支援CSS選擇器和Linq, 是以抽取元素的工作簡單多了.
導航樹
一個頁面, 它的結構可以是這樣的:
這裡面有幾個概念:
子标簽和後代标簽.
子标簽是父标簽的下一級, 而後代标簽則是指父标簽下面所有級别的标簽.
tr是table的子标簽, tr, th, td, img都是table的後代标簽.
使用AngleSharp, 找出子标簽可以使用.Children屬性. 而找出後代标簽, 可以使用CSS選擇器.
兄弟标簽
找到前一個兄弟标簽使用.PreviousElementSibling屬性, 後一個兄弟标簽是.NextElementSibling屬性.
父标簽
.ParentElement屬性就是父标簽.
public static async Task FindDescendantAsync()
{
const string url = "http://www.pythonscraping.com/pages/page3.html";
var html = await GetHtmlSourceCodeAsync(url);
if (!string.IsNullOrWhiteSpace(html))
{
var parser = new HtmlParser();
var document = await parser.ParseAsync(html);
var tableChildren = document.QuerySelector("table#giftList > tbody").Children;
Console.WriteLine("Table's children are:");
foreach (var child in tableChildren)
{
System.Console.WriteLine(child.LocalName);
}
var descendants = document.QuerySelectorAll("table#giftList > tbody *");
Console.WriteLine("Table's descendants are:");
foreach (var item in descendants)
{
Console.WriteLine(item.LocalName);
}
var siblings = document.QuerySelectorAll("table#giftList > tbody > tr").Select(x => x.NextElementSibling);
Console.WriteLine("Table's descendants are:");
foreach (var item in siblings)
{
Console.WriteLine(item?.LocalName);
}
var parentSibling = document.All.SingleOrDefault(x => x.HasAttribute("src") && x.GetAttribute("src") == "../img/gifts/img1.jpg")
?.ParentElement.PreviousElementSibling;
if (parentSibling != null)
{
Console.WriteLine($"Parent's previous sibling is: {parentSibling.TextContent}");
}
}
else
{
Console.WriteLine("No html source code returned.");
}
}
結果:
使用正規表達式
"如果你有一個問題打算使用正規表達式來解決, 那麼現在你有兩個問題了".
這裡有一個測試正規表達式的網站:
https://www.regexpal.com/目前, AngleSharp支援通過CSS選擇器來查找元素, 也可以使用Linq來過濾元素, 當然也可以通過多種方式使用正規表達式進行更複雜的查找動作.
關于正規表達式我就不介紹了. 直接看例子.
我想找到頁面中所有的滿足下列要求的圖檔, 其src的值以../img/gifts/img開頭并且随後跟着數字, 然後格式為.jpg的圖示.
public static async Task FindByRegexAsync()
{
const string url = "http://www.pythonscraping.com/pages/page3.html";
var html = await GetHtmlSourceCodeAsync(url);
if (!string.IsNullOrWhiteSpace(html))
{
var parser = new HtmlParser();
var document = await parser.ParseAsync(html);
var images = document.QuerySelectorAll("img")
.Where(x => x.HasAttribute("src") && Regex.Match(x.Attributes["src"].Value, @"\.\.\/img\/gifts/img.*\.jpg").Success);
foreach (var item in images)
{
Console.WriteLine(item.Attributes["src"].Value);
}
var elementsWith2Attributes = document.All.Where(x => x.Attributes.Length == 2);
foreach (var item in elementsWith2Attributes)
{
Console.WriteLine(item.LocalName);
foreach (var attr in item.Attributes)
{
Console.WriteLine($"\t{attr.Name} - {attr.Value}");
}
}
}
else
{
Console.WriteLine("No html source code returned.");
}
}
這個其實沒有任何難度.
但從本例可以看到, 判斷元素有沒有一個屬性可以使用HasAttribute("xxx")方法, 可以通過.Attributes索引來擷取屬性, 其屬性值就是.Attributes["xxx"].Value.
如果不會正規表達式, 我相信多寫的Linq的過濾代碼也差不多能達到要求.
第3章 開始采集
周遊單個域名
就是幾個應用的例子, 直接貼代碼吧.
列印出一個頁面内所有的超連結位址:
public static async Task TraversingASingleDomainAsync()
{
var httpClient = new HttpClient();
var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon");
var parser = new HtmlParser();
var document = await parser.ParseAsync(htmlSource);
var links = document.QuerySelectorAll("a");
foreach (var link in links)
{
Console.WriteLine(link.Attributes["href"]?.Value);
}
}
找出滿足下列條件的超連結:
- 在id為bodyContent的div裡
- url不包括分号
- url以/wiki開頭
public static async Task FindSpecificLinksAsync()
{
var httpClient = new HttpClient();
var htmlSource = await httpClient.GetStringAsync("http://en.wikipedia.org/wiki/Kevin_Bacon");
var parser = new HtmlParser();
var document = await parser.ParseAsync(htmlSource);
var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a")
.Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success);
foreach (var link in links)
{
Console.WriteLine(link.Attributes["href"]?.Value);
}
}
随機找到頁面裡面一個連接配接, 然後遞歸調用自己的方法, 直到主動停止:
private static async Task<IEnumerable<IElement>> GetLinksAsync(string uri)
{
var httpClient = new HttpClient();
var htmlSource = await httpClient.GetStringAsync($"http://en.wikipedia.org{uri}");
var parser = new HtmlParser();
var document = await parser.ParseAsync(htmlSource);
var links = document.QuerySelector("div#bodyContent").QuerySelectorAll("a")
.Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)((?!:).)*$").Success);
return links;
}
public static async Task GetRandomNestedLinksAsync()
{
var random = new Random();
var links = (await GetLinksAsync("/wiki/Kevin_Bacon")).ToList();
while (links.Any())
{
var newArticle = links[random.Next(0, links.Count)].Attributes["href"].Value;
Console.WriteLine(newArticle);
links = (await GetLinksAsync(newArticle)).ToList();
}
}
采集整個網站
首先要了解幾個概念:
淺網 surface web: 是網際網路上搜尋引擎可以直接抓取到的那部分網絡.
與淺網對立的就是深網 deep web: 網際網路中90%都是深網.
暗網Darknet / dark web / dark internet: 它完全是另外一種怪獸. 它們也建立在已有的網絡基礎上, 但是使用Tor用戶端, 帶有運作在HTTP之上的新協定, 提供了一個資訊交換的安全隧道. 這類網也可以采集, 但是超出了本書的範圍.....
深網相對暗網還是比較容易采集的.
采集整個網站的兩個好處:
- 生成網站地圖
- 收集資料
由于網站的規模和深度, 是以采集到的超連結很多可能是重複的, 這時我們就需要連結去重, 可以使用Set類型的集合:
private static readonly HashSet<string> LinkSet = new HashSet<string>();
private static readonly HttpClient HttpClient = new HttpClient();
private static readonly HtmlParser Parser = new HtmlParser();
public static async Task GetUniqueLinksAsync(string uri = "")
{
var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}");
var document = await Parser.ParseAsync(htmlSource);
var links = document.QuerySelectorAll("a")
.Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success);
foreach (var link in links)
{
if (!LinkSet.Contains(link.Attributes["href"].Value))
{
var newPage = link.Attributes["href"].Value;
Console.WriteLine(newPage);
LinkSet.Add(newPage);
await GetUniqueLinksAsync(newPage);
}
}
}
(遞歸調用的深度需要注意一下, 不然有時候能崩潰).
收集整個網站資料
這個例子相對網站, 包括收集相關文字和異常處理等:
private static readonly HashSet<string> LinkSet = new HashSet<string>();
private static readonly HttpClient HttpClient = new HttpClient();
private static readonly HtmlParser Parser = new HtmlParser();
public static async Task GetLinksWithInfoAsync(string uri = "")
{
var htmlSource = await HttpClient.GetStringAsync($"http://en.wikipedia.org{uri}");
var document = await Parser.ParseAsync(htmlSource);
try
{
var title = document.QuerySelector("h1").TextContent;
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine(title);
var contentElement = document.QuerySelector("#mw-content-text").QuerySelectorAll("p").FirstOrDefault();
if (contentElement != null)
{
Console.WriteLine(contentElement.TextContent);
}
var alink = document.QuerySelector("#ca-edit").QuerySelectorAll("span a").SingleOrDefault(x => x.HasAttribute("href"))?.Attributes["href"].Value;
Console.WriteLine(alink);
}
catch (NullReferenceException)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine("Cannot find the tag!");
}
var links = document.QuerySelectorAll("a")
.Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, @"^(/wiki/)").Success).ToList();
foreach (var link in links)
{
if (!LinkSet.Contains(link.Attributes["href"].Value))
{
var newPage = link.Attributes["href"].Value;
Console.WriteLine(newPage);
LinkSet.Add(newPage);
await GetLinksWithInfoAsync(newPage);
}
}
}
不知前方水深的例子
第一個例子, 尋找随機外鍊:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using AngleSharp.Parser.Html;
namespace WebScrapingWithDotNetCore.Chapter03
{
public class CrawlingAcrossInternet
{
private static readonly Random Random = new Random();
private static readonly HttpClient HttpClient = new HttpClient();
private static readonly HashSet<string> InternalLinks = new HashSet<string>();
private static readonly HashSet<string> ExternalLinks = new HashSet<string>();
private static readonly HtmlParser Parser = new HtmlParser();
public static async Task FollowExternalOnlyAsync(string startingSite)
{
var externalLink = await GetRandomExternalLinkAsync(startingSite);
if (externalLink != null)
{
Console.WriteLine($"External Links is: {externalLink}");
await FollowExternalOnlyAsync(externalLink);
}
else
{
Console.WriteLine("Random External link is null, Crawling terminated.");
}
}
private static async Task<string> GetRandomExternalLinkAsync(string startingPage)
{
try
{
var htmlSource = await HttpClient.GetStringAsync(startingPage);
var externalLinks = (await GetExternalLinksAsync(htmlSource, SplitAddress(startingPage)[0])).ToList();
if (externalLinks.Any())
{
return externalLinks[Random.Next(0, externalLinks.Count)];
}
var internalLinks = (await GetInternalLinksAsync(htmlSource, startingPage)).ToList();
if (internalLinks.Any())
{
return await GetRandomExternalLinkAsync(internalLinks[Random.Next(0, internalLinks.Count)]);
}
return null;
}
catch (HttpRequestException e)
{
Console.WriteLine($"Error requesting: {e.Message}");
return null;
}
}
private static string[] SplitAddress(string address)
{
var addressParts = address.Replace("http://", "").Replace("https://", "").Split("/");
return addressParts;
}
private static async Task<IEnumerable<string>> GetInternalLinksAsync(string htmlSource, string includeUrl)
{
var document = await Parser.ParseAsync(htmlSource);
var links = document.QuerySelectorAll("a")
.Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, $@"^(/|.*{includeUrl})").Success)
.Select(x => x.Attributes["href"].Value);
foreach (var link in links)
{
if (!string.IsNullOrEmpty(link) && !InternalLinks.Contains(link))
{
InternalLinks.Add(link);
}
}
return InternalLinks;
}
private static async Task<IEnumerable<string>> GetExternalLinksAsync(string htmlSource, string excludeUrl)
{
var document = await Parser.ParseAsync(htmlSource);
var links = document.QuerySelectorAll("a")
.Where(x => x.HasAttribute("href") && Regex.Match(x.Attributes["href"].Value, $@"^(http|www)((?!{excludeUrl}).)*$").Success)
.Select(x => x.Attributes["href"].Value);
foreach (var link in links)
{
if (!string.IsNullOrEmpty(link) && !ExternalLinks.Contains(link))
{
ExternalLinks.Add(link);
}
}
return ExternalLinks;
}
private static readonly HashSet<string> AllExternalLinks = new HashSet<string>();
private static readonly HashSet<string> AllInternalLinks = new HashSet<string>();
public static async Task GetAllExternalLinksAsync(string siteUrl)
{
try
{
var htmlSource = await HttpClient.GetStringAsync(siteUrl);
var internalLinks = await GetInternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]);
var externalLinks = await GetExternalLinksAsync(htmlSource, SplitAddress(siteUrl)[0]);
foreach (var link in externalLinks)
{
if (!AllExternalLinks.Contains(link))
{
AllExternalLinks.Add(link);
Console.WriteLine(link);
}
}
foreach (var link in internalLinks)
{
if (!AllInternalLinks.Contains(link))
{
Console.WriteLine($"The link is: {link}");
AllInternalLinks.Add(link);
await GetAllExternalLinksAsync(link);
}
}
}
catch (HttpRequestException e)
{
Console.WriteLine(e);
Console.WriteLine($"Request error: {e.Message}");
}
}
}
}
程式有Bug, 您可以給解決下......
第一部分先到這....主要用的是AngleSharp. AngleSharp不止這些功能, 很強大的, 具體請看文檔.
由于該書下一部分使用的是Python的Scrapy, 是以下篇文章我也許應該使用DotNetSpider了, 這是一個國産的庫....
項目的代碼在:
https://github.com/solenovex/Web-Scraping-With-.NET-Core下面是我的關于ASP.NET Core Web API相關技術的公衆号--草根專欄: