python库——BeautifulSoup_综合

beautifulsoup可以将html内容解析为soup文档。将不具备良好html格式的网页转化为完整的html文档。
究竟什么是html的完整格式？那么在此之前先介绍一下html。

HTML

HTML是一种超文本标记语言，并不是编程语言。它常与CSS、JavaScript一起用于设计网页、网页应用程序以及移动应用程序的用户界面。

标签
标签是html的重要成分，通常成对的出现，两个标签之间为元素的内容。

<html><head><title>This is a title</title></head><body><p>Hello world!</p></body>
</html>

<html>和</html>之间的文本描述网页，<body>和</body>之间的文本为可视页面内容。
头部<head>...</head>包含标题。
标记文本<title>This is a title</title>定义了浏览器的页面标题。
标题分为<h1>到<h6>六级，字体依次由大到小。
段落写在<p>...</p>中。
<br>换行。
<a>创建链接。

<a href="https://zh.wikipedia.org/">中文維基百科的連結！</a>

href属性包含链接的url地址。

属性
了解html的属性对python爬虫有重要意义。
1.id : id是元素在全文档的唯一标识，用于识别元素。
2.class : class属性提供一种将类似元素分类的方式。
3.style : style将可以表现的性质赋给一个特定的元素。
4.title : title属性给元素一个附加说明。
5.lang : lang用于识别元素内容的语言。

例：

<abbr id="ID" class="术语" style="color:purple;" title="超文本标记语言">HTML</abbr>

abbr为缩写元素。

BeautifulSoup的使用

beautifulsoup可以将不良html格式的网页解析为完整的html文档，并能按照标准的缩进格式的结构输出

>>>from bs4 import BeautifulSoup
>>>broken_html = '<ul class=shop><li>Price<li>Number</ul>'
>>>#解析此html
>>>soup = BeautifulSoup(broken_html,'html.parser'>)
>>>fixed_html = soup.prettify()
>>>print(fixed_html)
<ul class="shop"><li>Price<li>Number</li></li>
</ul>

以一段html文档来举例说明它的使用方法：

from bs4 import BeautifulSouphtml_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">...</p> """soup = BeautifulSoup(html_doc)
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>soup.title
# <title>The Dormouse's story</title>
soup.title.string
#"The Dormouse's story"
soup.title.parent.name
#'head'
soup.p
#<p class="title"><b>The Dormouse's story</b></p>
soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

tag对象与html中tag对象属性相同

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.name
# 'b'
#修改tag名称可以修改生成的html文档，tag的属性可以被添加,删除或修改
tag.name = "blockquote"
tag
#<blockquote class="boldest">Extremely bold</blockquote>
tag['id'] = 1
tag
#<blockquote class="boldest" id="1">Extremely bold</blockquote>

由于BeautifulSoup模块是纯python编写而正则模块是C语言编写的，与正则表达式相比BeautifulSoup抓取速度要慢很多，但其语法比正则表达式要简单易懂的多，上手简单，推荐新手使用。

参考文献：html维基百科： https://zh.wikipedia.org/wiki/HTML
BeautifulSoup4.2.0文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html