1、前言
我是在进行全站爬取某个网站时用到的这个包,它的主要功能就是分解URL,在对URL处理时是一个非常有用的包
2、功能介绍
This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
这组模块(即urllib.parse包)定义了一个标准接口,用于将URL分解成一个一个个组件,将组件重新组建成一个URL字符串。也就是利用基本的URL将相对地址(URL)转化成绝对地址。
3、函数介绍
3.1、URL Parsing
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.
3.1.1、urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)
urlparse()会将URL分解成六个部分,看例子
>>> from urllib.parse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
这六个部分的解释
Attribute | Index | Value | Value if not present |
---|---|---|---|
scheme | 0 | URL scheme specifier(也就是http/https) | scheme parameter |
netloc | 1 | Network location part(域名) | empty string |
path | 2 | Hierarchical path(分层路径) | empty string |
params | 3 | Parameters for last path element(最后一个路径元素的参数) | empty string |
query | 4 | Query component(查询组件) | empty string |
fragment | 5 | Fragment identifier(片段识别) | empty string |
函数方法说明
urlstring : URL路径
scheme : 协议类型,http或者https
allow_fragments: 默认是True,如果设置为False,fragment identifiers将不会被识别,就是说netloc后面的都会当成URL中的路径处理。
If the allow_fragments argument is false, fragment identifiers are not recognized. Instead, they are parsed as part of the path, parameters or query component, and fragment is set to the empty string in the return value.
更多关于urllib.parse的内容可前往官网