信息标记的处理

信息标记的处理

信息标记的三种形式

XML JSON YAML

XML (eXtensible Markup Language)

XML通过标签进行信息标记

JSON (JavaScript Object Notation)

有类型的键值对key:value

双引号表示字符串,数字直接写

例如

"name":"beijing"
"name":["beijing","shoudu"]
"name":{
		"newName":"xiaoming"
		"oldName":"xiaohua"
		}

YAML (YAML Ain’t Markup Language)

无类型键值对key:value
如

name:beijing
name:      # 缩进表所属
	newName:xiaoming
	oldName:xiaohua
name:      # 表并列
-beijing
-shoudu

三种信息标记的比较

XML 最早的通用信息标记语言，可扩展性好，但繁琐。
- Internet上的信息交互与传递。
JSON 信息有类型，适合程序处理js，较XML简洁。
- 移动应用云端和节点的信息通信，无注释。
YAML 信息无类型，文本信息比例最高，可读性好。
- 各类系统的配置文件，有注释易读

信息提取的一般方法

1. 完整解析信息的标记形式，再提取关键信息。

XML JSON YAML

需要标记解析器，例如bs4库的标签树遍历

优点:信息解析准确
缺点:提取过程繁琐，速度慢

2. 无视标记形式，直接搜索关键信息。

搜索

对信息的文本查找函数即可。

优点:提取过程简洁，速度较快。
缺点:提取结果准确性与信息内容相关。

3. 结合形式解析与搜索方法，提取关键信息

XML JSON YAML 搜索

需要标记解析器及文本查找函数。

实例：

提取HTML中所有URL链接

思路：

1)搜索到所有 <a> 标签

2)解析 <a> 标签格式，提取href后的链接内容。

>>> import requests
>>> demo = requests.get("http://python123.io/ws/demo.html").text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
...     print(link.get('href'))
... 
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
>>>

基于bs4库的HTML内容查找方法

>>> import requests
>>> demo = requests.get("http://python123.io/ws/demo.html").text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")

<>.find_all(name,attrs,recursive,string,**kargs)

返回列表类型，存储查找的结果。

name:对标签名称的检索字符串。

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>> for tag in soup.find_all(True):
...     print(tag.name)
... 
html
head
title
body
p
b
p
a
a

>>> import re  # 正则表达式库
>>> for tag in soup.find_all(re.compile('b')):
...     print(tag.name)
... 
body
b

attrs:对标签属性值的检索字符串，可标注属性检索。

>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive:是否对子孙全部检索,默认True

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]

string:<>...</>中字符串区域的检索字符串。

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string = "Basic Python")
['Basic Python']
>>> import re
>>> soup.find_all(string = re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']

注：

<tag>(..)等价于 <tag>.find_all(..)

soup(..)等价于soup.find_all(..)

扩展方法

方法	说明
<>.find()	搜索且只返回一个结果,字符串类型,同.find_all()参数
<>.find_parents()	在先辈节点中搜索,返回列表类型,同.find_all()参数
<>.find_parent()	在先辈节点中返回一个结果,字符串类型,同.find()参数
<>.find_next_siblings()	在后续平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_next_sibling()	在后续平行节点中返回一个结果,字符串类型,同.find参数
<>.find_previous_siblings()	在前序平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_previous_sibling()	在前序平行节点中返回一个结果,字符串类型,同.find()参数