Python-XPath

本文最后更新于：9 个月前

XPath是一门在 XML 文档中查找信息的语言

1、什么是xpath

XPath 使用路径表达式在 XML 文档中进行导航
XPath 包含一个标准函数库
XPath 是 XSLT 中的主要元素
XPath 是一个 W3C 标准

其实XPath就是一门在 XML 文档中查找信息的语言。XPath用于在 XML 文档中通过元素和属性进行导航。

2、安装

pip install lxml

3、xpath语法

3.1、选取节点

XPath使用路径表达式在XML文档种选取节点。

节点包括元素，属性，和内容

html ---> <html> ...</html>

div ---> <div> ...</div>

a  ---> <a> ...</a>

这里的元素和html中的标签一个意思。单独的元素是无法表达一个路径的，所以单独的元素不能独立使用

节点是通过沿着路径或者step来选取的

/   根节点，节点分隔符，

//  任意位置

.   当前节点

..  父级节点

@   属性

谓语

//a[n] n为大于零的整数，代表子元素排在第n个位置的<a>元素

//a[last()]   last()  代表子元素排在最后个位置的<a>元素

//a[last()-]  和上面同理，代表倒数第二个

//a[position()<3] 位置序号小于3，也就是前两个，这里我们可以看出xpath中的序列是从1开始

//a[@href]    拥有href的<a>元素

//a[@href='www.baidu.com']    href属性值为'www.baidu.com'的<a>元素

//book[@price>2]   price值大于2的<book>元素

etree.HTML(html)

from lxml import etree # 报红忽略

url=...
html=requests.get(html).decode('utf-8')
etree.HTML(html)

xpath 函数

text()

输出文本

string()

string()会把当前节点和所有的子孙节点中的文本全部提取出来，组合成一个字符串

code3 = code2.xpath('string(//div[@class="con"])')
# 将类名为con的div中的所有文本提取

https://www.w3school.com.cn/xpath/xpath_functions.asp

示例

<ul class="zxxx_list">

    <li>  <a href="7c34a65595744227bdae8a35d68de133.shtml" target="_blank" title='截至9月13日24时'  >截至9月13日24时</a><span class="ml">2022-09-14</span></li>
</ul>

#!/usr/bin/python
# coding=utf-8
from lxml import html
from lxml import etree
import requests

url=''

a = requests.session()          # session建立连接
html1 = a.get(url).content.decode('utf-8')
print(html1)

html2=etree.HTML(html1,parser = etree.HTMLParser(encoding='utf8'))

print(html2.xpath('//ul[@class="zxxx_list"]/li/a/@href')[0])
# 查找类名为zxxx_list的ul 向下查找li/a 获取a的href 获取第一项

===>

7c34a65595744227bdae8a35d68de133.shtml

参考：
https://www.cnblogs.com/mxjhaima/p/13775844.html

语法参考：
https://blog.csdn.net/qq_50854790/article/details/123610184

完整代码

#!/usr/bin/python
# coding=utf-8
from lxml import etree
import requests

url = 'http://www.nhc.gov.cn'
url1 = 'http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml'

# region 一、访问目标界面，获取动态更新网址

a = requests.session()  # session建立连接
html0 = a.get(url1)
html0.encoding = 'utf-8'

html1 = html0.content.decode('utf-8')
# print(html1)

html2 = etree.HTML(html1, parser=etree.HTMLParser(encoding='utf8'))
html3 = html2.xpath('//ul[@class="zxxx_list"]/li/a/@href')[0]
# print(html3)
a.close()

url2 = url + html3
print(url2)

# endregion

# region 二、从动态网页上获取所需信息
b = requests.session()
res1 = b.get(url2)
res1.encoding = 'utf-8'

code1 = res1.text
print(code1)

b.close()

code2 = etree.HTML(code1, parser=etree.HTMLParser(encoding='utf8'))
code3 = code2.xpath('string(//div[@class="con"])')
# print(code3)

# endregion

# region 三、处理数据

s1 = code3.find('。') + 1
s2 = code3.find('死亡')

code4 = code3[s1:s2]  # 处理数据，找到第一段
print(code4)

str = code4

dict = {'北京': 0, '天津': 0, '上海': 0, '重庆': 0, '河北': 0, '山西': 0, '辽宁': 0, '吉林': 0, '江苏': 0, '浙江': 0,
    '安徽': 0, '福建': 0, '江西': 0, '山东': 0, '河南': 0, '湖北': 0, '湖南': 0, '广东': 0, '海南': 0, '四川': 0,
    '贵州': 0, '云南': 0, '陕西': 0, '甘肃': 0, '青海': 0, '台湾': 0, '广西': 0, '西藏': 0, '宁夏': 0, '新疆': 0,
    '香港': 0, '澳门': 0, '黑龙江': 0, '内蒙古': 0}

sum = str.count('（')  # 处理数据
print(sum)

str1 = ''

for i in range(0, sum):
    num1 = str.find('（')
    num2 = str.find('）')
    str1 = str1 + str[num1 + 1:num2].replace('例', '')
    if i < sum - 1:
        str1 = str1 + '，'
    str = str[num2 + 1:]

# print(str1)

str2 = str1.split('，')

# print(str2)

for i in str2:
    for key in dict:
        if key in i and (i.replace(key, '')).isnumeric():
            dict[key] = dict[key] + int(i.replace(key, ''))

print(dict)

# endregion

# region 四、同步数据库

import pymssql

connect = pymssql.connect(host='x.x.x.x:xxxx', user='team16', password='team16', database='team16')  # 远程链接数据库建立连接

if connect:
    print("连接成功")
else:
    print("wrong!")

cursor = connect.cursor()  # 创建一个游标对象python里的sql语句都要通过cursor来执行
sql = "select * from dbo.C19_State;update dbo.C19_State set sum=0 where 1=1;"  # 查询数据库信息
cursor.execute(sql)  # 执行sql语句
row1 = cursor.fetchone()  # 读取查询结果
sql1 = ''
while row1:  # 循环读取所有结果
    for key in dict:
        if key in row1[3]:
            sql1 = sql1 + 'update dbo.C19_State set sum={} where areaName=\'{}\''.format(dict[key], row1[3]) + ';'
    # print(row1[0], row1[1], row1[2], row1[3], row1[4])  # 输出结果
    row1 = cursor.fetchone()

print(sql1)
cursor.execute(sql1)
connect.commit()


sql = "select * from dbo.C19_State;"  # 查询数据库信息
cursor.execute(sql)

row2 = cursor.fetchone()  # 读取查询结果
while row2:  # 循环读取所有结果
    print(row2[0], row2[1], row2[2], row2[3], row2[4],row2[5])  # 输出结果
    row2 = cursor.fetchone()

connect.close()

# endregion

if sum:
    print('同步成功')
else:
       print('同步失败')

XPath

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

GIS 上一篇

LeetCode 下一篇