python解析html
Posted on Wed 10 November 2010 in misc
python自带有一个html的解析库,但这个库的功能有限,而且对网页中异常情况的处理不好。\ 后来在网上找到一个叫[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)的网页解析库,这个库利用了正则表达式对网页进行处理,能比较完美地处理异常情况,还支持unicode。\ 除此之外还有lxml等python库。\ 下面是BeautifulSoup的一些例子,是从官网摘过来的。更多详细信息可以看[官方文档](http://www.crummy.com/software/BeautifulSoup/documentation.html),有[中文版](http://www.crummy.com/software/BeautifulSoup/documentation.zh.html)\
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from BeautifulSoup import BeautifulSoup | |
import re | |
doc = ['<html><head><title>Page title</title></head>', | |
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', | |
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.', | |
'</html>'] | |
doc = ''.join(doc) | |
soup = BeautifulSoup(doc) | |
print soup.prettify()#这里相当于把网页标准化。如<b>some thing后忘记了</b>,prettify会补上。 | |
#下面是打印的结果 | |
# <html> | |
# <head> | |
# <title> | |
# Page title | |
# </title> | |
# </head> | |
# <body> | |
# <p id="firstpara" align="center"> | |
# This is paragraph | |
# <b> | |
# one | |
# </b> | |
# . | |
# </p> | |
# <p id="secondpara" align="blah"> | |
# This is paragraph | |
# <b> | |
# two | |
# </b> | |
# . | |
# </p> | |
# </body> | |
# </html> | |
soup.contents[0].name#可以使用dot的方式访问节点,十分方便。 | |
# u'html' | |
soup.contents[0].contents[0].name | |
# u'head' | |
head = soup.contents[0].contents[0] | |
head.parent.name | |
# u'html' | |
head.next | |
# <title>Page title</title> | |
head.nextSibling.name | |
# u'body' | |
head.nextSibling.contents[0] | |
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> | |
head.nextSibling.contents[0].nextSibling | |
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> | |
soup.findAll('p', align="center")#可以使用筛选器,有点像css selectors | |
# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>] | |
soup.find('p', align="center") | |
# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p> | |
soup('p', align="center")[0]['id'] | |
# u'firstpara' | |
soup.find('p', align=re.compile('^b.*'))['id'] | |
# u'secondpara' | |
soup.find('p').b.string | |
# u'one' | |
soup('p')[1].b.string | |
# u'two' |