python解析html

Posted on Wed 10 November 2010 in misc

python自带有一个html的解析库,但这个库的功能有限,而且对网页中异常情况的处理不好。\ 后来在网上找到一个叫[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)的网页解析库,这个库利用了正则表达式对网页进行处理,能比较完美地处理异常情况,还支持unicode。\ 除此之外还有lxml等python库。\ 下面是BeautifulSoup的一些例子,是从官网摘过来的。更多详细信息可以看[官方文档](http://www.crummy.com/software/BeautifulSoup/documentation.html),有[中文版](http://www.crummy.com/software/BeautifulSoup/documentation.zh.html)\

from BeautifulSoup import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
doc = ''.join(doc)
soup = BeautifulSoup(doc)
print soup.prettify()#这里相当于把网页标准化。如<b>some thing后忘记了</b>,prettify会补上。
#下面是打印的结果
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
soup.contents[0].name#可以使用dot的方式访问节点,十分方便。
# u'html'
soup.contents[0].contents[0].name
# u'head'
head = soup.contents[0].contents[0]
head.parent.name
# u'html'
head.next
# <title>Page title</title>
head.nextSibling.name
# u'body'
head.nextSibling.contents[0]
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
head.nextSibling.contents[0].nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
soup.findAll('p', align="center")#可以使用筛选器,有点像css selectors
# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]
soup.find('p', align="center")
# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>
soup('p', align="center")[0]['id']
# u'firstpara'
soup.find('p', align=re.compile('^b.*'))['id']
# u'secondpara'
soup.find('p').b.string
# u'one'
soup('p')[1].b.string
# u'two'