from lxml import etree
reload(sys)
sys.setdefaultencoding("utf8")
import requests
r = requests.get('http://best.pconline.com.cn/')
html = r.text
xmlhtml = etree.HTML(html)
content = xmlhtml.xpath('//div[starts-with(@id,"topic")]/div[1]/a[2]/text()')
urllist = xmlhtml.xpath('//div[starts-with(@id,"topic")]/div[1]/a[2]/@href')
lastime = xmlhtml.xpath('//div[starts-with(@id,"topic")]/div[2]/div[2]/span[2]/text()')
data_text = [ text for text in content ]
data_url = [ url for url in urllist ]
data_time = [ t.strip() for t in lastime ]
for i in xrange(0, len(data_text), 1):
print "%s, %s, %s" % (data_text[i], data_url[i], data_time[i])
用 wget 获取一下 bbs.chinaunix.net 的页面,得到的页面是bbs.chinaunix.net版面列表,然后自然是要分析这个 html 文件,但是 html 文件的“源码”跟普通 txt 文件差别很大,在html“源码”里多几个空行,多几个空格都不会影响 html 文件的显示,但对于格式分析却有很大影响!这个实在太简单,哥给你写好。#!/bin/bash
echo "" >index.html
echo "<html><head><title>My HTML Image Viewer</title></head><body>" >>index.html
for f in `ls *.jpg`do
echo "<img src=\"$f\"/>" >>index.html
done
for f in `ls *.bmp`do
echo "<img src=\"$f\"/>" >>index.html
done
for f in `ls *.gif`do
echo "<img src=\"$f\"/>" >>index.html
done
echo "</body></html>" >>index.html
记得给哥加分