from urlparse import urlparse
def httppost(url, **kwgs):
httpClient = None
conn = urlparse(url)
try:
params = urllib.urlencode(dict(kwgs))
header = {"Content-type": "application/x-www-form-urlencoded","Accept": "text/plain", }
httpClient = httplib.HTTPConnection(conn.netloc, conn.port, timeout=30)httpClient.request("POST", conn.path, params, header)response = httpClient.getresponse()
print response.status
print response.reason
print response.read()
print response.getheaders()
except Exception, e:
print e
finally:
if httpClient:
httpClient.close()
从表面上看,Python爬虫程序运行中出现503错误是服务器的问题,其实真正的原因在程序,由于Python脚本运行过程中读取的速度太快,明显是自动读取而不是人工查询读取,这时服务器为了节省资源就会给Python脚本反馈回503错误。其实只要把爬取的速度放慢一点就好了。比如读取一条记录或几条记录后适当添加上time.sleep(10),这样就基本上不会出现503错误了。我本人在使用中一般是在每一次读取后都运行time.sleep(1)或time.sleep(3),具体的数值根据不同的网站确定。http.client包就实现了相应操作的,具体你可以看python的官方教程,下面是我从里面截取的POST示例:
>>> import http.client, urllib.parse>>> params = urllib.parse.urlencode({'@number': 12524, '@type': 'issue', '@action': 'show'})
>>> headers = {"Content-type": "application/x-www-form-urlencoded",
... "Accept": "text/plain"}
>>> conn = http.client.HTTPConnection("bugs.python.org")
>>> conn.request("POST", "", params, headers)
>>> response = conn.getresponse()
>>> print(response.status, response.reason)
302 Found
>>> data = response.read()
>>> data
b'Redirecting to <a href="http://bugs.python.org/issue12524">http://bugs.python.org/issue12524</a>'
>>> conn.close()