python抓取网页时是如何处理验证码的

2023-02-14 12:51:02Python030

python抓取网页时是如何处理验证码的,第1张

python抓取网页时是如何处理验证码的？下面给大家介绍几种方法：

1、输入式验证码

这种验证码主要是通过用户输入图片中的字母、数字、汉字等进行验证。如下图：

解决思路：这种是最简单的一种，只要识别出里面的内容，然后填入到输入框中即可。这种识别技术叫OCR，这里我们推荐使用Python的第三方库，tesserocr。对于没有什么背影影响的验证码如图2，直接通过这个库来识别就可以。但是对于有嘈杂的背景的验证码这种，直接识别识别率会很低，遇到这种我们就得需要先处理一下图片，先对图片进行灰度化，然后再进行二值化，再去识别，这样识别率会大大提高。

相关推荐：《Python入门教程》

2、滑动式验证码

这种是将备选碎片直线滑动到正确的位置，如下图：

解决思路：对于这种验证码就比较复杂一点，但也是有相应的办法。我们直接想到的就是模拟人去拖动验证码的行为，点击按钮，然后看到了缺口的位置，最后把拼图拖到缺口位置处完成验证。

第一步：点击按钮。然后我们发现，在你没有点击按钮的时候那个缺口和拼图是没有出现的，点击后才出现，这为我们找到缺口的位置提供了灵感。

第二步：拖到缺口位置。

我们知道拼图应该拖到缺口处，但是这个距离如果用数值来表示？

通过我们第一步观察到的现象，我们可以找到缺口的位置。这里我们可以比较两张图的像素，设置一个基准值，如果某个位置的差值超过了基准值，那我们就找到了这两张图片不一样的位置，当然我们是从那块拼图的右侧开始并且从左到右，找到第一个不一样的位置时就结束，这是的位置应该是缺口的left，所以我们使用selenium拖到这个位置即可。

这里还有个疑问就是如何能自动的保存这两张图？

这里我们可以先找到这个标签，然后获取它的location和size，然后 top，bottom，left，right = location['y'] ,location['y']+size['height']+ location['x'] + size['width'] ,然后截图，最后抠图填入这四个位置就行。

具体的使用可以查看selenium文档，点击按钮前抠张图，点击后再抠张图。最后拖动的时候要需要模拟人的行为，先加速然后减速。因为这种验证码有行为特征检测，人是不可能做到一直匀速的，否则它就判定为是机器在拖动，这样就无法通过验证了。

3、点击式的图文验证和图标选择

图文验证：通过文字提醒用户点击图中相同字的位置进行验证。

图标选择：给出一组图片，按要求点击其中一张或者多张。借用万物识别的难度阻挡机器。

这两种原理相似，只不过是一个是给出文字，点击图片中的文字，一个是给出图片，点出内容相同的图片。

这两种没有特别好的方法，只能借助第三方识别接口来识别出相同的内容，推荐一个超级鹰，把验证码发过去，会返回相应的点击坐标。

然后再使用selenium模拟点击即可。具体怎么获取图片和上面方法一样。

4、宫格验证码

这种就很棘手，每一次出现的都不一样，但是也会出现一样的。而且拖动顺序都不一样。

但是我们发现不一样的验证码个数是有限的，这里采用模版匹配的方法。我觉得就好像暴力枚举，把所有出现的验证码保存下来，然后挑出不一样的验证码，按照拖动顺序命名，我们从左到右上下到下，设为1，2，3，4。上图的滑动顺序为4，3，2，1，所以我们命名4_3_2_1.png，这里得手动搞。当验证码出现的时候，用我们保存的图片一一枚举，与出现这种比较像素，方法见上面。如果匹配上了，拖动顺序就为4，3，2，1。然后使用selenium模拟即可。

输入url，得到html，我早就写了函数了

自己搜：

getUrlRespHtml

就可以找到对应的python函数：

#------------------------------------------------------------------------------

def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False, postDataDelimiter="&") :

"""Get response from url, support optional postDict,headerDict,timeout,useGzip

Note:

1. if postDict not null, url request auto become to POST instead of default GET

2 if you want to auto handle cookies, should call initAutoHandleCookies() before use this function.

then following urllib2.Request will auto handle cookies

"""

# makesure url is string, not unicode, otherwise urllib2.urlopen will error

url = str(url)

if (postDict) :

if(postDataDelimiter=="&"):

postData = urllib.urlencode(postDict)

else:

postData = ""

for eachKey in postDict.keys() :

postData += str(eachKey) + "=" + str(postDict[eachKey]) + postDataDelimiter

postData = postData.strip()

logging.info("postData=%s", postData)

req = urllib2.Request(url, postData)

logging.info("req=%s", req)

req.add_header('Content-Type', "application/x-www-form-urlencoded")

else :

req = urllib2.Request(url)

defHeaderDict = {

'User-Agent' : gConst['UserAgent'],

'Cache-Control' : 'no-cache',

'Accept' : '*/*',

'Connection' : 'Keep-Alive',

}

# add default headers firstly

for eachDefHd in defHeaderDict.keys() :

#print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd])

req.add_header(eachDefHd, defHeaderDict[eachDefHd])

if(useGzip) :

#print "use gzip for",url

req.add_header('Accept-Encoding', 'gzip, deflate')

# add customized header later -> allow overwrite default header

if(headerDict) :

#print "added header:",headerDict

for key in headerDict.keys() :

req.add_header(key, headerDict[key])

if(timeout > 0) :

# set timeout value if necessary

resp = urllib2.urlopen(req, timeout=timeout)

else :

resp = urllib2.urlopen(req)

#update cookies into local file

if(gVal['cookieUseFile']):

gVal['cj'].save()

logging.info("gVal['cj']=%s", gVal['cj'])

return resp

#------------------------------------------------------------------------------

# get response html==body from url

#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :

def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True, postDataDelimiter="&") :

resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip, postDataDelimiter)

respHtml = resp.read()

#here, maybe, even if not send Accept-Encoding: gzip, deflate

#but still response gzip or deflate, so directly do undecompress

#if(useGzip) :

#print "---before unzip, len(respHtml)=",len(respHtml)

respInfo = resp.info()

# Server: nginx/1.0.8

# Date: Sun, 08 Apr 2012 12:30:35 GMT

# Content-Type: text/html

# Transfer-Encoding: chunked

# Connection: close

# Vary: Accept-Encoding

# ...

# Content-Encoding: gzip

# sometime, the request use gzip,deflate, but actually returned is un-gzip html

# -> response info not include above "Content-Encoding: gzip"

# eg: http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html

# -> so here only decode when it is indeed is gziped data

#Content-Encoding: deflate

if("Content-Encoding" in respInfo):

if("gzip" == respInfo['Content-Encoding']):

respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS)

elif("deflate" == respInfo['Content-Encoding']):

respHtml = zlib.decompress(respHtml, -zlib.MAX_WBITS)

return respHtml

及示例代码：

url = "http://www.crifan.com"

respHtml = getUrlRespHtml(url)

完全库函数，自己搜：

crifanLib.py

关于抓取动态页面，详见：

Python专题教程：抓取网站，模拟登陆，抓取动态网页

（自己搜标题即可找到）

在Python自带的交互式模式下编辑，交互式下，一行只能放一段代码import requests ，这一行要和下面你定义的函数隔开为两段代码

也就是import requests 要按回车键，然后在新的【>>>】开始处再输入你定义的函数代码

一些网页可以用Python的urllib来抓取内容，基本上没有问题

但是有的网页内容在浏览器看到的和抓取的有很大区别，抓取的基本上是框架实质内容没有

比如必应词典，http://dict.bing.com.cn/#good

验证码位置缺口然后拖动

# 上一篇：Python线上考试测试网络环境必须在家吗

# 下一篇：python如何将多维字典每个键的值转换成多维列表？

推荐阅读

热门文章

最新发布

标签列表

python抓取网页时是如何处理验证码的

给您推荐相同类型的内容：