Python语言学习之Python爬虫遇到验证码的处理方式-职坐标

Python语言学习之Python爬虫遇到验证码的处理方式

小职 2020-10-29 来源：阅读 700 评论 0

摘要：在Python语言学习中，Python爬虫遇到验证码的处理方式，把这些功能封装起来，供我们使用。希望对Python的学习有所帮助。

在Python语言学习中，Python爬虫遇到验证码的处理方式，把这些功能封装起来，供我们使用。希望对Python的学习有所帮助。

Python语言学习之Python爬虫遇到验证码的处理方式

将处理图片验证码的比较优秀的方式进行了一次封装, 主要是百度的aip

本篇文章介绍了爬虫中验证码的处理方式，并把这些功能封装起来，供我们使用，涉及到百度AIP的调用方式，以及一个最新的开源库muggle识别库的使用。

学会调用百度的aip接口：

扩展百度的色情识别接口：

学会muggle_ocr 识别接口：

封装源码：

学会调用百度的aip接口：

1. 首先需要注册一个账号：

https://login.bce.baidu.com/

注册完成之后登入

2. 创建项目

在这些技术里面找到文字识别，然后点击创建一下项目

Python爬虫遇到验证码的几种处理方式，文章末尾有源码

创建完成之后：

Python爬虫遇到验证码的几种处理方式，文章末尾有源码

图片中 AppID , API key, Secret Key 这些待会是需要用的。

下一步可以查看官网文档，或者直接使用我写的代码

3. 安装一下依赖库 pip install baidu-aip

这只是一个接口，需要前面的一些设置。

def return_ocr_by_baidu(self, test_image):

"""

ps: 先在__init__ 函数中完成你自己的baidu_aip 的一些参数设置

这次测试使用高精度版本测试

如果速度很慢可以换回一般版本

self.client.basicGeneral(image, options)

相关参考网址:

https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa

:param test_image: 待测试的文件名称

:return: 返回这个验证码的识别效果如果错误可以多次调用

"""

image = self.return_image_content(test_image=self.return_path(test_image))

# 调用通用文字识别（高精度版）

# self.client.basicAccurate(image)

# 如果有可选参数相关参数可以在上面的网址里面找到

options = {}

options["detect_direction"] = "true"

options["probability"] = "true"

# 调用

result = self.client.basicAccurate(image, options)

result_s = result['words_result'][0]['words']

# 不打印关闭

print(result_s)

if result_s:

return result_s.strip()

else:

raise Exception("The result is None , try it !")

扩展百度的色情识别接口：

我们写代码肯定是要找点乐子的，不可能这么枯燥无味吧?

色情识别接口在内容审核中，找一下就可以了。

调用方式源码：

# -*- coding : utf-8 -*-

# @Time : 2020/10/22 17:30

# @author : 沙漏在下雨

# @Software : PyCharm

# @CSDN : https://me.csdn.net/qq_45906219

from aip import AipContentCensor

from ocr import MyOrc

class Auditing(MyOrc):

"""

这是一个调用百度内容审核的aip接口

主要用来审核一些色情反恐恶心之类的东西

网址: https://ai.baidu.com/ai-doc/ANTIPORN/tk3h6xgkn

"""

def __init__(self):

# super().__init__()

APP_ID = '填写你的ID'

API_KEY = '填写你的KEY'

SECRET_KEY = '填写你的SECRET_KEY'

self.client = AipContentCensor(APP_ID, API_KEY, SECRET_KEY)

def return_path(self, test_image):

return super().return_path(test_image)

def return_image_content(self, test_image):

return super().return_image_content(test_image)

def return_Content_by_baidu_of_image(self, test_image, mode=0):

"""

继承ocr中的一些方法，因为都是放一起的少些一点代码

内容审核: 关于图片中是否存在一些非法不良信息

内容审核还可以实现文本审核我觉得有点鸡肋就没一起封装进去

url: https://ai.baidu.com/ai-doc/ANTIPORN/Wk3h6xg56

:param test_image: 待测试的图片可以本地文件也可以网址

:param mode: 默认 = 0 表示识别的本地文件 mode = 1 表示识别的图片网址连接

:return: 返回识别结果

"""

if mode == 0:

filepath = self.return_image_content(self.return_path(test_image=test_image))

elif mode == 1:

filepath = test_image

else:

raise Exception("The mode is 0 or 1 but your mode is ", mode)

# 调用色情识别接口

result = self.client.imageCensorUserDefined(filepath)

# """ 如果图片是url调用如下 """

# result = self.client.imageCensorUserDefined('//www.example.com/image.jpg')

print(result)

return result

a = Auditing()

a.return_Content_by_baidu_of_image("test_image/2.jpg", mode=0)

学会muggle_ocr 识别接口：

这个包是最近火起来的，使用起来很简单，没多少其他函数

安装 pip install muggle-ocr 这个下载有点慢最好使用手机热点目前镜像网站(清华/阿里) 还没有更新到这个包因为这个包是最新的一个ocr模型 12

调用接口

def return_ocr_by_muggle(self, test_image, mode=1):

"""

调用这个函数使用 muggle_ocr 来进行识别

:param test_image 待测试的文件名称最好绝对路径

:param 模型 mode = 0 即 ModelType.OCR 表示识别普通印刷文本

当 mode = 1 默认即 ModelType.Captcha 表示识别4-6位简单英输验证码

官方网站: https://pypi.org/project/muggle-ocr/

:return: 返回这个验证码的识别结果如果错误可以多次调用

"""

# 确定识别物品

if mode == 1:

sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha)

elif mode == 0:

sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.OCR)

else:

raise Exception("The mode is 0 or 1 , but your mode == ", mode)

filepath = self.return_path(test_image=test_image)

with open(filepath, 'rb') as fr:

captcha_bytes = fr.read()

result = sdk.predict(image_bytes=captcha_bytes)

# 不打印关闭

print(result)

return result.strip()

封装源码：

# -*- coding : utf-8 -*-

# @Time : 2020/10/22 14:12

# @author : 沙漏在下雨

# @Software : PyCharm

# @CSDN : https://me.csdn.net/qq_45906219

import muggle_ocr

import os

from aip import AipOcr

"""

PS: 这个作用主要是作了一个封装把2个常用的图片/验证码识别方式合在一起怎么用取决于自己

接口1: muggle_ocr

pip install muggle-ocr 这个下载有点慢最好使用手机热点

目前镜像网站(清华/阿里) 还没有更新到这个包因为这个包是最新的一个ocr模型

接口2: baidu-aip

pip install baidu-aip

这个知道的人应该很多很多，但是我觉得还是muggle 这个新包猛的一比

调用方式可以参考官网文档: https://cloud.baidu.com/doc/OCR/index.html

或者使用我如下的方式都是ok的

:param image_path 待识别的图片路径如果目录很深推荐使用绝对路径

"""

class MyOrc:

def __init__(self):

# 设置一些必要信息使用自己百度aip的内容

APP_ID = '你的ID'

API_KEY = '你的KEY'

SECRET_KEY = '你的SECRET_KEY'

self.client = AipOcr(APP_ID, API_KEY, SECRET_KEY)

def return_path(self, test_image):

""":return abs image_path"""

# 确定路径

if os.path.isabs(test_image):

filepath = test_image

else:

filepath = os.path.abspath(test_image)

return filepath

def return_image_content(self, test_image):

""":return the image content """

with open(test_image, 'rb') as fr:

return fr.read()

def return_ocr_by_baidu(self, test_image):

"""

ps: 先在__init__ 函数中完成你自己的baidu_aip 的一些参数设置

这次测试使用高精度版本测试

如果速度很慢可以换回一般版本

self.client.basicGeneral(image, options)

相关参考网址:

https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa

:param test_image: 待测试的文件名称

:return: 返回这个验证码的识别效果如果错误可以多次调用

"""

image = self.return_image_content(test_image=self.return_path(test_image))

# 调用通用文字识别（高精度版）

# self.client.basicAccurate(image)

# 如果有可选参数相关参数可以在上面的网址里面找到

options = {}

options["detect_direction"] = "true"

options["probability"] = "true"

# 调用

result = self.client.basicAccurate(image, options)

result_s = result['words_result'][0]['words']

# 不打印关闭

print(result_s)

if result_s:

return result_s.strip()

else:

raise Exception("The result is None , try it !")

def return_ocr_by_muggle(self, test_image, mode=1):

"""

调用这个函数使用 muggle_ocr 来进行识别

:param test_image 待测试的文件名称最好绝对路径

:param 模型 mode = 0 即 ModelType.OCR 表示识别普通印刷文本

当 mode = 1 默认即 ModelType.Captcha 表示识别4-6位简单英输验证码

官方网站: https://pypi.org/project/muggle-ocr/

:return: 返回这个验证码的识别结果如果错误可以多次调用

"""

# 确定识别物品

if mode == 1:

sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.Captcha)

elif mode == 0:

sdk = muggle_ocr.SDK(model_type=muggle_ocr.ModelType.OCR)

else:

raise Exception("The mode is 0 or 1 , but your mode == ", mode)

filepath = self.return_path(test_image=test_image)

with open(filepath, 'rb') as fr:

captcha_bytes = fr.read()

result = sdk.predict(image_bytes=captcha_bytes)

# 不打印关闭

print(result)

return result.strip()

# a = MyOrc()

# a.return_ocr_by_baidu(test_image='test_image/digit_img_1.png')

关注“职坐标在线”（Zhizuobiao_Online）公众号，免费获取最新技术干货教程资源哦

python核心编程 python编程从入门 python编程实战

本文由 @小职发布于职坐标。未经许可，禁止转载。

喜欢 | 0

不喜欢 | 0

看完这篇文章有何感觉？已经有0人表态，0%的人喜欢快给朋友分享吧~

后参与评论

加入IT交流圈

JAVA工程师交流群 +加入

大数据架构师交流群 +加入

人工智能Python交流群 +加入

WEB/H5前端交流群 +加入

编程语言 直通车

索取资料
答疑解惑
技术交流
职业测评
面试技巧
高薪秘笈

海同师资推荐更多>>

黄骈联系TA

认证System Analyst，System Architect

程光淼联系TA

精通C、C++等语言、智能芯片开发

热门就业培训班更多>>

【零基础】人工智能+Python全栈开发课程

Python30天热搜词更多>>