TesseractOCR样本训练提高识别验证码准确率-天狐博客

1.拿到样本文件

脚本或者手动下载几十张样本图片。批量转换为tif格式。

#pip3 install Pillow
from PIL import Image
import os.path
#PNG转TIE
def png2tif(path) :
    # 遍历文件夹
    for img in os.listdir(path):
        if img in ['.DS_Store']:
            continue
        im = Image.open(path +'/'+ img)
        im.save (path +'/' + img.replace('png','tif'), 'TIFF')

if __name__ == '__main__':
    png2tif('/temp')

#pip3 install Pillow

from PIL import Image

import os.path

#PNG转TIE

def png2tif(path) :

# 遍历文件夹

for img in os.listdir(path):

if img in ['.DS_Store']:

continue

im = Image.open(path +'/'+ img)

im.save (path +'/' + img.replace('png','tif'), 'TIFF')

if __name__ == '__main__':

png2tif('/temp')

2.安装 jTessBoxEditor

下载jTessBoxEditor，地址https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/
解压jTessBoxEditorFX-2.4.1.zip后得到jTessBoxEditor，Windows直接运行bat，macOS终端进入目录执行（需要提前安装好jre这里不介绍了）

java -Xms128m -Xmx1024m -jar jTessBoxEditorFX.jar

2.合并tif为一个文件

开 jTessBoxEditor >【Tools】>【Merge TIFF】

全选准备好的所有tif文件。运行后保存为cqc.font.exp0.tif

3.生成BOX文件

【语法】：tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
【语法】：lang为语言名称，fontname为字体名称，num为序号；在tesseract中，一定要注意格式

tesseract cqc.font.exp0.tif cqc.font.exp0 batch.nochop makebox

1	tesseract cqc.font.exp0.tif cqc.font.exp0 batch.nochop makebox

如果发现生产的box文件为空的，控制台报empty page!!可以额外加一个参数重新运行

tesseract cqc.font.exp0.tif cqc.font.exp0 --psm 7 batch.nochop makebox

1	tesseract cqc.font.exp0.tif cqc.font.exp0 --psm 7 batch.nochop makebox

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.

box文件用文本编辑器打开就是对识别字符与区域的描述。假如没自动识别对，看懂照着正确图片直接改就行。

字符名称、X轴坐标、Y轴坐标、字符宽度、字符高度、所处的图片位置

4.生成lstmf文件

tesseract cqc.font.exp0.tif cqc.font.exp0 -l eng --psm 7 lstm.train

1	tesseract cqc.font.exp0.tif cqc.font.exp0 -l eng --psm 7 lstm.train

运行之后，我们的文件夹下会生成一个名为cqc.font.exp0.lstmf的文件。同目录新建eng.training_files.txt里面写入cqc.font.exp0.lstmf的全路径。

5.提取语言的LSTM文件

我们接着从tesseract_best（链接：https://github.com/tesseract-ocr/tessdata_best）下载相应语言的traineddata文件。

在前面几步，我们选用的语言是英文，所以在这里选择eng.traineddata文件。

下载好之后，我们需要从中提取中它的LSTM文件，使用的命令如下所示：

combine_tessdata -e eng.traineddata eng.lstm

1	combine_tessdata -e eng.traineddata eng.lstm

运行上述命令，我们的文件夹下会生成一个名为eng.lstm的文件。

6.训练

控制台执行命令：

lstmtraining --model_output="/Volumes/files/ai/ocr训练/output"  --continue_from="/Volumes/files/ai/ocr训练/eng.lstm" --train_listfile="/Volumes/files/ai/ocr训练/eng.training_files.txt"  --traineddata="/Volumes/files/ai/ocr训练/eng.traineddata"  --debug_interval -1 --max_iterations 4000

1	lstmtraining --model_output="/Volumes/files/ai/ocr训练/output" --continue_from="/Volumes/files/ai/ocr训练/eng.lstm" --train_listfile="/Volumes/files/ai/ocr训练/eng.training_files.txt" --traineddata="/Volumes/files/ai/ocr训练/eng.traineddata" --debug_interval -1 --max_iterations 4000

结束训练，控制台输出:

File cqc.font.exp0.lstmf line 1 :
Mean rms=3.84%, delta=22.419%, train=97.54%(100%), skip ratio=0%
At iteration 4000/4000/4000, mean rms=3.840%, delta=22.419%, BCER train=97.540%, BWER train=100.000%, skip ratio=0.000%, wrote checkpoint.
Finished! Selected model with minimal training error rate (BCER) = 93.233

File cqc.font.exp0.lstmf line 1 :

Mean rms=3.84%, delta=22.419%, train=97.54%(100%), skip ratio=0%

At iteration 4000/4000/4000, mean rms=3.840%, delta=22.419%, BCER train=97.540%, BWER train=100.000%, skip ratio=0.000%, wrote checkpoint.

Finished! Selected model with minimal training error rate (BCER) = 93.233

Tesseract训练完成之后，在output文件夹下会有output_checkpoint记录文件。

7.合并traineddata

合并eng与checkpoint成为新的cqc.trainedata

lstmtraining --stop_training --continue_from="/Volumes/files/ai/ocr训练/output/output_checkpoint" --traineddata="/Volumes/files/ai/ocr训练/eng.traineddata" --model_output="/Volumes/files/ai/ocr训练/cqc.traineddata"

1	lstmtraining --stop_training --continue_from="/Volumes/files/ai/ocr训练/output/output_checkpoint" --traineddata="/Volumes/files/ai/ocr训练/eng.traineddata" --model_output="/Volumes/files/ai/ocr训练/cqc.traineddata"

在文件夹下会生成一个名为cqc.traineddata的文件，我们将其复制到Tesseract-OCR的tessdata文件夹下，就可以使用其作为一个语言进行文字识别了。

测试：

tesseract 1697784764.png num01 -l eng --psm 7

1	tesseract 1697784764.png num01 -l eng --psm 7

最终，我们的文件夹下有如下图所示的文件：

参考链接：真实场景下的Tesseract神经网络训练识别图片验证码

转载请注明：天狐博客 » TesseractOCR样本训练提高识别验证码准确率

TesseractOCR样本训练提高识别验证码准确率