Pythonのpytesseractライブラリの使い方|インストールから光学式文字認識(OCR)まで解説

2023.01.18 2022.10.27

今回は、PyTesseract（python-tesseract）を使って光学式文字認識を行う方法について紹介します。

PytesseractはTesseract-OCR Engineのラッパーです。

TesseractはGoogleが運営しているオープンソースのOCRエンジンです。

画像にテキストが含まれていて、それをコンピュータに入力する必要がある場合がある。

私たちが画像に書かれた文字を認識するのは簡単ですが、コンピュータが画像の中の文字を理解するのは本当に難しい作業です。

コンピュータは、画像をピクセルの配列として認識するだけです。

OCRは、このタスクに便利です。

OCRは、画像上のテキストコンテンツを検出し、コンピュータが容易に理解できるように、情報を符号化されたテキストに変換します。

この記事では、PythonでOCRタスクを実行する方法を紹介します。

この記事もチェック：TkinterのEntry Widget(テキストボックス)を使って文字の入力を受け付ける方法

Pythonによる基本的な光学式文字認識の実装
1. 1. 文字がはっきり見える画像を手に入れる
2. 2. 画像からテキストを抽出するコード
OpenCVを用いた前処理後のOCRの実装
1. 1. 文字がはっきりしている画像を探す
2. 2. Pythonを使った画像からの前処理とテキスト抽出の完全コード
まとめ

Pythonによる基本的な光学式文字認識の実装

tesseract の Python ラッパーを pip でインストールします。

$ pip install pytesseract

Tesseractのバイナリファイルのインストールとpytesseractの動作の詳細はstack overflowのこちらのクエリを参照してください。

1. 文字がはっきり見える画像を手に入れる

それでは、サンプル画像からテキストを抽出してみましょう。

#Importing libraries

import cv2

import pytesseract
 
#Loading image using OpenCV

img = cv2.imread('sample.jpg')
 
#Converting to text

text = pytesseract.image_to_string(img)
 
print(text)

2. 画像からテキストを抽出するコード

上の画像はjpeg形式なので、そこからテキスト情報を抽出してみます。

On the Insert tab, the galleries include items that are designed
to coordinate with the overall look of your document. You can
use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks. When you create
pictures, charts, or diagrams, they also coordinate with your
current document look.

結果は以下の通りです。

#Importing libraries

import cv2

import pytesseract

import numpy as np
 
#Loading image using OpenCV

img = cv2.imread('sample_test.jpg')
 
#Preprocessing image
#Converting to grayscale

gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
 
#creating Binary image by selecting proper threshold

binary_image = cv2.threshold(gray_image ,130,255,cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
 
#Inverting the image

inverted_bin = cv2.bitwise_not(binary_image)
 
#Some noise reduction

kernel = np.ones((2,2),np.uint8)

processed_img = cv2.erode(inverted_bin, kernel, iterations = 1)

processed_img = cv2.dilate(processed_img, kernel, iterations = 1)
 
#Applying image_to_string method

text = pytesseract.image_to_string(processed_img)
 
print(text)

OpenCVを使用して画像を読み込んだ後、入力引数として画像を必要とするpytesseract image_to_string メソッドを使用しました。

この1行のコードで、画像中のテキスト情報をエンコードされたテキストに変換します。

しかし、変換の効率は入力画像の品質に直接影響されるため、画像の前処理を行わない場合、OCRの実作業は困難となる。

OpenCVを用いた前処理後のOCRの実装

画像の前処理に使用するステップ。

画像をグレースケールに変換する – 画像は2値画像に変換する必要があるため，まず，色付きの画像をグレースケールに変換します．
グレースケール画像から2値画像への変換には、閾値処理を行います。閾値処理は、画素の値がある閾値より下か上かを判断します。以下の画素はすべて白い画素に、以上の画素はすべて黒い画素になります。
ここで、bitwise_not演算を用いて、画像を反転させます。
画像にノイズ除去処理を施します。
前処理された画像に対して、テキスト抽出手法を適用します。

1. 文字がはっきりしている画像を探す

上記の手順を、以下の画像を使ってコードで実装してみましょう。

On the Insert tab, the galleries include items that are designed
to coordinate with the overall look of your document. You can
use these galleries to insert tables, headers, footers, lists, cover
pages, and other document building blocks. When you create
pictures, charts, or diagrams, they also coordinate with your
current document look,
 
You can easily change the formatting of selected text in the
documenttext by choosing a look for the selected text from the
Quick Styies gallery on the Home tab. You can also format text
directly by using the other controls on the Home tab. Most
controls offer a choice of using the look from the current theme
 
or using a tormat that you specify directly.
 
To change the overall look of your document, choose new
Theme elements on the Page Layout tab. To change the looks
available in the Quick Style gallery, use the Change Current
Quick Style Set command. Both the Themes gallery and the
Quick Styles gallery provide reset commands so that you can