PythonのGensimを使ってWord2Vecモデルの実装や学習済みモデルの使い方を解説

2023.01.18 2022.10.16

Word2Vecは、単語をベクトルに変換するアルゴリズムで、類似した単語をベクトル空間にまとめることができる。

文書検索、機械翻訳システム、オートコンプリートや予測など、多くのアプリケーションで広く利用されています。

この記事では、Gensim ライブラリを用いて Word2Vec モデルを学習する方法と、単語をベクトルに変換する事前学習済みのモデルを読み込む方法について学びます。

この記事もチェック：知っておきたいPythonの機械学習アルゴリズムTOP5

Word2Vec
Gensim Word2Vec
まとめ

Word2Vec

Word2Vecとは、Googleが考案したアルゴリズムで、ニューラルネットワークを用いて、似たような意味を持つ単語の埋め込みが同じような方向を向くように作成されたものです。

例えば、love, careなどの単語の埋め込みは、fight, battleなどの単語の埋め込みと比較して、ベクトル空間上で同じような方向を向くようになる。

このようなモデルは、与えられた単語の同義語を検出し、部分的な文のためにいくつかの追加の単語を提案することもできる。

Gensim Word2Vec

Gensim はオープンソースの Python ライブラリで、トピックモデリング、文書インデックス作成、大規模コーパスとの類似性保持に使用できます。

Gensimのアルゴリズムは、コーパスのサイズに対してメモリ非依存です。

また、他のベクトル空間アルゴリズムを拡張できるように設計されています。

Gensim は Word2Vec アルゴリズムの実装と、その他の自然言語処理の機能を Word2Vec クラスで提供しています。

それでは、Gensimを使ってWord2Vecのモデルを作成する方法を見ていきましょう。

この記事もチェック：Pythonによる自然言語処理のステミングとレムマター化について解説する

Gensimを用いたWord2Vecモデルの開発

GensimのWord2Vecクラスが受け取る便利なパラメータをいくつか紹介します。

センテンス。単語埋め込みモデルを学習させるためのデータ。トークンや単語のリストでもいいですし、大規模なコーパスの場合はネットワークやディスクから取得したデータストリームでもかまいません。この例では、NLTKに含まれるBrownコーパスを使用します。
サイズ。語彙の各単語について、ベクトルの次元数をどの程度にしたいかを表す。デフォルトは100です。
window: ウィンドウ。現在の単語とその近傍の単語との距離の最大値を表す。隣接する単語がこの幅より大きい場合、いくつかの隣接する単語は現在の単語と関連があるとはみなされない。デフォルト値は5です。
min_count: 語彙に含まれる単語の最小頻度値を表す。デフォルトは5です。
iter: データセットに対する反復回数(iteration/epochs)を表す。デフォルトは5です。

PythonでのWord2Vecの使用例

import string

import nltk

from nltk.corpus import brown

from gensim.models import Word2Vec

from sklearn.decomposition import PCA

from matplotlib import pyplot
 
nltk.download("brown")
 
# Preprocessing data to lowercase all words and remove single punctuation words

document = brown.sents()

data = []

for sent in document:

  new_sent = []

  for word in sent:

    new_word = word.lower()

    if new_word[0] not in string.punctuation:

      new_sent.append(new_word)

  if len(new_sent) > 0:

    data.append(new_sent)
 
# Creating Word2Vec

model = Word2Vec(

    sentences = data,

    size = 50,

    window = 10,

    iter = 20,
)
 
# Vector for word love

print("Vector for love:")

print(model.wv["love"])

print()
 
# Finding most similar words

print("3 words similar to car")

words = model.most_similar("car", topn=3)

for word in words:

  print(word)

print()
 
#Visualizing data

words = ["france", "germany", "india", "truck", "boat", "road", "teacher", "student"]
 
X = model.wv[words]

pca = PCA(n_components=2)

result = pca.fit_transform(X)
 
pyplot.scatter(result[:, 0], result[:, 1])

for i, word in enumerate(words):

    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

結果を出力すると、以下の様になります。

Some Output[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
Vector for love:
[ 2.576164   -0.2537464  -2.5507743   3.1892483  -1.8316503   2.6448352

 -0.06407754  0.5304831   0.04439827  0.45178193 -0.4788834  -1.2661372

  1.0238386   0.3144989  -2.3910248   2.303471   -2.861455   -1.988338

 -0.36665946 -0.32186085  0.17170368 -2.0292065  -0.9724318  -0.5792801

 -2.809848    2.4033384  -1.0886359   1.1814215  -0.9120702  -1.1175308

  1.1127514  -2.287549   -1.6190344   0.28058434 -3.0212548   1.9233572

  0.13773602  1.5269752  -1.8643662  -1.5568101  -0.33570558  1.4902842

  0.24851061 -1.6321756   0.02789219 -2.1180007  -1.5782264  -0.9047415

  1.7374605   2.1492126 ]
 
3 words similar to car
('boat', 0.7544293403625488)
('truck', 0.7183066606521606)
('block', 0.6936473250389099)

import gensim

import gensim.downloader
 
for model_name in list(gensim.downloader.info()['models'].keys()):

  print(model_name)

上記の可視化では、studentやteacherといった単語はある方向を向き、インド、ドイツ、フランスといった国は別の方向を向き、road、boat、truckといった単語は別の方向を向いていることがわかります。

これは、Word2Vecモデルが、単語の意味に基づいて単語を区別する埋め込みを学習したことを示しています。

Gensimdによる学習済みモデルの読み込み

Gensim には、以下のように既に学習済みのモデルもいくつか付属しています。

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis

import gensim

import gensim.downloader
 
google_news_vectors = gensim.downloader.load('word2vec-google-news-300')
 
# Finding Capital of Britain given Capital of France: (Paris - France) + Britain = 

print("Finding Capital of Britain: (Paris - France) + Britain")

capital = google_news_vectors.most_similar(["Paris", "Britain"], ["France"], topn=1)

print(capital)

print()
 
# Finding Capital of India given Capital of Germany: (Berlin - Germany) + India = 

print("Finding Capital of India: (Berlin - Germany) + India")

capital = google_news_vectors.most_similar(["Berlin", "India"], ["Germany"], topn=1)

print(capital)

print()
 
# Finding words similar to BMW

print("5 similar words to BMW:")

words = google_news_vectors.most_similar("BMW", topn=5)

for word in words:

  print(word)

print()
 
# Finding words similar to Beautiful

print("3 similar words to beautiful:")

words = google_news_vectors.most_similar("beautiful", topn=3)

for word in words:

  print(word)

print()
 
# Finding cosine similarity between fight and battle

cosine = google_news_vectors.similarity("fight", "battle")

print("Cosine similarity between fight and battle:", cosine)

print()
 
# Finding cosine similarity between fight and love

cosine = google_news_vectors.similarity("fight", "love")

print("Cosine similarity between fight and love:", cosine)

ここでは word2vec-google-news-300 というモデルをロードして、Capital と Country の関係、類似単語の取得、Cosine の類似度計算などのタスクを実行します。

[==================================================] 100.0% 1662.8/1662.8MB downloaded
Finding Capital of Britain: (Paris - France) + Britain
[('London', 0.7541897892951965)]
 
Finding Capital of India: (Berlin - Germany) + India
[('Delhi', 0.72683185338974)]
 
5 similar words to BMW:
('Audi', 0.7932199239730835)
('Mercedes_Benz', 0.7683467864990234)
('Porsche', 0.727219820022583)
('Mercedes', 0.7078384757041931)
('Volkswagen', 0.695941150188446)
 
3 similar words to beautiful:
('gorgeous', 0.8353004455566406)
('lovely', 0.810693621635437)
('stunningly_beautiful', 0.7329413890838623)
 
Cosine similarity between fight and battle: 0.7021284
 
Cosine similarity between fight and love: 0.13506128

結果は以下の通りです。

まとめ

これでWord2Vecと、単語をベクトルに変換する独自のモデルの作り方がわかったと思います。

Word2Vecはドキュメントの類似性や検索、機械翻訳など多くのアプリケーションで広く使われています。

あなたのプロジェクトでも使えるようになりました。

お読みいただきありがとうございました。