Python,NLTKで自然言語処理

Install nltk

$ pip install nltk

wordnetのコーパスをPythonインタプリタからダウンロード

$ python
Python 2.7.5 (default, Jul 19 2013, 19:37:30)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()

MacならGUIの画面が起動するので適当に従ってダウンロード。

書き終わってから気づいたけど以下ナチュラルにsee()使ってます。代わりにdir()使うか、そもそも飛ばすか、便利なのでseeをインストールしましょう。

$ pip install see

~/.pythonstartup に

from see import see

を記述しておくと便利

dotfiles/.pythonstartup at master · haya14busa/dotfiles

Stemming and Lemmatisation

Stemming – Wikipedia, the free encyclopedia
- 語幹化
Lemmatisation – Wikipedia, the free encyclopedia
- 見出し語化, レンマ化

語幹化、見出し語化の前に、大文字・小文字を正規化しておく。(語幹化は大文字小文字混ざってても動くっぽいけどレンマ化は動かない)

>>> print 'Python'.lower()
python

Stemming

>>> from nltk import stem
>>> see(stem)
    help()                 .ISRIStemmer()         .LancasterStemmer()
    .PorterStemmer()       .RSLPStemmer()         .RegexpStemmer()
    .SnowballStemmer()     .StemmerI()            .WordNetLemmatizer()   .api
    .isri                  .lancaster             .porter
    .regexp                .rslp                  .snowball
    .wordnet
>>> stemmer = stem.PorterStemmer()
>>> stemmer.stem('dialogue')
'dialogu'
>>> stemmer2 = stem.LancasterStemmer()
>>> stemmer2.stem('dialogue')
'dialog'

PorterStemmerとLancasterStemmerでアルゴリズムが違うっぽい。

Lancasterのほうがアグレッシブ。

Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. One of the few stemmers that actually has Java support which is a plus, though it is also the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.

Lancaster: Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

– java – What are the major differences and benefits of Porter and Lancaster Stemming algorithms? – Stack Overflow

個人的にdialogueとdialogでstemが違うとつらいのでLancaster使ってみて、ミスが多いようならPorter使う予定。

他にもSnowballとかあるけど割愛

Lemmatisation

>>> from nltk import stem
>>> lemmatizer = stem.WordNetLemmatizer()
>>> lemmatizer.lemmatize('dialogs')
'dialog'
>>> lemmatizer.lemmatize('dialogues')
'dialogue'
>>> lemmatizer.lemmatize('cookings')
'cooking'
>>> lemmatizer.lemmatize('cooking', pos='v')
'cook'

Tokenization

Tokenization – Wikipedia, the free encyclopedia

>>> from nltk import tokenize
>>> see(tokenize)
    help()                    .BlanklineTokenizer()     .LineTokenizer()
    .PunktSentenceTokenizer()                           .PunktWordTokenizer()
    .RegexpTokenizer()        .SExprTokenizer()         .SpaceTokenizer()
    .TabTokenizer()           .TreebankWordTokenizer()  .WhitespaceTokenizer()
    .WordPunctTokenizer()     .api                      .blankline_tokenize()
    .line_tokenize()          .load()                   .punkt
    .regexp                   .regexp_tokenize()        .sent_tokenize()
    .sexpr                    .sexpr_tokenize()         .simple
    .treebank                 .util                     .word_tokenize()
    .wordpunct_tokenize()
>>> aio1 = 'He grinned and said, "I make lots of money.  On weekdays I receive
an average of 50 orders a day from all over the globe via the Internet."'

Sentence Tokenization

>>> tokenize.sent_tokenize(aio1)
['He grinned and said, "I make lots of money.', 'On weekdays I receive an average of 50 orders a day from all over the globe via the Internet.', '"']

Word Tokenization

>>> tokenize.word_tokenize(aio1)
['He', 'grinned', 'and', 'said', ',', '``', 'I', 'make', 'lots', 'of', 'money.', 'On', 'weekdays', 'I', 'receive', 'an', 'average', 'of', '50', 'orders', 'a', 'day', 'from', 'all', 'over', 'the', 'globe', 'via', 'the', 'Internet', '.', "''"]
>>> tokenize.wordpunct_tokenize(aio1)
['He', 'grinned', 'and', 'said', ',', '"', 'I', 'make', 'lots', 'of', 'money', '.', 'On', 'weekdays', 'I', 'receive', 'an', 'average', 'of', '50', 'orders', 'a', 'day', 'from', 'all', 'over', 'the', 'globe', 'via', 'the', 'Internet', '."']

Delete Stopwords

Stop words – Wikipedia, the free encyclopedia

>>> from nltk.corpus import stopwords
>>> stopset = set(stopwords.words('english')
... )
>>> stopset
set(['all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'])
>>> aio1words = tokenize.wordpunct_tokenize(aio1)
>>> aio1words
['He', 'grinned', 'and', 'said', ',', '"', 'I', 'make', 'lots', 'of', 'money', '.', 'On', 'weekdays', 'I', 'receive', 'an', 'average', 'of', '50', 'orders', 'a', 'day', 'from', 'all', 'over', 'the', 'globe', 'via', 'the', 'Internet', '."']
>>> for word in aio1words:
...  if len(word) < 3 or word in stopset:
...   continue
...  print word
...
grinned
said
make
lots
money
weekdays
receive
average
orders
day
globe
via
Internet
====================
filter で
====================
>>> print filter(lambda w: len(w) > 2 and w not in stopset, aio1words)
['grinned', 'said', 'make', 'lots', 'money', 'weekdays', 'receive', 'average', 'orders', 'day', 'globe', 'via', 'Internet']