We evaluate a tagger on data that was not seen during training: > round(tagger.accuracy(brown.tagged_sents(categories='news')), 3) 0.735 For more information, please consult chapter 5 of the NLTK Book. print(word, '->', tag) Mitchell -> NP decried -> None the -> AT high -> JJ rate -> NN of -> IN unemployment -> None Note that words that the tagger has not seen during training receive a tag of ``None``. For example, the unigram tagger tags each word *w* by checking what the most frequent tag for *w* was in a training corpus: > from rpus import brown > from nltk.tag import UnigramTagger > tagger = UnigramTagger(brown.tagged_sents(categories='news')) > sent = > for word, tag in tagger.tag(sent). Most of the taggers are built automatically based on a training corpus. It uses the Russian National Corpus tagset: > pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus') # doctest: +SKIP This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. It uses the Penn Treebank tagset: > from nltk import pos_tag, word_tokenize > pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +NORMALIZE_WHITESPACE A Russian tagger is also available if you specify lang="rus". For example, the following tagged token combines the word ``'fly'`` with a noun part of speech tag (``'NN'``): > tagged_tok = ('fly', 'NN') An off-the-shelf tagger is available for English. Tagged tokens are encoded as tuples ``(tag, token)``. A "tag" is a case-sensitive string that specifies some property of a token, such as its part of speech. # Natural Language Toolkit: Taggers # Copyright (C) 2001-2023 NLTK Project # Author: Edward Loper # Steven Bird (minor additions) # URL: # For license information, see LICENSE.TXT """ NLTK Taggers This package contains classes and interfaces for part-of-speech tagging, or simply "tagging".
0 Comments
Leave a Reply. |