public final class KoreanTokenizer extends Tokenizer
This tokenizer sets a number of additional attributes:
PartOfSpeechAttribute containing part-of-speech.
ReadingAttribute containing reading.
This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters.
| Modifier and Type | Class and Description |
|---|---|
static class |
KoreanTokenizer.DecompoundMode
Decompound mode: this determines how the tokenizer handles
POS.Type.COMPOUND, POS.Type.INFLECT and POS.Type.PREANALYSIS tokens. |
static class |
KoreanTokenizer.Type
Token type reflecting the original source of this token
|
AttributeSource.State| Modifier and Type | Field and Description |
|---|---|
static KoreanTokenizer.DecompoundMode |
DEFAULT_DECOMPOUND
Default mode for the decompound of tokens (
KoreanTokenizer.DecompoundMode.DISCARD. |
DEFAULT_TOKEN_ATTRIBUTE_FACTORY| Constructor and Description |
|---|
KoreanTokenizer()
Creates a new KoreanTokenizer with default parameters.
|
KoreanTokenizer(AttributeFactory factory,
TokenInfoDictionary systemDictionary,
UnknownDictionary unkDictionary,
ConnectionCosts connectionCosts,
UserDictionary userDictionary,
KoreanTokenizer.DecompoundMode mode,
boolean outputUnknownUnigrams,
boolean discardPunctuation)
Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary.
|
KoreanTokenizer(AttributeFactory factory,
UserDictionary userDictionary,
KoreanTokenizer.DecompoundMode mode,
boolean outputUnknownUnigrams)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
|
KoreanTokenizer(AttributeFactory factory,
UserDictionary userDictionary,
KoreanTokenizer.DecompoundMode mode,
boolean outputUnknownUnigrams,
boolean discardPunctuation)
Create a new KoreanTokenizer using the system and unknown dictionaries shipped with Lucene.
|
| Modifier and Type | Method and Description |
|---|---|
void |
close() |
void |
end() |
boolean |
incrementToken() |
void |
reset() |
void |
setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of
the Viterbi lattice
|
correctOffset, setReaderaddAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toStringpublic static final KoreanTokenizer.DecompoundMode DEFAULT_DECOMPOUND
KoreanTokenizer.DecompoundMode.DISCARD.public KoreanTokenizer()
Uses the default AttributeFactory.
public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams)
factory - the AttributeFactory to useuserDictionary - Optional: if non-null, user dictionary.mode - Decompound mode.outputUnknownUnigrams - if true outputs unigrams for unknown words.public KoreanTokenizer(AttributeFactory factory, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
factory - the AttributeFactory to useuserDictionary - Optional: if non-null, user dictionary.mode - Decompound mode.outputUnknownUnigrams - if true outputs unigrams for unknown words.discardPunctuation - true if punctuation tokens should be dropped from the output.public KoreanTokenizer(AttributeFactory factory, TokenInfoDictionary systemDictionary, UnknownDictionary unkDictionary, ConnectionCosts connectionCosts, UserDictionary userDictionary, KoreanTokenizer.DecompoundMode mode, boolean outputUnknownUnigrams, boolean discardPunctuation)
Create a new KoreanTokenizer supplying a custom system dictionary and unknown dictionary.
This constructor provides an entry point for users that want to construct custom language models
that can be used as input to DictionaryBuilder.
factory - the AttributeFactory to usesystemDictionary - a custom known token dictionaryunkDictionary - a custom unknown token dictionaryconnectionCosts - custom token transition costsuserDictionary - Optional: if non-null, user dictionary.mode - Decompound mode.outputUnknownUnigrams - if true outputs unigrams for unknown words.discardPunctuation - true if punctuation tokens should be dropped from the output.public void setGraphvizFormatter(GraphvizFormatter dotOut)
public void close()
throws IOException
close in interface Closeableclose in interface AutoCloseableclose in class TokenizerIOExceptionpublic void reset()
throws IOException
reset in class TokenizerIOExceptionpublic void end()
throws IOException
end in class TokenStreamIOExceptionpublic boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOExceptionCopyright © 2000-2024 Apache Software Foundation. All Rights Reserved.