Package org.languagetool.tokenizers
Class WordTokenizer
java.lang.Object
org.languagetool.tokenizers.WordTokenizer
- All Implemented Interfaces:
Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens.
The tokenizer is a quite simple character-based one, though it knows
about urls and will put them in one token, if fully specified including
a protocol (like
http://foobar.org
).-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionGet the protocols that the tokenizer knows about.static boolean
private boolean
isProtocol
(String token) static boolean
joinEMails
(List<String> list) joinEMailsAndUrls
(List<String> list) private boolean
private boolean
urlStartsAt
(int i, List<String> l)
-
Field Details
-
PROTOCOLS
-
URL_CHARS
-
DOMAIN_CHARS
-
NO_PROTOCOL_URL
-
E_MAIL
-
TOKENIZING_CHARACTERS
- See Also:
-
-
Constructor Details
-
WordTokenizer
public WordTokenizer()
-
-
Method Details
-
getProtocols
Get the protocols that the tokenizer knows about.- Returns:
- currently
http
,https
, andftp
- Since:
- 2.1
-
isUrl
- Since:
- 3.0
-
isEMail
- Since:
- 3.5
-
tokenize
-
getTokenizingCharacters
- Returns:
- The string containing the characters used by the tokenizer to tokenize words.
- Since:
- 2.5
-
joinEMailsAndUrls
-
joinEMails
- Since:
- 3.5
-
joinUrls
-
urlStartsAt
-
isProtocol
-
urlEndsAt
-