Class WordTokenizer

java.lang.Object
org.languagetool.tokenizers.WordTokenizer
All Implemented Interfaces:
Tokenizer

public class WordTokenizer extends Object implements Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (like http://foobar.org).
  • Field Details

    • PROTOCOLS

      private static final List<String> PROTOCOLS
    • URL_CHARS

      private static final Pattern URL_CHARS
    • DOMAIN_CHARS

      private static final Pattern DOMAIN_CHARS
    • NO_PROTOCOL_URL

      private static final Pattern NO_PROTOCOL_URL
    • E_MAIL

      private static final Pattern E_MAIL
    • TOKENIZING_CHARACTERS

      private static final String TOKENIZING_CHARACTERS
      See Also:
  • Constructor Details

    • WordTokenizer

      public WordTokenizer()
  • Method Details

    • getProtocols

      public static List<String> getProtocols()
      Get the protocols that the tokenizer knows about.
      Returns:
      currently http, https, and ftp
      Since:
      2.1
    • isUrl

      public static boolean isUrl(String token)
      Since:
      3.0
    • isEMail

      public static boolean isEMail(String token)
      Since:
      3.5
    • tokenize

      public List<String> tokenize(String text)
      Specified by:
      tokenize in interface Tokenizer
    • getTokenizingCharacters

      public String getTokenizingCharacters()
      Returns:
      The string containing the characters used by the tokenizer to tokenize words.
      Since:
      2.5
    • joinEMailsAndUrls

      protected List<String> joinEMailsAndUrls(List<String> list)
    • joinEMails

      protected List<String> joinEMails(List<String> list)
      Since:
      3.5
    • joinUrls

      protected List<String> joinUrls(List<String> l)
    • urlStartsAt

      private boolean urlStartsAt(int i, List<String> l)
    • isProtocol

      private boolean isProtocol(String token)
    • urlEndsAt

      private boolean urlEndsAt(int i, List<String> l, String urlQuote)