Tokenizers

Word, Regexp, and Treebank tokenizers are provided for breaking text up into arrays of tokens:

var natural = require('natural');
var tokenizer = new natural.WordTokenizer();
console.log(tokenizer.tokenize("your dog has fleas."));
// [ 'your', 'dog', 'has', 'fleas' ]

The other tokenizers follow a similar pattern:

tokenizer = new natural.TreebankWordTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my', 'dog', 'has', 'n\'t', 'any', 'fleas', '.' ]

tokenizer = new natural.RegexpTokenizer({pattern: /\-/});
console.log(tokenizer.tokenize("flea-dog"));
// [ 'flea', 'dog' ]

tokenizer = new natural.WordPunctTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my',  'dog',  'hasn',  '\'',  't',  'any',  'fleas',  '.' ]

tokenizer = new natural.OrthographyTokenizer({language: "fi"});
console.log(tokenizer.tokenize("Mikä sinun nimesi on?"));
// [ 'Mikä', 'sinun', 'nimesi', 'on' ]

The sentence tokenizer splits a text in sentences based on punctuation. It used the following four characters: period, question mark, exclamation mark, and ellipsis. Furthermore:

  • Multiple punctuation characters are treated as one; these may be mixed with space.
  • It recognizes abbreviations. It accepts an array of abbreviations as an argument. Case is ignored when matching abbreviations.
  • It handles decimal points in values, periods in URI’s and mail addresses gracefully.
  • Quotation marks are left in place. Opening mark at the beginning of the sentence, closing mark at the end of the last sentence of the quotation.

The algorithm that the tokenizer applies is based on the idea that all tokens containing punctuation characters that are not meant for marking the end of the sentence are replaced by a placeholder, then the text is split in sentences. Finally, the placeholders are replaced back to the original tokens.

const abbreviations = ['i.e.', 'e.g.', 'Dr.']
tokenizer = new natural.SentenceTokenizer(abbreviations);
console.log(tokenizer.tokenize("This is a sentence. This is another sentence"));
// ["This is a sentence.", "This is another sentence."]

Overview of available tokenizers:

Tokenizer Language Explanation  
WordTokenizer Any Splits on anything except alphabetic characters, digits and underscore  
WordPunctTokenizer Any Splits on anything except alphabetic characters, digits, punctuation and underscore  
SentenceTokenizer Any Break string up into parts based on punctation and quotation marks  
CaseTokenizer Any? If lower and upper case are the same, the character is assumed to be whitespace or something else (punctuation)  
RegexpTokenizer Any Splits on a regular expression that either defines sequences of word characters or gap characters  
OrthographyTokenizer Finnish Splits on anything except alpabetic characters, digits and underscore  
TreebankWordTokenizer Any    
AggressiveTokenizer English    
AggressiveTokenizerFa Farsi    
AggressiveTokenizerFr French    
AggressiveTokenizerDe German    
AggressiveTokenizerRu Russian    
AggressiveTokenizerEs Spanish    
AggressiveTokenizerIt Italian    
AggressiveTokenizerPl Polish    
AggressiveTokenizerPt Portuguese    
AggressiveTokenizerNo Norwegian    
AggressiveTokenizerSv Swedish    
AggressiveTokenizerVi Vietnamese    
AggressiveTokenizerId Indonesian    
AggressiveTokenizerHi Hindi    
AggressiveTokenizerUk Ukrainian    
TokenizerJa Japanese