Tokenizers

Word, Regexp, and Treebank tokenizers are provided for breaking text up into arrays of tokens:

var natural = require('natural');
var tokenizer = new natural.WordTokenizer();
console.log(tokenizer.tokenize("your dog has fleas."));
// [ 'your', 'dog', 'has', 'fleas' ]

The other tokenizers follow a similar pattern:

tokenizer = new natural.TreebankWordTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my', 'dog', 'has', 'n\'t', 'any', 'fleas', '.' ]

tokenizer = new natural.RegexpTokenizer({pattern: /\-/});
console.log(tokenizer.tokenize("flea-dog"));
// [ 'flea', 'dog' ]

tokenizer = new natural.WordPunctTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my',  'dog',  'hasn',  '\'',  't',  'any',  'fleas',  '.' ]

tokenizer = new natural.OrthographyTokenizer({language: "fi"});
console.log(tokenizer.tokenize("Mikä sinun nimesi on?"));
// [ 'Mikä', 'sinun', 'nimesi', 'on' ]

tokenizer = new natural.SentenceTokenizer();
console.log(tokenizer.tokenize("This is a sentence. This is another sentence"));
// ["This is a sentence.", "This is another sentence."]

In addition to the sentence tokenizer based on regular expressions (called SentenceTokenizer), there is a sentence tokenizer based on parsing (called SentenceTokenizerNew). It is build using PEGjs. It handles more cases, and can be extended in a more structured way (than regular expressions).

The sentence tokenizer can be adapted by editing the PEGjs grammar in ./lib/natural/tokenizers/pegjs_grammar_sentence_tokenizer.txt and then

pegjs -o ./lib/natural/tokenizers/parser_sentence_tokenizer.js ./lib/natural/tokenizers/pegjs_grammar_sentence_tokenizer.txt

Overview of available tokenizers:

Tokenizer Language Explanation  
WordTokenizer Any Splits on anything except alphabetic characters, digits and underscore  
WordPunctTokenizer Any Splits on anything except alphabetic characters, digits, punctuation and underscore  
SentenceTokenizer Any Break string up into parts based on punctation and quotation marks  
SentenceTokenizerNew Any Break string up into parts based on punctation and quotation marks (grammar/parser based)  
CaseTokenizer Any? If lower and upper case are the same, the character is assumed to be whitespace or something else (punctuation)  
RegexpTokenizer Any Splits on a regular expression that either defines sequences of word characters or gap characters  
OrthographyTokenizer Finnish Splits on anything except alpabetic characters, digits and underscore  
TreebankWordTokenizer Any    
AggressiveTokenizer English    
AggressiveTokenizerFa Farsi    
AggressiveTokenizerFr French    
AggressiveTokenizerDe German    
AggressiveTokenizerRu Russian    
AggressiveTokenizerEs Spanish    
AggressiveTokenizerIt Italian    
AggressiveTokenizerPl Polish    
AggressiveTokenizerPt Portuguese    
AggressiveTokenizerNo Norwegian    
AggressiveTokenizerSv Swedish    
AggressiveTokenizerVi Vietnamese    
AggressiveTokenizerId Indonesian    
AggressiveTokenizerHi Hindi    
AggressiveTokenizerUk Ukrainian    
TokenizerJa Japanese