Tokenizers

Word, Regexp, and Treebank tokenizers are provided for breaking text up into arrays of tokens:

var natural = require('natural');
var tokenizer = new natural.WordTokenizer();
console.log(tokenizer.tokenize("your dog has fleas."));
// [ 'your', 'dog', 'has', 'fleas' ]

The other tokenizers follow a similar pattern:

tokenizer = new natural.TreebankWordTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my', 'dog', 'has', 'n\'t', 'any', 'fleas', '.' ]

tokenizer = new natural.RegexpTokenizer({pattern: /\-/});
console.log(tokenizer.tokenize("flea-dog"));
// [ 'flea', 'dog' ]

tokenizer = new natural.WordPunctTokenizer();
console.log(tokenizer.tokenize("my dog hasn't any fleas."));
// [ 'my',  'dog',  'hasn',  '\'',  't',  'any',  'fleas',  '.' ]

tokenizer = new natural.OrthographyTokenizer({language: "fi"});
console.log(tokenizer.tokenize("Mikä sinun nimesi on?"));
// [ 'Mikä', 'sinun', 'nimesi', 'on' ]

The sentence tokenizer splits a text in sentences based on punctuation. It used the following four characters: period, question mark, exclamation mark, and ellipsis. Furthermore:

Multiple punctuation characters are treated as one; these may be mixed with space.
It recognizes abbreviations. It accepts an array of abbreviations as an argument. Case is ignored when matching abbreviations.
It handles decimal points in values, periods in URI’s and mail addresses gracefully.
Quotation marks are left in place. Opening mark at the beginning of the sentence, closing mark at the end of the last sentence of the quotation.

The algorithm that the tokenizer applies is based on the idea that all tokens containing punctuation characters that are not meant for marking the end of the sentence are replaced by a placeholder, then the text is split in sentences. Finally, the placeholders are replaced back to the original tokens.

const abbreviations = ['i.e.', 'e.g.', 'Dr.']
tokenizer = new natural.SentenceTokenizer(abbreviations);
console.log(tokenizer.tokenize("This is a sentence. This is another sentence"));
// ["This is a sentence.", "This is another sentence."]

Overview of available tokenizers:

Tokenizer	Language	Explanation
WordTokenizer	Any	Splits on anything except alphabetic characters, digits and underscore
WordPunctTokenizer	Any	Splits on anything except alphabetic characters, digits, punctuation and underscore
SentenceTokenizer	Any	Break string up into parts based on punctation and quotation marks
CaseTokenizer	Any?	If lower and upper case are the same, the character is assumed to be whitespace or something else (punctuation)
RegexpTokenizer	Any	Splits on a regular expression that either defines sequences of word characters or gap characters
OrthographyTokenizer	Finnish	Splits on anything except alpabetic characters, digits and underscore
TreebankWordTokenizer	Any
AggressiveTokenizer	English
AggressiveTokenizerFa	Farsi
AggressiveTokenizerFr	French
AggressiveTokenizerDe	German
AggressiveTokenizerRu	Russian
AggressiveTokenizerEs	Spanish
AggressiveTokenizerIt	Italian
AggressiveTokenizerPl	Polish
AggressiveTokenizerPt	Portuguese
AggressiveTokenizerNo	Norwegian
AggressiveTokenizerSv	Swedish
AggressiveTokenizerVi	Vietnamese
AggressiveTokenizerId	Indonesian
AggressiveTokenizerHi	Hindi
AggressiveTokenizerUk	Ukrainian
TokenizerJa	Japanese