Stemmers
Currently stemming is supported via the Porter and Lancaster (Paice/Husk) algorithms. The Indonesian and Japanese stemmers do not follow a known algorithm.
var natural = require('natural');
This example uses a Porter stemmer. “word” is returned.
console.log(natural.PorterStemmer.stem("words")); // stem a single word
in Russian:
console.log(natural.PorterStemmerRu.stem("падший"));
in Spanish:
console.log(natural.PorterStemmerEs.stem("jugaría"));
The following stemmers are available:
Language | Porter | Lancaster | Other | Module |
---|---|---|---|---|
Dutch | X | PorterStemmerNl | ||
English | X | PorterStemmer | ||
English | X | LancasterStemmer | ||
Farsi (in progress) | X | PorterStemmerFa | ||
French | X | PorterStemmerFr | ||
French | X | CarryStemmerFr | ||
German | X | PorterStemmerDe | ||
Indonesian | X | StemmerId | ||
Italian | X | PorterStemmerIt | ||
Japanese | X | StemmerJa | ||
Norwegian | X | PorterStemmerNo | ||
Portugese | X | PorterStemmerPt | ||
Russian | X | PorterStemmerRu | ||
Spanish | X | PorterStemmerEs | ||
Swedish | X | PorterStemmerSv |
attach()
patches stem()
and tokenizeAndStem()
to String as a shortcut to PorterStemmer.stem(token)
. tokenizeAndStem()
breaks text up into single words and returns an array of stemmed tokens.
natural.PorterStemmer.attach();
console.log("i am waking up to the sounds of chainsaws".tokenizeAndStem());
console.log("chainsaws".stem());
The same thing can be done with a Lancaster stemmer:
natural.LancasterStemmer.attach();
console.log("i am waking up to the sounds of chainsaws".tokenizeAndStem());
console.log("chainsaws".stem());
Carry stemmer
For French an additional stemmer is added called Carry stemmer. This is a Galileo Carry algorithm based on http://www.otlet-institute.org/docs/Carry.pdf
Note :bangbang:: The implementation descibed in the PDF differs with the one from the official C++ implementation. This implementation follows the C++ implementation rules which solves some problems of the algorithm described in the article.
References
- Carry stemmer is a contribution by Johan Maupetit.
- PEGjs: Parser Generator for JavaScript, https://pegjs.org/