Lemmatization v/s Stemming
Imagine applying to an NLP Engineer job position or an NLP Developer position or ML Engineer position, or even a Data Scientist position and not knowing the difference between lemmatization and stemming!
Well, no worries! because this article will walk you through these techniques THOROUGHLY!
Lemmatization
In simplest terms, lemmatization is an NLP preprocessing technique that gives you the base form or lemma of the word. Now, suppose we search the meaning of the word, say, authenticating, using google search, the word that we get the meaning of as a result is authenticate. The word authenticate is the base word for authenticating. Hence, lemmatization delivers the base word.
Below is an example to understand it better, using both SpaCy and NLTK (Natural Language Tool Kit) libraries:
SpaCy
import spacy
nlp = spacy.load("en_core_web_sm")
string = "paying attention to details can be challenging"
doc = nlp(string)for word in doc: print(word,":", word.lemma_)
The output is:
paying : pay
attention : attention
to : to
details : detail
can : can
be : be
challenging : challenge
NLTK
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
string = "paying attention to details can be challenging"
lemmatizer = WordNetLemmatizer()
for word in string.split():
print(word,":", lemmatizer.lemmatize(word))
The output is:
paying : paying
attention : attention
to : to
details : detail
can : can
be : be
challenging : challenging
You can observe the differences in the outputs of both libraries. This is because SpaCy pipelines are trained better (owing to their state-of-the-art nature) and hence, are more accurate than NLTK.
Stemming
Stemming, on the other hand, is an NLP preprocessing technique that outputs the root or stem of the word by removing suffixes, prefixes, and infixes of the word. Since, SpaCy doesn’t provide stemming, we’ll use only NLTK. There are many NLTK modules that perform this task, for example:
- nltk.stem.lancaster module : A word stemmer based on the Lancaster (Paice/Husk) stemming algorithm. Paice, Chris D. “Another Stemmer.”
- nltk.stem.regexp module : A stemmer that uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.
- nltk.stem.rslp module : A stemmer for Portuguese.
- nltk.stem.snowball.SnowballStemmer : This module provides a port of the Snowball stemmers developed by Martin Porter. The following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, and Swedish.
- nltk.stem.porter module : This is the Porter stemming algorithm. It follows the algorithm presented in Porter, M. “An algorithm for suffix stripping.” with some optional deviations that can be turned on or off with the mode argument to the constructor.
- nltk.stem.arlstem module: The ARLSTem is a light Arabic stemmer that is based on removing the affixes from the word, and the results showed that ARLSTem is promising and producing high performances.
Below is the code using Porter Stemmer:
from nltk.stem import PorterStemmer
portStemmer = PorterStemmer()
for word in string.split():
print(word,":",portStemmer.stem(word))
The output is:
paying : pay
attention : attent
to : to
details : detail
can : can
be : be
challenging : challeng
Which technique is better?
The use of these techniques depends highly on the use case. Lemmatization finds meaningful base forms of words that makes it slower than stemming as stemming just removes the ends of the word in order to achieve the stem. This may also lead to inaccuracies and hinder the performance of the model. Also, even though lemmatization is slower, it doesn’t throw a challenge that can’t be solved. Therefore, many times, Lemmatization is preferred over Stemming or both techniques are preferred together.
References
Hope this answers the interview question! Please feel free to share your feedback.