Extracting Keywords from Text with Python

A list of Python libraries for extracting keywords from text.
January 17, 2025 by
Extracting Keywords from Text with Python
Hamed Mohammadi
| No comments yet

In the world of data analysis, natural language processing (NLP), and information retrieval, extracting keywords from text is a common and valuable task. Keywords summarize the main topics of a document and can be used for tasks like search engine optimization (SEO), text summarization, and content classification. Fortunately, Python offers a variety of libraries that make keyword extraction straightforward. Let’s explore some of the most popular options.

1. RAKE (Rapid Automatic Keyword Extraction)

RAKE is a simple yet powerful algorithm for unsupervised keyword extraction. It works by analyzing the frequency and co-occurrence of words to determine key phrases.

Installation:

pip install rake-nltk

Example Usage:

from rake_nltk import Rake

text = "Python is a great programming language for data science and web development."
rake = Rake()
rake.extract_keywords_from_text(text)
keywords = rake.get_ranked_phrases()
print("Extracted Keywords:", keywords)

Output:

Extracted Keywords: ['web development', 'data science', 'python', 'great programming language']

2. spaCy

spaCy is a powerful NLP library that provides tools for Named Entity Recognition (NER) and syntactic parsing, which can be leveraged for keyword extraction. While it doesn’t have a dedicated keyword extraction feature, you can extract meaningful phrases using noun chunks.

Installation:

pip install spacy
python -m spacy download en_core_web_sm

Example Usage:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Python is a great programming language for data science and web development."
doc = nlp(text)

keywords = [chunk.text for chunk in doc.noun_chunks]
print("Extracted Keywords:", keywords)

Output:

Extracted Keywords: ['Python', 'a great programming language', 'data science', 'web development']

3. KeyBERT

KeyBERT uses BERT (a state-of-the-art language model) to generate keywords or key phrases that are semantically relevant to the input text.

Installation:

pip install keybert

Example Usage:

from keybert import KeyBERT

kw_model = KeyBERT()
text = "Python is a great programming language for data science and web development."
keywords = kw_model.extract_keywords(text)
print("Extracted Keywords:", keywords)

Output:

Extracted Keywords: [('data science', 0.78), ('web development', 0.76), ('programming language', 0.74)]

4. YAKE (Yet Another Keyword Extractor)

YAKE is an unsupervised, domain-independent keyword extraction tool that identifies keywords by analyzing their statistical significance in the text.

Installation:

pip install yake

Example Usage:

import yake

text = "Python is a great programming language for data science and web development."
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)
print("Extracted Keywords:", keywords)


Output:

Extracted Keywords: [('programming language', 0.003), ('data science', 0.004), ('web development', 0.005)]

5. TextRank (via Gensim or Sumy)

TextRank is a graph-based ranking algorithm inspired by Google’s PageRank, which can be used to extract keywords.

Installation:

pip install gensim

Example Usage:

from gensim.summarization import keywords

text = "Python is a great programming language for data science and web development."
extracted_keywords = keywords(text, words=5, lemmatize=True).split('\n')
print("Extracted Keywords:", extracted_keywords)

Output:

Extracted Keywords: ['python', 'data science', 'web development', 'programming', 'language']

6. TF-IDF with NLTK and Scikit-learn

The TF-IDF (Term Frequency-Inverse Document Frequency) approach scores words based on how important they are in a document relative to a collection of documents. This method can be implemented using Scikit-learn.

Installation:

pip install nltk scikit-learn

Example Usage:

from sklearn.feature_extraction.text import TfidfVectorizer

text = ["Python is a great programming language for data science and web development."]
tfidf = TfidfVectorizer(stop_words="english")
response = tfidf.fit_transform(text)

keywords = tfidf.get_feature_names_out()
print("Extracted Keywords:", keywords)

Output:

Extracted Keywords: ['data', 'development', 'great', 'language', 'programming', 'python', 'science', 'web']

Choosing the Right Library

Each of these libraries has its strengths and is suitable for different use cases:

  • RAKE: Simple and effective for smaller texts.
  • spaCy: Great for extracting noun phrases and named entities.
  • KeyBERT: Ideal for semantic keyword extraction using deep learning.
  • YAKE: Lightweight and effective for statistical keyword extraction.
  • TextRank: Useful for graph-based ranking.
  • TF-IDF: Works well when you have a collection of documents and need statistical analysis.

Conclusion

Keyword extraction is a vital task in text analysis, and Python provides a wealth of libraries to accomplish this. Whether you need a simple unsupervised algorithm like RAKE or a sophisticated deep learning-based method like KeyBERT, you have the tools to extract meaningful insights from your text.

Experiment with these libraries and find the one that best fits your project’s needs.

Extracting Keywords from Text with Python
Hamed Mohammadi January 17, 2025
Share this post
Tags
Archive

Please visit our blog at:

https://zehabsd.com/blog

A platform for Flash Stories:

https://readflashy.com

A platform for Persian Literature Lovers:

https://sarayesokhan.com

Sign in to leave a comment