fbpx

[Tutorial] Extract full news article content from any RSS feed using Extract API

Extract full article content EVEN from a summary-only RSS feed

Learn how to extract all fields from any RSS feed or given a list of URLs. For this example, we will be using Medium’s RSS feed. The code will be in python but can easily be adapted for other languages.

Lets start by importing the packages. We will be using “feedparser” to extract Medium Rss feed.

pip install feedparser
pip install requests

Let’s begin by first extracting links from the RSS feed. For this example, we will be extracting the articles from “Towards Data Science”. “Towards Data Science” is one of the leading blogs when it comes to Data Science, Machine Learning & Artificial Intelligence.

import feedparser

NewsFeed = feedparser.parse("https://towardsdatascience.com/feed")
print("Total entries found in feed: "+ str(len(NewsFeed.entries)) +"\n")
i =0
for entry in NewsFeed.entries:
print(str(i) + ": Got url: " + entry.link)
i = i +1

We are able to extract the links, now we want to extract the entire content, summary, metadata and other details for each news article in the feed.

To extract we will be using Pipfeed’s extract API: https://promptapi.com/marketplace/description/pipfeed-api You can get a free API key from prompt API.

import requests

url = "https://api.promptapi.com/pipfeed"

payload = "https://towardsdatascience.com/topic-model-evaluation-3c43e2308526"
headers= {
  "apikey": "YOUR_API_KEY"
}

response = requests.request("POST", url, headers=headers, data = payload)

status_code = response.status_code
result = response.text
print(result)

The above code will extract the given URL and return all the fields. Below is the response we get for the above code. DO NOT forget to replace the API key with your own API keys generated from prompt API.

“Summary” & “predictedCategories” are generated using Pipfeed’s AI models. Rest of the fields are extracted from the article HTML itself.

{
"publishedAt": "2020-11-09T05:15:23.001Z",
"title": "Topic Model Evaluation",
"authors": [
"Giri Rabindranath"
],
"description": "Evaluation is the key to understanding topic models - This article explains what topic model evaluation is, why it's important and how to do it",
"language": "en",
"url": "https://towardsdatascience.com/topic-model-evaluation-3c43e2308526",
"mainImage": "https://miro.medium.com/max/1200/1*wvlqQPpOHFK7xQ1XOhe6xg.jpeg",
"category": "machine-learning",
"categories": null,
"predictedCategories": [
"machine-learning",
"data-science",
"programming"
],
"tags": [],
"keywords": [
"coherence",
"evaluation",
"human",
"model",
"models",
"topic",
"topics",
"way",
"word",
"words"
],
"summary": "In this article, we\u2019ll look at topic model evaluation, what it is and how to do it.\nWhat is topic model evaluation?\nTopic model evaluation is the process of assessing how well a topic model does what it is designed for.\nThis is why topic model evaluation matters.\nHow to evaluate topic models \u2014 RecapThis article has hopefully made one thing clear \u2014 topic model evaluation isn\u2019t easy!",
"images": [
"https://miro.medium.com/fit/c/140/140/1*74Yrxu8s4sOtTECtixv9Fg.jpeg",
"https://miro.medium.com/max/60/1*[email protected]?q=20",
"https://miro.medium.com/fit/c/140/140/0*l_zfjU9IKMa47tfy",
"https://miro.medium.com/fit/c/56/56/2*b2y5uCYazQ9FgiUQEUHT6Q.jpeg",
"https://miro.medium.com/max/60/1*mpyrgqwMjfclV2oN1U2VIA.jpeg?q=20",
"https://miro.medium.com/max/698/1*E4oPMmq5jTKuStZJuyDGpw.jpeg",
"https://miro.medium.com/max/12032/1*wvlqQPpOHFK7xQ1XOhe6xg.jpeg",
"https://miro.medium.com/max/60/1*_MXaw5BKgIsm8J3dOUNHMg.jpeg?q=20",
"https://miro.medium.com/max/224/1*AGyTPCaRzVqL77kFwUwHKg.png",
"https://miro.medium.com/max/270/1*W_RAPQ62h0em559zluJLdQ.png",
"https://miro.medium.com/max/60/1*E4oPMmq5jTKuStZJuyDGpw.jpeg?q=20",
"https://miro.medium.com/max/1200/1*wvlqQPpOHFK7xQ1XOhe6xg.jpeg",
"https://miro.medium.com/max/60/0*aP8H1qpRN_OR1x5r?q=20",
"https://miro.medium.com/max/60/0*NIpOoYo9iHt4lMbg?q=20",
"https://miro.medium.com/max/60/0*l_zfjU9IKMa47tfy?q=20",
"https://miro.medium.com/max/270/1*Crl55Tm6yDNMoucPo1tvDg.png",
"https://miro.medium.com/max/784/1*_MXaw5BKgIsm8J3dOUNHMg.jpeg",
"https://miro.medium.com/fit/c/140/140/1*FTG-junI6KJzojC_xRVNXg.png",
"https://miro.medium.com/max/60/0*fG5RLd48iOZezB_y.jpeg?q=20",
"https://miro.medium.com/fit/c/140/140/0*NIpOoYo9iHt4lMbg",
"https://miro.medium.com/fit/c/140/140/1*[email protected]",
"https://miro.medium.com/max/60/1*wvlqQPpOHFK7xQ1XOhe6xg.jpeg?q=20",
"https://miro.medium.com/fit/c/140/140/1*mpyrgqwMjfclV2oN1U2VIA.jpeg",
"https://miro.medium.com/fit/c/140/140/0*fG5RLd48iOZezB_y.jpeg",
"https://miro.medium.com/fit/c/140/140/0*aP8H1qpRN_OR1x5r",
"https://miro.medium.com/max/60/1*74Yrxu8s4sOtTECtixv9Fg.jpeg?q=20",
"https://miro.medium.com/max/60/1*FTG-junI6KJzojC_xRVNXg.png?q=20"
],
"blogName": null,
"blogLogoUrl": null,
"html": "<div class=\"page\" id=\"readability-page-1\"><section><div><div><h2 id=\"ef6b\">DATA SCIENCE EXPLAINED</h2><h2 id=\"e375\">Here\u2019s what you need to know about evaluating topic models</h2><div><div><div><div><a rel=\"noopener\" href=\"https://medium.com/@g_rabi?source=post_page-----3c43e2308526--------------------------------\"><div><p><img height=\"28\" width=\"28\" src=\"https://miro.medium.com/fit/c/56/56/2*b2y5uCYazQ9FgiUQEUHT6Q.jpeg\" alt=\"Giri Rabindranath\"></p></div></a></div></div></div></div></div></div><div><p id=\"8bff\"><em>Topic models are widely used for analyzing unstructured text data, but they provide no guidance on the quality of topics produced. Evaluation is the key to understanding topic models. In this article, we\u2019ll look at what topic model evaluation is, why it\u2019s important and how to do it.</em></p></div></section><section><div><div><h2 id=\"324c\">Contents</h2><ul><li id=\"dd12\"><a rel=\"noopener\" href=\"#f0ce\"><em>What is topic model evaluation</em></a>?</li><li id=\"ceba\"><a rel=\"noopener\" href=\"#d1ae\"><em>How to evaluate topic models</em></a></li><li id=\"ea5d\"><a rel=\"noopener\" href=\"#2932\"><em>Evaluating topic models \u2014 Human judgment</em></a></li><li id=\"6275\"><a rel=\"noopener\" href=\"#9b50\"><em>Evaluating topic models \u2014 Quantitative metrics</em></a></li><li id=\"ea38\"><a rel=\"noopener\" href=\"#19ff\"><em>Calculating coherence using Gensim in Python</em></a></li><li id=\"95a3\"><a rel=\"noopener\" href=\"#1756\"><em>Limitations of coherence</em></a></li><li id=\"251a\"><a rel=\"noopener\" href=\"#63c4\"><em>How to evaluate topic models \u2014 Recap</em></a></li><li id=\"e448\"><a rel=\"noopener\" href=\"#31aa\"><em>Conclusion</em></a></li></ul><p id=\"6f84\">Topic modeling is a branch of <a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/natural-language-processing-explained-simply/\">natural language processing</a> that\u2019s used for exploring text data. It works by identifying key themes \u2014 or topics \u2014 based on the words or phrases in the data that have a similar meaning. Its versatility and ease-of-use have led to a variety of applications.</p><p id=\"3772\">Be<span id=\"rmm\">i</span>ng a form of unsupervised learning, topic modeling is useful when annotated or labeled data isn\u2019t available. This is helpful, as the majority of emerging text data isn\u2019t labeled, and labeling is time-consuming and expensive to do.</p><p id=\"030c\">For an easy-to-follow, intuitive explanation of topic modeling and its applications, see <a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/topic-modeling-intuitive/\">this article</a>.</p><p id=\"fb4a\">One of the shortcomings of topic modeling is that there\u2019s no guidance about the quality of topics produced. If you want to learn about how meaningful the topics are, you\u2019ll need to evaluate the topic model.</p><p id=\"b937\">In this article, we\u2019ll look at topic model evaluation, what it is and how to do it. It\u2019s an important part of the topic modeling process that sometimes gets overlooked. For a topic model to be truly useful, some sort of evaluation is needed to understand how relevant the topics are for the purpose of the model.</p><p id=\"b85d\">Topic model evaluation is the process of assessing how well a topic model does what it is designed for.</p><p id=\"44ee\">When you run a topic model, you usually do it with a specific purpose in mind. It may be for document classification, to explore a set of unstructured texts, or some other analysis. As with any model, if you wish to know how effective it is at doing what it\u2019s designed for, you\u2019ll need to evaluate it. This is why topic model evaluation matters.</p><p id=\"e9c9\">Evaluating a topic model can help you decide if the model has captured the internal structure of a corpus (a collection of text documents). This can be particularly useful in tasks like e-discovery, where the effectiveness of a topic model can have implications for legal proceedings or other important matters.</p><p id=\"a51a\">More generally, topic model evaluation can help you answer questions like:</p><ul><li id=\"b7ef\">Are the identified topics understandable?</li><li id=\"1d2d\">Are the topics coherent?</li><li id=\"325e\">Does the topic model serve the purpose it is being used for?</li></ul><p id=\"da03\">Without some form of evaluation, you won\u2019t know how well your topic model is performing or if it\u2019s being used properly.</p><p id=\"c559\">Evaluating a topic model isn\u2019t always easy, however.</p><p id=\"3adc\">If a topic model is used for a measurable task, such as classification, then its effectiveness is relatively straightforward to calculate (eg. measure the proportion of successful classifications). But if the model is used for a more qualitative task, such as exploring the semantic themes in an unstructured corpus, then evaluation is more difficult.</p><p id=\"ff58\">In this article, we\u2019ll focus on evaluating topic models that do not have clearly measurable outcomes. These include topic models used for document exploration, content recommendation and e-discovery, amongst other use cases.</p><p id=\"dc38\">Evaluating these types of topic models seeks to understand how easy it is for humans to interpret the topics produced by the model. Put another way, topic model evaluation is about the \u2018human interpretability\u2019 or \u2018semantic interpretability\u2019 of topics.</p><p id=\"48f0\">There are a number of ways to evaluate topic models. These include:</p><p id=\"fb66\"><em>Human judgment</em></p><ul><li id=\"422a\">Observation-based, eg. observing the top \u2019N\u2019 words in a topic</li><li id=\"cee6\">Interpretation-based, eg. \u2018word intrusion\u2019 and \u2018topic intrusion\u2019 to identify the words or topics that \u201cdon\u2019t belong\u201d in a topic or document</li></ul><p id=\"ce7c\"><em>Quantitative metrics</em> \u2014 Perplexity (held out likelihood) and coherence calculations</p><p id=\"8c62\"><em>Mixed approaches</em> \u2014 Combinations of judgment-based and quantitative approaches</p><p id=\"6bd7\">Let\u2019s look at a few of these more closely.</p><h2 id=\"f1c4\">Observation-based approaches</h2><p id=\"4ea9\">The easiest way to evaluate a topic is to look at the most probable words in the topic. This can be done in a tabular form, for instance by listing the top 10 words in each topic, or in other formats.</p><p id=\"939a\">One visually appealing way to observe the probable words in a topic is through Word Clouds.</p><p id=\"9585\">To illustrate, the following example is a Word Cloud based on topics modeled from the minutes of US Federal Open Market Committee (FOMC) meetings. The FOMC is an important part of the US financial system and meets 8 times per year. The following Word Cloud is based on a topic that emerged from an analysis of topic trends in FOMC meetings over 2007 to 2020.</p><figure><a href=\"https://highdemandskills.com/topic-trends-fomc/\"><div><div><p><img data-old-src=\"https://miro.medium.com/max/60/1*E4oPMmq5jTKuStZJuyDGpw.jpeg?q=20\" sizes=\"349px\" srcset=\"https://miro.medium.com/max/552/1*E4oPMmq5jTKuStZJuyDGpw.jpeg 276w, https://miro.medium.com/max/698/1*E4oPMmq5jTKuStZJuyDGpw.jpeg 349w\" height=\"181\" width=\"349\" src=\"https://miro.medium.com/max/698/1*E4oPMmq5jTKuStZJuyDGpw.jpeg\" alt=\"Image for post\"></p></div></div></a><figcaption>Word Cloud of \u201cinflation\u201d topic. Image by Author.</figcaption></figure><p id=\"b025\">Topic modeling doesn\u2019t provide guidance on the meaning of any topic, so labeling a topic requires human interpretation. In this case, based on the most probable words displayed in the Word Cloud, the topic appears to be about \u201cinflation\u201d.</p><p id=\"ead0\">You can see more Word Clouds from the FOMC topic modeling example <a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/topic-trends-fomc/#h4-interpret-topics\">here</a>.</p><p id=\"d586\">Beyond observing the most probable words in a topic, a more comprehensive observation-based approach called \u2018Termite\u2019 has been <a rel=\"noopener nofollow\" href=\"http://vis.stanford.edu/files/2012-Termite-AVI.pdf\">developed by Stanford University researchers</a>.</p><p id=\"a78d\">Termite is described as \u201c<em>a visualization of the term-topic distributions produced by topic models\u201d </em>[1]. In this description, \u2018term\u2019 refers to a \u2018word\u2019, so \u2018term-topic distributions\u2019 are \u2018word-topic distributions\u2019.</p><p id=\"a348\">Termite produces meaningful visualizations by introducing two calculations:</p><ol><li id=\"ddbe\">A \u2018saliency\u2019 measure, which identifies words that are more relevant for the topics in which they appear (beyond mere frequencies of their counts)</li><li id=\"246c\">A \u2018seriation\u2019 method, for sorting words into more coherent groupings based on the degree of semantic similarity between them</li></ol><p id=\"d552\">Termite produces graphs which summarize words and topics based on saliency and seriation. This helps to identify more interpretable topics and leads to better topic model evaluation.</p><p id=\"5bce\">You can see example Termite visualizations <a rel=\"noopener nofollow\" href=\"http://vis.stanford.edu/topic-diagnostics/\">here</a>.</p><h2 id=\"1281\">Interpretation-based approaches</h2><p id=\"ccc6\">Interpretation-based approaches take more effort than observation-based approaches but produce better results. These approaches are considered a \u2018gold standard\u2019 for evaluating topic models since they use human judgment to maximum effect.</p><p id=\"eff2\">A good illustration of these is described in a <a rel=\"noopener nofollow\" href=\"http://users.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf\">research paper</a> by Jonathan Chang and others (2009) [2] which developed \u2018word intrusion\u2019 and \u2018topic intrusion\u2019 to help evaluate semantic coherence.</p><p id=\"e268\"><strong>Word intrusion</strong></p><p id=\"cd41\">In word intrusion, subjects are presented with groups of 6 words, 5 of which belong to a given topic and one which does not \u2014 the \u2018intruder\u2019 word. Subjects are asked to identify the intruder word.</p><p id=\"3489\">To understand how this works, consider the group of words:</p><p id=\"7c02\">[ <em>dog, cat, horse, apple, pig, cow </em>]</p><p id=\"e26b\">Can you spot the intruder?</p><p id=\"0e95\">Most subjects pick \u2018apple\u2019 because it looks different to the others (all of which are animals, suggesting an animal-related topic for the others).</p><p id=\"294d\">Now, consider:</p><p id=\"3370\">[ <em>car, teacher, platypus, agile, blue, Zaire </em>]</p><p id=\"2ebb\">Which is the intruder in this group of words?</p><p id=\"fed9\">It\u2019s much harder to identify, so most subjects choose the intruder at random. This implies poor topic coherence.</p><p id=\"3b59\"><strong>Topic intrusion</strong></p><p id=\"1ee8\">Similar to word intrusion, in topic intrusion subjects are asked to identify the \u2018intruder\u2019 topic from groups of topics that make up documents.</p><p id=\"2b99\">In this task, subjects are shown a title and a snippet from a document along with 4 topics. Three of the topics have a high probability of belonging to the document while the remaining topic has a low probability \u2014 the \u2018intruder\u2019 topic.</p><p id=\"41ed\">As for word intrusion, the intruder topic is sometimes easy to identify and at other times not. The success with which subjects can correctly choose the intruder helps to determine the level of coherence.</p><p id=\"7489\">While evaluation methods based on human judgment can produce good results, they are costly and time-consuming to do.</p><p id=\"e193\">Moreover, human judgment isn\u2019t clearly defined and humans don\u2019t always agree on what makes a good topic. In contrast, the appeal of quantitative metrics is the ability to standardize, automate and scale the evaluation of topic models.</p><h2 id=\"2047\">Held out likelihood or perplexity</h2><p id=\"415d\">A traditional metric for evaluating topic models is the \u2018held out likelihood\u2019, also referred to as \u2018perplexity\u2019.</p><p id=\"4c08\">This is calculated by splitting a dataset into two parts \u2014 a training set and a test set. The idea is to train a topic model using the training set and then test the model on a test set which contains previously unseen documents (ie. held out documents). Likelihood is usually calculated as a logarithm, so this metric is sometimes referred to as the \u2018held out log-likelihood\u2019.</p><p id=\"c176\">The perplexity metric is a predictive one. It assesses a topic model\u2019s ability to predict a test set after having been trained on a training set. In practice, around 80% of a corpus may be set aside as a training set with the remaining 20% being a test set.</p><p id=\"4ffe\">Although the perplexity metric is a natural choice for topic models from a technical standpoint, it does not provide good results for human interpretation. This was demonstrated by research, again by Jonathan Chang and others (2009), which found that perplexity did not do a good job of conveying whether topics are coherent or not.</p><p id=\"bf74\">When comparing perplexity against human judgment approaches like word intrusion and topic intrusion, the research showed a negative correlation. This means that as the perplexity score improves (ie. the held out log-likelihood is higher), the human interpretability of topics and topic mixes get worse (rather than better). The perplexity metric therefore appears to be misleading when it comes to the human understanding of topics and topic mixes.</p><p id=\"ad6d\">Are there better quantitative metrics than perplexity for evaluating topic models?</p><h2 id=\"6e3f\">Coherence</h2><p id=\"a14e\">One of the shortcomings of perplexity is that it does not capture context, ie. perplexity does not capture the relationship between words in a topic or topics in a document. The idea of semantic context is important for human understanding.</p><p id=\"3175\">To overcome this, approaches have been developed that attempt to capture context between words in a topic. They use measures such as the conditional likelihood (rather than the log-likelihood) of the co-occurrence of words in a topic. These approaches are collectively referred to as \u2018coherence\u2019.</p><p id=\"6ad7\">There\u2019s been a lot of research on coherence over recent years and as a result there are a variety of methods available. A useful way to deal with this is to set up a framework that allows you to choose the methods that you prefer.</p><p id=\"ebf8\">Such a framework has been proposed by researchers at <a rel=\"noopener nofollow\" href=\"http://aksw.org/About.html\">AKSW</a>. Using this <a rel=\"noopener nofollow\" href=\"http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf\">framework</a>, which we\u2019ll call the \u201ccoherence pipeline\u201d, you can calculate coherence in a way that works best for your circumstances (eg. based on availability of a corpus, speed of computation etc).</p><p id=\"06ae\">The coherence pipeline offers a versatile way to calculate coherence. It is also what Gensim, a popular package for topic modeling in Python, uses for implementing coherence (more on this later).</p><p id=\"94ae\">The coherence pipeline is made up of four stages:</p><ol><li id=\"acc4\">Segmentation</li><li id=\"1668\">Probability estimation</li><li id=\"75ce\">Confirmation</li><li id=\"0982\">Aggregation</li></ol><p id=\"4e6c\">These four stages form the basis of coherence calculations and work as follows:</p><p id=\"91d7\"><strong>Segmentation</strong> sets up the word groupings that are used for pair-wise comparisons.</p><p id=\"b612\">Let\u2019s say that we wish to calculate the coherence of a set of topics. Coherence calculations start by choosing words within each topic (usually the most frequently occurring words) and comparing them with each other, one pair at a time. Segmentation is the process of choosing how words are grouped together for these pair-wise comparisons.</p><p id=\"7439\">Word groupings can be made up of single words or larger groupings. For single words, each word in a topic is compared with each other word in the topic. For 2-word or 3-word groupings, each 2-word group is compared with each other 2-word group, or each 3-word group is compared with each other 3-word group, and so on.</p><p id=\"0de4\">Comparisons can also be made between groupings of different size, for instance single words can be compared with 2-word or 3-word groups.</p><p id=\"4a41\"><strong>Probability </strong>estimation refers to the type of probability measure that underpins the calculation of coherence. To illustrate, consider the two widely used coherence approaches of <em>UCI</em> and <em>UMass</em>:</p><ul><li id=\"6032\">UCI is based on point-wise mutual information (PMI) calculations. This is given by: <code><strong>PMI</strong>(wi,wj) = log[(<strong>P</strong>(wi,wj) + e) / <strong>P</strong>(wi).<strong>P</strong>(wj)]</code>, for words <code>wi</code> and <code>wj</code> and some small number <code>e</code>, and where <code><strong>P</strong>(wi)</code> is the probability of word <code>i</code> occurring in a topic and <code><strong>P</strong>(wi,wj)</code> is the probability of both words <code>i</code> and <code>j</code> appearing in a topic. Here, the probabilities are based on word co-occurrence counts.</li><li id=\"0083\">UMass caters for the order in which words appear and is based on the calculation of: <code>log[(<strong>P</strong>(wi,wj) + e) / <strong>P</strong>(wj)]</code>, with <code>wi</code>, <code>wj</code>, <code><strong>P</strong>(wi)</code> and <code><strong>P</strong>(wi,wj)</code> as for UCI. Here, the probabilities are conditional, since <code><strong>P</strong>(wi|wj) = [(<strong>P</strong>(wi,wj) / <strong>P</strong>(wj)]</code>, which we know from <a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/bayes-theorem/\">Bayes\u2019 theorem</a>. So, this approach measures how much a common word appearing within a topic is a good predictor for a less common word in the topic.</li></ul><p id=\"f8a3\"><strong>Confirmation</strong> measures how strongly each word grouping in a topic relates to other word groupings (ie. how similar they are). There are direct and indirect ways of doing this, depending on the frequency and distribution of words in a topic.</p><p id=\"1c55\"><strong>Aggregation</strong> is the final step of the coherence pipeline. It\u2019s a summary calculation of the confirmation measures of all the word groupings, resulting in a single coherence score. This is usually done by averaging the confirmation measures using the mean or median. Other calculations may also be used, such as the harmonic mean, quadratic mean, minimum or maximum.</p><p id=\"93e5\">Coherence is a popular way to quantitatively evaluate topic models and has good coding implementations in languages such as Python (eg. Gensim).</p><p id=\"ddfb\">To see how coherence works in practice, let\u2019s look at an example.</p><p id=\"8445\">Gensim is a widely used package for topic modeling in Python. It uses <a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/topic-modeling-intuitive/\">Latent Dirichlet Allocation</a> (LDA) for topic modeling and includes functionality for calculating the coherence of topic models.</p><p id=\"d3e9\">As mentioned, Gensim calculates coherence using the coherence pipeline, offering a range of options for users.</p><p id=\"1da5\">The following example uses Gensim to model topics for US company earnings calls. These are quarterly conference calls in which company management discusses financial performance and other updates with analysts, investors and the media. They are an important fixture in the US financial calendar.</p><p id=\"ad5b\">The following code calculates coherence for the trained topic model in the example:</p><figure><div></div><figcaption><a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/topic-modeling-lda/\">Calculating the coherence score using Gensim</a></figcaption></figure><p id=\"c9d2\">The coherence method that was chosen in this example is \u201cc_v\u201d. This is one of several choices offered by Gensim. Other choices include UCI (\u201cc_uci\u201d) and UMass (\u201cu_mass\u201d).</p><p id=\"594d\">For more information about the Gensim package and the various choices that go with it, please refer to the <a rel=\"noopener nofollow\" href=\"https://radimrehurek.com/gensim/models/coherencemodel.html\">Gensim documentation</a>.</p><p id=\"3292\">Gensim can also be used to explore the effect of varying LDA parameters on a topic model\u2019s coherence score. This helps to select the best choice of parameters for the model. The following code shows how to calculate coherence for varying values of the alpha parameter in the LDA model:</p><figure><div></div><figcaption><a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/topic-modeling-lda/\">Investigating coherence by varying the alpha parameter</a></figcaption></figure><p id=\"fe44\">The above code also produces a chart of the model\u2019s coherence score for different values of the alpha parameter:</p><figure><a href=\"https://highdemandskills.com/topic-modeling-lda/\"><div><div><p><img data-old-src=\"https://miro.medium.com/max/60/1*_MXaw5BKgIsm8J3dOUNHMg.jpeg?q=20\" sizes=\"392px\" srcset=\"https://miro.medium.com/max/552/1*_MXaw5BKgIsm8J3dOUNHMg.jpeg 276w, https://miro.medium.com/max/784/1*_MXaw5BKgIsm8J3dOUNHMg.jpeg 392w\" height=\"262\" width=\"392\" src=\"https://miro.medium.com/max/784/1*_MXaw5BKgIsm8J3dOUNHMg.jpeg\" alt=\"Image for post\"></p></div></div></a><figcaption>Topic model coherence for different values of the alpha parameter. Image by Author.</figcaption></figure><p id=\"ae19\">This helps in choosing the best value of alpha based on coherence scores.</p><p id=\"e6dc\">In practice, you would also want to check the effect of varying other model parameters on the coherence score. You can see how this was done in the US company earning call example <a rel=\"noopener nofollow\" href=\"https://highdemandskills.com/topic-modeling-lda/#h3-3\">here</a>.</p><p id=\"ecfa\">The overall choice of parameters would depend on balancing the varying effects on coherence, and also on judgment about the nature of the topics and the purpose of the model.</p><p id=\"098d\">Despite its usefulness, coherence has some important limitations.</p><p id=\"5a89\">According to <a rel=\"noopener nofollow\" href=\"https://www.linkedin.com/in/mattilyra/?originalSubdomain=de\">Matti Lyra</a>, a leading data scientist and researcher, the key limitations are:</p><ul><li id=\"83d2\"><strong>Variability</strong> \u2014 The aggregation step of the coherence pipeline is typically calculated over a large number of word-group pairs. While this produces a metric (eg. mean of the coherence scores), there\u2019s no way of estimating the variability of the metric. This means that there\u2019s no way of knowing the degree of confidence in the metric. Hence, although we can calculate aggregate coherence scores for a topic model, we don\u2019t really know how well that score reflects the actual coherence of the model (relative to statistical noise).</li><li id=\"72c8\"><strong>Comparability</strong> \u2014 The coherence pipeline allows the user to select different methods for each part of the pipeline. This, combined with the unknown variability of coherence scores, makes it difficult to meaningfully compare different coherence scores, or coherence scores between different models.</li><li id=\"4722\"><strong>Reference corpus</strong> \u2014 The choice of reference corpus is important. In cases where the probability estimates are based on the reference corpus, then a smaller or domain-specific corpus can produce misleading results when applied to set of documents that are quite different to the reference corpus.</li><li id=\"5a31\"><strong>\u201cJunk\u201d topics</strong> \u2014 Topic modeling provides no guarantees about the topics that are identified (hence the need for evaluation) and sometimes produces meaningless, or \u201cjunk\u201d, topics. These can distort the results of coherence calculations. The difficulty lies in identifying these junk topics for removal \u2014 it usually requires human inspection to do so. But involving humans in the process defeats the very purpose of using coherence, ie. to automate and scale topic model evaluation.</li></ul><p id=\"5cab\">With these limitations in mind, what\u2019s the best approach for evaluating topic models?</p><p id=\"1a08\">This article has hopefully made one thing clear \u2014 topic model evaluation isn\u2019t easy!</p><p id=\"6705\">Unfortunately, there\u2019s no straight forward or reliable way to evaluate topic models to a high standard of human interpretability. Also, the very idea of human interpretability differs between people, domains and use cases.</p><p id=\"ae25\">Nevertheless, the most reliable way to evaluate topic models is by using human judgment. But this takes time and is expensive.</p><p id=\"e576\">In terms of quantitative approaches, coherence is a versatile and scalable way to evaluate topic models, notwithstanding its limitations.</p><p id=\"659b\">In practice, you\u2019ll need to decide how to evaluate a topic model on a case-by-case basis, including which methods and process to use. A degree of domain knowledge and a clear understanding of the purpose of the model will help.</p><p id=\"7e42\">The thing to remember is that some sort of evaluation can be important in helping you assess the merits of your topic model and how to apply it.</p><p id=\"d9cf\">Topic model evaluation is an important part of the topic modeling process. This is because topic modeling offers no guidance on the quality of topics produced. Evaluation helps you assess how relevant the produced topics are, and how effective the topic model is.</p><p id=\"1dcb\">Evaluating topic models is unfortunately difficult to do. There are various approaches available, but the best results come from human interpretation. This is a time-consuming and costly exercise.</p><p id=\"a922\">Quantitative evaluation methods offer the benefits of automation and scaling. Coherence is the most popular of these and is easy to implement in widely used coding languages, such as with Gensim in Python.</p><p id=\"c34b\">In practice, the best approach for evaluating topic models will depend on the circumstances. Domain knowledge, an understanding of the model\u2019s purpose, and judgment will help in deciding the best evaluation approach.</p><p id=\"db90\">Topic modeling is an area of ongoing research \u2014 newer, better ways of evaluating topic models are likely to emerge.</p><p id=\"cd6c\">In the meantime, topic modeling continues to be a versatile and effective way to analyze and make sense of unstructured text data. And with the continued use of topic models, evaluation will remain an important part of the process.</p><p id=\"6e32\">[1] J. Chuang, C. D. Manning and J. Heer, <a rel=\"noopener nofollow\" href=\"http://vis.stanford.edu/files/2012-Termite-AVI.pdf\">Termite: Visualization Techniques for Assessing Textual Topic Models</a> (2012), Stanford University Computer Science Department</p><p id=\"7dfa\">[2] J. Chang et al, <a rel=\"noopener nofollow\" href=\"http://users.umiacs.umd.edu/~jbg/docs/nips2009-rtl.pdf\">Reading Tea Leaves: How Humans Interpret Topic Models</a> (2009), Neural Information Processing Systems</p></div></div></section></div>",
"text": "DATA SCIENCE EXPLAINED Here\u2019s what you need to know about evaluating topic models Topic models are widely used for analyzing unstructured text data, but they provide no guidance on the quality of topics produced. Evaluation is the key to understanding topic models. In this article, we\u2019ll look at what topic model evaluation is, why it\u2019s important and how to do it. Contents What is topic model evaluation? How to evaluate topic models Evaluating topic models \u2014 Human judgment Evaluating topic models \u2014 Quantitative metrics Calculating coherence using Gensim in Python Limitations of coherence How to evaluate topic models \u2014 Recap Conclusion Topic modeling is a branch of natural language processing that\u2019s used for exploring text data. It works by identifying key themes \u2014 or topics \u2014 based on the words or phrases in the data that have a similar meaning. Its versatility and ease-of-use have led to a variety of applications. Being a form of unsupervised learning, topic modeling is useful when annotated or labeled data isn\u2019t available. This is helpful, as the majority of emerging text data isn\u2019t labeled, and labeling is time-consuming and expensive to do. For an easy-to-follow, intuitive explanation of topic modeling and its applications, see this article. One of the shortcomings of topic modeling is that there\u2019s no guidance about the quality of topics produced. If you want to learn about how meaningful the topics are, you\u2019ll need to evaluate the topic model. In this article, we\u2019ll look at topic model evaluation, what it is and how to do it. It\u2019s an important part of the topic modeling process that sometimes gets overlooked. For a topic model to be truly useful, some sort of evaluation is needed to understand how relevant the topics are for the purpose of the model. Topic model evaluation is the process of assessing how well a topic model does what it is designed for. When you run a topic model, you usually do it with a specific purpose in mind. It may be for document classification, to explore a set of unstructured texts, or some other analysis. As with any model, if you wish to know how effective it is at doing what it\u2019s designed for, you\u2019ll need to evaluate it. This is why topic model evaluation matters. Evaluating a topic model can help you decide if the model has captured the internal structure of a corpus (a collection of text documents). This can be particularly useful in tasks like e-discovery, where the effectiveness of a topic model can have implications for legal proceedings or other important matters. More generally, topic model evaluation can help you answer questions like: Are the identified topics understandable? Are the topics coherent? Does the topic model serve the purpose it is being used for? Without some form of evaluation, you won\u2019t know how well your topic model is performing or if it\u2019s being used properly. Evaluating a topic model isn\u2019t always easy, however. If a topic model is used for a measurable task, such as classification, then its effectiveness is relatively straightforward to calculate (eg. measure the proportion of successful classifications). But if the model is used for a more qualitative task, such as exploring the semantic themes in an unstructured corpus, then evaluation is more difficult. In this article, we\u2019ll focus on evaluating topic models that do not have clearly measurable outcomes. These include topic models used for document exploration, content recommendation and e-discovery, amongst other use cases. Evaluating these types of topic models seeks to understand how easy it is for humans to interpret the topics produced by the model. Put another way, topic model evaluation is about the \u2018human interpretability\u2019 or \u2018semantic interpretability\u2019 of topics. There are a number of ways to evaluate topic models. These include: Human judgment Observation-based, eg. observing the top \u2019N\u2019 words in a topic Interpretation-based, eg. \u2018word intrusion\u2019 and \u2018topic intrusion\u2019 to identify the words or topics that \u201cdon\u2019t belong\u201d in a topic or document Quantitative metrics \u2014 Perplexity (held out likelihood) and coherence calculations Mixed approaches \u2014 Combinations of judgment-based and quantitative approaches Let\u2019s look at a few of these more closely. Observation-based approaches The easiest way to evaluate a topic is to look at the most probable words in the topic. This can be done in a tabular form, for instance by listing the top 10 words in each topic, or in other formats. One visually appealing way to observe the probable words in a topic is through Word Clouds. To illustrate, the following example is a Word Cloud based on topics modeled from the minutes of US Federal Open Market Committee (FOMC) meetings. The FOMC is an important part of the US financial system and meets 8 times per year. The following Word Cloud is based on a topic that emerged from an analysis of topic trends in FOMC meetings over 2007 to 2020. Word Cloud of \u201cinflation\u201d topic. Image by Author. Topic modeling doesn\u2019t provide guidance on the meaning of any topic, so labeling a topic requires human interpretation. In this case, based on the most probable words displayed in the Word Cloud, the topic appears to be about \u201cinflation\u201d. You can see more Word Clouds from the FOMC topic modeling example here. Beyond observing the most probable words in a topic, a more comprehensive observation-based approach called \u2018Termite\u2019 has been developed by Stanford University researchers. Termite is described as \u201ca visualization of the term-topic distributions produced by topic models\u201d [1]. In this description, \u2018term\u2019 refers to a \u2018word\u2019, so \u2018term-topic distributions\u2019 are \u2018word-topic distributions\u2019. Termite produces meaningful visualizations by introducing two calculations: A \u2018saliency\u2019 measure, which identifies words that are more relevant for the topics in which they appear (beyond mere frequencies of their counts) A \u2018seriation\u2019 method, for sorting words into more coherent groupings based on the degree of semantic similarity between them Termite produces graphs which summarize words and topics based on saliency and seriation. This helps to identify more interpretable topics and leads to better topic model evaluation. You can see example Termite visualizations here. Interpretation-based approaches Interpretation-based approaches take more effort than observation-based approaches but produce better results. These approaches are considered a \u2018gold standard\u2019 for evaluating topic models since they use human judgment to maximum effect. A good illustration of these is described in a research paper by Jonathan Chang and others (2009) [2] which developed \u2018word intrusion\u2019 and \u2018topic intrusion\u2019 to help evaluate semantic coherence. Word intrusion In word intrusion, subjects are presented with groups of 6 words, 5 of which belong to a given topic and one which does not \u2014 the \u2018intruder\u2019 word. Subjects are asked to identify the intruder word. To understand how this works, consider the group of words: [ dog, cat, horse, apple, pig, cow ] Can you spot the intruder? Most subjects pick \u2018apple\u2019 because it looks different to the others (all of which are animals, suggesting an animal-related topic for the others). Now, consider: [ car, teacher, platypus, agile, blue, Zaire ] Which is the intruder in this group of words? It\u2019s much harder to identify, so most subjects choose the intruder at random. This implies poor topic coherence. Topic intrusion Similar to word intrusion, in topic intrusion subjects are asked to identify the \u2018intruder\u2019 topic from groups of topics that make up documents. In this task, subjects are shown a title and a snippet from a document along with 4 topics. Three of the topics have a high probability of belonging to the document while the remaining topic has a low probability \u2014 the \u2018intruder\u2019 topic. As for word intrusion, the intruder topic is sometimes easy to identify and at other times not. The success with which subjects can correctly choose the intruder helps to determine the level of coherence. While evaluation methods based on human judgment can produce good results, they are costly and time-consuming to do. Moreover, human judgment isn\u2019t clearly defined and humans don\u2019t always agree on what makes a good topic. In contrast, the appeal of quantitative metrics is the ability to standardize, automate and scale the evaluation of topic models. Held out likelihood or perplexity A traditional metric for evaluating topic models is the \u2018held out likelihood\u2019, also referred to as \u2018perplexity\u2019. This is calculated by splitting a dataset into two parts \u2014 a training set and a test set. The idea is to train a topic model using the training set and then test the model on a test set which contains previously unseen documents (ie. held out documents). Likelihood is usually calculated as a logarithm, so this metric is sometimes referred to as the \u2018held out log-likelihood\u2019. The perplexity metric is a predictive one. It assesses a topic model\u2019s ability to predict a test set after having been trained on a training set. In practice, around 80% of a corpus may be set aside as a training set with the remaining 20% being a test set. Although the perplexity metric is a natural choice for topic models from a technical standpoint, it does not provide good results for human interpretation. This was demonstrated by research, again by Jonathan Chang and others (2009), which found that perplexity did not do a good job of conveying whether topics are coherent or not. When comparing perplexity against human judgment approaches like word intrusion and topic intrusion, the research showed a negative correlation. This means that as the perplexity score improves (ie. the held out log-likelihood is higher), the human interpretability of topics and topic mixes get worse (rather than better). The perplexity metric therefore appears to be misleading when it comes to the human understanding of topics and topic mixes. Are there better quantitative metrics than perplexity for evaluating topic models? Coherence One of the shortcomings of perplexity is that it does not capture context, ie. perplexity does not capture the relationship between words in a topic or topics in a document. The idea of semantic context is important for human understanding. To overcome this, approaches have been developed that attempt to capture context between words in a topic. They use measures such as the conditional likelihood (rather than the log-likelihood) of the co-occurrence of words in a topic. These approaches are collectively referred to as \u2018coherence\u2019. There\u2019s been a lot of research on coherence over recent years and as a result there are a variety of methods available. A useful way to deal with this is to set up a framework that allows you to choose the methods that you prefer. Such a framework has been proposed by researchers at AKSW. Using this framework, which we\u2019ll call the \u201ccoherence pipeline\u201d, you can calculate coherence in a way that works best for your circumstances (eg. based on availability of a corpus, speed of computation etc). The coherence pipeline offers a versatile way to calculate coherence. It is also what Gensim, a popular package for topic modeling in Python, uses for implementing coherence (more on this later). The coherence pipeline is made up of four stages: Segmentation Probability estimation Confirmation Aggregation These four stages form the basis of coherence calculations and work as follows: Segmentation sets up the word groupings that are used for pair-wise comparisons. Let\u2019s say that we wish to calculate the coherence of a set of topics. Coherence calculations start by choosing words within each topic (usually the most frequently occurring words) and comparing them with each other, one pair at a time. Segmentation is the process of choosing how words are grouped together for these pair-wise comparisons. Word groupings can be made up of single words or larger groupings. For single words, each word in a topic is compared with each other word in the topic. For 2-word or 3-word groupings, each 2-word group is compared with each other 2-word group, or each 3-word group is compared with each other 3-word group, and so on. Comparisons can also be made between groupings of different size, for instance single words can be compared with 2-word or 3-word groups. Probability estimation refers to the type of probability measure that underpins the calculation of coherence. To illustrate, consider the two widely used coherence approaches of UCI and UMass: UCI is based on point-wise mutual information (PMI) calculations. This is given by: PMI(wi,wj) = log[(P(wi,wj) + e) / P(wi).P(wj)], for words wi and wj and some small number e, and where P(wi) is the probability of word i occurring in a topic and P(wi,wj) is the probability of both words i and j appearing in a topic. Here, the probabilities are based on word co-occurrence counts. UMass caters for the order in which words appear and is based on the calculation of: log[(P(wi,wj) + e) / P(wj)], with wi, wj, P(wi) and P(wi,wj) as for UCI. Here, the probabilities are conditional, since P(wi|wj) = [(P(wi,wj) / P(wj)], which we know from Bayes\u2019 theorem. So, this approach measures how much a common word appearing within a topic is a good predictor for a less common word in the topic. Confirmation measures how strongly each word grouping in a topic relates to other word groupings (ie. how similar they are). There are direct and indirect ways of doing this, depending on the frequency and distribution of words in a topic. Aggregation is the final step of the coherence pipeline. It\u2019s a summary calculation of the confirmation measures of all the word groupings, resulting in a single coherence score. This is usually done by averaging the confirmation measures using the mean or median. Other calculations may also be used, such as the harmonic mean, quadratic mean, minimum or maximum. Coherence is a popular way to quantitatively evaluate topic models and has good coding implementations in languages such as Python (eg. Gensim). To see how coherence works in practice, let\u2019s look at an example. Gensim is a widely used package for topic modeling in Python. It uses Latent Dirichlet Allocation (LDA) for topic modeling and includes functionality for calculating the coherence of topic models. As mentioned, Gensim calculates coherence using the coherence pipeline, offering a range of options for users. The following example uses Gensim to model topics for US company earnings calls. These are quarterly conference calls in which company management discusses financial performance and other updates with analysts, investors and the media. They are an important fixture in the US financial calendar. The following code calculates coherence for the trained topic model in the example: Calculating the coherence score using Gensim The coherence method that was chosen in this example is \u201cc_v\u201d. This is one of several choices offered by Gensim. Other choices include UCI (\u201cc_uci\u201d) and UMass (\u201cu_mass\u201d). For more information about the Gensim package and the various choices that go with it, please refer to the Gensim documentation. Gensim can also be used to explore the effect of varying LDA parameters on a topic model\u2019s coherence score. This helps to select the best choice of parameters for the model. The following code shows how to calculate coherence for varying values of the alpha parameter in the LDA model: Investigating coherence by varying the alpha parameter The above code also produces a chart of the model\u2019s coherence score for different values of the alpha parameter: Topic model coherence for different values of the alpha parameter. Image by Author. This helps in choosing the best value of alpha based on coherence scores. In practice, you would also want to check the effect of varying other model parameters on the coherence score. You can see how this was done in the US company earning call example here. The overall choice of parameters would depend on balancing the varying effects on coherence, and also on judgment about the nature of the topics and the purpose of the model. Despite its usefulness, coherence has some important limitations. According to Matti Lyra, a leading data scientist and researcher, the key limitations are: Variability \u2014 The aggregation step of the coherence pipeline is typically calculated over a large number of word-group pairs. While this produces a metric (eg. mean of the coherence scores), there\u2019s no way of estimating the variability of the metric. This means that there\u2019s no way of knowing the degree of confidence in the metric. Hence, although we can calculate aggregate coherence scores for a topic model, we don\u2019t really know how well that score reflects the actual coherence of the model (relative to statistical noise). Comparability \u2014 The coherence pipeline allows the user to select different methods for each part of the pipeline. This, combined with the unknown variability of coherence scores, makes it difficult to meaningfully compare different coherence scores, or coherence scores between different models. Reference corpus \u2014 The choice of reference corpus is important. In cases where the probability estimates are based on the reference corpus, then a smaller or domain-specific corpus can produce misleading results when applied to set of documents that are quite different to the reference corpus. \u201cJunk\u201d topics \u2014 Topic modeling provides no guarantees about the topics that are identified (hence the need for evaluation) and sometimes produces meaningless, or \u201cjunk\u201d, topics. These can distort the results of coherence calculations. The difficulty lies in identifying these junk topics for removal \u2014 it usually requires human inspection to do so. But involving humans in the process defeats the very purpose of using coherence, ie. to automate and scale topic model evaluation. With these limitations in mind, what\u2019s the best approach for evaluating topic models? This article has hopefully made one thing clear \u2014 topic model evaluation isn\u2019t easy! Unfortunately, there\u2019s no straight forward or reliable way to evaluate topic models to a high standard of human interpretability. Also, the very idea of human interpretability differs between people, domains and use cases. Nevertheless, the most reliable way to evaluate topic models is by using human judgment. But this takes time and is expensive. In terms of quantitative approaches, coherence is a versatile and scalable way to evaluate topic models, notwithstanding its limitations. In practice, you\u2019ll need to decide how to evaluate a topic model on a case-by-case basis, including which methods and process to use. A degree of domain knowledge and a clear understanding of the purpose of the model will help. The thing to remember is that some sort of evaluation can be important in helping you assess the merits of your topic model and how to apply it. Topic model evaluation is an important part of the topic modeling process. This is because topic modeling offers no guidance on the quality of topics produced. Evaluation helps you assess how relevant the produced topics are, and how effective the topic model is. Evaluating topic models is unfortunately difficult to do. There are various approaches available, but the best results come from human interpretation. This is a time-consuming and costly exercise. Quantitative evaluation methods offer the benefits of automation and scaling. Coherence is the most popular of these and is easy to implement in widely used coding languages, such as with Gensim in Python. In practice, the best approach for evaluating topic models will depend on the circumstances. Domain knowledge, an understanding of the model\u2019s purpose, and judgment will help in deciding the best evaluation approach. Topic modeling is an area of ongoing research \u2014 newer, better ways of evaluating topic models are likely to emerge. In the meantime, topic modeling continues to be a versatile and effective way to analyze and make sense of unstructured text data. And with the continued use of topic models, evaluation will remain an important part of the process. [1] J. Chuang, C. D. Manning and J. Heer, Termite: Visualization Techniques for Assessing Textual Topic Models (2012), Stanford University Computer Science Department [2] J. Chang et al, Reading Tea Leaves: How Humans Interpret Topic Models (2009), Neural Information Processing Systems"
}

Now lets combine the code and save the results to a csv file.

import requests
import feedparser
import csv
import json

url = "https://api.promptapi.com/pipfeed"

headers= {
  "apikey": "YOUR_API_KEY"
}

#File we want to save the articles to
csv_file = "articles.csv"

def extract_article(article_url):
payload = article_url
response = requests.request("POST", url, headers=headers, data = payload)
status_code = response.status_code
result = response.text
return json.loads(result)


def save_to_css(extracted_articles):
keys = extracted_articles[0].keys()
with open(csv_file, 'w', newline='')  as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(extracted_articles)


NewsFeed = feedparser.parse("https://towardsdatascience.com/feed")
extracted_articles = list()

print("Total entries found in feed: "+ str(len(NewsFeed.entries)) +"\n")
i =0
for entry in NewsFeed.entries:
print(str(i) + ": Extracting url: " + article_url)
extracted_article = extract_article(entry.link)
extracted_articles.append(extracted_article)

print(extracted_articles)
print("Saving articles to csv")
save_to_css(extracted_articles)

The above code will extract all the articles that appear in the RSS feed for https://towardsdatascience.com/feed and save them to a csv file called “articles.csv”.

You can now use this data for training models, analytics and whatever you may feel like. Let us know what you think about the APIs and tutorial in the comments.

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy