Can someone help me understand this NLP python assignment, it's really important | Learn AI Together | Page 1

young kite Apr 14, 2023, 7:42 AM

#

Assignment:

When constructing an NLP pipeline for news scraped from various sources, it is crucial to first eliminate unnecessary and redundant sentences from the news content. This step is important as it ensures that the output generated by the subsequent summarization and named entity recognition models is accurate and relevant. By removing irrelevant and repetitive sentences, the NLP pipeline can focus on the important information and provide a clearer and more concise summary of the news.

A quick solution that can be deployed to achieve this is using TF-IDF which is a numerical statistic that is commonly used in natural language processing and information retrieval. At the sentence level, it can be used to identify the most important sentences within a document by measuring the relevance of each sentence based on the frequency of specific words or phrases. By assigning a weight to each sentence based on its TF-IDF score, it is possible to rank the sentences in order of importance and use this information to generate a summary of the document or perform other NLP tasks.

Other techniques include TextRank, Latent Semantic Analysis, Latent Dirichlet Allocation, and the use of sentence embeddings for similarity computation and clustering, such as using cosine similarity to compare the vector representations of sentences.

These vector representations can be obtained through techniques such as bag-of-words or more advanced methods like BERT embeddings.

dataset of news: dataset.csv

Leveraging any of the above methods or anything else you can come up with, do a test-train split of 10-90 and generate the cleaned responses for the test set.

Submission-
The code must be submitted to a Github repository that includes a Readme file describing the methodology used and a table with at-least these specified columns for the test set.

languid mountain Apr 14, 2023, 2:11 PM

#

young kite Assignment: When constructing an NLP pipeline for news scraped from various sou...

It helps if you can describe what do you understand and what do you not understand

young kite Apr 14, 2023, 4:38 PM

#

languid mountain It helps if you can describe what do you understand and what do you not understa...

Oh yeah sorry about that, I have a dataset containg news headlines,
So What I don't understand is what should I make from this data

#

Like Do I have to predict summary from the headlines or just remove the duplicate ones

#

#

@languid mountain I need to submit the assignment in above format, its a table containing these 4 columns

languid mountain Apr 14, 2023, 6:29 PM

#

young kite Like Do I have to predict summary from the headlines or just remove the duplicat...

From what I can understand, for each row, there can be multiple sentences. It's saying that given an original sentence, which could range from 1 to many sentences, do your processing so you only retain more meaningful sentences, which you want to show in new content. In removed lines, you show what lines did you remove. If you didn't remove anything, new content should be the same as original content, and removed lines should be n/a or None. Further metrics probably just means, what is the metrics that you used for the techniques of your choice (using TF-IDF? Using LDA or other techniques?)

#

I think you should still clarify with your instructor, but that's how I interpret it

young kite Apr 14, 2023, 7:03 PM

#

Thanks will clarify that and so this table is to show preprocessing I guess and the use of sklearn should shown in code only

#

@languid mountain when I used data.nunique() it showed all are unique

#

this is a part of dataset

#

What my main doubt is how to preprocess this data , there are no null values and no duplicates

languid mountain Apr 14, 2023, 8:03 PM

#

young kite What my main doubt is how to preprocess this data , there are no null values and...

It’s not asking you to do those. It’s just asking you yo process all the titles and content.

#

Each title and each content only need to retain important sentences, it’s what it’s talking about

#

This is a start, but by no means the best possible work https://www.analyticsvidhya.com/blog/2021/12/how-to-extract-key-phrases-using-tfidf-with-python/

Analytics Vidhya

Ali Mansour

Fast and Effective ways to Extract Keyphrases using TFIDF with Python

We are going to learn how to extract keywords from text documents in a smooth and simple way step by step, using TFIDF in Python.

young kite Apr 15, 2023, 1:22 PM

#

@languid mountain I did that , I searched on kaggle for related datasets and found some good notebooks , but if I preprocess both what will be my testing parameter

#Can someone help me understand this NLP python assignment, it's really important