#Can someone help me understand this NLP python assignment, it's really important

1 messages · Page 1 of 1 (latest)

young kite
#

Assignment:

When constructing an NLP pipeline for news scraped from various sources, it is crucial to first eliminate unnecessary and redundant sentences from the news content. This step is important as it ensures that the output generated by the subsequent summarization and named entity recognition models is accurate and relevant. By removing irrelevant and repetitive sentences, the NLP pipeline can focus on the important information and provide a clearer and more concise summary of the news.

A quick solution that can be deployed to achieve this is using TF-IDF which is a numerical statistic that is commonly used in natural language processing and information retrieval. At the sentence level, it can be used to identify the most important sentences within a document by measuring the relevance of each sentence based on the frequency of specific words or phrases. By assigning a weight to each sentence based on its TF-IDF score, it is possible to rank the sentences in order of importance and use this information to generate a summary of the document or perform other NLP tasks.

Other techniques include TextRank, Latent Semantic Analysis, Latent Dirichlet Allocation, and the use of sentence embeddings for similarity computation and clustering, such as using cosine similarity to compare the vector representations of sentences.

These vector representations can be obtained through techniques such as bag-of-words or more advanced methods like BERT embeddings.

dataset of news: dataset.csv

Leveraging any of the above methods or anything else you can come up with, do a test-train split of 10-90 and generate the cleaned responses for the test set.

Submission-
The code must be submitted to a Github repository that includes a Readme file describing the methodology used and a table with at-least these specified columns for the test set.

languid mountain
young kite
#

Like Do I have to predict summary from the headlines or just remove the duplicate ones

#

|Original Content | New Content |Removed Lines | Further Metrics.
| | | |
| | | |
| | | |

#

@languid mountain I need to submit the assignment in above format, its a table containing these 4 columns

languid mountain
# young kite Like Do I have to predict summary from the headlines or just remove the duplicat...

From what I can understand, for each row, there can be multiple sentences. It's saying that given an original sentence, which could range from 1 to many sentences, do your processing so you only retain more meaningful sentences, which you want to show in new content. In removed lines, you show what lines did you remove. If you didn't remove anything, new content should be the same as original content, and removed lines should be n/a or None. Further metrics probably just means, what is the metrics that you used for the techniques of your choice (using TF-IDF? Using LDA or other techniques?)

#

I think you should still clarify with your instructor, but that's how I interpret it

young kite
#

Thanks will clarify that and so this table is to show preprocessing I guess and the use of sklearn should shown in code only

#

@languid mountain when I used data.nunique() it showed all are unique

#

this is a part of dataset

#

What my main doubt is how to preprocess this data , there are no null values and no duplicates

languid mountain
#

Each title and each content only need to retain important sentences, it’s what it’s talking about

young kite
#

@languid mountain I did that , I searched on kaggle for related datasets and found some good notebooks , but if I preprocess both what will be my testing parameter