#nlp
1 messages · Page 1 of 1 (latest)
Hello Grey
Hello Guys! How are we doing!
yoh !!
Yep. I'm here
Hello there,
I'm beginner level UG student in NLP domain, I seek your guidance regarding a topic :"Automatic literature review generation from Research Papers".Like how should I approach and generate.
- I have 3 publications on Deep learning domain (imagery though) .So i will be able to grab your idea or guidance.
Thanks to all
Hello guys great to be here
Hey Guys
What exactly do you want to do? Given your own research paper text without a literature review section, find suitable papers for literature review?
Or given a specific paper, to generate a summarised text that you can use as a literature review of that paper in your own paper's literature review section?
This one - given a specific paper, to generate a summarised text that you can use as a literature review of that paper in your own paper's literature review section
Thanks a lot for your response mate
This sounds like a specialized text summarisation problem.
You could use extractive or abstractive text summarisation on the given paper.
For a more specialized model, you could fine-tune it on a dataset of papers and their lit. reviews, maybe abstracts?
Further, if you want to add your own paper's influence into the review as context, you could use an LLM with Retrieval Augmented Generation.
These are very rudimentary ideas, as you experiment you should get something more concrete, or maybe someone else here can improve on these
No problem! Atb with your work
Thanks mate ❤️
Hey there, I was wondering, how tedious it is to create embeddings for NLPs? I may want to create a gig for exactly that.
There are enterprise APIs and open source models to create embeddings for a given text. Obviously each of these generated embeddings will be specific to the embedding space of that particular model checkpoint.
What problem are you trying to solve?
Oh I was wondering if this is something I can offer to Fiverr as a gig, but if there are APIs or models to just quickly generate embeddings, I suppose never mind lol
You can use basically any open source language model to get embeddings for a piece of text. Embeddings are a latent space representation, and an intermediate of the process for how most language models work.
Depending on your use case ofcourse, quality of embeddings and embedding space will differ for each model.
I have a question about fine-tuning a mobilebert model. Should I use a teacher model for doing that as well? Or could I just train the mobilebert directly using a low learning_rate? I am trying this now, but I have a val_root_mean_squared_error x10 as big as what is expected with the best parameters found so far... any advice or help would be greatly appreciated! It's my first time doing this. Really learningful experience though. Thanks! 😄
I want to create an LLM that is for NLP and question answering based on a collection of pdfs and text. I want to talk to someone about this architecturally to see if there are any examples i could follow and what archicetctures should i implement based on State of Art stuff like Llama 2 and or langchain openai. Please someone message me so I can finally get started.
i passed a json doc into model using langchain and when making an inference of the RAG model it does not seem to be responding based off the json data specifically, when asking it certain questions that should have a good response. Do i need to restructure my Json Data away from the nested dict style to something more condensed?
it's difficult to pinpoint the error based on this much information.
Langchain's popular use case is to provide a convenient pipeline for your flow of data and some templates.
Are you never getting answers relevant to the context from your data-store? Then maybe there's a problem with the data pipeline.
If you are getting answers related to information in the data-store, but not relevant enough to the question, then you might have problems with the quality of the semantic similarity based retrieval (or MMR or whatever method you're using)
If you are getting relevant answers but it doesn't answer the question effectively, then you might have a problem with your prompt template, structure of context-data, or the strength of the model.
There are many other possible scenarios I think. Like I said, difficult to pinpoint with this much information.
Good luck!
data pipeline is fine. could be problem with semantic sim retrieval and prompt. here is both of my stuff.
where are you actually using retrieve_info() to get context for your LLMChain query
def generate_response(message):
best_practice = retrieve_info(message)
response = chain.run(message=message, best_practice=best_practice)
return response
I see! And what is off about the results? They're pulled from the database but not relevant to the context? Or it's relevant to the context but doesn't align to the full context?
Or is it only that the context is identified and retrieved properly but you want richer, higher quality responses?
i ask it a specific question about the database and it seems to not pull the information from that source sometimes it hallucinates and seems to bs a bit @dreamy karma
I cann show u a demo but id have to make u sign an NDA @dreamy karma
I wonder if its the way im templating or something idk but there is a lot of json data i gave it like over 400 lines @dreamy karma
Yeah that's why I was trying to ask generic enough questions xd
Yeah that's why I asked "....doesn't align to the full context" for cases where maybe the entire context can't fit into the model's context limit
Your vectordb is a combination of the json and 2 CSVs
Have you inspected the final chunks in the vector DB? What is the size of each chunk?
How many chunks are needed to typically answer a query?
What is the total number of tokens in that number of chunks
did it before but not sure i can x2 check @dreamy karma
That would be something I'd check.
Also I see you're adding the Document objects together to get the final object that you pass to faiss
iirc the faiss.from_documents() method already takes a list of documents as input, so there's no need to add them together using the + operator.
The Document class defines 2 fields for the text and metadata respectively for each document so I'm not sure what the behaviour would be on just adding these objects together with the + operator, the Document class doesn't define this afaik.
Though if I recall correctly, the pinecone vectorstore raises an error on attempting this.
So that would also be a suggestion, instead of adding them with the + operator, just construct a list and pass that list to the from_documents method
do u have any documentation on this? using FAIISS struggling to build chunk size
num_chunks = db.ntotal # Get the total number of chunks (or vectors)
for chunk_idx in range(num_chunks):
chunk_info = db.get_chunk_info(chunk_idx)
chunk_size = chunk_info['size']
print(f"Chunk {chunk_idx + 1}: Size = {chunk_size}")
I think a
documents = [doc1, doc2, doc3]
FAISS.from_documents(documents, embeddings)
instead, would work
I generally work with the qdrant vectorstore when working with langchain so I'm not sure about the exact method for this
how do u like qdrant is it esay to implement?
Faiss is more of a knn library than a full vectorstore, so I anticipate before going to prod you'd have to shift to a proper vectorstore anyway.
Qdrant claims to be one of the most performant, with both self-hosted and cloud options.
For prototyping you could use the in-memory mode of qdrant client with langchain, which very little setup as you're doing with faiss
Though this doesn't look like a vectorstore problem right now, so you might want to decide that later
does it matter what data format (json or jsonl) which ones do u usually do
I would suggest you to look at some samples of what retrieve_info() is returning and whether that makes sense for your query
When you use a document loader it converts your data to string, the contents of which depend on the structure and size of your original data.
I don't have the original data, so you'll get a better idea by inspecting whether the context returned makes sense, how it is chunked, whether semantically related segments are in the same chunk, how many chunks are retrieved, etc
how to fix this:
how to fix this:
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 10.0 seconds as it raised RateLimitError: Rate limit reached for text-embedding-ada-002 in organization org-kivGQXmzDATD on tokens per min. Limit: 1000000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues..
i fixed it btw @dreamy karma
Oh great, what was the issue?
anyone good with making text splitters for json data ?
i am new here i wish to know if anyone has a source of where i can learrn to make a chatbot
I studied NLP basics and now learned about transformer and implemented it, I now didn't know what to do next
what i have to learn and what i should do??
i want to embedd my model with more resourceful information to support more general queries. The thing is going through the website it is difficult to transform the context into one data format. For example, There is raw text and then a value pair for commonly asked questions and response. What do you guys think would be the best way to efficiently collect the data like this whilst ensuring a legible data format so the model can digest it?
Do you mind giving more information about what you want to do?
Are you concerned chunking your data may split question and corresponding answer into different chunks?
Were you finally able to resolve this?
I mean, what more can I say I think I said all the information
what is the best way to get my model to understand basic question prompting im using rag.. Sometimes it requires user to prompt specifically from the data source syntax to get the best response.. I understand
- User Feedback Integration System
- Fine Tune with Diverse Queries.
What are some other options you guys recommend? Thank you
Has anyone ever used BoxCox (scipy.stats.boxcox) before, in the context of count data?
I've always used log transform in the past, but BoxCox seems like a nice alternative?
Hey can anyone assist me for how to go Forward with NLP projects?? Like I have tried pandas and it's too slow in processing my inputs. I need an alternative for that to process my datasets for sentences. I have used pandarellel but it's for some reason popping an error which is not solvable.
Hi everyone I'm want to fine tune a MultiModal llm which can generate icons based on input text, text to image task but not diffusion based models, I'm not able to find open-source models for it.. any idea?
Hi everyone, I started in the field of machine learning about 4 months ago. I learned Tensorflow for NLP. I know how to classify and generate text, but I feel lost in this field. What do you advise me to continue with?
hey
i have been trying to perform irony detection with this dataset and i am barely reaching 75% anyone got any ideas on how to do this
Maybe check out dask
What's that can you explain?
Does anyone know where i can find datasets for SeamlessM4T model?
Hi, guys
we can combine stt, chatgpt and tts to make voice chatbot
if we use streaming mode for stt and chatgpt, users can get response within 2 seconds and it can be useful.
I have developed AI call center by using Twilio and a voice chatbot.
I think it can take a role of human callers
Sounds Nice!! if you are okay share the GitHub repo, algorithm, or steps you have followed, plz share them with us.
I am sorry. But it is not free.
Hi guys
I am trying to extract people that are mentioned in an audio file.
I use Google Speech to Text to transcribe the audio and then I am using spacy for NER.
The results are quite inaccurate and I've realized that spacy does not identify well names when they are lowercase.
As the text comes from Speech to Text, it is not perfectly formatted and hence spacy is having some issues.
Do you guys suggest anything?
hey, everybody.is there a channel to discuss the dataset?
Are there any paper reading sessions here?
hello there everyone i'm currently working on a resume parsing task which is a part of a resume matching project there's little information about this project so i was wondering if someone can clarify something for me someone wh worked on similar projects or who has any ideas that help . thank you in advance
did you able to do it ?
Hi, Im using reranking after retrieval in my rag pipeline but there are some cases where the reranking step provides very low scores in questions like "what are my products?". Any ideas on how to work around this? I was thinking about ponderating results I get from my vector dB but I'm not quiet sure how to use reranking scores for that
How to create a meta like ai using llma 2
Tha when the user use generate , image words it generate
Else talk
if button.text = 'generate':
generate_image()
else:
talk()
guys, i have a corpus of 5,572 documents and when i am converting those documents into vectors using Word2Vec, i am getting 7 vectors which are completely empty and i am unable to figure out why, it would be really appreciated if anyone would help me in this
you mean problem? if yes, then i am classifying a message as a spam or ham
ham? 🍔
ham means "NOT spam"
Ah! 💡 From now on I will write HAM in the subject of every email so people know that it's not SPAM 😂
Welcome to our deep dive into the Top 10 Deep Learning Algorithms! In this video, we break down each algorithm with a concise 10-word explanation. Perfect for beginners and experts alike, you'll get a quick yet comprehensive overview of the most impactful algorithms in the world of deep learning.
Top 10 Deep Learning Algorithms:
Convolutional ...
Hello everyone.
Please help me to implement this.
I want to generate new song from 100 ambient songs.
These songs are the same in their style.
Anyone help me to do this.
Hi, everyone.
We are planning to deploy Llama3 for our app with millions of users.
How can we achieve this?
And which GPU series or cloud platforms are best for achieving high speed and scalability?
Prolly google cloud
Can you explain more
has anyone used vector databases for mobile applications? chromadb, pinecone, ...?
I used SQLite.. but it is not so good for embeddings, vectors.. 🥲
Hi, everybody.
I have a question
I need to extract the abstract of papers not using GPT4, I have to rely on local resource.
from py_pdf_parser.loaders import load_file
from py_pdf_parser.components import ElementOrdering
document = load_file("JPM-2022-Harvey-25-46.pdf")
file_path = 'JPM-2022-Harvey-25-46.pdf'
document = load_file(
file_path, element_ordering=ElementOrdering.RIGHT_TO_LEFT_TOP_TO_BOTTOM
)
So I parsed the pdf using py_pdf_parser, and I'm going to merge the pieces until obtain the compelete abstract.
Now I try to use embedding models for this. But that doesn't work well.
If somebody has solution to about this, please help me.
In the case that I have to use the LLM models, the size should be under 2GB.
Thanks!
Hello everyone, could someone gives me a playlist or a course that gives me introduction about NLP ?
@gloomy glen this is the best playlist ever I think for NLP
https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
Thank you so much 🌹
Hi, Any playlist for LLMs and RAGs please....their easy to understand explanation, implementation and fine tuning
https://course.spacy.io/en good intro to nlp with spacy
sound interesting, can you explain more and give some examples?
Hey I am building a terms and condition summariser for software. Is anyone aware of available datasets with the T&Cs and summaries. I am using a Pegasus for text-generation and all it gives out is some gibberish with repeating words.
how often n how much we will require regex understanding in nlp ?
It comes in handy during the preprocessing stuffs.
But recently you could utilize genai tools for generating those regex as long as you know how they work.
That means one should know them it's actually important thnx
Is anyone familiar with Natural Language Understanding (NLU)?
If so, are there any learning resources you can recommend?
This is a very specific sub-field of NLP, but I'm finding a dearth of information about it (outside of a $1,750 Stanford course).
I'm specifically looking for practical, hands-on learning projects.
Just about everything I can find when searching specifically about NLU material is buried under the NLP umbrella.
HI, I am Abdullah I am an ML engineer want to join any team to particapte in kaggle competions
Hi, need help in generation of sequence from a GPT model trained and evaluated on "tiny_shakespeare" dataset(built with PyTorch from scratch) . please DM 🙏
Problem/Project I'm Working On:
Given a pair of resume and job description, I'd like to predict if the resume is one of the following categories: "Fit", "Potential Fit", "No Fit".
Dataset: https://huggingface.co/datasets/cnamuangtoun/resume-job-description-fit
Data Distribution:
- 50% - No Fit labels
- 25% - Potential Fit labels
- 25% - No Fit labels
Train - 6.24k rows
Test - 1.76k rows
Model Architecture (Siamese BERT Architecture). I read through a paper that used this architecture and I thought of using it.
Paper: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/final_reports/report062.pdf
I have attached a picture of the architecture I'm using. To give a brief summary I constructed 3 different variations, 2 of which are already in the image. All of them have two separate bert models which are sharing the same weights (siamese), where they differ is in the implementations:
- Model 1: Take absolute difference of both the pooled (cls pooling) embeddings
- Model 2: Take cosine similarity between the pooled embeddings
- Model 3: Hybrid version that combines both the absolute difference along with cosine similarity.
All 3 of these models then have a Softmax applied on it to predict the category. In terms of the preprocessing done to the text, I've removed links, lemmatized the text, removed special characters etc. just picked up a good enough code from stack overflow and adapted it.
Model Details:
- Loss Function: CrossEntropy
- Optimizer: Adam
- Learning Rate: 5e-5 (From the paper)
- Num Epochs: 10
Results:
- Evaluation Accuracy after 10 epochs: 85%
- Testing Accuracy: 55%
I am unable to get past this 55% testing accuracy no matter how many things I've changed. I am currently using DistilBERT, I tried with BERT as well and am getting similar results. I need some guidance on what to do, right now I know the model is overfitting and based on the little error analysis I've done, the model is misclassifying all the types of labels.
Hi I have a simple beginner doubt in transformers
-
Embedding models that we see in hugging face are they basically encoder or abstraction of encoder part of transformers
-
If yes then in transformers architecture we see embedding layer, how do we obtain those embeddings (the layer before we apply positional embedding)
Hi @hasty crag did you find some help to make it better?
Hello
Please show some support for this and I would also love to hear any insights for the posts
https://medium.com/@banupriya.valluvan
Hi guys
Hi everyone,
I’m transitioning from a Neuroscience & Psychology and Data Science background to specialize in NLP/LLMs, and I’d love your advice:
Background:
- BSc in Neuroscience & Psychology
- 6-month Data Science specialization (2021)
- 3 years as a Data Scientist in healthcare, focusing on EEG signal processing and classification. However, many of my projects didn’t reach completion, and I haven’t consistently kept up with the latest methodologies and innovations, so I recognize gaps in my expertise.
Current focus:
- Completed Coursera’s Generative AI with LLMs course
- Currently working on a small emotion-detection project (sentence classification) for my GitHub portfolio
Questions:
- What core skills, tools, or frameworks should I master now to break into the NLP/LLM space and to be the most relevant to employers?
- Which project ideas or portfolio pieces or experience in general you've found to be relevant in the hiring processes?
- Any must-read papers, tutorials, or communities you’d recommend?
- How do you stay up-to-date with all the changes, tools and innovations in the field?
Thanks in advance for your insights!
Job Title: Part-Time Senior AI/ML Engineer (Remote)
We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.
Requirements:
-Minimum of 9–10 years of professional software development experience
-Proven experience working effectively in a remote environment
-Advanced English proficiency (C1 or higher); an American accent is preferred
-Availability to work 10–15 hours per week during EST or CST business hours
If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, we’d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384
Job Title: Part-Time Senior AI/ML Engineer (Remote)
We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.
Requirements:
-Minimum of 7–10 years of professional software development experience
-Proven experience working effectively in a remote environment
-Advanced English proficiency (C1 or higher); an American accent is preferred
-Availability to work 10–15 hours per week during EST or CST business hours
If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, we’d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384
Mmm interesting you should go for some topics in deep learning, just follow major ai accounts socially you will get updates on daily basis
Else i m young and you are my senior i just cant advice you hehe
Code Rehabilitation:
A New Paradigm
I share a concept "code rehabilitation and rehabilitators."
A story:
I appointed an excellent fast chef, a few serving robots, and retained my existing staff—even hired more. Why?
Reason 1: My chef started producing fast, good, tasty recipes with about 90% accuracy in the logistics of placing them in plates or packages to be served. But that means a lot of human interference is still needed.
Reason 2: More training is going on for my robots on how to handle exceptional situations—so there's a growing need for rehab-trainers.
Reason 3: The more demand for my recipes, the more I need marginally more staff and robots. While 90% of the task gets done swiftly (maybe even at "200 percent velocity"!), that remaining 10% requires more people and robots. As demand increases, this marginal staff requirement could become tremendous.
The Moral: An analogical scenario for why I still recruit—but in a different fashion, for a different paradigm. This reminds me of "creative destruction," a concept from economist Joseph Schumpeter (though not a Nobel laureate, his influence on economic thought is monumental). It's a sort of Brahmic cycle—simultaneous destruction and creative activity.
The Temporal Question
But here's the thing: Will this 10% shrink over time? Will next year's AI handle 95%, then 98%? Or will new complexities emerge—new edge cases, security concerns, integration challenges—that keep regenerating that stubborn 10%? The rehabilitation need might persist longer than we think, or it might evolve into something entirely different. Time will tell which way the wind blows.
What Lies Ahead
I foresee a Rehabilitation Engineering Program emerging soon—a structured discipline to train specialists who can bridge the gap between AI-generated output and production-ready systems. These rehabilitators will be the crucial link in our new *creative-destructive cycle.
(*With due regards to Joseph Schrumpeter who kindled this idea)
I'm finding a US developer for the collaboration. If anybody interested, please dm me.
Hi @everyone 👋🏻 This is Rammani Pandey.
Let's connect with me on LinkedIn:-
https://www.linkedin.com/posts/rammani-pandey-97302a22b_day9-sqlchallenge-indiar-activity-7394236510352424960-hmbp?utm_source=share&utm_medium=member_android&rcm=ACoAADl9mucBpatBQ_YBq3C6DW6wPrW7snWLC0k
Hey everyone. Im still learning the whole nlp topic. At the moment, I want to split and classify fantasy stories into narrative parts. Like "Battle", "Journey", "Rest", and so on. My plan: after the preprocessing (remove punctuations, lowercasing, lemmatization), I use a POS-Tagger to identify the verbs. And then I thoughed maybe to classify them. But the problem is, that some same verbs have different meanings. Like "run" (She runs a small business; This machine runs on magic) or "draw" (She drew water from the well; He draw his sword).
Do you know a way, how I can separate/classify the meaning of the sentences, to throw away the verbs I dont need, for a higher quality of the results? 🙂
I would be rly glad, if anyone has some ideas, which I could try out 🙏
Hey everyone 👋
Quick question for folks building chatbots / LLM apps here —
how are you currently handling long-term user memory beyond a single session?
Curious what’s actually working in practice (RAG, DB, custom hacks, etc).
Who has experience with a inmp441?
RAG imo is easier to learn so start with RAG is what I would say
https://www.kaggle.com/code/axelmatthew/from-llms-to-agents-a-curated-ai-corpus-study
Hey everyone , pls look on my notebook and give a upvote if u guys like my learning notebook code . Thankyou
New Dataset Just published!
View: https://www.kaggle.com/datasets/mabubakrsiddiq/clear-bg-ocr-dataset-eng-and-zh-22k-images
🔹 Overview
This dataset contains synthetic OCR images of English and Chinese sentences. Each language is organized in a separate folder with corresponding metadata. The images have clear backgrounds, random fonts and font sizes, and optional blur for variability.
The dataset is designed for OCR research, machine learning, and computer vision tasks. Perfect for training models to recognize text in multiple languages and fonts.
🎨 Features
- ✅ Two-lingual dataset: English & Chinese
- ✅ Random fonts: Multiple font options for diversity
- ✅ Random font sizes: Increases model generalization
- ✅ Optional Gaussian blur: Simulates real-world imaging
- ✅ Clear backgrounds: Good for clean OCR training
- ✅ Metadata included: Easy for preprocessing and analysis
💡 Possible Use Cases
- 🖋️ OCR Model Training: Train models like Tesseract, PaddleOCR, or deep learning OCR pipelines
- 🤖 Computer Vision Research: Use metadata for font/style classification
- 🏫 Language Learning Tools: Visual recognition for English or Chinese sentences
- 🔧 Augmentation Testing: Benchmark text recognition under blur and font variations
- 🧠 Multi-Lingual OCR Experiments: Test cross-lingual recognition models
⚡ Notes
- The Chinese text is rendered using Microsoft YaHei and NSimSun fonts for proper character display.
- The English text uses a variety of fonts for diversity.
Please consider giving an upvote!
We built a Kaggle Search where you can search datasets on Kaggle (or HF) and find datasets that positively or negatively influence model based on your prompt. It uses gradient of the model's loss with respect to the final transformer block parameters.
Instead of relying on upvotes from folks that may not utilize the dataset for the same reason as you, you can test what model you are training and it will calculate their influence.
New Dataset published!
https://www.kaggle.com/datasets/mabubakrsiddiq/language-identification-dataset-20-languages/data/data/data/data/data
The Language Identification Dataset is a curated collection of approximately 68978 text samples, each paired with a corresponding language label. The dataset was constructed by gathering multilingual text passages from three major sources: the Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi-MT. These sources provide a diverse mix of domains, writing styles, and sentence structures, making the dataset suitable for research and machine learning tasks involving language detection, multilingual NLP, and text classification.
New Notebook:
https://www.kaggle.com/code/mabubakrsiddiq/language-identification-99-accuracy
Notbook creates a model to identify a languge acheiving 99% accuracy
My Notebook :
https://www.kaggle.com/code/axelmatthew/identifying-error-patterns-in-stackoverflow
Hello guys , i would like to share the analysist quite intrestings about the stackoverflow insight . Hope u can drop upvote and fork it if it feel useful for future analytics. Thankyou
Anyone interested in giving a talk (45 -60 mins) , at University of Wisconsin - Milwaukee ???
My team is organizing a colloquium as part of the NSF REU Summer 2026 program(Link: https://sites.uwm.edu/reu/about/ ).
The colloquium will focus on Natural Language Processing (NLP), Natural Language Understanding (NLU), and Large Language Models (LLMs), with an audience made up primarily of undergraduate students. It's a wonderful opportunity to inspire the next generation of researchers and share your expertise in a meaningful way!
Kindly dm with your info if you are interested
Regards,
Pamela Martey
Just published a dataset on Google Historical Stock prices, Would love to here your Feedback
https://www.kaggle.com/datasets/ibrahimshahrukh/google-alphabet-stock-prices-2016-2026
I want to start learning NLP using PyTorch. Can anyone suggest some good resources or beginner-to-intermediate projects I can build to practice?
https://www.kaggle.com/datasets/mabubakrsiddiq/urdu-ghazal-dataset-32-poets-and-their-ghazals
The dataset contains poetry by 30 greatest urdu poets. Here they are:
'mirza-ghalib','allama-iqbal','faiz-ahmad-faiz','sahir-ludhianvi','meer-taqi-meer', 'dagh-dehlvi','kaifi-azmi','gulzar','bahadur-shah-zafar','parveen-shakir', 'jaan-nisar-akhtar','javed-akhtar','jigar-moradabadi','jaun-eliya', 'ahmad-faraz','meer-anees','mohsin-naqvi','firaq-gorakhpuri','fahmida-riaz','wali-mohammad-wali', 'waseem-barelvi','akbar-allahabadi','altaf-hussain-hali','ameer-khusrau','naji-shakir','naseer-turabi', 'nazm-tabatabai','nida-fazli','noon-meem-rashid', 'habib-jalib'
Every ghazal is given in three writing systems:
Urdu (Arabic Script)
Hindi (Hindi writing system)
English (Latin Script)
Divided into three folders: ur, en and hi.
Potential use cases:
NLP
Meter Detection
Modeling AI to predict the poet given the ghazal or couplet
Have fun with data!
Remember not to publish the same thing in multiple channels - you were right to put it in data!
Anyone recommendations for a free BIO-Annotation-Tool?
I've tried Label Studio and doccano personally, they're good 🐣
Nice! Thanks 🙂
Hey! This is actually a really interesting problem.
POS tagging alone probably won’t be enough here. What you are running into is basically word sense disambiguation, especially with verbs like “run” or “draw” that change meaning depending on context.
One direction you could try is contextual embeddings like BERT or RoBERTa. These models look at the full sentence instead of isolated words, so the meaning of the verb changes based on context automatically.
Another option is semantic role labeling. Tools like AllenNLP can identify roles such as agent, action, and theme, which could map nicely to narrative categories like battle or journey.
You could also skip verb level extraction entirely and treat this as a sentence classification problem. That approach is often simpler and works better in practice.
What does your dataset look like? Is it already labeled or are you planning to create the labels yourself?
Hi! Thanks for your response and your interest. My dataset are german Fairytales and its just plain text. I plan to create the labels myself and have already started annotating the data. But yeah.. its a lot of work. But you are absolutely right: I have those problems, also some others problems like names and nouns. So I need the Labels for different words like the plant "Rapunzel" and the name "Rapunzel". In the end, I want to identify every figure in the fairy tales.
My aim is to label enough sentences and fine tune a spacy NER-model. But I would also like to try your recommended embeddings with BERT. Thats the reason why I decided for BIO-Annotation (I hope it was the right decision :D) - for fine tuning some transformer models.
AllenNLP sounds super interesting! I've never heard about this. Thank you very much!
You could also skip verb level extraction entirely and treat this as a sentence classification problem. That approach is often simpler and works better in practice.
What do you mean by this?
Hey there! You could also try the good old RNN approach in order to make computations easier on a CPU. If you're classifying incredibly large passages, you could try and leave only the sentences with the highest-scoring IDF (to prevent general words like "the" and "run" getting thrown in).
You could also just try the simplest non-neural approach, something like Naive Bayes or pure TF-IDF and see how that goes. I think for classifying genres that would be fine.
Otherwise, I wouldn't recommend straight up jumping to BERT or such classifiers - I doubt 110M parameters are going to make classification any more accurate. You could try sBERT. I found https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 to be good for sentence--> vector tasks, used it in my paper once, and you could just "glue" a head onto it and fine-tune. It has 22M, so it is smaller and more than enough for the job.
TL;DR: Try non-neural approaches first (Naive Bayes, TF-IDF), then RNNs (GRU or similar). If RNNs don't work as good, try sBERT and only then something bigger.