#nlp

1 messages · Page 1 of 1 (latest)

placid ridge
#

Any other Natural Language Processing nerds in this discord? 🙋‍♂️

sullen stone
#

Hello Grey

marsh tiger
#

Hello Guys! How are we doing!

tawdry remnant
#

yoh !!

haughty pond
#

Yep. I'm here

frozen plinth
graceful elm
#

Hello there,
I'm beginner level UG student in NLP domain, I seek your guidance regarding a topic :"Automatic literature review generation from Research Papers".Like how should I approach and generate.

  • I have 3 publications on Deep learning domain (imagery though) .So i will be able to grab your idea or guidance.
    Thanks to all
neon torrent
#

Hello guys great to be here

void zenith
#

Hey Guys

dreamy karma
graceful elm
#

Thanks a lot for your response mate

dreamy karma
# graceful elm This one - given a specific paper, to generate a summarised text that you can us...

This sounds like a specialized text summarisation problem.
You could use extractive or abstractive text summarisation on the given paper.
For a more specialized model, you could fine-tune it on a dataset of papers and their lit. reviews, maybe abstracts?
Further, if you want to add your own paper's influence into the review as context, you could use an LLM with Retrieval Augmented Generation.

These are very rudimentary ideas, as you experiment you should get something more concrete, or maybe someone else here can improve on these

dreamy karma
odd marlin
#

Hey there, I was wondering, how tedious it is to create embeddings for NLPs? I may want to create a gig for exactly that.

dreamy karma
odd marlin
dreamy karma
#

You can use basically any open source language model to get embeddings for a piece of text. Embeddings are a latent space representation, and an intermediate of the process for how most language models work.
Depending on your use case ofcourse, quality of embeddings and embedding space will differ for each model.

marble tide
#

I have a question about fine-tuning a mobilebert model. Should I use a teacher model for doing that as well? Or could I just train the mobilebert directly using a low learning_rate? I am trying this now, but I have a val_root_mean_squared_error x10 as big as what is expected with the best parameters found so far... any advice or help would be greatly appreciated! It's my first time doing this. Really learningful experience though. Thanks! 😄

acoustic atlas
#

I want to create an LLM that is for NLP and question answering based on a collection of pdfs and text. I want to talk to someone about this architecturally to see if there are any examples i could follow and what archicetctures should i implement based on State of Art stuff like Llama 2 and or langchain openai. Please someone message me so I can finally get started.

acoustic atlas
#

i passed a json doc into model using langchain and when making an inference of the RAG model it does not seem to be responding based off the json data specifically, when asking it certain questions that should have a good response. Do i need to restructure my Json Data away from the nested dict style to something more condensed?

dreamy karma
# acoustic atlas i passed a json doc into model using langchain and when making an inference of t...

it's difficult to pinpoint the error based on this much information.
Langchain's popular use case is to provide a convenient pipeline for your flow of data and some templates.

Are you never getting answers relevant to the context from your data-store? Then maybe there's a problem with the data pipeline.
If you are getting answers related to information in the data-store, but not relevant enough to the question, then you might have problems with the quality of the semantic similarity based retrieval (or MMR or whatever method you're using)
If you are getting relevant answers but it doesn't answer the question effectively, then you might have a problem with your prompt template, structure of context-data, or the strength of the model.

There are many other possible scenarios I think. Like I said, difficult to pinpoint with this much information.
Good luck!

acoustic atlas
dreamy karma
#

where are you actually using retrieve_info() to get context for your LLMChain query

acoustic atlas
dreamy karma
#

I see! And what is off about the results? They're pulled from the database but not relevant to the context? Or it's relevant to the context but doesn't align to the full context?
Or is it only that the context is identified and retrieved properly but you want richer, higher quality responses?

acoustic atlas
#

i ask it a specific question about the database and it seems to not pull the information from that source sometimes it hallucinates and seems to bs a bit @dreamy karma

#

I cann show u a demo but id have to make u sign an NDA @dreamy karma

#

I wonder if its the way im templating or something idk but there is a lot of json data i gave it like over 400 lines @dreamy karma

dreamy karma
dreamy karma
#

Your vectordb is a combination of the json and 2 CSVs
Have you inspected the final chunks in the vector DB? What is the size of each chunk?
How many chunks are needed to typically answer a query?
What is the total number of tokens in that number of chunks

acoustic atlas
#

did it before but not sure i can x2 check @dreamy karma

dreamy karma
# acoustic atlas did it before but not sure i can x2 check <@833644804670750750>

That would be something I'd check.
Also I see you're adding the Document objects together to get the final object that you pass to faiss
iirc the faiss.from_documents() method already takes a list of documents as input, so there's no need to add them together using the + operator.
The Document class defines 2 fields for the text and metadata respectively for each document so I'm not sure what the behaviour would be on just adding these objects together with the + operator, the Document class doesn't define this afaik.
Though if I recall correctly, the pinecone vectorstore raises an error on attempting this.

So that would also be a suggestion, instead of adding them with the + operator, just construct a list and pass that list to the from_documents method

acoustic atlas
dreamy karma
#

I think a

documents = [doc1, doc2, doc3]
FAISS.from_documents(documents, embeddings)

instead, would work

dreamy karma
acoustic atlas
dreamy karma
#

Faiss is more of a knn library than a full vectorstore, so I anticipate before going to prod you'd have to shift to a proper vectorstore anyway.
Qdrant claims to be one of the most performant, with both self-hosted and cloud options.
For prototyping you could use the in-memory mode of qdrant client with langchain, which very little setup as you're doing with faiss

#

Though this doesn't look like a vectorstore problem right now, so you might want to decide that later

acoustic atlas
dreamy karma
#

When you use a document loader it converts your data to string, the contents of which depend on the structure and size of your original data.
I don't have the original data, so you'll get a better idea by inspecting whether the context returned makes sense, how it is chunked, whether semantically related segments are in the same chunk, how many chunks are retrieved, etc

acoustic atlas
#

how to fix this:
how to fix this:

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 10.0 seconds as it raised RateLimitError: Rate limit reached for text-embedding-ada-002 in organization org-kivGQXmzDATD on tokens per min. Limit: 1000000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues..
#

i fixed it btw @dreamy karma

dreamy karma
acoustic atlas
#

anyone good with making text splitters for json data ?

strong stratus
#

i am new here i wish to know if anyone has a source of where i can learrn to make a chatbot

smoky gale
#

I studied NLP basics and now learned about transformer and implemented it, I now didn't know what to do next
what i have to learn and what i should do??

acoustic atlas
#

i want to embedd my model with more resourceful information to support more general queries. The thing is going through the website it is difficult to transform the context into one data format. For example, There is raw text and then a value pair for commonly asked questions and response. What do you guys think would be the best way to efficiently collect the data like this whilst ensuring a legible data format so the model can digest it?

minor sable
minor sable
acoustic atlas
acoustic atlas
#

what is the best way to get my model to understand basic question prompting im using rag.. Sometimes it requires user to prompt specifically from the data source syntax to get the best response.. I understand

  1. User Feedback Integration System
  2. Fine Tune with Diverse Queries.

What are some other options you guys recommend? Thank you

tardy tapir
#

Has anyone ever used BoxCox (scipy.stats.boxcox) before, in the context of count data?

#

I've always used log transform in the past, but BoxCox seems like a nice alternative?

silk night
#

Hey can anyone assist me for how to go Forward with NLP projects?? Like I have tried pandas and it's too slow in processing my inputs. I need an alternative for that to process my datasets for sentences. I have used pandarellel but it's for some reason popping an error which is not solvable.

olive socket
#

Hi everyone I'm want to fine tune a MultiModal llm which can generate icons based on input text, text to image task but not diffusion based models, I'm not able to find open-source models for it.. any idea?

glad cobalt
#

Hi everyone, I started in the field of machine learning about 4 months ago. I learned Tensorflow for NLP. I know how to classify and generate text, but I feel lost in this field. What do you advise me to continue with?

odd marten
#

hey

#

i have been trying to perform irony detection with this dataset and i am barely reaching 75% anyone got any ideas on how to do this

silk night
#

What's that can you explain?

hot wedge
#

Does anyone know where i can find datasets for SeamlessM4T model?

weak harness
#

Hi, guys
we can combine stt, chatgpt and tts to make voice chatbot
if we use streaming mode for stt and chatgpt, users can get response within 2 seconds and it can be useful.
I have developed AI call center by using Twilio and a voice chatbot.
I think it can take a role of human callers

jagged sparrow
weak harness
tardy depot
#

Hi guys
I am trying to extract people that are mentioned in an audio file.
I use Google Speech to Text to transcribe the audio and then I am using spacy for NER.
The results are quite inaccurate and I've realized that spacy does not identify well names when they are lowercase.

As the text comes from Speech to Text, it is not perfectly formatted and hence spacy is having some issues.

Do you guys suggest anything?

karmic skiff
outer bridge
#

Are there any paper reading sessions here?

unique laurel
#

hello there everyone i'm currently working on a resume parsing task which is a part of a resume matching project there's little information about this project so i was wondering if someone can clarify something for me someone wh worked on similar projects or who has any ideas that help . thank you in advance

unreal spindle
#

Hi, Im using reranking after retrieval in my rag pipeline but there are some cases where the reranking step provides very low scores in questions like "what are my products?". Any ideas on how to work around this? I was thinking about ponderating results I get from my vector dB but I'm not quiet sure how to use reranking scores for that

copper thunder
#

How to create a meta like ai using llma 2

#

Tha when the user use generate , image words it generate

#

Else talk

frosty summit
lucid veldt
#

guys, i have a corpus of 5,572 documents and when i am converting those documents into vectors using Word2Vec, i am getting 7 vectors which are completely empty and i am unable to figure out why, it would be really appreciated if anyone would help me in this

copper thunder
#

Be more specific

#

What's the prob v

lucid veldt
#

you mean problem? if yes, then i am classifying a message as a spam or ham

jade bison
#

ham? 🍔

lucid veldt
#

ham means "NOT spam"

jade bison
#

Ah! 💡 From now on I will write HAM in the subject of every email so people know that it's not SPAM 😂

copper thunder
placid ridge
#

Hello everyone.
Please help me to implement this.

#

I want to generate new song from 100 ambient songs.
These songs are the same in their style.

#

Anyone help me to do this.

placid ridge
#

Hi, everyone.
We are planning to deploy Llama3 for our app with millions of users.
How can we achieve this?
And which GPU series or cloud platforms are best for achieving high speed and scalability?

copper thunder
#

Prolly google cloud

tardy depot
#

has anyone used vector databases for mobile applications? chromadb, pinecone, ...?

zenith vale
steep abyss
#

Hi, everybody.
I have a question
I need to extract the abstract of papers not using GPT4, I have to rely on local resource.

from py_pdf_parser.loaders import load_file
from py_pdf_parser.components import ElementOrdering

document = load_file("JPM-2022-Harvey-25-46.pdf")
file_path = 'JPM-2022-Harvey-25-46.pdf'

document = load_file(
file_path, element_ordering=ElementOrdering.RIGHT_TO_LEFT_TOP_TO_BOTTOM
)

So I parsed the pdf using py_pdf_parser, and I'm going to merge the pieces until obtain the compelete abstract.
Now I try to use embedding models for this. But that doesn't work well.
If somebody has solution to about this, please help me.
In the case that I have to use the LLM models, the size should be under 2GB.
Thanks!

gloomy glen
#

Hello everyone, could someone gives me a playlist or a course that gives me introduction about NLP ?

silver sandal
fierce creek
#

Hi, Any playlist for LLMs and RAGs please....their easy to understand explanation, implementation and fine tuning

jade matrix
mortal raven
glad marsh
#

Hey I am building a terms and condition summariser for software. Is anyone aware of available datasets with the T&Cs and summaries. I am using a Pegasus for text-generation and all it gives out is some gibberish with repeating words.

jade matrix
#

how often n how much we will require regex understanding in nlp ?

bold flicker
#

But recently you could utilize genai tools for generating those regex as long as you know how they work.

jade matrix
royal sable
#

Is anyone familiar with Natural Language Understanding (NLU)?

#

If so, are there any learning resources you can recommend?

#

This is a very specific sub-field of NLP, but I'm finding a dearth of information about it (outside of a $1,750 Stanford course).

#

I'm specifically looking for practical, hands-on learning projects.

#

Just about everything I can find when searching specifically about NLU material is buried under the NLP umbrella.

ivory widget
#

HI, I am Abdullah I am an ML engineer want to join any team to particapte in kaggle competions

strong sphinx
#

Hi, need help in generation of sequence from a GPT model trained and evaluated on "tiny_shakespeare" dataset(built with PyTorch from scratch) . please DM 🙏

hasty crag
#

Problem/Project I'm Working On:

Given a pair of resume and job description, I'd like to predict if the resume is one of the following categories: "Fit", "Potential Fit", "No Fit".

Dataset: https://huggingface.co/datasets/cnamuangtoun/resume-job-description-fit

Data Distribution:

  • 50% - No Fit labels
  • 25% - Potential Fit labels
  • 25% - No Fit labels

Train - 6.24k rows
Test - 1.76k rows

Model Architecture (Siamese BERT Architecture). I read through a paper that used this architecture and I thought of using it.
Paper: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/final_reports/report062.pdf

I have attached a picture of the architecture I'm using. To give a brief summary I constructed 3 different variations, 2 of which are already in the image. All of them have two separate bert models which are sharing the same weights (siamese), where they differ is in the implementations:

  1. Model 1: Take absolute difference of both the pooled (cls pooling) embeddings
  2. Model 2: Take cosine similarity between the pooled embeddings
  3. Model 3: Hybrid version that combines both the absolute difference along with cosine similarity.

All 3 of these models then have a Softmax applied on it to predict the category. In terms of the preprocessing done to the text, I've removed links, lemmatized the text, removed special characters etc. just picked up a good enough code from stack overflow and adapted it.

Model Details:

  • Loss Function: CrossEntropy
  • Optimizer: Adam
  • Learning Rate: 5e-5 (From the paper)
  • Num Epochs: 10

Results:

  • Evaluation Accuracy after 10 epochs: 85%
  • Testing Accuracy: 55%

I am unable to get past this 55% testing accuracy no matter how many things I've changed. I am currently using DistilBERT, I tried with BERT as well and am getting similar results. I need some guidance on what to do, right now I know the model is overfitting and based on the little error analysis I've done, the model is misclassifying all the types of labels.

dusk notch
#

Hi I have a simple beginner doubt in transformers

  1. Embedding models that we see in hugging face are they basically encoder or abstraction of encoder part of transformers

  2. If yes then in transformers architecture we see embedding layer, how do we obtain those embeddings (the layer before we apply positional embedding)

keen wave
celest roost
#

Hello
Please show some support for this and I would also love to hear any insights for the posts
https://medium.com/@banupriya.valluvan

wraith galleon
#

Hi guys

sinful ocean
#

Hi everyone,

I’m transitioning from a Neuroscience & Psychology and Data Science background to specialize in NLP/LLMs, and I’d love your advice:

Background:

  • BSc in Neuroscience & Psychology
  • 6-month Data Science specialization (2021)
  • 3 years as a Data Scientist in healthcare, focusing on EEG signal processing and classification. However, many of my projects didn’t reach completion, and I haven’t consistently kept up with the latest methodologies and innovations, so I recognize gaps in my expertise.

Current focus:

  • Completed Coursera’s Generative AI with LLMs course
  • Currently working on a small emotion-detection project (sentence classification) for my GitHub portfolio

Questions:

  • What core skills, tools, or frameworks should I master now to break into the NLP/LLM space and to be the most relevant to employers?
  • Which project ideas or portfolio pieces or experience in general you've found to be relevant in the hiring processes?
  • Any must-read papers, tutorials, or communities you’d recommend?
  • How do you stay up-to-date with all the changes, tools and innovations in the field?

Thanks in advance for your insights!

short sentinel
#

Job Title: Part-Time Senior AI/ML Engineer (Remote)

We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.

Requirements:
-Minimum of 9–10 years of professional software development experience

-Proven experience working effectively in a remote environment

-Advanced English proficiency (C1 or higher); an American accent is preferred

-Availability to work 10–15 hours per week during EST or CST business hours

If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, we’d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384

short sentinel
#

Job Title: Part-Time Senior AI/ML Engineer (Remote)

We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.

Requirements:
-Minimum of 7–10 years of professional software development experience

-Proven experience working effectively in a remote environment

-Advanced English proficiency (C1 or higher); an American accent is preferred

-Availability to work 10–15 hours per week during EST or CST business hours

If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, we’d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384

frigid lantern
#

Else i m young and you are my senior i just cant advice you hehe

rigid harbor
#

Code Rehabilitation:
A New Paradigm
I share a concept "code rehabilitation and rehabilitators."

A story:

I appointed an excellent fast chef, a few serving robots, and retained my existing staff—even hired more. Why?

Reason 1: My chef started producing fast, good, tasty recipes with about 90% accuracy in the logistics of placing them in plates or packages to be served. But that means a lot of human interference is still needed.

Reason 2: More training is going on for my robots on how to handle exceptional situations—so there's a growing need for rehab-trainers.

Reason 3: The more demand for my recipes, the more I need marginally more staff and robots. While 90% of the task gets done swiftly (maybe even at "200 percent velocity"!), that remaining 10% requires more people and robots. As demand increases, this marginal staff requirement could become tremendous.

The Moral: An analogical scenario for why I still recruit—but in a different fashion, for a different paradigm. This reminds me of "creative destruction," a concept from economist Joseph Schumpeter (though not a Nobel laureate, his influence on economic thought is monumental). It's a sort of Brahmic cycle—simultaneous destruction and creative activity.

The Temporal Question
But here's the thing: Will this 10% shrink over time? Will next year's AI handle 95%, then 98%? Or will new complexities emerge—new edge cases, security concerns, integration challenges—that keep regenerating that stubborn 10%? The rehabilitation need might persist longer than we think, or it might evolve into something entirely different. Time will tell which way the wind blows.

What Lies Ahead
I foresee a Rehabilitation Engineering Program emerging soon—a structured discipline to train specialists who can bridge the gap between AI-generated output and production-ready systems. These rehabilitators will be the crucial link in our new *creative-destructive cycle.

(*With due regards to Joseph Schrumpeter who kindled this idea)

ripe mango
#

I'm finding a US developer for the collaboration. If anybody interested, please dm me.

unique raptor
#

Hey there any one

#

i just need a quick help

spice sedge
#

Hey everyone. Im still learning the whole nlp topic. At the moment, I want to split and classify fantasy stories into narrative parts. Like "Battle", "Journey", "Rest", and so on. My plan: after the preprocessing (remove punctuations, lowercasing, lemmatization), I use a POS-Tagger to identify the verbs. And then I thoughed maybe to classify them. But the problem is, that some same verbs have different meanings. Like "run" (She runs a small business; This machine runs on magic) or "draw" (She drew water from the well; He draw his sword).

Do you know a way, how I can separate/classify the meaning of the sentences, to throw away the verbs I dont need, for a higher quality of the results? 🙂

I would be rly glad, if anyone has some ideas, which I could try out 🙏

fickle nimbus
#

Hey everyone 👋
Quick question for folks building chatbots / LLM apps here —
how are you currently handling long-term user memory beyond a single session?
Curious what’s actually working in practice (RAG, DB, custom hacks, etc).

green dune
#

Who has experience with a inmp441?

spare fiber
simple flare
shy linden
#

New Dataset Just published!

View: https://www.kaggle.com/datasets/mabubakrsiddiq/clear-bg-ocr-dataset-eng-and-zh-22k-images

🔹 Overview

This dataset contains synthetic OCR images of English and Chinese sentences. Each language is organized in a separate folder with corresponding metadata. The images have clear backgrounds, random fonts and font sizes, and optional blur for variability.

The dataset is designed for OCR research, machine learning, and computer vision tasks. Perfect for training models to recognize text in multiple languages and fonts.

🎨 Features

  • Two-lingual dataset: English & Chinese
  • Random fonts: Multiple font options for diversity
  • Random font sizes: Increases model generalization
  • Optional Gaussian blur: Simulates real-world imaging
  • Clear backgrounds: Good for clean OCR training
  • Metadata included: Easy for preprocessing and analysis

💡 Possible Use Cases

  • 🖋️ OCR Model Training: Train models like Tesseract, PaddleOCR, or deep learning OCR pipelines
  • 🤖 Computer Vision Research: Use metadata for font/style classification
  • 🏫 Language Learning Tools: Visual recognition for English or Chinese sentences
  • 🔧 Augmentation Testing: Benchmark text recognition under blur and font variations
  • 🧠 Multi-Lingual OCR Experiments: Test cross-lingual recognition models

⚡ Notes

  • The Chinese text is rendered using Microsoft YaHei and NSimSun fonts for proper character display.
  • The English text uses a variety of fonts for diversity.

Please consider giving an upvote!

tardy orbit
#

We built a Kaggle Search where you can search datasets on Kaggle (or HF) and find datasets that positively or negatively influence model based on your prompt. It uses gradient of the model's loss with respect to the final transformer block parameters.

Instead of relying on upvotes from folks that may not utilize the dataset for the same reason as you, you can test what model you are training and it will calculate their influence.

https://durinn-concept-explorer.azurewebsites.net/

shy linden
#

New Dataset published!

https://www.kaggle.com/datasets/mabubakrsiddiq/language-identification-dataset-20-languages/data/data/data/data/data
The Language Identification Dataset is a curated collection of approximately 68978 text samples, each paired with a corresponding language label. The dataset was constructed by gathering multilingual text passages from three major sources: the Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi-MT. These sources provide a diverse mix of domains, writing styles, and sentence structures, making the dataset suitable for research and machine learning tasks involving language detection, multilingual NLP, and text classification.

shy linden
simple flare
mighty tiger
#

Anyone interested in giving a talk (45 -60 mins) , at University of Wisconsin - Milwaukee ???
My team is organizing a colloquium as part of the NSF REU Summer 2026 program(Link: https://sites.uwm.edu/reu/about/ ).

The colloquium will focus on Natural Language Processing (NLP), Natural Language Understanding (NLU), and Large Language Models (LLMs), with an audience made up primarily of undergraduate students. It's a wonderful opportunity to inspire the next generation of researchers and share your expertise in a meaningful way!
Kindly dm with your info if you are interested

Regards,
Pamela Martey

normal inlet
chilly hull
#

I want to start learning NLP using PyTorch. Can anyone suggest some good resources or beginner-to-intermediate projects I can build to practice?

shy linden
#

https://www.kaggle.com/datasets/mabubakrsiddiq/urdu-ghazal-dataset-32-poets-and-their-ghazals

The dataset contains poetry by 30 greatest urdu poets. Here they are:

'mirza-ghalib','allama-iqbal','faiz-ahmad-faiz','sahir-ludhianvi','meer-taqi-meer', 'dagh-dehlvi','kaifi-azmi','gulzar','bahadur-shah-zafar','parveen-shakir', 'jaan-nisar-akhtar','javed-akhtar','jigar-moradabadi','jaun-eliya', 'ahmad-faraz','meer-anees','mohsin-naqvi','firaq-gorakhpuri','fahmida-riaz','wali-mohammad-wali', 'waseem-barelvi','akbar-allahabadi','altaf-hussain-hali','ameer-khusrau','naji-shakir','naseer-turabi', 'nazm-tabatabai','nida-fazli','noon-meem-rashid', 'habib-jalib'
Every ghazal is given in three writing systems:

Urdu (Arabic Script)
Hindi (Hindi writing system)
English (Latin Script)
Divided into three folders: ur, en and hi.

Potential use cases:

NLP
Meter Detection
Modeling AI to predict the poet given the ghazal or couplet
Have fun with data!

remote mural
quaint agate
#

Hello guys

#

I have a regional contest soon

#

how do I do for NLP?

spice sedge
#

Anyone recommendations for a free BIO-Annotation-Tool?

west garnet
nimble hound
# spice sedge Hey everyone. Im still learning the whole nlp topic. At the moment, I want to sp...

Hey! This is actually a really interesting problem.
POS tagging alone probably won’t be enough here. What you are running into is basically word sense disambiguation, especially with verbs like “run” or “draw” that change meaning depending on context.

One direction you could try is contextual embeddings like BERT or RoBERTa. These models look at the full sentence instead of isolated words, so the meaning of the verb changes based on context automatically.

Another option is semantic role labeling. Tools like AllenNLP can identify roles such as agent, action, and theme, which could map nicely to narrative categories like battle or journey.

You could also skip verb level extraction entirely and treat this as a sentence classification problem. That approach is often simpler and works better in practice.

What does your dataset look like? Is it already labeled or are you planning to create the labels yourself?

spice sedge
# nimble hound Hey! This is actually a really interesting problem. POS tagging alone probably w...

Hi! Thanks for your response and your interest. My dataset are german Fairytales and its just plain text. I plan to create the labels myself and have already started annotating the data. But yeah.. its a lot of work. But you are absolutely right: I have those problems, also some others problems like names and nouns. So I need the Labels for different words like the plant "Rapunzel" and the name "Rapunzel". In the end, I want to identify every figure in the fairy tales.

My aim is to label enough sentences and fine tune a spacy NER-model. But I would also like to try your recommended embeddings with BERT. Thats the reason why I decided for BIO-Annotation (I hope it was the right decision :D) - for fine tuning some transformer models.

AllenNLP sounds super interesting! I've never heard about this. Thank you very much!

You could also skip verb level extraction entirely and treat this as a sentence classification problem. That approach is often simpler and works better in practice.
What do you mean by this?

graceful yacht
# spice sedge Hi! Thanks for your response and your interest. My dataset are german Fairytales...

Hey there! You could also try the good old RNN approach in order to make computations easier on a CPU. If you're classifying incredibly large passages, you could try and leave only the sentences with the highest-scoring IDF (to prevent general words like "the" and "run" getting thrown in).

You could also just try the simplest non-neural approach, something like Naive Bayes or pure TF-IDF and see how that goes. I think for classifying genres that would be fine.

Otherwise, I wouldn't recommend straight up jumping to BERT or such classifiers - I doubt 110M parameters are going to make classification any more accurate. You could try sBERT. I found https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 to be good for sentence--> vector tasks, used it in my paper once, and you could just "glue" a head onto it and fine-tune. It has 22M, so it is smaller and more than enough for the job.

TL;DR: Try non-neural approaches first (Naive Bayes, TF-IDF), then RNNs (GRU or similar). If RNNs don't work as good, try sBERT and only then something bigger.