placid ridge Sep 1, 2023, 9:10 PM

#

Any other Natural Language Processing nerds in this discord? 🙋‍♂️

sullen stone Sep 1, 2023, 9:17 PM

#

Hello Grey

marsh tiger Sep 1, 2023, 9:31 PM

#

Hello Guys! How are we doing!

tawdry remnant Sep 1, 2023, 11:31 PM

#

yoh !!

haughty pond Sep 2, 2023, 12:43 AM

#

Yep. I'm here

frozen plinth Sep 3, 2023, 6:20 PM

#

graceful elm Sep 4, 2023, 8:09 AM

#

Hello there,
I'm beginner level UG student in NLP domain, I seek your guidance regarding a topic :"Automatic literature review generation from Research Papers".Like how should I approach and generate.

I have 3 publications on Deep learning domain (imagery though) .So i will be able to grab your idea or guidance.
Thanks to all

neon torrent Sep 6, 2023, 5:26 PM

#

Hello guys great to be here

void zenith Sep 6, 2023, 6:57 PM

#

Hey Guys

dreamy karma Sep 8, 2023, 2:32 AM

#

graceful elm Hello there, I'm beginner level UG student in NLP domain, I seek your guidance ...

What exactly do you want to do? Given your own research paper text without a literature review section, find suitable papers for literature review?
Or given a specific paper, to generate a summarised text that you can use as a literature review of that paper in your own paper's literature review section?

graceful elm Sep 8, 2023, 6:35 AM

#

dreamy karma What exactly do you want to do? Given your own research paper text without a lit...

This one - given a specific paper, to generate a summarised text that you can use as a literature review of that paper in your own paper's literature review section

#

Thanks a lot for your response mate

dreamy karma Sep 8, 2023, 7:57 AM

#

graceful elm This one - given a specific paper, to generate a summarised text that you can us...

This sounds like a specialized text summarisation problem.
You could use extractive or abstractive text summarisation on the given paper.
For a more specialized model, you could fine-tune it on a dataset of papers and their lit. reviews, maybe abstracts?
Further, if you want to add your own paper's influence into the review as context, you could use an LLM with Retrieval Augmented Generation.

These are very rudimentary ideas, as you experiment you should get something more concrete, or maybe someone else here can improve on these

dreamy karma Sep 8, 2023, 7:57 AM

#

graceful elm Thanks a lot for your response mate

No problem! Atb with your work

graceful elm Sep 8, 2023, 6:03 PM

#

dreamy karma This sounds like a specialized text summarisation problem. You could use extract...

Thanks mate ❤️

odd marlin Sep 14, 2023, 3:36 PM

#

Hey there, I was wondering, how tedious it is to create embeddings for NLPs? I may want to create a gig for exactly that.

dreamy karma Sep 15, 2023, 6:04 AM

#

odd marlin Hey there, I was wondering, how tedious it is to create embeddings for NLPs? I m...

There are enterprise APIs and open source models to create embeddings for a given text. Obviously each of these generated embeddings will be specific to the embedding space of that particular model checkpoint.
What problem are you trying to solve?

odd marlin Sep 15, 2023, 8:02 AM

#

dreamy karma There are enterprise APIs and open source models to create embeddings for a give...

Oh I was wondering if this is something I can offer to Fiverr as a gig, but if there are APIs or models to just quickly generate embeddings, I suppose never mind lol

dreamy karma Sep 15, 2023, 8:52 AM

#

You can use basically any open source language model to get embeddings for a piece of text. Embeddings are a latent space representation, and an intermediate of the process for how most language models work.
Depending on your use case ofcourse, quality of embeddings and embedding space will differ for each model.

marble tide Sep 17, 2023, 6:51 PM

#

I have a question about fine-tuning a mobilebert model. Should I use a teacher model for doing that as well? Or could I just train the mobilebert directly using a low learning_rate? I am trying this now, but I have a val_root_mean_squared_error x10 as big as what is expected with the best parameters found so far... any advice or help would be greatly appreciated! It's my first time doing this. Really learningful experience though. Thanks! 😄

acoustic atlas Sep 23, 2023, 9:31 PM

#

I want to create an LLM that is for NLP and question answering based on a collection of pdfs and text. I want to talk to someone about this architecturally to see if there are any examples i could follow and what archicetctures should i implement based on State of Art stuff like Llama 2 and or langchain openai. Please someone message me so I can finally get started.

acoustic atlas Oct 8, 2023, 9:31 PM

#

i passed a json doc into model using langchain and when making an inference of the RAG model it does not seem to be responding based off the json data specifically, when asking it certain questions that should have a good response. Do i need to restructure my Json Data away from the nested dict style to something more condensed?

dreamy karma Oct 14, 2023, 6:25 AM

#

acoustic atlas i passed a json doc into model using langchain and when making an inference of t...

it's difficult to pinpoint the error based on this much information.
Langchain's popular use case is to provide a convenient pipeline for your flow of data and some templates.

Are you never getting answers relevant to the context from your data-store? Then maybe there's a problem with the data pipeline.
If you are getting answers related to information in the data-store, but not relevant enough to the question, then you might have problems with the quality of the semantic similarity based retrieval (or MMR or whatever method you're using)
If you are getting relevant answers but it doesn't answer the question effectively, then you might have a problem with your prompt template, structure of context-data, or the strength of the model.

There are many other possible scenarios I think. Like I said, difficult to pinpoint with this much information.
Good luck!

acoustic atlas Oct 18, 2023, 3:45 AM

#

dreamy karma it's difficult to pinpoint the error based on this much information. Langchain's...

data pipeline is fine. could be problem with semantic sim retrieval and prompt. here is both of my stuff.

dreamy karma Oct 18, 2023, 8:46 AM

#

where are you actually using retrieve_info() to get context for your LLMChain query

acoustic atlas Oct 18, 2023, 1:02 PM

#

dreamy karma where are you actually using `retrieve_info()` to get context for your LLMChain ...

def generate_response(message):
    best_practice = retrieve_info(message)
    response = chain.run(message=message, best_practice=best_practice)
    return response

dreamy karma Oct 18, 2023, 1:15 PM

#

I see! And what is off about the results? They're pulled from the database but not relevant to the context? Or it's relevant to the context but doesn't align to the full context?
Or is it only that the context is identified and retrieved properly but you want richer, higher quality responses?

acoustic atlas Oct 18, 2023, 1:18 PM

#

i ask it a specific question about the database and it seems to not pull the information from that source sometimes it hallucinates and seems to bs a bit @dreamy karma

#

I cann show u a demo but id have to make u sign an NDA @dreamy karma

#

I wonder if its the way im templating or something idk but there is a lot of json data i gave it like over 400 lines @dreamy karma

dreamy karma Oct 18, 2023, 1:20 PM

#

acoustic atlas I cann show u a demo but id have to make u sign an NDA <@833644804670750750>

Yeah that's why I was trying to ask generic enough questions xd

dreamy karma Oct 18, 2023, 1:21 PM

#

acoustic atlas I wonder if its the way im templating or something idk but there is a lot of jso...

Yeah that's why I asked "....doesn't align to the full context" for cases where maybe the entire context can't fit into the model's context limit

#

Your vectordb is a combination of the json and 2 CSVs
Have you inspected the final chunks in the vector DB? What is the size of each chunk?
How many chunks are needed to typically answer a query?
What is the total number of tokens in that number of chunks

acoustic atlas Oct 18, 2023, 1:27 PM

#

did it before but not sure i can x2 check @dreamy karma

dreamy karma Oct 18, 2023, 1:37 PM

#

acoustic atlas did it before but not sure i can x2 check <@833644804670750750>

That would be something I'd check.
Also I see you're adding the Document objects together to get the final object that you pass to faiss
iirc the faiss.from_documents() method already takes a list of documents as input, so there's no need to add them together using the + operator.
The Document class defines 2 fields for the text and metadata respectively for each document so I'm not sure what the behaviour would be on just adding these objects together with the + operator, the Document class doesn't define this afaik.
Though if I recall correctly, the pinecone vectorstore raises an error on attempting this.

So that would also be a suggestion, instead of adding them with the + operator, just construct a list and pass that list to the from_documents method

acoustic atlas Oct 18, 2023, 1:38 PM

#

dreamy karma Your vectordb is a combination of the json and 2 CSVs Have you inspected the fin...

do u have any documentation on this? using FAIISS struggling to build chunk size

num_chunks = db.ntotal  # Get the total number of chunks (or vectors)

for chunk_idx in range(num_chunks):
    chunk_info = db.get_chunk_info(chunk_idx)
    chunk_size = chunk_info['size']
    print(f"Chunk {chunk_idx + 1}: Size = {chunk_size}")

dreamy karma Oct 18, 2023, 1:43 PM

#

I think a

documents = [doc1, doc2, doc3]
FAISS.from_documents(documents, embeddings)

instead, would work

dreamy karma Oct 18, 2023, 1:44 PM

#

acoustic atlas do u have any documentation on this? using FAIISS struggling to build chunk size...

I generally work with the qdrant vectorstore when working with langchain so I'm not sure about the exact method for this

acoustic atlas Oct 18, 2023, 1:52 PM

#

dreamy karma I generally work with the qdrant vectorstore when working with langchain so I'm ...

how do u like qdrant is it esay to implement?

dreamy karma Oct 18, 2023, 1:57 PM

#

Faiss is more of a knn library than a full vectorstore, so I anticipate before going to prod you'd have to shift to a proper vectorstore anyway.
Qdrant claims to be one of the most performant, with both self-hosted and cloud options.
For prototyping you could use the in-memory mode of qdrant client with langchain, which very little setup as you're doing with faiss

#

Though this doesn't look like a vectorstore problem right now, so you might want to decide that later

acoustic atlas Oct 18, 2023, 4:24 PM

#

dreamy karma Faiss is more of a knn library than a full vectorstore, so I anticipate before g...

does it matter what data format (json or jsonl) which ones do u usually do

dreamy karma Oct 19, 2023, 3:00 AM

#

acoustic atlas does it matter what data format (json or jsonl) which ones do u usually do

I would suggest you to look at some samples of what retrieve_info() is returning and whether that makes sense for your query

#

When you use a document loader it converts your data to string, the contents of which depend on the structure and size of your original data.
I don't have the original data, so you'll get a better idea by inspecting whether the context returned makes sense, how it is chunked, whether semantically related segments are in the same chunk, how many chunks are retrieved, etc

acoustic atlas Oct 19, 2023, 3:02 PM

#

how to fix this:
how to fix this:

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 10.0 seconds as it raised RateLimitError: Rate limit reached for text-embedding-ada-002 in organization org-kivGQXmzDATD on tokens per min. Limit: 1000000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues..

#

i fixed it btw @dreamy karma

dreamy karma Oct 20, 2023, 2:57 AM

#

acoustic atlas i fixed it btw <@833644804670750750>

Oh great, what was the issue?

acoustic atlas Oct 20, 2023, 10:56 AM

#

anyone good with making text splitters for json data ?

strong stratus Oct 22, 2023, 9:20 PM

#

i am new here i wish to know if anyone has a source of where i can learrn to make a chatbot

smoky gale Oct 23, 2023, 3:43 PM

#

I studied NLP basics and now learned about transformer and implemented it, I now didn't know what to do next
what i have to learn and what i should do??

acoustic atlas Oct 28, 2023, 4:28 PM

#

i want to embedd my model with more resourceful information to support more general queries. The thing is going through the website it is difficult to transform the context into one data format. For example, There is raw text and then a value pair for commonly asked questions and response. What do you guys think would be the best way to efficiently collect the data like this whilst ensuring a legible data format so the model can digest it?

minor sable Oct 28, 2023, 9:03 PM

#

acoustic atlas i want to embedd my model with more resourceful information to support more gene...

Do you mind giving more information about what you want to do?
Are you concerned chunking your data may split question and corresponding answer into different chunks?

minor sable Oct 28, 2023, 9:04 PM

#

acoustic atlas i passed a json doc into model using langchain and when making an inference of t...

Were you finally able to resolve this?

acoustic atlas Oct 29, 2023, 5:03 PM

#

minor sable Do you mind giving more information about what you want to do? Are you concerned...

I mean, what more can I say I think I said all the information

acoustic atlas Oct 31, 2023, 6:42 PM

#

what is the best way to get my model to understand basic question prompting im using rag.. Sometimes it requires user to prompt specifically from the data source syntax to get the best response.. I understand

User Feedback Integration System
Fine Tune with Diverse Queries.

What are some other options you guys recommend? Thank you

tardy tapir Nov 17, 2023, 9:09 AM

#

Has anyone ever used BoxCox (scipy.stats.boxcox) before, in the context of count data?

#

I've always used log transform in the past, but BoxCox seems like a nice alternative?

silk night Dec 23, 2023, 6:15 AM

#

Hey can anyone assist me for how to go Forward with NLP projects?? Like I have tried pandas and it's too slow in processing my inputs. I need an alternative for that to process my datasets for sentences. I have used pandarellel but it's for some reason popping an error which is not solvable.

olive socket Dec 27, 2023, 3:08 AM

#

Hi everyone I'm want to fine tune a MultiModal llm which can generate icons based on input text, text to image task but not diffusion based models, I'm not able to find open-source models for it.. any idea?

glad cobalt Jan 20, 2024, 12:56 PM

#

Hi everyone, I started in the field of machine learning about 4 months ago. I learned Tensorflow for NLP. I know how to classify and generate text, but I feel lost in this field. What do you advise me to continue with?

odd marten Jan 22, 2024, 3:36 PM

#

hey

#

https://github.com/Cyvhee/SemEval2018-Task3/tree/master/datasets

GitHub

SemEval2018-Task3/datasets at master · Cyvhee/SemEval2018-Task3

This is the Github repository for SemEval-2018 Task 3 - Cyvhee/SemEval2018-Task3

#

i have been trying to perform irony detection with this dataset and i am barely reaching 75% anyone got any ideas on how to do this

dreamy karma Mar 1, 2024, 2:22 AM

#

silk night Hey can anyone assist me for how to go Forward with NLP projects?? Like I have t...

Maybe check out dask

silk night Mar 1, 2024, 4:55 AM

#

What's that can you explain?

hot wedge Mar 4, 2024, 4:29 AM

#

Does anyone know where i can find datasets for SeamlessM4T model?

weak harness Mar 8, 2024, 1:42 AM

#

Hi, guys
we can combine stt, chatgpt and tts to make voice chatbot
if we use streaming mode for stt and chatgpt, users can get response within 2 seconds and it can be useful.
I have developed AI call center by using Twilio and a voice chatbot.
I think it can take a role of human callers

jagged sparrow Mar 9, 2024, 8:50 PM

#

weak harness Hi, guys we can combine stt, chatgpt and tts to make voice chatbot if we use str...

Sounds Nice!! if you are okay share the GitHub repo, algorithm, or steps you have followed, plz share them with us.

weak harness Mar 9, 2024, 10:43 PM

#

jagged sparrow Sounds Nice!! if you are okay share the GitHub repo, algorithm, or steps you hav...

I am sorry. But it is not free.

tardy depot Apr 2, 2024, 8:41 AM

#

Hi guys
I am trying to extract people that are mentioned in an audio file.
I use Google Speech to Text to transcribe the audio and then I am using spacy for NER.
The results are quite inaccurate and I've realized that spacy does not identify well names when they are lowercase.

As the text comes from Speech to Text, it is not perfectly formatted and hence spacy is having some issues.

Do you guys suggest anything?

karmic skiff Apr 9, 2024, 8:21 AM

#

hey, everybody.is there a channel to discuss the dataset?

outer bridge Apr 15, 2024, 6:57 PM

#

Are there any paper reading sessions here?

unique laurel Apr 29, 2024, 3:13 PM

#

hello there everyone i'm currently working on a resume parsing task which is a part of a resume matching project there's little information about this project so i was wondering if someone can clarify something for me someone wh worked on similar projects or who has any ideas that help . thank you in advance

karmic moss May 31, 2024, 6:04 PM

#

unique laurel hello there everyone i'm currently working on a resume parsing task which is a p...

did you able to do it ?

unreal spindle Jun 21, 2024, 8:49 AM

#

Hi, Im using reranking after retrieval in my rag pipeline but there are some cases where the reranking step provides very low scores in questions like "what are my products?". Any ideas on how to work around this? I was thinking about ponderating results I get from my vector dB but I'm not quiet sure how to use reranking scores for that

copper thunder Jun 22, 2024, 6:25 PM

#

How to create a meta like ai using llma 2

#

Tha when the user use generate , image words it generate

#

Else talk

frosty summit Jun 22, 2024, 6:39 PM

#

copper thunder Tha when the user use generate , image words it generate

if button.text = 'generate':
generate_image()
else:
talk()

lucid veldt Jun 24, 2024, 12:44 AM

#

guys, i have a corpus of 5,572 documents and when i am converting those documents into vectors using Word2Vec, i am getting 7 vectors which are completely empty and i am unable to figure out why, it would be really appreciated if anyone would help me in this

copper thunder Jun 27, 2024, 1:01 AM

#

Be more specific

#

What's the prob v

lucid veldt Jun 27, 2024, 3:21 PM

#

you mean problem? if yes, then i am classifying a message as a spam or ham

jade bison Jun 27, 2024, 3:51 PM

#

ham? 🍔

lucid veldt Jun 28, 2024, 8:51 AM

#

ham means "NOT spam"

jade bison Jun 28, 2024, 3:48 PM

#

Ah! 💡 From now on I will write HAM in the subject of every email so people know that it's not SPAM 😂

copper thunder Jul 1, 2024, 2:51 AM

#

https://youtu.be/vHMEJwxs_dg?si=NHM0uee4Rys7jVZC

YouTube

Rauf

Top 10 Deep Learning Algorithms intro in 1 min

Welcome to our deep dive into the Top 10 Deep Learning Algorithms! In this video, we break down each algorithm with a concise 10-word explanation. Perfect for beginners and experts alike, you'll get a quick yet comprehensive overview of the most impactful algorithms in the world of deep learning.

Top 10 Deep Learning Algorithms:

Convolutional ...

▶ Play video

placid ridge Jul 6, 2024, 7:24 AM

#

Hello everyone.
Please help me to implement this.

#

I want to generate new song from 100 ambient songs.
These songs are the same in their style.

#

Anyone help me to do this.

placid ridge Jul 27, 2024, 11:45 AM

#

Hi, everyone.
We are planning to deploy Llama3 for our app with millions of users.
How can we achieve this?
And which GPU series or cloud platforms are best for achieving high speed and scalability?

copper thunder Aug 3, 2024, 2:34 AM

#

Prolly google cloud

copper thunder Aug 3, 2024, 2:35 AM

#

placid ridge I want to generate new song from 100 ambient songs. These songs are the same in ...

Can you explain more

tardy depot Aug 14, 2024, 9:50 PM

#

has anyone used vector databases for mobile applications? chromadb, pinecone, ...?

zenith vale Aug 17, 2024, 6:38 PM

#

tardy depot has anyone used vector databases for mobile applications? chromadb, pinecone, .....

I used SQLite.. but it is not so good for embeddings, vectors.. 🥲

steep abyss Aug 20, 2024, 7:15 AM

#

Hi, everybody.
I have a question
I need to extract the abstract of papers not using GPT4, I have to rely on local resource.

from py_pdf_parser.loaders import load_file
from py_pdf_parser.components import ElementOrdering

document = load_file("JPM-2022-Harvey-25-46.pdf")
file_path = 'JPM-2022-Harvey-25-46.pdf'

document = load_file(
file_path, element_ordering=ElementOrdering.RIGHT_TO_LEFT_TOP_TO_BOTTOM
)

So I parsed the pdf using py_pdf_parser, and I'm going to merge the pieces until obtain the compelete abstract.
Now I try to use embedding models for this. But that doesn't work well.
If somebody has solution to about this, please help me.
In the case that I have to use the LLM models, the size should be under 2GB.
Thanks!

gloomy glen Sep 16, 2024, 3:43 PM

#

Hello everyone, could someone gives me a playlist or a course that gives me introduction about NLP ?

silver sandal Sep 17, 2024, 3:51 PM

#

@gloomy glen this is the best playlist ever I think for NLP
https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ

YouTube

Neural Networks: Zero to Hero

gloomy glen Sep 17, 2024, 6:41 PM

#

silver sandal <@608368964757618698> this is the best playlist ever I think for NLP https://ww...

Thank you so much 🌹

fierce creek Sep 24, 2024, 3:30 PM

#

Hi, Any playlist for LLMs and RAGs please....their easy to understand explanation, implementation and fine tuning

jade matrix Oct 22, 2024, 3:16 AM

#

gloomy glen Hello everyone, could someone gives me a playlist or a course that gives me intr...

https://course.spacy.io/en good intro to nlp with spacy

Advanced NLP with spaCy

Advanced NLP with spaCy · A free online course

spaCy is a modern Python library for industrial-strength Natural Language Processing. In this free and interactive online course, you'll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches.

mortal raven Oct 24, 2024, 10:53 AM

#

unreal spindle Hi, Im using reranking after retrieval in my rag pipeline but there are some cas...

sound interesting, can you explain more and give some examples?

glad marsh Nov 2, 2024, 7:46 AM

#

Hey I am building a terms and condition summariser for software. Is anyone aware of available datasets with the T&Cs and summaries. I am using a Pegasus for text-generation and all it gives out is some gibberish with repeating words.

jade matrix Nov 2, 2024, 8:11 AM

#

glad marsh Hey I am building a terms and condition summariser for software. Is anyone aware...

https://huggingface.co/datasets/joelniklaus/online_terms_of_service

joelniklaus/online_terms_of_service · Datasets at Hugging Face

jade matrix Nov 3, 2024, 6:13 AM

#

how often n how much we will require regex understanding in nlp ?

bold flicker Nov 4, 2024, 3:26 PM

#

jade matrix how often n how much we will require regex understanding in nlp ?

It comes in handy during the preprocessing stuffs.

#

But recently you could utilize genai tools for generating those regex as long as you know how they work.

jade matrix Nov 4, 2024, 3:33 PM

#

bold flicker But recently you could utilize genai tools for generating those regex as long as...

That means one should know them it's actually important thnx

royal sable Nov 10, 2024, 11:31 PM

#

Is anyone familiar with Natural Language Understanding (NLU)?

#

If so, are there any learning resources you can recommend?

#

This is a very specific sub-field of NLP, but I'm finding a dearth of information about it (outside of a $1,750 Stanford course).

#

I'm specifically looking for practical, hands-on learning projects.

#

Just about everything I can find when searching specifically about NLU material is buried under the NLP umbrella.

ivory widget Nov 11, 2024, 10:54 AM

#

HI, I am Abdullah I am an ML engineer want to join any team to particapte in kaggle competions

strong sphinx Nov 13, 2024, 8:07 PM

#

Hi, need help in generation of sequence from a GPT model trained and evaluated on "tiny_shakespeare" dataset(built with PyTorch from scratch) . please DM 🙏

hasty crag Dec 6, 2024, 3:06 PM

#

Problem/Project I'm Working On:

Given a pair of resume and job description, I'd like to predict if the resume is one of the following categories: "Fit", "Potential Fit", "No Fit".

Dataset: https://huggingface.co/datasets/cnamuangtoun/resume-job-description-fit

Data Distribution:

50% - No Fit labels
25% - Potential Fit labels
25% - No Fit labels

Train - 6.24k rows
Test - 1.76k rows

Model Architecture (Siamese BERT Architecture). I read through a paper that used this architecture and I thought of using it.
Paper: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/final_reports/report062.pdf

I have attached a picture of the architecture I'm using. To give a brief summary I constructed 3 different variations, 2 of which are already in the image. All of them have two separate bert models which are sharing the same weights (siamese), where they differ is in the implementations:

Model 1: Take absolute difference of both the pooled (cls pooling) embeddings
Model 2: Take cosine similarity between the pooled embeddings
Model 3: Hybrid version that combines both the absolute difference along with cosine similarity.

All 3 of these models then have a Softmax applied on it to predict the category. In terms of the preprocessing done to the text, I've removed links, lemmatized the text, removed special characters etc. just picked up a good enough code from stack overflow and adapted it.

Model Details:

Loss Function: CrossEntropy
Optimizer: Adam
Learning Rate: 5e-5 (From the paper)
Num Epochs: 10

Results:

Evaluation Accuracy after 10 epochs: 85%
Testing Accuracy: 55%

I am unable to get past this 55% testing accuracy no matter how many things I've changed. I am currently using DistilBERT, I tried with BERT as well and am getting similar results. I need some guidance on what to do, right now I know the model is overfitting and based on the little error analysis I've done, the model is misclassifying all the types of labels.

cnamuangtoun/resume-job-description-fit · Datasets at Hugging Face

dusk notch Mar 26, 2025, 1:30 PM

#

Hi I have a simple beginner doubt in transformers

Embedding models that we see in hugging face are they basically encoder or abstraction of encoder part of transformers
If yes then in transformers architecture we see embedding layer, how do we obtain those embeddings (the layer before we apply positional embedding)

keen wave Mar 31, 2025, 2:31 PM

#

hasty crag Problem/Project I'm Working On: Given a pair of resume and job description, I'd...

Hi @hasty crag did you find some help to make it better?

celest roost Apr 3, 2025, 11:02 AM

#

Hello
Please show some support for this and I would also love to hear any insights for the posts
https://medium.com/@banupriya.valluvan

Medium

Banupriya Valluvan – Medium

Read writing from Banupriya Valluvan on Medium. A Master’s graduate with a strong passion for Large Language Models (LLMs) and AI technologies. Every day, Banupriya Valluvan and thousands of other voices read, write, and share important stories on Medium.

wraith galleon Apr 4, 2025, 8:50 PM

#

Hi guys

sinful ocean Jul 2, 2025, 2:14 PM

#

Hi everyone,

I’m transitioning from a Neuroscience & Psychology and Data Science background to specialize in NLP/LLMs, and I’d love your advice:

Background:

BSc in Neuroscience & Psychology
6-month Data Science specialization (2021)
3 years as a Data Scientist in healthcare, focusing on EEG signal processing and classification. However, many of my projects didn’t reach completion, and I haven’t consistently kept up with the latest methodologies and innovations, so I recognize gaps in my expertise.

Current focus:

Completed Coursera’s Generative AI with LLMs course
Currently working on a small emotion-detection project (sentence classification) for my GitHub portfolio

Questions:

What core skills, tools, or frameworks should I master now to break into the NLP/LLM space and to be the most relevant to employers?
Which project ideas or portfolio pieces or experience in general you've found to be relevant in the hiring processes?
Any must-read papers, tutorials, or communities you’d recommend?
How do you stay up-to-date with all the changes, tools and innovations in the field?

Thanks in advance for your insights!

short sentinel Aug 29, 2025, 2:18 AM

#

Job Title: Part-Time Senior AI/ML Engineer (Remote)

We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.

Requirements:
-Minimum of 9–10 years of professional software development experience

-Proven experience working effectively in a remote environment

-Advanced English proficiency (C1 or higher); an American accent is preferred

-Availability to work 10–15 hours per week during EST or CST business hours

If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, we’d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384

short sentinel Sep 5, 2025, 4:10 PM

#

Job Title: Part-Time Senior AI/ML Engineer (Remote)

We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.

Requirements:
-Minimum of 7–10 years of professional software development experience

-Proven experience working effectively in a remote environment

-Advanced English proficiency (C1 or higher); an American accent is preferred

-Availability to work 10–15 hours per week during EST or CST business hours

If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, we’d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384

frigid lantern Sep 22, 2025, 11:12 AM

#

sinful ocean Hi everyone, I’m transitioning from a Neuroscience & Psychology and Data Scienc...

Mmm interesting you should go for some topics in deep learning, just follow major ai accounts socially you will get updates on daily basis

#

Else i m young and you are my senior i just cant advice you hehe

rigid harbor Oct 17, 2025, 1:02 AM

#

Code Rehabilitation:
A New Paradigm
I share a concept "code rehabilitation and rehabilitators."

A story:

I appointed an excellent fast chef, a few serving robots, and retained my existing staff—even hired more. Why?

Reason 1: My chef started producing fast, good, tasty recipes with about 90% accuracy in the logistics of placing them in plates or packages to be served. But that means a lot of human interference is still needed.

Reason 2: More training is going on for my robots on how to handle exceptional situations—so there's a growing need for rehab-trainers.

Reason 3: The more demand for my recipes, the more I need marginally more staff and robots. While 90% of the task gets done swiftly (maybe even at "200 percent velocity"!), that remaining 10% requires more people and robots. As demand increases, this marginal staff requirement could become tremendous.

The Moral: An analogical scenario for why I still recruit—but in a different fashion, for a different paradigm. This reminds me of "creative destruction," a concept from economist Joseph Schumpeter (though not a Nobel laureate, his influence on economic thought is monumental). It's a sort of Brahmic cycle—simultaneous destruction and creative activity.

The Temporal Question
But here's the thing: Will this 10% shrink over time? Will next year's AI handle 95%, then 98%? Or will new complexities emerge—new edge cases, security concerns, integration challenges—that keep regenerating that stubborn 10%? The rehabilitation need might persist longer than we think, or it might evolve into something entirely different. Time will tell which way the wind blows.

What Lies Ahead
I foresee a Rehabilitation Engineering Program emerging soon—a structured discipline to train specialists who can bridge the gap between AI-generated output and production-ready systems. These rehabilitators will be the crucial link in our new *creative-destructive cycle.

(*With due regards to Joseph Schrumpeter who kindled this idea)

ripe mango Nov 10, 2025, 5:53 PM

#

I'm finding a US developer for the collaboration. If anybody interested, please dm me.

unique raptor Nov 11, 2025, 10:24 AM

#

Hey there any one

#

i just need a quick help

ivory magnet Nov 12, 2025, 4:59 AM

#

Hi @everyone 👋🏻 This is Rammani Pandey.
Let's connect with me on LinkedIn:-
https://www.linkedin.com/posts/rammani-pandey-97302a22b_day9-sqlchallenge-indiar-activity-7394236510352424960-hmbp?utm_source=share&utm_medium=member_android&rcm=ACoAADl9mucBpatBQ_YBq3C6DW6wPrW7snWLC0k

spice sedge Dec 14, 2025, 12:46 PM

#

Hey everyone. Im still learning the whole nlp topic. At the moment, I want to split and classify fantasy stories into narrative parts. Like "Battle", "Journey", "Rest", and so on. My plan: after the preprocessing (remove punctuations, lowercasing, lemmatization), I use a POS-Tagger to identify the verbs. And then I thoughed maybe to classify them. But the problem is, that some same verbs have different meanings. Like "run" (She runs a small business; This machine runs on magic) or "draw" (She drew water from the well; He draw his sword).

Do you know a way, how I can separate/classify the meaning of the sentences, to throw away the verbs I dont need, for a higher quality of the results? 🙂

I would be rly glad, if anyone has some ideas, which I could try out 🙏

fickle nimbus Dec 26, 2025, 6:21 PM

#

Hey everyone 👋
Quick question for folks building chatbots / LLM apps here —
how are you currently handling long-term user memory beyond a single session?
Curious what’s actually working in practice (RAG, DB, custom hacks, etc).

green dune Jan 14, 2026, 7:20 PM

#

Who has experience with a inmp441?

digital sigil Jan 23, 2026, 3:46 AM

#

https://www.linkedin.com/posts/nikhilpmarihal_nlp-artificialintelligence-featureengineering-activity-7420309915430440960-ZOC-?utm_source=share&utm_medium=member_desktop&rcm=ACoAAE_I8KgBEGVzLhmVXxrwJ7QWSAiAzZDJ4jk

spare fiber Jan 29, 2026, 6:23 PM

#

fickle nimbus Hey everyone 👋 Quick question for folks building chatbots / LLM apps here — how...

RAG imo is easier to learn so start with RAG is what I would say

simple flare Feb 7, 2026, 12:21 PM

#

https://www.kaggle.com/code/axelmatthew/from-llms-to-agents-a-curated-ai-corpus-study

Hey everyone , pls look on my notebook and give a upvote if u guys like my learning notebook code . Thankyou

shy linden Feb 11, 2026, 2:16 PM

#

New Dataset Just published!

View: https://www.kaggle.com/datasets/mabubakrsiddiq/clear-bg-ocr-dataset-eng-and-zh-22k-images

🔹 Overview

This dataset contains synthetic OCR images of English and Chinese sentences. Each language is organized in a separate folder with corresponding metadata. The images have clear backgrounds, random fonts and font sizes, and optional blur for variability.

The dataset is designed for OCR research, machine learning, and computer vision tasks. Perfect for training models to recognize text in multiple languages and fonts.

🎨 Features

✅ Two-lingual dataset: English & Chinese
✅ Random fonts: Multiple font options for diversity
✅ Random font sizes: Increases model generalization
✅ Optional Gaussian blur: Simulates real-world imaging
✅ Clear backgrounds: Good for clean OCR training
✅ Metadata included: Easy for preprocessing and analysis

💡 Possible Use Cases

🖋️ OCR Model Training: Train models like Tesseract, PaddleOCR, or deep learning OCR pipelines
🤖 Computer Vision Research: Use metadata for font/style classification
🏫 Language Learning Tools: Visual recognition for English or Chinese sentences
🔧 Augmentation Testing: Benchmark text recognition under blur and font variations
🧠 Multi-Lingual OCR Experiments: Test cross-lingual recognition models

⚡ Notes

The Chinese text is rendered using Microsoft YaHei and NSimSun fonts for proper character display.
The English text uses a variety of fonts for diversity.

Please consider giving an upvote!

tardy orbit Feb 16, 2026, 9:18 AM

#

We built a Kaggle Search where you can search datasets on Kaggle (or HF) and find datasets that positively or negatively influence model based on your prompt. It uses gradient of the model's loss with respect to the final transformer block parameters.

Instead of relying on upvotes from folks that may not utilize the dataset for the same reason as you, you can test what model you are training and it will calculate their influence.

https://durinn-concept-explorer.azurewebsites.net/

shy linden Feb 17, 2026, 2:24 PM

#

New Dataset published!

https://www.kaggle.com/datasets/mabubakrsiddiq/language-identification-dataset-20-languages/data/data/data/data/data
The Language Identification Dataset is a curated collection of approximately 68978 text samples, each paired with a corresponding language label. The dataset was constructed by gathering multilingual text passages from three major sources: the Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi-MT. These sources provide a diverse mix of domains, writing styles, and sentence structures, making the dataset suitable for research and machine learning tasks involving language detection, multilingual NLP, and text classification.

shy linden Feb 17, 2026, 3:23 PM

#

New Notebook:

https://www.kaggle.com/code/mabubakrsiddiq/language-identification-99-accuracy
Notbook creates a model to identify a languge acheiving 99% accuracy

simple flare Feb 19, 2026, 6:03 AM

#

My Notebook :

https://www.kaggle.com/code/axelmatthew/identifying-error-patterns-in-stackoverflow

Hello guys , i would like to share the analysist quite intrestings about the stackoverflow insight . Hope u can drop upvote and fork it if it feel useful for future analytics. Thankyou

mighty tiger Feb 23, 2026, 8:03 PM

#

Anyone interested in giving a talk (45 -60 mins) , at University of Wisconsin - Milwaukee ???
My team is organizing a colloquium as part of the NSF REU Summer 2026 program(Link: https://sites.uwm.edu/reu/about/ ).

The colloquium will focus on Natural Language Processing (NLP), Natural Language Understanding (NLU), and Large Language Models (LLMs), with an audience made up primarily of undergraduate students. It's a wonderful opportunity to inspire the next generation of researchers and share your expertise in a meaningful way!
Kindly dm with your info if you are interested

Regards,
Pamela Martey

normal inlet Mar 1, 2026, 4:03 AM

#

Just published a dataset on Google Historical Stock prices, Would love to here your Feedback
https://www.kaggle.com/datasets/ibrahimshahrukh/google-alphabet-stock-prices-2016-2026

chilly hull Mar 5, 2026, 6:04 PM

#

I want to start learning NLP using PyTorch. Can anyone suggest some good resources or beginner-to-intermediate projects I can build to practice?

shy linden Mar 9, 2026, 12:41 PM

#

https://www.kaggle.com/datasets/mabubakrsiddiq/urdu-ghazal-dataset-32-poets-and-their-ghazals

The dataset contains poetry by 30 greatest urdu poets. Here they are:

'mirza-ghalib','allama-iqbal','faiz-ahmad-faiz','sahir-ludhianvi','meer-taqi-meer', 'dagh-dehlvi','kaifi-azmi','gulzar','bahadur-shah-zafar','parveen-shakir', 'jaan-nisar-akhtar','javed-akhtar','jigar-moradabadi','jaun-eliya', 'ahmad-faraz','meer-anees','mohsin-naqvi','firaq-gorakhpuri','fahmida-riaz','wali-mohammad-wali', 'waseem-barelvi','akbar-allahabadi','altaf-hussain-hali','ameer-khusrau','naji-shakir','naseer-turabi', 'nazm-tabatabai','nida-fazli','noon-meem-rashid', 'habib-jalib'
Every ghazal is given in three writing systems:

Urdu (Arabic Script)
Hindi (Hindi writing system)
English (Latin Script)
Divided into three folders: ur, en and hi.

Potential use cases:

NLP
Meter Detection
Modeling AI to predict the poet given the ghazal or couplet
Have fun with data!

remote mural Mar 9, 2026, 6:24 PM

#

shy linden # https://www.kaggle.com/datasets/mabubakrsiddiq/urdu-ghazal-dataset-32-poets-an...

Remember not to publish the same thing in multiple channels - you were right to put it in data!

quaint agate Mar 31, 2026, 3:08 PM

#

Hello guys

#

I have a regional contest soon

#

how do I do for NLP?

spice sedge May 6, 2026, 5:55 PM

#

Anyone recommendations for a free BIO-Annotation-Tool?

west garnet May 6, 2026, 9:47 PM

#

spice sedge Anyone recommendations for a free BIO-Annotation-Tool?

I've tried Label Studio and doccano personally, they're good 🐣

spice sedge May 7, 2026, 6:30 AM

#

west garnet I've tried Label Studio and doccano personally, they're good 🐣

Nice! Thanks 🙂

nimble hound May 13, 2026, 3:09 PM

#

spice sedge Hey everyone. Im still learning the whole nlp topic. At the moment, I want to sp...

Hey! This is actually a really interesting problem.
POS tagging alone probably won’t be enough here. What you are running into is basically word sense disambiguation, especially with verbs like “run” or “draw” that change meaning depending on context.

One direction you could try is contextual embeddings like BERT or RoBERTa. These models look at the full sentence instead of isolated words, so the meaning of the verb changes based on context automatically.

Another option is semantic role labeling. Tools like AllenNLP can identify roles such as agent, action, and theme, which could map nicely to narrative categories like battle or journey.

You could also skip verb level extraction entirely and treat this as a sentence classification problem. That approach is often simpler and works better in practice.

What does your dataset look like? Is it already labeled or are you planning to create the labels yourself?

spice sedge May 15, 2026, 8:55 PM

#

nimble hound Hey! This is actually a really interesting problem. POS tagging alone probably w...

Hi! Thanks for your response and your interest. My dataset are german Fairytales and its just plain text. I plan to create the labels myself and have already started annotating the data. But yeah.. its a lot of work. But you are absolutely right: I have those problems, also some others problems like names and nouns. So I need the Labels for different words like the plant "Rapunzel" and the name "Rapunzel". In the end, I want to identify every figure in the fairy tales.

My aim is to label enough sentences and fine tune a spacy NER-model. But I would also like to try your recommended embeddings with BERT. Thats the reason why I decided for BIO-Annotation (I hope it was the right decision :D) - for fine tuning some transformer models.

AllenNLP sounds super interesting! I've never heard about this. Thank you very much!

You could also skip verb level extraction entirely and treat this as a sentence classification problem. That approach is often simpler and works better in practice.
What do you mean by this?

graceful yacht May 19, 2026, 2:57 PM

#

spice sedge Hi! Thanks for your response and your interest. My dataset are german Fairytales...

Hey there! You could also try the good old RNN approach in order to make computations easier on a CPU. If you're classifying incredibly large passages, you could try and leave only the sentences with the highest-scoring IDF (to prevent general words like "the" and "run" getting thrown in).

You could also just try the simplest non-neural approach, something like Naive Bayes or pure TF-IDF and see how that goes. I think for classifying genres that would be fine.

Otherwise, I wouldn't recommend straight up jumping to BERT or such classifiers - I doubt 110M parameters are going to make classification any more accurate. You could try sBERT. I found https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 to be good for sentence--> vector tasks, used it in my paper once, and you could just "glue" a head onto it and fine-tune. It has 22M, so it is smaller and more than enough for the job.

TL;DR: Try non-neural approaches first (Naive Bayes, TF-IDF), then RNNs (GRU or similar). If RNNs don't work as good, try sBERT and only then something bigger.

#nlp