CRF, LLM or NER for text classification | Learn AI Together | Page 1

random token Sep 9, 2023, 7:47 AM

#

Please, help me find the appropriate model. As an input I have ASCII text with different text styles (font size, font weight). The text consist of 4-10 lines each 15-40 symbols, containing information about a renting object (flats or houses or parking garage). The text may contain information about the object like size, amount of flats, amount of rooms, age etc. (in total there are about 50 categories). But not every description will contain all the categories.
The goal is to identify and classify individual text parts into existing categories.
What model should I choose ?
I read that either CRF, NER or even LLM would be suitable.

zenith sierra Sep 10, 2023, 8:13 AM

#

random token Please, help me find the appropriate model. As an input I have ASCII text with d...

If you don't have your data set ready, you can provide few shot examples to some LLMs (most notably OpenAI's) to recognize entiteis: https://medium.com/@iryna230520/dynamic-few-shot-prompting-overcoming-context-limit-for-chatgpt-text-classification-2f70c3bd86f9

If you have an annotated dataset already, I recommend going NER instead

Medium

Dynamic Few-Shot Prompting: Overcoming Context Limit for ChatGPT Te...

Recent explosion in the popularity of large language models like ChatGPT has led to their increased usage in classical NLP tasks like…

random token Sep 10, 2023, 8:14 AM

#

thanks - how big should the data-set be in your experience?

zenith sierra Sep 10, 2023, 8:34 AM

#

random token thanks - how big should the data-set be in your experience?

It highly depends on the problem, but in general one can aim for at least a few hundreds to thousands for each category, but more importantly, balanced. This is if you're training NER.

#

I would definitely try the LLM few shot learning via prompting route though, as you don't need pre existing dataset, but more on how to prompt correctly. It probably is the lowest hanging fruit

random token Sep 10, 2023, 8:35 AM

#

the thing is, that the solution I need, must be self-hosted

#

so a SaaS like OpenAI wont work, since it runs on their servers

#

do you think a BERT or similiar is a good alternative?

zenith sierra Sep 10, 2023, 8:43 AM

#

random token do you think a BERT or similiar is a good alternative?

You can train NER with BERT or similar models, but you first need annotated data. You can use Label Studio to annotate. You

Then, you can train an NER by wrangling the data into IOB format, or you can check out using this tool, though I haven't tried. See how to train NER with BERT here. You can experiment with different BERT like open source models too

Label Studio — Text Named Entity Recognition Data Labeling Template

Template for performing named entity recognition on text with Label Studio for your machine learning and data science projects.

random token Sep 10, 2023, 8:45 AM

#

cool - I will read about it. BERT was the one I found, but may be there is a better one. Can you reccomend something better?

zenith sierra Sep 10, 2023, 8:50 AM

#

If you're training NER with transformer, you're looking into encoder only models. So as longas you're looking into better performing encoder only models, you should be good. BERT is always a good start. You probably can also try FLAN-T5, an encoder-decoder model, but the setup would be different. I'd try the encoder only families first

Encoder models - Hugging Face NLP Course

#CRF, LLM or NER for text classification