#CRF, LLM or NER for text classification

11 messages · Page 1 of 1 (latest)

random token
#

Please, help me find the appropriate model. As an input I have ASCII text with different text styles (font size, font weight). The text consist of 4-10 lines each 15-40 symbols, containing information about a renting object (flats or houses or parking garage). The text may contain information about the object like size, amount of flats, amount of rooms, age etc. (in total there are about 50 categories). But not every description will contain all the categories.
The goal is to identify and classify individual text parts into existing categories.
What model should I choose ?
I read that either CRF, NER or even LLM would be suitable.

zenith sierra
# random token Please, help me find the appropriate model. As an input I have ASCII text with d...

If you don't have your data set ready, you can provide few shot examples to some LLMs (most notably OpenAI's) to recognize entiteis: https://medium.com/@iryna230520/dynamic-few-shot-prompting-overcoming-context-limit-for-chatgpt-text-classification-2f70c3bd86f9

If you have an annotated dataset already, I recommend going NER instead

Medium

Recent explosion in the popularity of large language models like ChatGPT has led to their increased usage in classical NLP tasks like…

random token
#

thanks - how big should the data-set be in your experience?

zenith sierra
#

I would definitely try the LLM few shot learning via prompting route though, as you don't need pre existing dataset, but more on how to prompt correctly. It probably is the lowest hanging fruit

random token
#

the thing is, that the solution I need, must be self-hosted

#

so a SaaS like OpenAI wont work, since it runs on their servers

#

do you think a BERT or similiar is a good alternative?

zenith sierra
# random token do you think a BERT or similiar is a good alternative?

You can train NER with BERT or similar models, but you first need annotated data. You can use Label Studio to annotate. You

Then, you can train an NER by wrangling the data into IOB format, or you can check out using this tool, though I haven't tried. See how to train NER with BERT here. You can experiment with different BERT like open source models too

random token
#

cool - I will read about it. BERT was the one I found, but may be there is a better one. Can you reccomend something better?

zenith sierra
#

If you're training NER with transformer, you're looking into encoder only models. So as longas you're looking into better performing encoder only models, you should be good. BERT is always a good start. You probably can also try FLAN-T5, an encoder-decoder model, but the setup would be different. I'd try the encoder only families first