Please, help me find the appropriate model. As an input I have ASCII text with different text styles (font size, font weight). The text consist of 4-10 lines each 15-40 symbols, containing information about a renting object (flats or houses or parking garage). The text may contain information about the object like size, amount of flats, amount of rooms, age etc. (in total there are about 50 categories). But not every description will contain all the categories.
The goal is to identify and classify individual text parts into existing categories.
What model should I choose ?
I read that either CRF, NER or even LLM would be suitable.
#CRF, LLM or NER for text classification
11 messages · Page 1 of 1 (latest)
If you don't have your data set ready, you can provide few shot examples to some LLMs (most notably OpenAI's) to recognize entiteis: https://medium.com/@iryna230520/dynamic-few-shot-prompting-overcoming-context-limit-for-chatgpt-text-classification-2f70c3bd86f9
If you have an annotated dataset already, I recommend going NER instead
thanks - how big should the data-set be in your experience?
It highly depends on the problem, but in general one can aim for at least a few hundreds to thousands for each category, but more importantly, balanced. This is if you're training NER.
I would definitely try the LLM few shot learning via prompting route though, as you don't need pre existing dataset, but more on how to prompt correctly. It probably is the lowest hanging fruit
the thing is, that the solution I need, must be self-hosted
so a SaaS like OpenAI wont work, since it runs on their servers
do you think a BERT or similiar is a good alternative?
You can train NER with BERT or similar models, but you first need annotated data. You can use Label Studio to annotate. You
Then, you can train an NER by wrangling the data into IOB format, or you can check out using this tool, though I haven't tried. See how to train NER with BERT here. You can experiment with different BERT like open source models too
cool - I will read about it. BERT was the one I found, but may be there is a better one. Can you reccomend something better?
If you're training NER with transformer, you're looking into encoder only models. So as longas you're looking into better performing encoder only models, you should be good. BERT is always a good start. You probably can also try FLAN-T5, an encoder-decoder model, but the setup would be different. I'd try the encoder only families first