#Looking for tips

49 messages · Page 1 of 1 (latest)

hidden tree
#

Hey! I’m new to coding and might be making this harder than it needs to be.

#

I want to build a program that reads and converts text from a list.

#

How shud i go abt this?

obsidian crown
#

I would train a model on the list of terms that you want it to use then apply a agent, sort of like I have shown above.

hidden tree
#

This is pretty neat! Checking it out, thanks!

obsidian crown
#

You can do GRPO training for this, to reward it for using the correct term I believe. I'm not the best person to ask about GRPO though.

hidden tree
obsidian crown
#

You can try to use a instruct model, then apply a little training on top. Then when you give it specific instructions, it should be able to answer in the way you need it to. I haven't attempted this before, as I don't have the hardware to get GRPO done. In other training methods, I think you would need a lot of entries within your dataset for it associate with what you're trying to accomplish.

#

If you were to spend some colab credits, look around the discord about GRPO training methods, and you may be able to get this done without much hassle. If you find a good method to get this going, then whenever the updates come out, try to either apply that method or continue the training with the new added entries.

#

Then combining your model with some of the examples I gave above, should allow you to make a basis for the program you're trying to create.

#

Hope this helps, but someone else on here might have a better method to get this done.

#

These give some explanation, as well to the types of trainings you can try.

hidden tree
#

Sorry, I'm not great with this stuff. You're saying that I should instead train an instruct reasoning model? Would this model be for cleaning up the results of the current first model and replacing values with ones from the list? Would the lists act as context for the reasoning model?

obsidian crown
#

Oh well no, I was thinking you could customize the GRPO training a little bit and reward the model for responding the in the way that you want it to for the wording and format of the documents.

hidden tree
#

Well, I did SFT to create a non-reasoning Qwen fine-tuned (for free on colab). Are you recommending I look into GRPO training a reasoning model instead?

obsidian crown
#

I'm trying to figure out the best answer but it also depends on how heavy the terms list is going to be, because if thats the case maybe a RAG option maye be better but I was going with using unsloth for Fine tuning a model based on what you want your output to be.
Such as like this:

Output: {"diagnoses": "colorectal cancer"}```
hidden tree
#

The terms list has to be absolute. The idea is that, in the far future, these CSVs can be uploaded to database where we can see "150 people have colorectal cancer." If a CSV has a value that says "colorectal cancers" (with an s) by accident, the database will reject it. The lists, combined, are 6,000 tokens, but can be separated into separate lists.

Unfortunately, the lists are updated every few weeks (eg with new diagnoses) so if I fine-tune "colorectal cancer" into a model, but the list changes "colorectal cancer" to "colon cancer" and "rectal cancer", then the database will reject the results of my model.

#

The goal is to overfit, haha

obsidian crown
#

OH then now I see your issue more now. That is is kinda rough because you're exactly right that accuracy would have to absolute for the output.

#

You kinda got me there haha 😄

#

Im trying think about the best way to go about it, and I keep backtracking to what would be better. Ill think about it for a bit and come back to you but RAG/Agents seem like the route to go if the terms can change. Just don't know that I would 100% trust the outputs.

hidden tree
#

Yeah, I understand what you're saying; I'll definitely make a post-processing script where I fuzzy match the values or delete the data altogether, regardless of how well the model runs. Honestly, though, thanks so much for all of the advice, haha! You got me googling GRPO and interested in your CSV model

obsidian crown
#

Oh well the CSV model uses HuggingFace inference API to then grab answer from the base: Qwen/Qwen2.5-Coder-32B-Instruct model

hidden tree
#

I can think of a variety of things this would be useful for-I use CSVs so often, haha!

obsidian crown
#

I was thinking about how to get this done, and not a direct way to solve the issue but something that may help with workflow, you could tinker around with, and so you're not having to go to so many places to try to then compile a solution. I started using n8n a month ago, and its been somewhat difficult to grasp at first. Watching guides from others helped a lot for my solution for what I was trying to get done a month ago, and I thought with how many tools that exist on n8n, that maybe you could search around and find a solution to yours. In the link I sent you, there is info about the available parsing tools that help get a specific output.

hidden tree
#

Hey @obsidian crown! It seems like a really neat tool, and it honestly seems like it could be useful for this project. My fear is that I've never had experience with a software like this and that it'll take more time to learn and implement n8n than if I just did it without... Do you think it'd be worth it to learn?

obsidian crown
#

@hidden tree So, I would say its worth doing if its a on going system because it can be tweaked around or rerouted to another template on n8n if necessary, along with so many integrations but.. if you're trying to do without any cost to yourself, then I would find a method using scripts and open source models like you were. I was actually kind intrigued by this because I tried to do a proccess in my head myself on how I would get this done. I'm trying to get better at applying agents or creating applications around LLM's. If you're interested actually about maybe prototyping a app like this on HuggingFace Space's, I would actually be down. I think I could learn from something like this and the end product could be something you use for your use case. Let me know, but I will say the trial for n8n is only 14 days. So if you do end up liking it, beware that it does cost. So, if you want a free option, I recommend following the path you were on. I just suggested because I know sometimes accuracy and less supervision over a process is usually what we all look for, and when looking at open source or projects that are in the works, sometimes the result we get from these things tend to have us still look over the results a lot more than we want to. I sometimes find when its a paid service that I have to do that less.

scenic ingot
#

Use an LLM for medical extraction. When it extracts the medical terms in the text chunk, you can use vector search with a vector database like chroma to find closest terms. For your example, the llm extracts "colon cancer", you search it in vector database and retrieve most relevant text (you can use a custom embedding model specified for medical terms). Of course you must store your keywords in the vector database. Like classic RAG but not chunks, medical terms.
Then you get the top 5 similar medical terms and ask it to LLM again for if one of them is medically same term. If the LLM returns true, you can replace the words as colon cancer -> colorectal cancer.

#

And of course you should save the embeddings of every medical term to chroma seperately

hidden tree
#

@obsidian crown That's pretty darn cool! I'd be down to prototype in the future, but I first want to finish designing the project for use on a local computer. Given that this project is a means to an end for now, I don't think I'm willing to invest in n8n yet. From my googling, I read you can either use n8n's website and purchase, or download to local computer and use for free-is this right?

#

@scenic ingot Thank you for the advice! Looking into chroma now. I really like this plan.

hidden tree
scenic ingot
hidden tree
#

That's actually the one I used, and it worked brilliantly, even pretty good on its math skills (eg, more or less than 20 tumors)-something I wouldn't expect from a RAG. I'd actually like your advice on the second AI you mentioned, unless I'm asking for too much help.

Then you get the top 5 similar medical terms and ask it to LLM again for if one of them is medically same term
I have a list of mutations, but whenever a mutation that isn't in the list is mentioned in the medical record input, I'd like the output to be "Other mutation" with the specific mutation mentioned in the details, eg:
• Input: “Patient has a FHL1 mutation” → Final output of program: {'Patient Diagnosis': 'Other mutation', 'Diagnosis details': 'FHL1 mutation'}
Right now, the RAG does not choose "Other mutation" from the list, but the thing closest in spelling (FH mutation, FKTN mutation, etc). I like your method of letting a RAG choose the top 5 results and an LM choosing based off that. Right now, I'm thinking the workflow should be:
• Input: “Patient has a FHL1 mutation” → LM #1 analyzes input and generates: {'Patient Diagnosis': 'FLH1 mutation'} → RAG analyzes "Other mutation - FLH1 mutation" and outputs a list (eg FH mutation, FKTN mutation) → concatenate "Other mutation" to the list → LM #2 analyzes the short list and {'Patient Diagnosis': 'FLH1 mutation'} and, seeing the best option is "Other mutation", generates {'Patient Diagnosis': 'Other mutation', 'Diagnosis details': 'FHL1 mutation'} → Run rag one more time on "Other mutation" to make sure spelling/formatting is correct
Assuming that you think this is a good workflow, how do you think I should get the second LM? I would guess it'd have to be medically literate enough to not hallucinate medical facts and high parameter enough to preform math correctly, but I'd also like to fine-tune it to prevent formatting errors. Should I keep playing with Unsloth, or look into fine-tuning medical LMs?

hidden tree
#

@scenic ingot you know what? One-shot Qwen2.5 7b isn't doing a bad job at all, despite all of the medical jargon. I think I'll just fine-tune Qwen again, haha!

uncut oyster
#

You might want to try Levenstein distance

scenic ingot
scenic ingot
# uncut oyster You might want to try Levenstein distance

It may not work as expected. For example in Turkish, "Verem" and "Tüberküloz" are same thing. Their Levenstein distance is "far". An embedding model thinks they are nearly same but this algorithm can't provide this type of similarity.

#

@hidden tree Current LLMs are good enough for this task i think. If you fine tune it for spesific output, you may not able to use one LLM for both text extraction and verification for RAG. If it can handle without fine tuning i would use it as is. Gemma 3 has a tokenizer that can spell out the medical terms with less tokens. This makes me think they might be used more medical data to train it. You can try gemma 3 12b or 27b first.

hidden tree
# uncut oyster You might want to try Levenstein distance

Ya'll are so smart... I've never even heard of that, that's sick! And ironically, I learned it's very popular in my field of study-who knew? After a bit of googling, I think Alper is right; if the program records stomach cancer, and the corresponding diagnosis on the list is gastric cancer, then the semantic meaning of the word is most valuable, as opposed to fuzzy-like matching with Levenstein.

hidden tree
#

@scenic ingot Oh man... I just tried 4-shot prompting out-of-the-box Gemma 3 27b, and it does pretty well! And it only added 3.5k tokens to the prompt (6-8k altogether), which isn't bad at all, haha! It's looking like I didn't need to fine-tune to begin with!

#

I'll keep playing with it, but bro... I think I'm finally wrapping this project up! Thank you Alper for literally master-minding this project for me, haha! Thanks to all of you who suggested stuff! I'll let ya'll know how it goes!

scenic ingot
#

ok i wonder how it goes too. I will be waiting.

jovial mirage
#

Hey @hidden tree what is your current architecture on this ? Genma + RAG (chroma) with no finetunning at all ? Are you satisfied with your current results ?