#Is it possible to use RAG with LlamaV2 to consistently produce a valid Json of a specific structure?

59 messages · Page 1 of 1 (latest)

kind coyote
#

Say I have the following template for a json:

{
  "Group_i": {
    "Person_j": {
      "Mood": "",
      "ActionData": {
        "Action_k": {
          "ActionBehavior": "",
        }
      }
    }
  }
}

iterators:
i, j, k

We want to be able to set limits:

i to 3 groups (adjustable)
j to 3 people (adjustable)
k to 5 actions (adjustable)

Each variable for "Mood" or "ActionBehavior" should also be limit-bound

The intention is to get something like this, where natural language can be translated into a consistently valid json within boundaries

Maybe saying something like "I want 1 group with 2 people, first person be happy and will have 2 actions: clap, sing, and second person will be sad and have 1 action, clap" should translate to this

{
  "Group_1": {
    "Person_1": {
      "Mood": "Happy",
      "ActionData": {
        "Action_1": {
          "ActionBehavior": "Clap",
        },
        "Action_2": {
          "ActionBehavior": "Sing",
        }
      }
    },
    "Person_2": {
      "Mood": "Sad",
  
      "ActionData": {
        "Action_1": {
          "ActionBehavior": "Clap",
        }
      }
    }
  }
}

What would be the most efficient/effective way of going about building a chatbot that will always output a valid json of this structure?

My initial thought was to use a command-line interface because it made the most sense, or some type of coarse-grained configuration. But I wanted to know if there's a way to push the boundary to make a llm output something extremely configurable like this through Natural Languge in a consistent manner. Is this within the limits of what langchain + Llama can do at the moment?

kind coyote
#

Sorry, maybe my initial description was unclear. I've updated it. Is anyone able to chime in?

primal goblet
# kind coyote Sorry, maybe my initial description was unclear. I've updated it. Is anyone able...

This exists but i've never tested it myself: https://til.simonwillison.net/llms/llama-cpp-python-grammars
You can also inject some examples in the prompt to guide the model, and add a rule based testing function or custom grammar to test out the final output since the JSON schema you want to use looks simple enough.

llama.cpp recently added the ability to control the output of any model using a grammar.

kind coyote
#

i was initially gonna do like

the initial text prompt goes in
then the output hits the pydantic verifier first, which replies to it and tells it if its wrong
then it replies again to the pydantic until it gets it correct and can finally give us the final output

but idk if that's too redundant/repetitive

glacial osprey
# kind coyote Say I have the following template for a json: ``` { "Group_i": { "Person_j...

Few things. First, you need to use array wherever it makes sense.

It should be:
Groups: [{name: str, persons: [Person object]}]
And person object be something like:
{name: str, mood: str, actions: [actionObject]}

Do use Pydantic for both Pydantic validation and for outputting a JSON schema, see model_dump_schema.

Second, break down your prompts to the smallest unit possible. The longer it has to repsond, the more it is prone to error. So in this case, the smallest is: Given a mood, provide an array of action object. This is where your prompt engineering is focused on, and you only need to provide the schema of actionObject, with mood as parameter.

Finally, the rest is just a concurrent call for every Person object.

#

Assuming you have a range of options, you can also do enum, see documentation.

Your actionObject can be something like:
{"name": enum, "order": int, "description": str} so [{"name": "laugh", "order": 0, "description": str}, {"name": "clap" "order": 1, "description": str}]

kind coyote
# glacial osprey Few things. First, you need to use array wherever it makes sense. It should be:...

The pydantic stuff is really helpful, thanks!

But about this
Second, break down your prompts to the smallest unit possible. The longer it has to repsond, the more it is prone to error. So in this case, the smallest is: Given a mood, provide an array of action object. This is where your prompt engineering is focused on, and you only need to provide the schema of actionObject, with mood as parameter.

if I'm locking the inputs/prompts into this type of segmented language it isn't really natural and kind of defeats the purpose of a llm doesn't it 😔

#

is there any way around it

glacial osprey
#

You don't expect a single prompt to do everything. That's the least efficient way to use LLMs

#

It's also making it "modular". If in the chat message, if you use chat that is, you detect a need to invoke for persons or multiple persons, then your program should invoke it concurrently

#

By running a long prompt, you have a single point of failure.

#

So you have concurrent calls to do retries

#

Each Person object performs request independently that is

#

In this sense, you can keep the chat system the same, and add other features.

#

Or even modify your action ones without changing anything about your chat system

kind coyote
#

if say the user spawns 3000 person objects with 'give me 3000 persons..' then wouldn't it just be recurrently asking them the same call over and over

#

assuming chaining ofc

#

is there a way to use RAG to like say, 'if it's more than 10 people we are gonna change to a different configuration strategy'

#

or something like that

glacial osprey
#

It’s the same endpoint yup. See from another perspective, what if 3000 person objects in aggregate go over your context length?

glacial osprey
#

Hence if it’s RAG, it’s also a chaining of API calls to your vector DB, then feed into your next LLM endpoint

#

I’m in phone, can’t read the code well, but you should be able to make async it seems

primal goblet
# kind coyote is there a way to use RAG to like say, 'if it's more than 10 people we are gonna...

The output for 3k people would be above the token limit for most (all?) models. You can use chained prompts in the backend to decompose the problem. The first step could be to get how many people are expected in the prompt. Not sure I understand your use case though, if you're generating a random json of 3000 people, then you don't really need AI. If it's a text with information about 3000 people, what kind of cursed project is this? There's no point in accounting for edge cases that will never occur.

kind coyote
# primal goblet The output for 3k people would be above the token limit for most (all?) models. ...

I'm just thinking about the upper limit/edge case because if there are 100 groups it will just take 30 people per group to to get to that number

The use case is essentially to use natural language to:
1. configure a json with array/list of nested objects, while also
2. Being able to map natural language to consistent integer digits.
3. It needs to always check to be a valid json
4. Must follow a specific schema/model (which is the biggest thing pydantic helps with)

i wasn't sure what the upper boundary was

#

It doesn't have to be people necessarily

#

But it needs to be array/collections within array/collections and with a clearly defined and adhered number of attributes

I think ill use 3 layers max which is why i put group:person:actionbehavior

#

How would one store the running count for i,j,k in the model to make sure it always assigns the correct key name to each kvp though

#

Or is there a better way to differentiate key names for each object in the json array to avoid collision

primal goblet
# kind coyote I'm just thinking about the upper limit/edge case because if there are 100 group...

I can't tell you exactly how to do it since this would require experimentation, but you can always find ways to decompose the problem in the order of the nested objects you want and then aggregate it all in the end. For example, a first step defines how many groups there are, then how many people in each group, then how many actions or whatever subitems for the persons. Return everything as arrays Groups:[group1,group2...]; Persons: [Person1:group1, Person2:Group2] then aggregate it into the final JSON. Just what I can think of right now. I've only dealt with the opposite usecase of turning jsons into outputs.

kind coyote
#

The thing is that method is very similar to how i already programmed it as a command line

It feels like the llm is offering not much value other than language interpretation

glacial osprey
#

Yeah, 100 groups with 30 people is hardly a edge case (in my PoC we have more than that, and that's without invoking customer data).

#

Are you running on a single script? Is that why?

primal goblet
#

unless you're fine with them working half the time

glacial osprey
#

I think there is a gap between how you would program an application vs. just prototyping

kind coyote
glacial osprey
#

Here, FastAPI is used as writing the endpoint to invoke the service for model prediction. LLM is basically model prediction as well. And so in this multi-container design is a basic application that allows better scalability (within the bounds of a single machine)

#

I see many chat examples are using command line to invoke a single script or something for somehow, but that's really just for personal use

#

For the Medium article, if you take out Streamlit part, with API endpoints designed up, you can just have a Python script to make concurrent calls to API as well

#

But if these are unfamiliar, then that's what I imagine the gap would be

kind coyote
#

I guess for me it was more of a thought experiment of 'what can go wrong and how do i deal with it when it does' lol

#

The caveat here is that today is literally the first day i learned that prompts can be redirected to individual endpoints like this, its a literal TIL

#

Thank you for the heads up, i need to read more

glacial osprey
# kind coyote Thank you for the heads up, i need to read more

Glad we could help 🙂 As long as you go back to the ML fundamentals, prompting is really just model prediction, so how to scale model prediction in production is the same (well, with some additional special tooling as well). But also no worries on this, I find many data scientists aren't familiar with these anyways

kind coyote
glacial osprey
# kind coyote If prompting is essentially request based prediction, how do GPT and similar sto...

Models themselves don't sotre it at all. All the model gets, typically unless specialized, is just a string. You often see a base model and a chat model also for this reason. Base models are trained on just predicting next token based on all the data fed in, and chat models are specifically finetuned to prompt for response.

You can see OpenAssistant/llama2-13b-megacode2-oasst and OpenAssistant/codellama-13b-oasst-sft-v10, and I quote:

Prompt dialogue template:

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

The model input can contain multiple conversation turns between user and assistant, e.g.

<|im_start|>user
{prompt 1}<|im_end|>
<|im_start|>assistant
{reply 1}<|im_end|>
<|im_start|>user
{prompt 2}<|im_end|>
<|im_start|>assistant
(...)

The {} parts, basically you can see as string replacement with f strings. Now the rest is relatively easy. You basically cache all the conversation happening, maybe separate by sessions, and then prepend last few turns of conversation until context length runs out.

#

No different from doing RAG, just that this is using cache store instead, like Redis

#

And the reason why you have a specifci format when calling API, is so they can easy do the above string replacement

kind coyote
# glacial osprey Take a very simple example: https://rihab-feki.medium.com/deploying-machine-lear...

I have a question, how would one adjust this architecture such that the model can pass the json outputs to an additional pydantic validation layer

Would it literally just be a different layer/function in the FastAPI endpoint?

I was thinking of separating the different nested objects, and then the validation/pydantic layer is coded to recognize the relevant key pair values and validate according to that specific object

glacial osprey
kind coyote