Is it possible to use RAG with LlamaV2 to consistently produce a valid Json of a specific structure? | Learn AI Together | Page 1

kind coyote Jan 3, 2024, 10:17 AM

#

Say I have the following template for a json:

{
  "Group_i": {
    "Person_j": {
      "Mood": "",
      "ActionData": {
        "Action_k": {
          "ActionBehavior": "",
        }
      }
    }
  }
}

iterators:
i, j, k

We want to be able to set limits:

i to 3 groups (adjustable)
j to 3 people (adjustable)
k to 5 actions (adjustable)

Each variable for "Mood" or "ActionBehavior" should also be limit-bound

The intention is to get something like this, where natural language can be translated into a consistently valid json within boundaries

Maybe saying something like "I want 1 group with 2 people, first person be happy and will have 2 actions: clap, sing, and second person will be sad and have 1 action, clap" should translate to this

{
  "Group_1": {
    "Person_1": {
      "Mood": "Happy",
      "ActionData": {
        "Action_1": {
          "ActionBehavior": "Clap",
        },
        "Action_2": {
          "ActionBehavior": "Sing",
        }
      }
    },
    "Person_2": {
      "Mood": "Sad",
  
      "ActionData": {
        "Action_1": {
          "ActionBehavior": "Clap",
        }
      }
    }
  }
}

What would be the most efficient/effective way of going about building a chatbot that will always output a valid json of this structure?

My initial thought was to use a command-line interface because it made the most sense, or some type of coarse-grained configuration. But I wanted to know if there's a way to push the boundary to make a llm output something extremely configurable like this through Natural Languge in a consistent manner. Is this within the limits of what langchain + Llama can do at the moment?

kind coyote Jan 3, 2024, 11:26 AM

#

Sorry, maybe my initial description was unclear. I've updated it. Is anyone able to chime in?

primal goblet Jan 3, 2024, 12:10 PM

#

kind coyote Sorry, maybe my initial description was unclear. I've updated it. Is anyone able...

This exists but i've never tested it myself: https://til.simonwillison.net/llms/llama-cpp-python-grammars
You can also inject some examples in the prompt to guide the model, and add a rule based testing function or custom grammar to test out the final output since the JSON schema you want to use looks simple enough.

Using llama-cpp-python grammars to generate JSON

llama.cpp recently added the ability to control the output of any model using a grammar.

kind coyote Jan 3, 2024, 12:12 PM

#

primal goblet This exists but i've never tested it myself: https://til.simonwillison.net/llms/...

I had a suggestion earlier so i started thinking of using some Pydantic models or something, would it be possible to combine this with that 🤔

#

i was initially gonna do like

the initial text prompt goes in
then the output hits the pydantic verifier first, which replies to it and tells it if its wrong
then it replies again to the pydantic until it gets it correct and can finally give us the final output

but idk if that's too redundant/repetitive

glacial osprey Jan 3, 2024, 12:52 PM

#

kind coyote Say I have the following template for a json: ``` { "Group_i": { "Person_j...

Few things. First, you need to use array wherever it makes sense.

It should be:
Groups: [{name: str, persons: [Person object]}]
And person object be something like:
{name: str, mood: str, actions: [actionObject]}

Do use Pydantic for both Pydantic validation and for outputting a JSON schema, see model_dump_schema.

Second, break down your prompts to the smallest unit possible. The longer it has to repsond, the more it is prone to error. So in this case, the smallest is: Given a mood, provide an array of action object. This is where your prompt engineering is focused on, and you only need to provide the schema of actionObject, with mood as parameter.

Finally, the rest is just a concurrent call for every Person object.

BaseModel - Pydantic

Data validation using Python type hints

#

Assuming you have a range of options, you can also do enum, see documentation.

Your actionObject can be something like:
{"name": enum, "order": int, "description": str} so [{"name": "laugh", "order": 0, "description": str}, {"name": "clap" "order": 1, "description": str}]

Validation Errors - Pydantic

Data validation using Python type hints

kind coyote Jan 3, 2024, 12:58 PM

#

glacial osprey Few things. First, you need to use array wherever it makes sense. It should be:...

The pydantic stuff is really helpful, thanks!

But about this
Second, break down your prompts to the smallest unit possible. The longer it has to repsond, the more it is prone to error. So in this case, the smallest is: Given a mood, provide an array of action object. This is where your prompt engineering is focused on, and you only need to provide the schema of actionObject, with mood as parameter.

if I'm locking the inputs/prompts into this type of segmented language it isn't really natural and kind of defeats the purpose of a llm doesn't it 😔

#

is there any way around it

glacial osprey Jan 3, 2024, 1:01 PM

#

kind coyote The pydantic stuff is really helpful, thanks! But about this ```Second, break d...

Untrue. This is actually how most production system works. You would have multiple endpoints and chain them together. You can have the chat to invoke this endpoint

#

You don't expect a single prompt to do everything. That's the least efficient way to use LLMs

#

It's also making it "modular". If in the chat message, if you use chat that is, you detect a need to invoke for persons or multiple persons, then your program should invoke it concurrently

#

By running a long prompt, you have a single point of failure.

#

So you have concurrent calls to do retries

#

Each Person object performs request independently that is

#

In this sense, you can keep the chat system the same, and add other features.

#

Or even modify your action ones without changing anything about your chat system

kind coyote Jan 3, 2024, 1:06 PM

#

if say the user spawns 3000 person objects with 'give me 3000 persons..' then wouldn't it just be recurrently asking them the same call over and over

#

assuming chaining ofc

#

is there a way to use RAG to like say, 'if it's more than 10 people we are gonna change to a different configuration strategy'

#

or something like that

glacial osprey Jan 3, 2024, 1:08 PM

#

It’s the same endpoint yup. See from another perspective, what if 3000 person objects in aggregate go over your context length?

glacial osprey Jan 3, 2024, 1:08 PM

#

kind coyote is there a way to use RAG to like say, 'if it's more than 10 people we are gonna...

I personally don’t use frameworks, so by deploying as API endpoints, it’s all about how you programmatically solve it

#

Hence if it’s RAG, it’s also a chaining of API calls to your vector DB, then feed into your next LLM endpoint

#

Someone did write about this? https://towardsdatascience.com/async-calls-for-chains-with-langchain-3818c16062ed

Medium

Async calls for Chains with Langchain

How to make Langchain chains work with Async calls to LLMs, speeding up the time it takes to run a sequential long chain.

#

I’m in phone, can’t read the code well, but you should be able to make async it seems

primal goblet Jan 3, 2024, 1:22 PM

#

kind coyote is there a way to use RAG to like say, 'if it's more than 10 people we are gonna...

The output for 3k people would be above the token limit for most (all?) models. You can use chained prompts in the backend to decompose the problem. The first step could be to get how many people are expected in the prompt. Not sure I understand your use case though, if you're generating a random json of 3000 people, then you don't really need AI. If it's a text with information about 3000 people, what kind of cursed project is this? There's no point in accounting for edge cases that will never occur.

kind coyote Jan 3, 2024, 1:30 PM

#

primal goblet The output for 3k people would be above the token limit for most (all?) models. ...

I'm just thinking about the upper limit/edge case because if there are 100 groups it will just take 30 people per group to to get to that number

The use case is essentially to use natural language to:
1. configure a json with array/list of nested objects, while also
2. Being able to map natural language to consistent integer digits.
3. It needs to always check to be a valid json
4. Must follow a specific schema/model (which is the biggest thing pydantic helps with)

i wasn't sure what the upper boundary was

#

It doesn't have to be people necessarily

#

But it needs to be array/collections within array/collections and with a clearly defined and adhered number of attributes

I think ill use 3 layers max which is why i put group:person:actionbehavior

#

How would one store the running count for i,j,k in the model to make sure it always assigns the correct key name to each kvp though

#

Or is there a better way to differentiate key names for each object in the json array to avoid collision

primal goblet Jan 3, 2024, 1:40 PM

#

kind coyote I'm just thinking about the upper limit/edge case because if there are 100 group...

I can't tell you exactly how to do it since this would require experimentation, but you can always find ways to decompose the problem in the order of the nested objects you want and then aggregate it all in the end. For example, a first step defines how many groups there are, then how many people in each group, then how many actions or whatever subitems for the persons. Return everything as arrays Groups:[group1,group2...]; Persons: [Person1:group1, Person2:Group2] then aggregate it into the final JSON. Just what I can think of right now. I've only dealt with the opposite usecase of turning jsons into outputs.

kind coyote Jan 3, 2024, 1:43 PM

#

primal goblet I can't tell you exactly how to do it since this would require experimentation, ...

Basically use chaining and have the chat prompt each endpoint like ianyu said yea

#

The thing is that method is very similar to how i already programmed it as a command line

It feels like the llm is offering not much value other than language interpretation

glacial osprey Jan 3, 2024, 1:44 PM

#

Yeah, 100 groups with 30 people is hardly a edge case (in my PoC we have more than that, and that's without invoking customer data).

#

Are you running on a single script? Is that why?

primal goblet Jan 3, 2024, 1:45 PM

#

kind coyote The thing is that method is very similar to how i already programmed it as a com...

lol that's all LLMs are good for right now

#

unless you're fine with them working half the time

glacial osprey Jan 3, 2024, 1:45 PM

#

I think there is a gap between how you would program an application vs. just prototyping

kind coyote Jan 3, 2024, 1:46 PM

#

glacial osprey I think there is a gap between how you would program an application vs. just pro...

Sorry i dont quite follow the context 🤔😳

glacial osprey Jan 3, 2024, 1:47 PM

#

kind coyote Sorry i dont quite follow the context 🤔😳

Take a very simple example: https://rihab-feki.medium.com/deploying-machine-learning-models-with-streamlit-fastapi-and-docker-bb16bbf8eb91

Medium

Deploying Machine Learning models with Streamlit, FastAPI and Docker

A step by step guide to put your ML model into production

#

Here, FastAPI is used as writing the endpoint to invoke the service for model prediction. LLM is basically model prediction as well. And so in this multi-container design is a basic application that allows better scalability (within the bounds of a single machine)

#

So you can easily invoke 3k calls concurrently, and if you need to, rate limit yourself from FastAPI https://stackoverflow.com/questions/65491184/ratelimit-in-fastapi

Stack Overflow

Ratelimit in Fastapi

How to ratelimit API endpoint request in Fastapi application ? I need to ratelimit API call 5 request per second per user and exceeding that limit blocks that particular user for 60 seconds.
In mai...

#

I see many chat examples are using command line to invoke a single script or something for somehow, but that's really just for personal use

#

For the Medium article, if you take out Streamlit part, with API endpoints designed up, you can just have a Python script to make concurrent calls to API as well

#

But if these are unfamiliar, then that's what I imagine the gap would be

kind coyote Jan 3, 2024, 1:53 PM

#

I guess for me it was more of a thought experiment of 'what can go wrong and how do i deal with it when it does' lol

#

The caveat here is that today is literally the first day i learned that prompts can be redirected to individual endpoints like this, its a literal TIL

#

Thank you for the heads up, i need to read more

glacial osprey Jan 3, 2024, 2:13 PM

#

kind coyote Thank you for the heads up, i need to read more

Glad we could help 🙂 As long as you go back to the ML fundamentals, prompting is really just model prediction, so how to scale model prediction in production is the same (well, with some additional special tooling as well). But also no worries on this, I find many data scientists aren't familiar with these anyways

kind coyote Jan 3, 2024, 11:30 PM

#

glacial osprey Glad we could help 🙂 As long as you go back to the ML fundamentals, prompting i...

If prompting is essentially request based prediction, how do GPT and similar store the 'conversation context'? 🤔

#

https://ai.stackexchange.com/a/38262

This discussion gives suggestions but nothig conclusive

Its some type of context window?

Artificial Intelligence Stack Exchange

How does ChatGPT retain the context of previous questions?

One of the innovations with OpenAI's ChatGPT is how natural it is for users to interact with it.
What is the technical enabler for ChatGPT to maintain the context of previous questions in its answe...

glacial osprey Jan 4, 2024, 12:00 AM

#

kind coyote If prompting is essentially request based prediction, how do GPT and similar sto...

Models themselves don't sotre it at all. All the model gets, typically unless specialized, is just a string. You often see a base model and a chat model also for this reason. Base models are trained on just predicting next token based on all the data fed in, and chat models are specifically finetuned to prompt for response.

You can see OpenAssistant/llama2-13b-megacode2-oasst and OpenAssistant/codellama-13b-oasst-sft-v10, and I quote:

Prompt dialogue template:
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
The model input can contain multiple conversation turns between user and assistant, e.g.
<|im_start|>user
{prompt 1}<|im_end|>
<|im_start|>assistant
{reply 1}<|im_end|>
<|im_start|>user
{prompt 2}<|im_end|>
<|im_start|>assistant
(...)

The {} parts, basically you can see as string replacement with f strings. Now the rest is relatively easy. You basically cache all the conversation happening, maybe separate by sessions, and then prepend last few turns of conversation until context length runs out.

OpenAssistant/llama2-13b-megacode2-oasst · Hugging Face

#

No different from doing RAG, just that this is using cache store instead, like Redis

#

And the reason why you have a specifci format when calling API, is so they can easy do the above string replacement

kind coyote Jan 9, 2024, 4:54 AM

#

glacial osprey Take a very simple example: https://rihab-feki.medium.com/deploying-machine-lear...

I have a question, how would one adjust this architecture such that the model can pass the json outputs to an additional pydantic validation layer

Would it literally just be a different layer/function in the FastAPI endpoint?

I was thinking of separating the different nested objects, and then the validation/pydantic layer is coded to recognize the relevant key pair values and validate according to that specific object

glacial osprey Jan 9, 2024, 11:56 AM

#

kind coyote I have a question, how would one adjust this architecture such that the model ca...

Pydantic and FastAPI play nicely with each other, see doc.

If your return is not as expected, it'll return an error. Validation is used.

You would ideally also enforce request schema, so it's auto validated that the request you receive is in expected format, see doc

Request Body - FastAPI

FastAPI framework, high performance, easy to learn, fast to code, ready for production

kind coyote Jan 9, 2024, 12:04 PM

#

Thanks! I also just found out about libraries like Outlines/Guidance

https://github.com/outlines-dev/outlines

GitHub

GitHub - outlines-dev/outlines: Guided Text Generation

Guided Text Generation. Contribute to outlines-dev/outlines development by creating an account on GitHub.

#Is it possible to use RAG with LlamaV2 to consistently produce a valid Json of a specific structure?