#Model returning malformed characters in JSON response using API

1 messages · Page 1 of 1 (latest)

winter smelt
#

Hello OpenAI team,

I'm using the OpenAI API to extract and structure addresses from free-text descriptions. I rely on the response_format: json option to ensure clean, machine-readable outputs. However, in many cases, the API is returning malformed or incorrectly encoded characters within the JSON response.

For example, instead of returning "São Paulo" or "Guarujá", I receive:

{
"estado": "S\u00050",
"cidade": "Guaruj\u00101"
}

These are control characters (\u0005, \u0010, etc.) that corrupt the expected UTF-8 output, making it unusable in production systems.

This behavior has been consistent and is severely affecting the reliability of our integration. For context, in this past month alone, our usage statistics are:

Total tokens: 1,971,385,774

Total requests: 344,854

We kindly ask for guidance or a fix, as we're strictly relying on the model's output for critical address processing and need consistent, UTF-8 clean responses.

Thank you in advance for your support.

Best regards,
Luis

heady jasper
#

Hello! What I normally do is to define pydantic models with a finite set of enums / Literal the model has to choose from.

If the amount of cities / countries is not very large, it should work very well.

#

example of what I mean:

from pydantic import BaseModel
from typing import Literal
from openai import OpenAI

1. Define address schema with finite set

class AddressResponse(BaseModel):
country: Literal["Brazil", "Spain", "Japan", "USA"]
city: Literal["São Paulo", "Guarujá", "Madrid", "Tokyo"]
street: str
number: str

client = OpenAI() # Ensure openai>=1.42.0, pydantic>=2.8

def extract_address(text: str) -> AddressResponse | None:
response = client.responses.create(
model="gpt-4o-2024-08-06",
instructions=(
"Extract the address in strict JSON matching the schema: "
"country, city, street, number."
),
input=text,
response_format=AddressResponse,
)

# responses.output is instance with .parsed attribute
# output.parsed is typed AddressResponse or None on refusal
parsed = response.output.parsed  # type: ignore
return parsed

if name == "main":
sample = "Rua Santo Antônio, 456, Guarujá, Brazil"
result = extract_address(sample)
if result:
print(result.dict())
else:
print("Failed to parse or model refused.")

#

with pydantic base models you can enforce further validation (max /min number of elements in a list, ...)

winter smelt
# heady jasper Hello! What I normally do is to define pydantic models with a finite set of enum...

Thanks! That approach makes sense and works well for the estado (state) field since we have a finite and small set of values.

However, for cidade (city), we’re dealing with over 5,500 cities in Brazil.
Wouldn’t including all of them in a Literal[...] definition cause token overload or significantly increase the prompt size?

Just wanted to confirm if this is still a recommended and scalable approach in this case, or if there's an alternative way to enforce UTF-8 and proper values without listing thousands of cities manually.

Thanks again!