I am using the GPT-4O model to generate SQL queries by passing the schema along with the prompt. The model works well when the prompt explicitly mentions column names, but it struggles to infer the correct column name when the prompt contains string values corresponding to a column.
For example, consider the following schema:
PARTNER: String
COUNTRY: String
ADVERTISER: String
AD_UNIT: String
AD_FORMAT: String
DEVICE_CATEGORY: String
TOTAL_IMPRESSION: Int64
TOTAL_REVENUE: Float64
Working Example:
When the prompt is:
"Show the total revenue for the USA,"
The model correctly maps "USA" to the COUNTRY column and generates the appropriate SQL query:
SELECT SUM(TOTAL_REVENUE)
FROM table_name
WHERE COUNTRY = 'USA';
Occasional Failing Example:
When the prompt is:
"Show the total impressions for PartnerX,"
The model may sometimes incorrectly map "PartnerX" to the ADVERTISER column instead of the PARTNER column and generate an incorrect query:
SELECT SUM(TOTAL_IMPRESSION)
FROM table_name
WHERE ADVERTISER = 'PartnerX';
Similarly, for the following prompt:
"Show the total revenue for AdvertiserA,"
The model may occasionally map "AdvertiserA" to the PARTNER column instead of the ADVERTISER column, generating an incorrect query:
SELECT SUM(TOTAL_REVENUE)
FROM table_name
WHERE PARTNER = 'AdvertiserA';
These errors happen only in some cases, particularly when certain column values (like partner or advertiser names) are less clear in context.
Question:
How can I improve the model's ability to detect the correct column name based on the provided string values in the prompt?
Are there techniques, data augmentation methods, or prompt engineering strategies that can help the model understand the relationship between column names and potential column values more effectively?