#konwinski-prize

1 messages ยท Page 1 of 1 (latest)

inner sigil
#

I am the first to message here

wild island
silent lily
#

Looking for teammates for this comp. If you are interested and have experience with github and LLMs, please reach out

lethal zodiac
#

hi everybody!

safe plank
#

heyyyy

edgy moss
#

hey, do you guys have or know of any resources that would come in handy for preparation?

desert island
#

just to clarify, will the real test set be only python code bases, or can it be any language?

raw mauve
# edgy moss hey, do you guys have or know of any resources that would come in handy for prep...

Hi I've got some experience and I'm looking at putting together a little guide to answer that, suggestions welcome.

As a start, in my opinion the best prototype agent to start studying is SWE-agent. It's intensionally simple and by the same group that made SWE-bench. The paper is also very good, I did a video reading of the preprint.

As a pre-req, here are some prompting concepts that LLM agents grew up around:

Chain of thought
ReAct (LLM tool use)
Retrieval Augmented Generation

raw mauve
warped current
desert island
#

I looked through the eval code, it's definitely just python since the code is pretty tightly coupled to evaluating python

edgy moss
#

this might be a stupid question, but is the goal of the competition to develop a new model from scratch or to use pre-existing models and finetuning it to solve the task?

minor veldt
#

You can do both. For what I understood you can also add external data to your solution, but it can't access the internet during the execution

crisp warren
#

yeah i don't get that part though. We need the repos cloned though, right? so that we can embed the repo before making changes?

#

cloning the repo will need internet accessโ€ฆ

raw mauve
cobalt seal
# crisp warren yeah i don't get that part though. We need the repos cloned though, right? so th...

The above is correct. The repos are provided dynamically through the api. check out the demo notebook and the predict method you'll see repo_archive providing relevant code base.

Per the demo notebook:

The evaluation API requires that you set up a server which will respond to inference requests. We have already defined the server; you just need write the predict function. When we evaluate your submission on the hidden test set the client defined in konwinski_prize_gateway will run in a different container with direct access to the hidden test set and hand off the data.

Your code will always have access to the published copies of the files.

quartz siren
#

it can't access the internet during the execution
well that's not ideal. I can't use openAI etc?

fickle geyser
raw mauve
# quartz siren > it can't access the internet during the execution well that's not ideal. I ca...

Right, Konwinski's blog here clarifies that this is only for agents using Open Weight models. I assume the lack of internet access means those models will need to be running locally
https://andykonwinski.com/2024/12/12/konwinski-prize.html

quartz siren
#

Ahh makes sense!

quartz siren
#

@fickle geyser 's notebook is great, but make sure you follow these extra steps otherwise it will not work. https://www.kaggle.com/competitions/konwinski-prize/discussion/553294#3081905
~~I tried to run @fickle geyser 's notebook but it doesn't work on kaggle... however, I was inspired by his generosity ๐Ÿ™Œ so here's another notebook which will run on kaggle which gives you simple access to a function send_llm you can call to send responses to an LLM running completely within kaggle itself!
https://www.kaggle.com/code/thornkey/notebookdd75bed624

(Note you will need your own HuggingFace token, but that's free ๐Ÿ™‚ )

Heaps of kudos to https://github.com/casualcomputer/llm_google_colab which I distilled to make this notebook.
See you at the finish line! ๐Ÿ™‚~~

GitHub

A tutorial on how to set up a LLM on Google Colab for both GPU-accelerated and CPU-only session. - casualcomputer/llm_google_colab

warped current
#

maybe I misunderstand the competition. But it baffles me how you can't solve it without making use of a LLM.

quartz siren
#

they're not saying dont use an llm they're just saying dont use a closed source LLM provider

#

you need to use an open source LLM

abstract basin
#

When looking at the predict method in "konwinski-prize-demo-submission", it appears that some information in the base training set ("test_patch", FAIL_TO_PASS, PASS_TO_PASS), is not being provided. During the forecasting(inference) phase on the hidden data-set, will the swe-agent that provides the code-patch not have access to any of these columns to verify if its generated patch is good?

fickle geyser
quartz siren
#

Fair! Thanks!

versed bloom
versed bloom
drifting wharf
#

The 4 blocks of code in the "Konwinski Prize Demo Submission" I am very clueless about. Can someone explain what's happening in it.

Also while running the same code in a notebook the

Import kaggle_evaluation. Konwinski_prize_inference_server is throwing an error

versed bloom
#

NVIDIA has just released a 1.5B instruct model which should fit in the K-prize conditions and might be efficient.
https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/

Transformers, with their attention-based architecture, have become the dominant choice for language models (LMs) due to their strong performance, parallelization capabilities, and long-term recallโ€ฆ

versed bloom
drifting wharf
#

Same error but for me it's on the Kaggle Notebook itself

versed bloom
versed bloom
warped current
#

I'm still trying to fully grok this project, but as I see it, the overall aim is to create an LLM (or harness one) that generates bug fixes based off bug fix requests

versed bloom
warped current
versed bloom
sleek carbon
#

requires like 300+ gb of vram just at 4 bytes to run

#

im more worried bout this comp being canceled

versed bloom
sleek carbon
#

4 bit quantized = 300+ gb of vram

versed bloom
sleek carbon
#

I actually thought of using deepseek v3 lol

#

I even spent a few hours downloading a copy to a SSD

#

Its just so impractical loading and unloading weights lol

#

Also anything below 4 byte is just dont

#

What I was actually thinking was having multiple models to preform different tasks

#

Like a main model and a critic model

#

And I can run code through them in batches

#

which has the same idea as brute forcing through a gigantic model like V3 but more streamlined

oak oriole
#

This is a new type of competition for me, but it sounds fascinating.
There are a few things Iโ€™m still trying to understand:

Can only open-source models (like LLaMA or Qwen) be used?
Is it possible to use closed-source models in any way (like GPT or Claude)?
It seems that relying only on Kaggle's computing resources would not be enough here. So, am I correct in assuming that to perform well in this competition, participants would need to train models on more powerful servers (like AWS or Google Cloud) and then perform inference on Kaggle? If thatโ€™s the case, would it be mandatory to create a public dataset with your model or external dataset? If so, wouldnโ€™t that allow others to use it as well?
Apologies if these questions are naive.

versed bloom
oak oriole
versed bloom
versed bloom
sleek carbon
#

also, in general, is it better to have smaller # of parameters with larger float point accuracy, or lots of parameters with less accuracy (4b quant @80b vs 32b @20b)

#

bc im assuming both take the same resources to compute

#

and another thing, would it be better to have 1 large model, 1 small model generating multiple solutions, 1 small model refining a solution multiple times, multiple models in tandem generating multiple solutions, or multiple models working on 1 solution (given they use the same amount of resources)

#

heres a graph from nebius

#

'best-of-N: the agent runs on each problem instance N times, and a problem is considered solved if it is successfully solved in at least one of these runs.

random: a problem is considered solved if a solution, randomly selected from N runs, is correct. This serves as a Monte Carlo estimate of the success rate using N samples.'

#

also, in the AI-numina math olympiad challenge, basically everyone used a deepseek 7b model, and had it generate multiple answers, and used a majority vote method to choose the correct answer. However, as this isnt as black and white as math answers, I was wondering if maybe a diffusion tecnique could work out

sleek carbon
#

Weird hyperlink dude

#

<@&1303433601177751593>

sleek carbon
#

Thanks for banning him

radiant pier
#

Creating a team for the competition, if interested. DM me .

last sedge
#

Hi everyone, I am a PhD student and I am looking for a team

versed bloom
fickle geyser
chilly wren
#

Sometimes

fickle geyser
#

yeah

dense mural
#

why solutions in this competition score that low? a lot of solutions have achieved 30%+ in SWE bench, how this competition is harder?

chilly wren
#

penalty on fails

chilly wren
#

(a - b)/(a + b + c)

#

a: correct

#

b: incorrect

#

c: skipped

#

what I don't understand is why benchmark is -1, as I see it skipping everything should score (0 - 0)/(0 + 0 + N) = 0

#

my bad

#

actually metric is (a - b - c/10000)/(a + b + c) now makes sense

#

actually no, it should score then -1e-4, someone knows were I am wrong?

#

if n_correct == 0: return incorrect_score

#

there it is

#

incorrect_score has been set to -1

versed bloom
versed bloom
# dense mural that's crazy

Today looks crazy five years ago (GPT 3 was released June 2020 and of course the real delta change was 4 which was March 2023). But yeah, no one expects 90% next month. The prize pool is doubled (from $50K to $100K) if anyone can get 30% or better.

fickle geyser
#

it seems that I am not able to submit when I exceed my GPU quota?

chilly wren
#

Indeed, any code submission has to start running

#

Just wait a few hours

fickle geyser
#

this notebook is public (although there is a huge disclaimer to the score)

sleek carbon
#

Ok thank god, so Wednesday isnโ€™t final submission date

sleek carbon
#

If not, what were your methods of obtaining your score

#

Also, we can aubmit two notebooks to the final competition, right?

#

so theoretically I can have two different strategies?

velvet vigil
#

Hi Folks! Newbie here. Trying participating in this competition for fun and learning. Does any one know any link/post to get started and to get up to speed quickly?

sleek carbon
#

If the competition submission deadline is Wednesday

#

Does that mean the start of Wednesday or the end of Wednesday

nocturne warren
#

When will this competition end?
western usa time or eastern usa time, just curious...

warped current
#

good luck to all still in the running

hasty owl
#

When will be the second KPrize competition?

warped current
hasty owl
#

but new submissions is not allowed now

#

Is new submissions still allowed now?

soft sedge
nocturne warren
#

I am in the leaderboard in this competition, but my homepage does not have this active competition like others

#

๐Ÿ‘† is my homepage, no active competition

#

this picture is someone else, has this active competition tab

#

@limber scaffold @rugged finch Can you help me see this? a little wired

limber scaffold
nocturne warren
sullen bay
past grove
#

will there be another one next year? because of the low top score

versed bloom
# past grove will there be another one next year? because of the low top score

There will be another round but the time and other details are TBD based on feedback from first round participants.
https://techcrunch.com/2025/07/23/a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty/
https://x.com/andykonwinski/status/1948190939983589737
With the new batch of open coding models I expect much better results although 90% will still be unlikely. Will probably take another look when they announce the next round. I thought they should have extended the time for the first round but if they cycle quickly, like twice a year, then that would be great.

A new AI coding challenge has revealed its first winner โ€” and set a new bar for AI-powered software engineers.

And yes, round 2 is coming. We're starting by listening to the teams who showed up in Round 1. What worked, what didnโ€™t, what we can improve. Big thanks to @Kaggle for teaming up on this crazy idea. https://t.co/p24tJ6gTcR

sullen bay
#

and whhen round2 gonna be launched

sullen bay
versed bloom
sullen bay