#konwinski-prize | Kaggle | Page 1

inner sigil Dec 12, 2024, 6:06 PM

#

I am the first to message here

wild island Dec 14, 2024, 6:45 PM

#

inner sigil I am the first to message here

No worries I'm behind you . 😉

silent lily Dec 15, 2024, 12:23 PM

#

Looking for teammates for this comp. If you are interested and have experience with github and LLMs, please reach out

lethal zodiac Dec 15, 2024, 5:14 PM

#

hi everybody!

safe plank Dec 16, 2024, 11:39 AM

#

heyyyy

edgy moss Dec 16, 2024, 11:43 PM

#

hey, do you guys have or know of any resources that would come in handy for preparation?

desert island Dec 17, 2024, 3:22 AM

#

just to clarify, will the real test set be only python code bases, or can it be any language?

raw mauve Dec 18, 2024, 7:44 AM

#

edgy moss hey, do you guys have or know of any resources that would come in handy for prep...

Hi I've got some experience and I'm looking at putting together a little guide to answer that, suggestions welcome.

As a start, in my opinion the best prototype agent to start studying is SWE-agent. It's intensionally simple and by the same group that made SWE-bench. The paper is also very good, I did a video reading of the preprint.

As a pre-req, here are some prompting concepts that LLM agents grew up around:

Chain of thought
ReAct (LLM tool use)
Retrieval Augmented Generation

raw mauve Dec 19, 2024, 9:05 AM

#

raw mauve Hi I've got some experience and I'm looking at putting together a little guide t...

As I think this contest has the potential to be quite significant to the development of AI coding agents, I've put together a video with some background info and will try and answer common questions I see in a guide on GitHub.

warped current Dec 19, 2024, 4:33 PM

#

desert island just to clarify, will the real test set be only python code bases, or can it be ...

You shouldn't worry about the particular lnguage used, although if hub uses python, you want solution to use python. It involves using an LLM to come up with the "python" to the system, imho.

desert island Dec 19, 2024, 4:34 PM

#

I looked through the eval code, it's definitely just python since the code is pretty tightly coupled to evaluating python

edgy moss Dec 20, 2024, 6:56 AM

#

this might be a stupid question, but is the goal of the competition to develop a new model from scratch or to use pre-existing models and finetuning it to solve the task?

minor veldt Dec 20, 2024, 12:50 PM

#

You can do both. For what I understood you can also add external data to your solution, but it can't access the internet during the execution

crisp warren Dec 20, 2024, 4:03 PM

#

yeah i don't get that part though. We need the repos cloned though, right? so that we can embed the repo before making changes?

#

cloning the repo will need internet access…

raw mauve Dec 21, 2024, 1:47 AM

#

crisp warren yeah i don't get that part though. We need the repos cloned though, right? so th...

My assumption is that in the bench environment the code and dependencies will be already available in the container, but I'd appreciate clarification from someone who knows for sure.

cobalt seal Dec 21, 2024, 4:13 AM

#

crisp warren yeah i don't get that part though. We need the repos cloned though, right? so th...

The above is correct. The repos are provided dynamically through the api. check out the demo notebook and the predict method you'll see repo_archive providing relevant code base.

Per the demo notebook:

The evaluation API requires that you set up a server which will respond to inference requests. We have already defined the server; you just need write the predict function. When we evaluate your submission on the hidden test set the client defined in konwinski_prize_gateway will run in a different container with direct access to the hidden test set and hand off the data.

Your code will always have access to the published copies of the files.

quartz siren Dec 22, 2024, 2:31 AM

#

it can't access the internet during the execution
well that's not ideal. I can't use openAI etc?

fickle geyser Dec 25, 2024, 4:56 AM

#

I wrote this https://www.kaggle.com/competitions/konwinski-prize/discussion/553294

Konwinski Prize

$1M for the AI that can close 90% of new GitHub issues

raw mauve Dec 26, 2024, 8:54 PM

#

quartz siren > it can't access the internet during the execution well that's not ideal. I ca...

Right, Konwinski's blog here clarifies that this is only for agents using Open Weight models. I assume the lack of internet access means those models will need to be running locally
https://andykonwinski.com/2024/12/12/konwinski-prize.html

Andy Konwinski

Why I built the Konwinski Prize

Why I launched the K Prize, aka, why want to give $1M to the first open source AI that gets 90% on contamination-free SWE-bench.

quartz siren Dec 26, 2024, 9:55 PM

#

Ahh makes sense!

quartz siren Dec 26, 2024, 9:56 PM

#

fickle geyser I wrote this https://www.kaggle.com/competitions/konwinski-prize/discussion/5532...

This is amazing!

quartz siren Dec 27, 2024, 11:36 AM

#

@fickle geyser 's notebook is great, but make sure you follow these extra steps otherwise it will not work. https://www.kaggle.com/competitions/konwinski-prize/discussion/553294#3081905
~~I tried to run @fickle geyser 's notebook but it doesn't work on kaggle... however, I was inspired by his generosity 🙌 so here's another notebook which will run on kaggle which gives you simple access to a function send_llm you can call to send responses to an LLM running completely within kaggle itself!
https://www.kaggle.com/code/thornkey/notebookdd75bed624

(Note you will need your own HuggingFace token, but that's free 🙂 )

Heaps of kudos to https://github.com/casualcomputer/llm_google_colab which I distilled to make this notebook.
See you at the finish line! 🙂~~

Demo: Run LLM Locally

Explore and run machine learning code with Kaggle Notebooks | Using data from Konwinski Prize

GitHub

GitHub - casualcomputer/llm_google_colab: A tutorial on how to set ...

A tutorial on how to set up a LLM on Google Colab for both GPU-accelerated and CPU-only session. - casualcomputer/llm_google_colab

Konwinski Prize

$1M for the AI that can close 90% of new GitHub issues

warped current Dec 27, 2024, 1:38 PM

#

maybe I misunderstand the competition. But it baffles me how you can't solve it without making use of a LLM.

quartz siren Dec 27, 2024, 1:43 PM

#

they're not saying dont use an llm they're just saying dont use a closed source LLM provider

#

you need to use an open source LLM

abstract basin Dec 27, 2024, 5:32 PM

#

When looking at the predict method in "konwinski-prize-demo-submission", it appears that some information in the base training set ("test_patch", FAIL_TO_PASS, PASS_TO_PASS), is not being provided. During the forecasting(inference) phase on the hidden data-set, will the swe-agent that provides the code-patch not have access to any of these columns to verify if its generated patch is good?

fickle geyser Dec 27, 2024, 7:42 PM

#

quartz siren <@776813718834905120> 's notebook is great, but make sure you follow these extra...

I wrote a reply in the thread

quartz siren Dec 27, 2024, 7:46 PM

#

Fair! Thanks!

versed bloom Dec 27, 2024, 11:48 PM

#

Is it possible to run DeepSeek Coder V3 (42% on SWE-bench verified) in the compute available for the K-prize?
https://medium.com/data-science-in-your-pocket/deepseek-v3-the-best-open-source-llm-727d3421ae38
Is there a list of models that people have run under the K-prize conditions?

Medium

DeepSeek V3: The best Open-source LLM

Better than Claude 3.5 Sonnet, GPT-4o, Llama3.1 405B

versed bloom Dec 28, 2024, 12:04 AM

#

versed bloom Is it possible to run DeepSeek Coder V3 (42% on SWE-bench verified) in the compu...

Ah, probably not. 8 x H200 recommended for FP8:
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

GitHub

sglang/benchmark/deepseek_v3 at main · sgl-project/sglang

SGLang is a fast serving framework for large language models and vision language models. - sgl-project/sglang

drifting wharf Jan 3, 2025, 10:03 PM

#

The 4 blocks of code in the "Konwinski Prize Demo Submission" I am very clueless about. Can someone explain what's happening in it.

Also while running the same code in a notebook the

Import kaggle_evaluation. Konwinski_prize_inference_server is throwing an error

versed bloom Jan 5, 2025, 5:43 AM

#

NVIDIA has just released a 1.5B instruct model which should fit in the K-prize conditions and might be efficient.
https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/

NVIDIA Technical Blog

Xin Dong

Hymba Hybrid-Head Architecture Boosts Small Language Model Performa...

Transformers, with their attention-based architecture, have become the dominant choice for language models (LMs) due to their strong performance, parallelization capabilities, and long-term recall…

versed bloom Jan 5, 2025, 8:33 PM

#

drifting wharf The 4 blocks of code in the "Konwinski Prize Demo Submission" I am very clueless...

The K-prize notebooks run fine for me on Kaggle but I'm unable to run in Colab. The kaggle_evaluation package is missing and I can't find with that name on PyPi or elsewhere.
How do I install kaggle_evaluation?

import kaggle_evaluation.konwinski_prize_inference_server

ModuleNotFoundError: No module named 'kaggle_evaluation'

drifting wharf Jan 5, 2025, 8:44 PM

#

Same error but for me it's on the Kaggle Notebook itself

versed bloom Jan 5, 2025, 8:59 PM

#

drifting wharf Same error but for me it's on the Kaggle Notebook itself

I've found the code, it is at {konwinski_prize_path}/kaggle_evaluation

versed bloom Jan 5, 2025, 11:41 PM

#

I got the K-prize Gemma 2B demo notebook running on Colab. An A100 runs the fine-tune in 3 minutes (vs 27 on Kaggle's P100).
https://github.com/fovi-llc/refactor-python/blob/main/konwinski-prize-gemma-llm.ipynb

warped current Jan 9, 2025, 5:55 PM

#

I'm still trying to fully grok this project, but as I see it, the overall aim is to create an LLM (or harness one) that generates bug fixes based off bug fix requests

versed bloom Jan 9, 2025, 7:46 PM

#

warped current I'm still trying to fully grok this project, but as I see it, the overall aim is...

Yes. It is an evolution of SWE-bench to push the frontier by evaluating on unseen test cases and using open, low-resource models.
https://www.swebench.com/

SWE-bench

SWE-bench: Evaluate Language Models on Open Source Software Tasks

warped current Jan 14, 2025, 1:03 AM

#

versed bloom Yes. It is an evolution of SWE-bench to push the frontier by evaluating on unse...

It's not really clear what it is about, or at least wasn't to me, like alot of Kaggle contest descriptions. It's definitely an interesting AI project in general: like developing AI Droids

V5f7xft3STdewu9m2m832dt8d6t6lXF1dlb1sKXM3nbMXTlFXOGIKxxxhSOucMQVjrjCEVc44gpHXOGIKxxxhSOucMQVjrjCEVc44gpHXOGIKxxxhfPurtrVN8K3DEEQBEEQBEEoiH9GYWFZI83XmQAAAABJRU5ErkJggg.png

versed bloom Jan 14, 2025, 2:22 AM

#

warped current It's not really clear what it is about, or at least wasn't to me, like alot of K...

There is a summary on the Ksggle page but Konwinski has a longer explanation on his site.
https://andykonwinski.com/2024/12/12/konwinski-prize.html

Andy Konwinski

Why I built the Konwinski Prize

Why I launched the K Prize, aka, why want to give $1M to the first open source AI that gets 90% on contamination-free SWE-bench.

sleek carbon Jan 16, 2025, 11:46 PM

#

versed bloom Is it possible to run DeepSeek Coder V3 (42% on SWE-bench verified) in the compu...

heck no what

#

requires like 300+ gb of vram just at 4 bytes to run

#

im more worried bout this comp being canceled

versed bloom Jan 17, 2025, 1:39 AM

#

sleek carbon requires like 300+ gb of vram just at 4 bytes to run

Unsloth has quantized options that could probably run DS3 in the K-prize conditions but I doubt that would work well because it is a MoE. Starting with a smaller model like Llama or QWen is probably the way to go.
https://huggingface.co/collections/unsloth/deepseek-v3-all-versions-677cf5cfd7df8b7815fc723c

sleek carbon Jan 17, 2025, 2:10 AM

#

versed bloom Unsloth has quantized options that could probably run DS3 in the K-prize conditi...

Yea uh

#

4 bit quantized = 300+ gb of vram

versed bloom Jan 17, 2025, 6:12 AM

#

sleek carbon 4 bit quantized = 300+ gb of vram

DS3 is MoE so the weights can be swapped. They say ~40GB for 2bit.
Not planning to try it but am interested in the results if someone does.
Q2_K_XS should run ok in ~40GB of CPU / GPU VRAM with automatic llama.cpp offloading.
https://huggingface.co/unsloth/DeepSeek-V3-GGUF

sleek carbon Jan 17, 2025, 12:25 PM

#

versed bloom DS3 is MoE so the weights can be swapped. They say ~40GB for 2bit. Not plannin...

Ya but imagine loading and unloading the weights over n over

#

I actually thought of using deepseek v3 lol

#

I even spent a few hours downloading a copy to a SSD

#

Its just so impractical loading and unloading weights lol

#

Also anything below 4 byte is just dont

#

What I was actually thinking was having multiple models to preform different tasks

#

Like a main model and a critic model

#

And I can run code through them in batches

#

which has the same idea as brute forcing through a gigantic model like V3 but more streamlined

oak oriole Jan 17, 2025, 1:24 PM

#

This is a new type of competition for me, but it sounds fascinating.
There are a few things I’m still trying to understand:

Can only open-source models (like LLaMA or Qwen) be used?
Is it possible to use closed-source models in any way (like GPT or Claude)?
It seems that relying only on Kaggle's computing resources would not be enough here. So, am I correct in assuming that to perform well in this competition, participants would need to train models on more powerful servers (like AWS or Google Cloud) and then perform inference on Kaggle? If that’s the case, would it be mandatory to create a public dataset with your model or external dataset? If so, wouldn’t that allow others to use it as well?
Apologies if these questions are naive.

versed bloom Jan 17, 2025, 5:57 PM

#

oak oriole This is a new type of competition for me, but it sounds fascinating. There are a...

Not speaking for the competition operators but my understanding is that the answer to all of those questions is yes. I don't expect anyone to get over the score needed for the one millliion dollars because these conditions combine to be extremely tough but that (the difficulty) is the point and again for those reasons would be mind-blowingly beneficial for everyone if achieved.

oak oriole Jan 17, 2025, 5:59 PM

#

versed bloom Not speaking for the competition operators but my understanding is that the answ...

if you had indeed reached 90% on SWE, it probably would be worth more than 1 million, so I am not sure it would make sense to submit your solution here rather than, say, make your own company.

versed bloom Jan 17, 2025, 6:05 PM

#

oak oriole if you had indeed reached 90% on SWE, it probably would be worth more than 1 mil...

Some people care more about doing good than making money. The prize (i.e. gift) is a nice return on a few months effort by someone who takes that view. Job offers of more than a million a year if they want it, without having to do any selling or interviewing, are also a foregone consequence thanks to K-prize marketing.

oak oriole Jan 17, 2025, 6:13 PM

#

versed bloom Some people care more about doing good than making money. The prize (i.e. gift)...

yes quite possible

versed bloom Jan 18, 2025, 10:41 PM

#

Unsloth just released a free Kaggle notebook for running Phi-4 which supports up to 128K context length.
https://www.kaggle.com/code/danielhanchen/phi-4-finetuning-unsloth-notebook

Phi-4 Finetuning Unsloth notebook

Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources

sleek carbon Jan 19, 2025, 12:34 AM

#

oak oriole This is a new type of competition for me, but it sounds fascinating. There are a...

no claude or gpt

#

also, in general, is it better to have smaller # of parameters with larger float point accuracy, or lots of parameters with less accuracy (4b quant @80b vs 32b @20b)

#

bc im assuming both take the same resources to compute

#

and another thing, would it be better to have 1 large model, 1 small model generating multiple solutions, 1 small model refining a solution multiple times, multiple models in tandem generating multiple solutions, or multiple models working on 1 solution (given they use the same amount of resources)

#

#

heres a graph from nebius

#

'best-of-N: the agent runs on each problem instance N times, and a problem is considered solved if it is successfully solved in at least one of these runs.

random: a problem is considered solved if a solution, randomly selected from N runs, is correct. This serves as a Monte Carlo estimate of the success rate using N samples.'

#

also, in the AI-numina math olympiad challenge, basically everyone used a deepseek 7b model, and had it generate multiple answers, and used a majority vote method to choose the correct answer. However, as this isnt as black and white as math answers, I was wondering if maybe a diffusion tecnique could work out

sleek carbon Jan 23, 2025, 3:55 AM

#

Weird hyperlink dude

#

<@&1303433601177751593>

sleek carbon Jan 23, 2025, 10:27 AM

#

Thanks for banning him

radiant pier Jan 30, 2025, 9:55 AM

#

Creating a team for the competition, if interested. DM me .

last sedge Jan 31, 2025, 4:11 PM

#

Hi everyone, I am a PhD student and I am looking for a team

versed bloom Feb 1, 2025, 12:35 AM

#

@vale lichen <=== has implemented and shared a Kaggle notebook based on OpenHands that solves one of the K-prize test cases running Qwen 2.5 Instruct 32b. 🎉 🔥 🚀
https://www.kaggle.com/code/smartmanoj/kprize-openhands-fork

KPrize | Openhands Fork

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

fickle geyser Feb 7, 2025, 9:29 AM

#

The first published notebook not scoring -1

https://www.kaggle.com/competitions/konwinski-prize/discussion/561695

Konwinski Prize

$1M for the AI that can close 90% of new GitHub issues

chilly wren Feb 7, 2025, 6:22 PM

#

Sometimes

fickle geyser Feb 7, 2025, 6:34 PM

#

yeah

dense mural Feb 10, 2025, 1:24 PM

#

why solutions in this competition score that low? a lot of solutions have achieved 30%+ in SWE bench, how this competition is harder?

chilly wren Feb 10, 2025, 3:05 PM

#

penalty on fails

chilly wren Feb 10, 2025, 3:21 PM

#

(a - b)/(a + b + c)

#

a: correct

#

b: incorrect

#

c: skipped

#

what I don't understand is why benchmark is -1, as I see it skipping everything should score (0 - 0)/(0 + 0 + N) = 0

#

my bad

#

actually metric is (a - b - c/10000)/(a + b + c) now makes sense

#

actually no, it should score then -1e-4, someone knows were I am wrong?

#

if n_correct == 0: return incorrect_score

#

there it is

#

incorrect_score has been set to -1

versed bloom Feb 11, 2025, 1:00 AM

#

dense mural why solutions in this competition score that low? a lot of solutions have achiev...

You have to use a small open weights model that will run on an L4.

dense mural Feb 12, 2025, 12:35 AM

#

versed bloom You have to use a small open weights model that will run on an L4.

that's crazy

versed bloom Feb 12, 2025, 1:50 AM

#

dense mural that's crazy

Today looks crazy five years ago (GPT 3 was released June 2020 and of course the real delta change was 4 which was March 2023). But yeah, no one expects 90% next month. The prize pool is doubled (from $50K to $100K) if anyone can get 30% or better.

fickle geyser Feb 14, 2025, 4:33 PM

#

it seems that I am not able to submit when I exceed my GPU quota?

chilly wren Feb 14, 2025, 6:18 PM

#

Indeed, any code submission has to start running

#

Just wait a few hours

fickle geyser Mar 3, 2025, 5:14 AM

#

this notebook is public (although there is a huge disclaimer to the score)

sleek carbon Mar 3, 2025, 12:47 PM

#

Ok thank god, so Wednesday isn’t final submission date

sleek carbon Mar 3, 2025, 12:49 PM

#

fickle geyser this notebook is public (although there is a huge disclaimer to the score)

Is that just the select patch verify? Did you make any changes to it

#

If not, what were your methods of obtaining your score

#

Also, we can aubmit two notebooks to the final competition, right?

#

so theoretically I can have two different strategies?

velvet vigil Mar 4, 2025, 12:10 AM

#

Hi Folks! Newbie here. Trying participating in this competition for fun and learning. Does any one know any link/post to get started and to get up to speed quickly?

sleek carbon Mar 10, 2025, 11:41 AM

#

If the competition submission deadline is Wednesday

#

Does that mean the start of Wednesday or the end of Wednesday

nocturne warren Mar 11, 2025, 8:04 PM

#

When will this competition end?
western usa time or eastern usa time, just curious...

warped current Apr 2, 2025, 10:05 PM

#

good luck to all still in the running

hasty owl Apr 3, 2025, 11:58 AM

#

When will be the second KPrize competition?

warped current Apr 4, 2025, 2:05 PM

#

hasty owl When will be the second KPrize competition?

There may not be a second one. Those who have joined still have time to make new submissions.

hasty owl Apr 4, 2025, 2:11 PM

#

but new submissions is not allowed now

#

Is new submissions still allowed now?

soft sedge Apr 5, 2025, 4:14 AM

#

hasty owl When will be the second KPrize competition?

Probably late 2025

nocturne warren Apr 9, 2025, 10:18 PM

#

I am in the leaderboard in this competition, but my homepage does not have this active competition like others

#

#

👆 is my homepage, no active competition

#

this picture is someone else, has this active competition tab

#

@limber scaffold @rugged finch Can you help me see this? a little wired

#

limber scaffold Apr 10, 2025, 9:38 PM

#

nocturne warren I am in the leaderboard in this competition, but my homepage does not have this ...

"Active competition" is based on recent submission activity, so it might be due to not havign made a submission to the competition recently. When the competition ends it will be shown on your profile either way.

nocturne warren Apr 10, 2025, 11:59 PM

#

limber scaffold "Active competition" is based on recent submission activity, so it might be due ...

Thank you for your explanation. I understand it now, and it really makes a lot of sense!

sullen bay Jul 13, 2025, 9:06 PM

#

soft sedge Probably late 2025

Too late yes came to know about swe bench 😭

past grove Jul 27, 2025, 11:08 PM

#

will there be another one next year? because of the low top score

versed bloom Aug 4, 2025, 4:59 PM

#

past grove will there be another one next year? because of the low top score

There will be another round but the time and other details are TBD based on feedback from first round participants.
https://techcrunch.com/2025/07/23/a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty/
https://x.com/andykonwinski/status/1948190939983589737
With the new batch of open coding models I expect much better results although 90% will still be unlikely. Will probably take another look when they announce the next round. I thought they should have extended the time for the first round but if they cycle quickly, like twice a year, then that would be great.

TechCrunch

Connie Loizos

A new AI coding challenge just published its first results — and ...

A new AI coding challenge has revealed its first winner — and set a new bar for AI-powered software engineers.

Andy Konwinski (@andykonwinski)

And yes, round 2 is coming. We're starting by listening to the teams who showed up in Round 1. What worked, what didn’t, what we can improve. Big thanks to @Kaggle for teaming up on this crazy idea. https://t.co/p24tJ6gTcR

sullen bay Aug 14, 2025, 12:30 PM

#

versed bloom There will be another round but the time and other details are TBD based on feed...

what was the result in round1

#

and whhen round2 gonna be launched

versed bloom Aug 14, 2025, 11:16 PM

#

sullen bay what was the result in round1

https://www.kaggle.com/competitions/konwinski-prize/leaderboard

sullen bay Aug 15, 2025, 3:25 AM

#

btw why swe-bench not in kaggle benchmarks

versed bloom Aug 17, 2025, 5:52 AM

#

sullen bay btw why swe-bench not in [kaggle benchmarks](https://www.kaggle.com/benchmarks)

Those are benchmarks that Kaggle reproduces, so someone would have to do the work to set it up there. I'm sure volunteers are welcome! And seriously, it would be great for someone to get that done!

sullen bay Aug 17, 2025, 6:00 AM

#

versed bloom Those are benchmarks that Kaggle reproduces, so someone would have to do the wor...

pls go ahead n do that 😔