#konwinski-prize
1 messages ยท Page 1 of 1 (latest)
No worries I'm behind you . ๐
Looking for teammates for this comp. If you are interested and have experience with github and LLMs, please reach out
hi everybody!
heyyyy
hey, do you guys have or know of any resources that would come in handy for preparation?
just to clarify, will the real test set be only python code bases, or can it be any language?
Hi I've got some experience and I'm looking at putting together a little guide to answer that, suggestions welcome.
As a start, in my opinion the best prototype agent to start studying is SWE-agent. It's intensionally simple and by the same group that made SWE-bench. The paper is also very good, I did a video reading of the preprint.
As a pre-req, here are some prompting concepts that LLM agents grew up around:
Chain of thought
ReAct (LLM tool use)
Retrieval Augmented Generation
You shouldn't worry about the particular lnguage used, although if hub uses python, you want solution to use python. It involves using an LLM to come up with the "python" to the system, imho.
I looked through the eval code, it's definitely just python since the code is pretty tightly coupled to evaluating python
this might be a stupid question, but is the goal of the competition to develop a new model from scratch or to use pre-existing models and finetuning it to solve the task?
You can do both. For what I understood you can also add external data to your solution, but it can't access the internet during the execution
yeah i don't get that part though. We need the repos cloned though, right? so that we can embed the repo before making changes?
cloning the repo will need internet accessโฆ
My assumption is that in the bench environment the code and dependencies will be already available in the container, but I'd appreciate clarification from someone who knows for sure.
The above is correct. The repos are provided dynamically through the api. check out the demo notebook and the predict method you'll see repo_archive providing relevant code base.
Per the demo notebook:
The evaluation API requires that you set up a server which will respond to inference requests. We have already defined the server; you just need write the predict function. When we evaluate your submission on the hidden test set the client defined in konwinski_prize_gateway will run in a different container with direct access to the hidden test set and hand off the data.
Your code will always have access to the published copies of the files.
it can't access the internet during the execution
well that's not ideal. I can't use openAI etc?
Right, Konwinski's blog here clarifies that this is only for agents using Open Weight models. I assume the lack of internet access means those models will need to be running locally
https://andykonwinski.com/2024/12/12/konwinski-prize.html
Why I launched the K Prize, aka, why want to give $1M to the first open source AI that gets 90% on contamination-free SWE-bench.
Ahh makes sense!
This is amazing!
@fickle geyser 's notebook is great, but make sure you follow these extra steps otherwise it will not work. https://www.kaggle.com/competitions/konwinski-prize/discussion/553294#3081905
~~I tried to run @fickle geyser 's notebook but it doesn't work on kaggle... however, I was inspired by his generosity ๐ so here's another notebook which will run on kaggle which gives you simple access to a function send_llm you can call to send responses to an LLM running completely within kaggle itself!
https://www.kaggle.com/code/thornkey/notebookdd75bed624
(Note you will need your own HuggingFace token, but that's free ๐ )
Heaps of kudos to https://github.com/casualcomputer/llm_google_colab which I distilled to make this notebook.
See you at the finish line! ๐~~
maybe I misunderstand the competition. But it baffles me how you can't solve it without making use of a LLM.
they're not saying dont use an llm they're just saying dont use a closed source LLM provider
you need to use an open source LLM
When looking at the predict method in "konwinski-prize-demo-submission", it appears that some information in the base training set ("test_patch", FAIL_TO_PASS, PASS_TO_PASS), is not being provided. During the forecasting(inference) phase on the hidden data-set, will the swe-agent that provides the code-patch not have access to any of these columns to verify if its generated patch is good?
I wrote a reply in the thread
Fair! Thanks!
Is it possible to run DeepSeek Coder V3 (42% on SWE-bench verified) in the compute available for the K-prize?
https://medium.com/data-science-in-your-pocket/deepseek-v3-the-best-open-source-llm-727d3421ae38
Is there a list of models that people have run under the K-prize conditions?
Ah, probably not. 8 x H200 recommended for FP8:
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
The 4 blocks of code in the "Konwinski Prize Demo Submission" I am very clueless about. Can someone explain what's happening in it.
Also while running the same code in a notebook the
Import kaggle_evaluation. Konwinski_prize_inference_server is throwing an error
NVIDIA has just released a 1.5B instruct model which should fit in the K-prize conditions and might be efficient.
https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/
The K-prize notebooks run fine for me on Kaggle but I'm unable to run in Colab. The kaggle_evaluation package is missing and I can't find with that name on PyPi or elsewhere.
How do I install kaggle_evaluation?
import kaggle_evaluation.konwinski_prize_inference_server
ModuleNotFoundError: No module named 'kaggle_evaluation'
Same error but for me it's on the Kaggle Notebook itself
I've found the code, it is at {konwinski_prize_path}/kaggle_evaluation
I got the K-prize Gemma 2B demo notebook running on Colab. An A100 runs the fine-tune in 3 minutes (vs 27 on Kaggle's P100).
https://github.com/fovi-llc/refactor-python/blob/main/konwinski-prize-gemma-llm.ipynb
I'm still trying to fully grok this project, but as I see it, the overall aim is to create an LLM (or harness one) that generates bug fixes based off bug fix requests
Yes. It is an evolution of SWE-bench to push the frontier by evaluating on unseen test cases and using open, low-resource models.
https://www.swebench.com/
SWE-bench: Evaluate Language Models on Open Source Software Tasks
It's not really clear what it is about, or at least wasn't to me, like alot of Kaggle contest descriptions. It's definitely an interesting AI project in general: like developing AI Droids
There is a summary on the Ksggle page but Konwinski has a longer explanation on his site.
https://andykonwinski.com/2024/12/12/konwinski-prize.html
Why I launched the K Prize, aka, why want to give $1M to the first open source AI that gets 90% on contamination-free SWE-bench.
heck no what
requires like 300+ gb of vram just at 4 bytes to run
im more worried bout this comp being canceled
Unsloth has quantized options that could probably run DS3 in the K-prize conditions but I doubt that would work well because it is a MoE. Starting with a smaller model like Llama or QWen is probably the way to go.
https://huggingface.co/collections/unsloth/deepseek-v3-all-versions-677cf5cfd7df8b7815fc723c
Yea uh
4 bit quantized = 300+ gb of vram
DS3 is MoE so the weights can be swapped. They say ~40GB for 2bit.
Not planning to try it but am interested in the results if someone does.
Q2_K_XS should run ok in ~40GB of CPU / GPU VRAM with automatic llama.cpp offloading.
https://huggingface.co/unsloth/DeepSeek-V3-GGUF
Ya but imagine loading and unloading the weights over n over
I actually thought of using deepseek v3 lol
I even spent a few hours downloading a copy to a SSD
Its just so impractical loading and unloading weights lol
Also anything below 4 byte is just dont
What I was actually thinking was having multiple models to preform different tasks
Like a main model and a critic model
And I can run code through them in batches
which has the same idea as brute forcing through a gigantic model like V3 but more streamlined
This is a new type of competition for me, but it sounds fascinating.
There are a few things Iโm still trying to understand:
Can only open-source models (like LLaMA or Qwen) be used?
Is it possible to use closed-source models in any way (like GPT or Claude)?
It seems that relying only on Kaggle's computing resources would not be enough here. So, am I correct in assuming that to perform well in this competition, participants would need to train models on more powerful servers (like AWS or Google Cloud) and then perform inference on Kaggle? If thatโs the case, would it be mandatory to create a public dataset with your model or external dataset? If so, wouldnโt that allow others to use it as well?
Apologies if these questions are naive.
Not speaking for the competition operators but my understanding is that the answer to all of those questions is yes. I don't expect anyone to get over the score needed for the one millliion dollars because these conditions combine to be extremely tough but that (the difficulty) is the point and again for those reasons would be mind-blowingly beneficial for everyone if achieved.
if you had indeed reached 90% on SWE, it probably would be worth more than 1 million, so I am not sure it would make sense to submit your solution here rather than, say, make your own company.
Some people care more about doing good than making money. The prize (i.e. gift) is a nice return on a few months effort by someone who takes that view. Job offers of more than a million a year if they want it, without having to do any selling or interviewing, are also a foregone consequence thanks to K-prize marketing.
yes quite possible
Unsloth just released a free Kaggle notebook for running Phi-4 which supports up to 128K context length.
https://www.kaggle.com/code/danielhanchen/phi-4-finetuning-unsloth-notebook
no claude or gpt
also, in general, is it better to have smaller # of parameters with larger float point accuracy, or lots of parameters with less accuracy (4b quant @80b vs 32b @20b)
bc im assuming both take the same resources to compute
and another thing, would it be better to have 1 large model, 1 small model generating multiple solutions, 1 small model refining a solution multiple times, multiple models in tandem generating multiple solutions, or multiple models working on 1 solution (given they use the same amount of resources)
heres a graph from nebius
'best-of-N: the agent runs on each problem instance N times, and a problem is considered solved if it is successfully solved in at least one of these runs.
random: a problem is considered solved if a solution, randomly selected from N runs, is correct. This serves as a Monte Carlo estimate of the success rate using N samples.'
also, in the AI-numina math olympiad challenge, basically everyone used a deepseek 7b model, and had it generate multiple answers, and used a majority vote method to choose the correct answer. However, as this isnt as black and white as math answers, I was wondering if maybe a diffusion tecnique could work out
Thanks for banning him
Creating a team for the competition, if interested. DM me .
Hi everyone, I am a PhD student and I am looking for a team
@vale lichen <=== has implemented and shared a Kaggle notebook based on OpenHands that solves one of the K-prize test cases running Qwen 2.5 Instruct 32b. ๐ ๐ฅ ๐
https://www.kaggle.com/code/smartmanoj/kprize-openhands-fork
The first published notebook not scoring -1
https://www.kaggle.com/competitions/konwinski-prize/discussion/561695
Sometimes
yeah
why solutions in this competition score that low? a lot of solutions have achieved 30%+ in SWE bench, how this competition is harder?
penalty on fails
(a - b)/(a + b + c)
a: correct
b: incorrect
c: skipped
what I don't understand is why benchmark is -1, as I see it skipping everything should score (0 - 0)/(0 + 0 + N) = 0
my bad
actually metric is (a - b - c/10000)/(a + b + c) now makes sense
actually no, it should score then -1e-4, someone knows were I am wrong?
if n_correct == 0: return incorrect_score
there it is
incorrect_score has been set to -1
You have to use a small open weights model that will run on an L4.
that's crazy
Today looks crazy five years ago (GPT 3 was released June 2020 and of course the real delta change was 4 which was March 2023). But yeah, no one expects 90% next month. The prize pool is doubled (from $50K to $100K) if anyone can get 30% or better.
it seems that I am not able to submit when I exceed my GPU quota?
this notebook is public (although there is a huge disclaimer to the score)
Ok thank god, so Wednesday isnโt final submission date
Is that just the select patch verify? Did you make any changes to it
If not, what were your methods of obtaining your score
Also, we can aubmit two notebooks to the final competition, right?
so theoretically I can have two different strategies?
Hi Folks! Newbie here. Trying participating in this competition for fun and learning. Does any one know any link/post to get started and to get up to speed quickly?
If the competition submission deadline is Wednesday
Does that mean the start of Wednesday or the end of Wednesday
When will this competition end?
western usa time or eastern usa time, just curious...
good luck to all still in the running
When will be the second KPrize competition?
There may not be a second one. Those who have joined still have time to make new submissions.
Probably late 2025
I am in the leaderboard in this competition, but my homepage does not have this active competition like others
๐ is my homepage, no active competition
this picture is someone else, has this active competition tab
@limber scaffold @rugged finch Can you help me see this? a little wired
"Active competition" is based on recent submission activity, so it might be due to not havign made a submission to the competition recently. When the competition ends it will be shown on your profile either way.
Thank you for your explanation. I understand it now, and it really makes a lot of sense!
Too late yes came to know about swe bench ๐ญ
will there be another one next year? because of the low top score
There will be another round but the time and other details are TBD based on feedback from first round participants.
https://techcrunch.com/2025/07/23/a-new-ai-coding-challenge-just-published-its-first-results-and-they-arent-pretty/
https://x.com/andykonwinski/status/1948190939983589737
With the new batch of open coding models I expect much better results although 90% will still be unlikely. Will probably take another look when they announce the next round. I thought they should have extended the time for the first round but if they cycle quickly, like twice a year, then that would be great.
A new AI coding challenge has revealed its first winner โ and set a new bar for AI-powered software engineers.
And yes, round 2 is coming. We're starting by listening to the teams who showed up in Round 1. What worked, what didnโt, what we can improve. Big thanks to @Kaggle for teaming up on this crazy idea. https://t.co/p24tJ6gTcR
what was the result in round1
and whhen round2 gonna be launched
btw why swe-bench not in kaggle benchmarks
Those are benchmarks that Kaggle reproduces, so someone would have to do the work to set it up there. I'm sure volunteers are welcome! And seriously, it would be great for someone to get that done!
pls go ahead n do that ๐