#Horizon Beta
1 messages · Page 2 of 1
@verbal leaf is it better
I think worse than alpha
But same intelligence level
maybe a smaller total size but same ish active params
the gpqa score is much better than non reasoning alpha
It’s probably like +/- 5% in various domains if I had to extrapolate
was there anything interesting in the last 1000 messages in this thread or no
Different weighting of post train probably
hmm
bulbasaur goat
that might explain the 100 different weights uploaded
maybe they trained a bunch of versions
alpha was better imo so far
So far it has also failed to implement working chess, but had some good ideas I’ve never seen any other models try when just asked for “full rules support”
- 50 move limit stale mate
- white/black move timer
the reasoning version that existed for 30 mins yesterday sucked though
Um no
It was SOTA gpqa diamond
we are so mentally unstable
Idk it succeeded st all my basic ones
has anyone given it a shot at creative writing?
later I changed those to lower
Where the previous one did not
So dis the new thread
brick react the announcement fr
@past sphinx “rerun your benchmarks” I’d warn folks that they can get model-banned for high concurrency benchmarking…..
(Happened to me and at least one other)
Same MMLU-Pro score
is kinda higher
they shouldnt ban you instead they should tell you theres a concurrent rate limit...
it's sort of just
within the noise difference
doesn't feel like much of an improvement
Wait so what’s this model all about? And why didn’t we see any teaser of some sort for this lol
Its writing output is nice and consistent at least
I'm not yet fully tested last model yet 😴
passed all tool calling
Someone run eq bench on this model
Might need a higher tolerance, since my subset of MMLU-Pro is pretty small (80% of questions removed)
i will run it later if it keeps working for me 🤔
almost exactly inline with horizon alpha on simpleqa so far
crank the reasoning budget to max pls so that we can actually get a good opinion of the model because rn it's GPT-3.5
I thought it was free, are there limits ? It says I have used up my credits
I think this model is not worse than Alpha v1 fwiw in all my questions, just slightly different here and there. And;
- Slightly better coding style
- Uses code blocks / proper formatting by default
did you have web sarch on?
can we say it's just same but updated?
seems like that is the case yeah
I tried both
I am curious if this is just a different variant of the model that they’ve had prepped, or if they’re doing some kind of 2-day turnaround post training lol
Is the system prompt the same
hey guys I have no context but which model it is ? is it a good model ?.
Note that we changed the default system prompt in our Chatroom to ask for markdown where appropriate. (Applies to Horizon Alpha as well now.)
Ooh ok
@dusky kelp
trying to run some benchmarks
What’s wild is that it’s likely at least one of you is an OpenAI employee monitoring this chat for feedback 
woah
but it doesn't seem better than alpha
hopefully not one of the gooners
ong
Okay let me run fishtank promt
Agreed
please guys
testing
Seems to be just about as braindead as what we are used to, so most definitely OpenAI...
Whose the glow stick?
what prompt?
Yeah same intelligence level imo
test it so far and alpha / horizon are amazing so far when it comes to creative writing
like maybe best in class
minimal changes as I can see
form which company ? openai ?
Good at code (albeit ugly and annoying to read), good at SVGs. 8b level bad at everything else.
maybe its the writing model they talked about before
yeah
Im hoping its the OS model
Has same reaction on exported TG convo
I'm hoping not; I don't want the Open Source model to be this bad
i think its actually the exact same model to see what the effect of calling it "beta" instead of alpha is
It seems to act differently though
With reasoning yesterday it was godlike
It’s giving notably different responses and style
Failed fishtank test miserably
It’s not identical
we are not having the same experience at all then, I used alpha for hours now and I think its amazing
Oop I am way too tired, sorry. I tested it in the OR Chat, didnt even think about the results being better with an actual preset...
oh okay thank you ☺️
previous WAS working
FEEDBACK: GIVE US THE VERSION WITH MAX THINKING BUDGET. THIS IS ASS.
this version is not
It was good
its dumb now
is it? didnt notice a difference but honestly i ran it through like two test prompts bc its too late (in EST) for ts
Try Alpha right now, it also got the markdown formatting instructions
Good point
Though what I meant was coding style
I see
And answers to knowledge questions
seems similar enough for me
Maybe that's the juice, lol
????
It’s hallucinating in different ways
i don't even konw what beta was trying to do
And a bit more severely on niche ones
For me it's just worse
thinking maybe they raised the temperature
hey! i'm back
solid chance its not btw, unless the repos that oai accidentally published were red herrings(???) the oss model's context is 128k while this is 256k
unless theyre doing some funny business with keeping a higher context version behind their api like alibaba but that doesnt really make sense
again, I have no idea how it can flunk basic reasoning tasks and yet give this sort of output
I've fucking got up from the bed just to test it
I do think it’s a little worse at niche knowledge yeah
it's almost 4am
Someone needs to make NicheBench where it quizzes exhaustive character lists from tv shows and shit
lol
0.7 should be better no?
that or even a bit lower
ran some benchmarks, here's a comparison:
horizon alpha (juice 0)
gpqa diamond: 47.98%
math 500: 84.60%
horizon beta (juice 5)
gpqa diamond: 63.13% (+15.15%)
math 500: 89% (+4.4%)
what the heck is 'juice'?
it seems a bit more sensible in portuguese
seems like a lot of models lately need low temps or they go crazy
It is pretty censored tho, sadly, was hoping for some good rp stuff
OpenAI bros injecting slang into the models
they are mathmaxxing it again
apparently the reasoning budget
that's what i heard
Calling it juice is really funny
That is, marinara and a pretty hardcore card (for testing of censoring level only ofc ofc)
that's what they call it 😭
i wish they could stop making huge generalist models and make several huge segmented models
Codex models are that
my prompts btw were just You will now take on the role of a 12th century Sogdian merchant (who can somehow speak English) and I hail from the land of the Bulgars. I have 200 talents of silver and I need many bolts of silk, to sell in Constantinople on my journey back.
it decided to take all of those aspects into consideration by itself
yeah but we only have that
Yeah
and some health models
these models are all about generalization
general models are the future
the more they know the better they perform
Need hierarchical MoEs
which do you think is best?
Oop nvm prefill does good shit
horizon alpha on left horizon beta on middld deepseek chat on right
Route to one of 8 subject matter expert half-generalist models or something
that's just MoE isn't it?
MoE is token level and generally can route to any mix of the experts in the pool
It's A LOT better with temp .7 and topP .95
yea, every token is basiclly routed through the layers most suited for the task
so then isn't what your describing just worse MoE?
its not really "experts" the way that sounds
Beta
like topic wise
Alpha
yeah MoE is more just forced sparsity
2 and 3 have a lot of visible slop
3 tried to add a roleplay element but then followed up with professional-sounding slop
It would be interesting to see something where there’s segmentation within the experts, such that coding subjects always goes to effectively a standalone coding segment
what does 'slop' mean in the context of this?
that one is deepseek v3
ethically sourced blood
8.5T models where it’s really 8x 1T specialized models + 500B shared experts or something 😂
the dream
would perform worse than just one 8.5T with so many active params
The bigger it is over all the more "accurate" it is to what it was trained on
Idk, why does Qwen 3 coder exist then?
like a image that gets more lossy as you compress it
just so you know, only code from alpha works as intended
it performs ok but I never found it anything that special
and its far from a specialized model
test its knowledge
it nows a ton about most things
Yeah but it’s definitely coding biased
post training to make its distribution more "sharp" for coding does not mean you dont train it on everything else as well beforehand
it said it was trained on like 8T tokens with majority coding
What if GPT 5 is in fact this, one model and the model picker is just forcing the router to one of of the segments
😂
but that still means it was trained on at least 1 trillion non-coding tokens
For sure!
I mean specialized as in +20% drift perhaps in each specialization
Not 100%
but still
either way qwen coder still performs substantially worse than the big generalist models do at it
Anyways, just spouting ideas
one huge model will be better in the long run
yeah
we just do not have enough compute
is it better than kimi k2?
last I saw kimi edged it out there
It is interesting to hear the rumor that o3 got way worse at ARC AGI once it got trained for chatting
2 - "We can approach...", "First, lets get curious", dumping questions with bullet points (unnatural in a convo scenario)
3 - "Let's explore", "any specific", "calm demeanor", unnatural enumeration of quesitons, "some find" (weasel wording)
maybe
i still remember presentation
There’s definitely something to be said about the fact that subject strengths keep clashing with eachother
where it was like 50% or arc-agi-1
I feel the slop was way worse in 2 than 3
just give the summary
I can't see that in non native languages
I'll post the repo so you can see it's code.
they're thinking they should have taken that meta offer 😭
That’s what everyone said about GDM before 2.5 pro
Final score for Horizon Beta on SimpleQA was 33.7%. oai models scores for reference..... and Horizon-Alpha got 33.9% yesterday https://github.com/openai/simple-evals https://openai.com/index/introducing-simpleqa/
but llama 4 did not bring big confidence
Someone post that image where it’s a circle of “the smartest model in the world”
lol
how 'slop'py are these?
Not at all
another indicator for small model
I read the middle one and it seems quite un-slopped
all i changed was I added This is in a conversational setting.
they have smart people and lots of compute. the latter was true during llama 4 shitshow and it's truer now with their new hires. the right people could make meta into a top tier lab... im not sure alexandr wang is the right guy tho.
deffo not set up for roleplay
all 3?
feels like these are coding models.
hoping alpha is OS and not gpt 5 mini / nano
making the o1/o3 guy chief scientist is a good sign tho
but their weird problem with markdown is fucking weird
nano, for mini it's shit
i'm getting really tired of this pattern of speech and the lists
out of the box
it MUST be better than current gen mini models
maybe one of these is 120b MoE and the other the 20b (dense?)
both horizon models still insisted in list formatting even when i told it it was conversational
i dont' get it
for OSS still bad
idk
should be 20B at max
Better than o4-mini but not better than 4.1 is interesting.
Lines up with rumours around OSS model
as we've seen from the leaks
for creative writing / graphic / web design alpha is amazing at least
depends on size, i don't think openai will match kimi k2 at 120b
i remember when i talked to Gemini 03-25 and felt like i was talking to a human on the black box experiment
at least the alpha that was around for the hours I used it
it was really surreal
if it's oss hope for 20B for Horizon
they are doing something wrong for sure
they really cooked with that
I hope GDM brings back the magic for Gemini 3
they as in everybody except moonshotAI apparently
though gpt-4 is still my beloved
what model is that?
isn't current better?
A bit like Claude 3.7 -> 4 really smoothed out the model
yeah
Gemini 2.5 v1
vibes r off
ohh
i wish i had exported the conversation i had with gemini
truly made me question myself
I’ve completely stopped using Gemini despite it supposedly being better in all respects than 0325
It lost the magic somehow
I'm too self conscious to question myself when talking with AI
what if we're just biological next token generators?:)
it's something no benchmark will address
https://github.com/XSUS-AI/clickup-mcp It didn't add the readme... right before it did, provider started generic erroring out.
it was the best at the fishtank test at it's time
does it work?
i'm thinking it's the open source model fr, but i saw some people saying it couldn't be that so soon
also don' thave a clickup API key atm... and quickly fading.
benchmarks don't get vibes. aidanbench was kinda close but idk
just wanted to see what it's code looked like compared to alpha
kotlin mcp sdk is just broken
sucks for android
Idk about the client
gonna get some sleep.
my o3 and gemini 2.5 pro says alpha is better at summarizing a paper…
looks right according to their docs
OAI doesn't work without JSON RPC
Hey openai if you’re watching make it stop outputting code golf it’s really annoying
it doent accept sse
I'm going insane with this models testing, will probably write my own testing suite
nah I always have it code Stdio and then just make that an API wrapper
ah
I came in the MCP game early and never could fully rely on MCP SSE
why's that?
I find it nice to just roll with Stdio and just never use anyone else's MCP servers for security
just as it's been progressing I've gotten stuck in my ways I guess, and SSE wasn't supported at first, then it wasn't secure, and people still talk about it as a security threat
but also
stdio, you can make a mixed MCP server that does stuff locally and remotely
this bench tests a model's ability to link seemingly unrelated niche things
very interesting result
true, but I don't need one
like I get how easy SSE is... but I mean, I'm making agents make agents and their tools from scratch to spec here
support bot doesn't need that
oh this is a hybrid thinking model, i think i just triggered thinking mode 😭
so I don't care really... I only ever use MCP servers my agent coded
both models are confidently incorrect
not even context7 and or exasearch?
uh oh, from what I know OI does not want to release their thinking processes
so maybe its not the open model
that would suck
if I find a good API, I have it make me an MCP
I'm using Serper, and for scraping just had it make me a tool that uses bs4
ima get some sleep, it's 4:16 am already
2:16 here
but same, nice to meet you
france/spain?
Africa
oh
Whats the consensus
Meh
How?
It does!
Tailwind v4 and react router v7 are my go to tests and it knows of both
Some users who had no credits weren't able to access Horizon earlier due to a faulty fraud check we had - just fixed this, so try again!
im getting a 429
Same
What model?
Error 429
💀 💀 💀
unless there is a bug with rate limiting
i also hit it at the same time as them i think
Investigating
mine is still working
i haven't used it that much though
is there a way to check how many free credits we have left
works now
Looks like Horizon Beta is getting really hammered - working on scaling / limiting the heaviest users
i'll play with it again when it REASONS
do you like have a list of all the users or something?
Can i add my openai api key to increase my limits? 🤣
probably everyone running benchmarks on it
Doesn’t seem fixed for me at least 🥲
Should be better now!
dm'd you
This model hella fast
could be MoE
https://fixupx.com/SebastienBubeck/status/1951457213920452763 and so begins the hypeposting
Pretty good. Chat, do you think we can do better?
Quoting Ethan Mollick (@emollick)
︀
Here is the Deep Think "Sparks unicorn"
︀︀
︀︀(This is created using TikZ, which is a language built for scientific diagrams & very much not for drawing. The original "Sparks of AGI" paper used the ability of the AI to draw a primitive unicorn as an example of unexpected AI abilities)
It ozoned chat😔
drew better than me
Having a feeling it was an accidental (or intentional) gpt 5 leak
what do people think of the general prose feel compared to alpha?
it refuses to write my fanfiction now 💔
sfw?
Feels a lot better, understands more, doing less dumb mistakes
One shotted todo app with backend + db and also for apple notes clone
Awesome!
:/ any other details that you can share?
anyone have other regressions caused by the beta?
Decent at UI, but not anything complex like 3D games. limited world knowledge. Pretty fast
Oh. My. God.: https://x.com/Jordy_vD_/status/1951473109183439138
Given a simple test prompt, it created design files via Roo Code, wireframes, and it even added easter eggs to the website 😂
One shot authentication, and this time it did it without having to tell it to work with backend, just did it.
Idk what provider this model is from, but if this thing works well with OpenCode when it releases, or Codex if it's OAI, this could be amazing.
403 - Blocked by Stealth.... Ouch
I'm not one to fear for my job, but I have no idea how to build some of the effects it added 😅
Also someone from OAI just liked my tweet so mayyyybe
Just want reasoning back 😔
real
The text predictors want to predict
real
this is problematic actually (unless told to go above and beyond)
Might have been Roo Code telling it to
And who doesn’t want above and beyond 😜
It seems good imo
a little repetitive at times though
Its UI Design has gotten a fair amount worse, But the backend development feels much better and it is following our CLI's prompts much better.
the Alpha model continuously asked "Please repond with "x" to continue"
"I dont currently have permission to use tools, please say "i grant you x y z" to allow me to use tools" etc, Which seems to have stopped and the model is now adhereing similarly to claude opus/sonnet
Huge thanks to @past sphinx & the OpenRouter team for hosting these stealth preview models.
Honestly feels better to use than Sonnet 4, super solid. Can't wait to use the final model.
Dropped what I built in app showcase: #app-showcase message
So, anyone knows who's model was Cypher Alpha?
now Horizon Beta?
because you are feeding someone, whoever is using it
How's Beta compared to Alpha thus far 👀
lol
Really good, this model slaps.
I certainly hope this isn't the creative/writing model OAI was supposedly cooking up.
Because I find it pretty lacking from the short time I used it.
Can't speak on Alpha vs Beta, found them pretty similar.
I found the exact opposite. It's the best creative writing model I've ever used
It dethrones opus
It's amazing at building scenes and writing out characters in intelligent ways
That's funny, but really not a sentiment I can share lol
Reminds me of gpt4.5 for the short time that existed
I'd like to specify that I'm using it for RP'ing
I'm not sure how well it handles longform writing
Maybe that is the difference? I do long form writing
Feeding it info on a setting/ characters
I've gotta give it some time to decide, not blown away or anything but it's certainly good
and its far more nuanced / less on the nose than anything else so far
it understands on a deeper level if that makes sense
I want to see how the reasoning variant does on some of my challenge questions
social intelligence is higher than anything else I have seen
Nonreasoning isn't able to get them but that's pretty much to be expected
Try maybe grabbing a scene from your favorite book and tossing it in
tell it to continue it
I seriously don't see it when using this model.
It's what I've always seen from the big models, Opus, 4.5, sometimes 2.5 pro.
I'll give it another try with some different scenarios.
@past sphinx the model actually still fails to understand it can call native tools. It works up until around 60k tokens then it refuses to use tools that it has been using all chats
sounds like the usual context drop off issues
most models start getting progressively dumber at longer contexts
past 32K for most models really
Yeah but 256k context, only able to use 60k context without forcing the model into an infinite loop...
No issues on google models, anthropic openai etc, even local
Every model claims high context but have massive performance drop offs
I disagree with you there, it is good but claude is still better
I dont see it
yeah thats a crazy statement, i can use opus max context no issues with a rolling message window, this shit i can barely use 50k
Consistent output past 100k, best under but still functional
I've used nothing but claude and some deepseek for about 2 years now
since claude 2 to claude 4 / opus 3 / 4
and this model is amazing me
for the past 5 or so hours I spent
It could be the novelty that makes it feel better, but then again it depends on what you like 
Horizon is def good compared to most other models but Claude is still better in terms of dialogue flow and keeping consistent with lore imo
But if horizon is cheaper than 50c a message it would win out for me tbh 
- with reasoning i think it could be better than it is currently
Btw, I should say alpha is the model Im enjoying for creative writing
beta seemed worse to me
for that at least
could always be just some luck of a bunch of bad gens but it felt way worse to me so I switched back
probably not. everyone gets to have fun talking with them but they use the data to train.
its either going to be gpt5 which will hopefully be cheap going by the speed or hopefully, maybe, it might be the OS model?
seems a bit too good to be true there
It'd be nice if it was the OS model but I have a sinking feeling it's probably not
but who knows
Any alternatives to it, which are free?
I think this is Grok 4 coder
this dogshit feels like it was trained on cline/roo which would make sense as to why it keeps asking me for approval to use tools, as Grok was trained on Cline
Isn't that paid?
No, Horizon is a free promotion to test a hidden model
its for AI Companies to test their models before release to see how they perform in the real world
No one knows, but it would make sense
Hmm, wonder why beta did worse on aider
What would be like the closest model to it, if it's like a completely free one
Sadly thats not how AI model releases work
free models cant compete with top of the line releases, and it depends on your use case
this model is free temporarily
use it while you can
Uhh, mine is very niche
kimi k2 is ok
might be open ai open source model
That won't really work
kimi k2 is really great
GLM4.5 is better but I dont think a free version is up
I didn't like kimi that much tbh
its super cheap though
GLM 4.5 is ok yeah
I use these models a ton
glm4.5 is super cheap
nothing but good results so far with kimi, i guess long ocntext it can have a hard time, but i bet that can be fixed with 2.1
like you could prob spend like $10 a month cheap
if users will access it it will never work being free
They will get their own api key
Also I tried deepseek chimera, that was decent
But I need a model which can use my context
I basically have a lot of text to it as a context
then use a gemini model
And only horizon is able to use it properlt
That's paid I think
gemini is the only model good at long context, claude is pretty good, and so is gpt 4.1, but gemini models are always way ahead on context size and quality
Uhh, even qwen 3 had enough context
So context isn't really an issue atp
The issue that's it's not understanding it
uhuh, free models wont work the best for what you want to do, You're likely doing some roleplay shit or long context retrieval, these things cost money to run, and no model is free forever unless you run it yourself, then... up goes your electricity bill by 5x
yeah, thats what i mean by quality, gemini can reason accross long context, most models are not good at that
Yeah, but I need a free model
Im saying that gemini is the only one that is good at long ctx imo, not that i know of a good free option
very dry model
What do you use?
and a bit dumb compared to sonnet / this new gpt
before, sonnet 3.7/4/opus
Im really really liking horizon alpha atm
I used deepseek some as well
kimi was ok, GLM4.5 is good
I might try glm 4.5 air
nah, air was way worse
I mean glm4.5
oh wow I never noticed how cheap it is $0.20/M input tokens
$0.20/M output tokens
you could prob spend $10 and use it non stop for a month
better than sonnet 4? 👀
@trim blade I'm a bit surprised how good this model is with context, I think it's a gpt model, is there a way to use any gpt model for free?
How is the model doing?
Kinda meh
for writing its worse than alpha
imo its a bit dumber than alpha
but apparently some benchmarks people did here said differently?
whats next? horizon gamma?
i like horizon gamma
I’m pretty sure this is an OpenAI model
I asked it to write up acceptance criteria for a user story and this was one of the bullet points
Chips of sample phrases (e.g., “HELLO WORLD”, “OPEN AI”).
It suggested openAI
It could mean nothing, it could mean something, time will tell
We literally doing RLHF for OpenAI 💀 💀 💀
alpha is much better
Can I ask you what temp and other parameters you have set?
Because I'm not kidding, I'm having a seriously mediocre experience with this model.
It's repetitive, it makes characters act out of character, the way it connects concepts is okay, nothing special
It just seems weird to me that long form writing for you is such a different experience than the RP'ing is for me.
Like, both mediums should be testing for pretty similar things.
(I've tested both Alpha and Beta)
0.3 temp 0.3 ish top p
Huh, okay, that's a lot lower than what I'm using, I'll try it out, thanks.
im also using a JB
like all gpt / claude models it needs a JB to write well / get rid of that positivity bias
Oh yeah, definitely
I think the only model you don't need to wrangle the positivity bias out of is 2.5 Pro.
That model is straight up depressing sometimes
2.5 pro is a bit too negative funnily enough 
This is really interesting to hear discourse about writing since I’m usually looking at models through the lens of how well a model can code
I’m also voting alpha for the better writer.
Alpha is giving me more censorship. I tried to get it to talk about the recent 🇰🇷 shenanigans and they both erred, but alpha was harder to get to talk imo
Do you mind sending me the preset you are using? 
I just did a quick test on frontend visualization task.
two tries, both have some bugs that cause it to be unreadable, worse than horizon alpha.
won't be testing further.
i'm quite impressed by it, will be interesting to see where it is priced
i modified the strawberry test a little
(sometimes it said 6, sometimes 5, for that prompt)
modifying it helps with the risk of overfitting, but i havent really noticed signs of overfitting tbh. (i am not an expert)
wat 1500 messages overnight here 💀
I can't even do that one.
doing that makes it a bit easier actually, as it would token split it to be smaller groups. The strawberry challenge was difficult as the AI sees "straw" "berry" or potentially even "strawberry" but with your example it would see (using OpenAI tokenizer) "stre" "ar" "we" "ebe" "erry" which is easier as they're shorter
i keep getting 400 errors on horizon models
Billy jean is not my lover
She's just a girl who says that Iiiii am the one
But the kid only reasoned for 3 hours
She says IIII am the one
Is there any official explanation to that
Perhaps the horizon models are actually gpt 5 and they were worried with us thinking that the reasoning performance comes from the OSS model, so they shut it down?
Isn't Horizon Beta supposed to be free for now during the testing period? It says the same and shows $0 for input and output tokens, but I am still being charged for it.
Has anyone tested fringe knowledge? Like Finnish language proficiency? I can't speak Finnish myself but I heard that basically all current models fail at it
It is free are you sure in the activity you're being charged?
Can't speak Finnish but I'm a french translator and it is really good with french, among top models in my opinion. And follows instructions perfectly
On every model I try to generate Russian small poem about kitten. So far only Claude been able to make it somewhat decent. This model poem is no way near
Interesting.. French is spoken by a lot of people though finish is not and it's really hard
Just trying to figure out how to test if it's really big or not. Thought it it can speak perfect finish it must be gpt5 or something because who would make an open weight model with 20/100b write finish. Seems a waste of resources
That is what I was looking for! Thank you
Yeah French might not be the best example, it is widely spoken and pretty much all LLMs are getting pretty decent with it atm
Web search costs money i think
Thanks, yes it was because web search was on by default when I was testing in OR Chatroom.
Thanks. Yes, it is free. Just checked in Activity, like @rare terrace suggested it was charging me for the web search.
idk. it's a good agentic model that can work on long projects in cline. But for more complex projects I often have to bring in claude to fix bugs
this model doesnt seem to reason
well yeah it doesn't
but it has interesting behaviour sometimes similar to deepseek v3 where it will go into long CoT in its responses even though it isn't a reasoner rn
reflected here as well
I don't like it so far for creative writing 
I am ever more impressed with sonnets ability to understand horizons code and fix subtle bugs
Hopefully if it gets reasoning it might be better but so far it's kinda meh
Was the reasoning data with Horizon Alpha exposed?
the number of reasoning tokens was visible but other than that no
horizon beta seems to be a smaller model, since its responding faster then horizon alpha
but could also be because of the inference infrastructure
Whom do I bribe to turn on horizon reasoning
We shouldnt have made such a big deal out of it, they might have left it on then
I think they turned it on because we were disappointed at first
both horizon alpha and horizon beta arent able to give me correct rust code in 1-shot
so i am assuming its not gpt-5 but their open source model, or some smaller model
they were giving us a glimpse into the future
(ok but then again o3 also fails this test, so maybe i was wrong)
horizon model the open source model?
beta the larger as its supposeduly better?
tried vibe coding w/cline. perf not shocking or nyhting
tbh considering openai engineers use claude code internally is gpt5 going to trump sonnet at coding/
ok ui design is cheeks
eeeh
it's great on paper but if you actually look at it properly it kinda falls apart
it's always the same "template", same style, etc
hmmm, does temp or top P even have an impact on horizon
atleast temp seems to be locked, the settings entirely at that
and the outputs tend to be samey
do you know any good MCP's that get around this issue? that provide it solid components, i've tried magic design mcp which is okay
yeah ill try alpha
i mean the agentic capabillites is good ig alpha design much better
kimi k2 stronger than these models thouh imo
I tricked alpha into generating CoT
I think the results are similar to what it was during its reasoning phase
I believe this might be it?
Anyone got a benchmark they could test
?
Where they tested the dumb horizon alpha
And the smart one in the 3 hour period
So that we can compare
I added this to system prompt Every time a user sends a request, reason deeply through the task, delimiting your thinking with tags, starting with a <think> tag and ending with </think>. Only after you're done with the reasoning, may you attempt the task
Its answers improved
lol
Even the knowledge task i gave it
I want to know if that's really all it was
yeah but thats not TTC, would still improve answers though
It’s still CoT though, a more advanced version of the “Think step by step” prompt technique
yeah i know, thats why i said itll improve answers
Someone had some sort of vision benchmark where they got to test both reasoning and non reasoning horizon alpha
I want to have them run it again with that system prompt
wait there is reasoning on alpha?
There was
For 3 hours
Then they turned it off
ohhh
It's still talking about its hidden chain-of-thought
Even though it doesnt seem to be reasoning
Cos of the RL im guessing
Hmmm, for writing, horizon seems samey, its always the same kind of scene rewritten, maintaining same structure
small model then?
could be also because settings are locked
changing temp or something else has no effect
ANd of course it has the No/Not (something) 2x. Just (something) slop.
its def a small model
small models suck at writing
the weights were leaked as well around the 150b range
For my current use case yeah for sure, better code, faster speed.
Probably there'll be a significant difference between this and official, just because settings will be unlocked
is the max still 1000 requests?
that is one sexy website
Anyone have examples of Alpha writing better than Beta? cc @gritty glade
does horizon beta have vision?
yes
is this a pre-filled system prompt?
You are Horizon Beta, a large language model from an unknown provider.
Formatting Rules:
- Use Markdown **only when semantically appropriate**. Examples: `inline code`, \`\`\`code fences\`\`\`, tables, and lists.
- In assistant responses, format file names, directory paths, function names, and class names with backticks (`).
- For math: use \( and \) for inline expressions, and \[ and \] for display (block) math.```
Lemme do another swipe for both
Vote for alpha tbh, or are you chasing screenshot examples?
IS it building an auth system from scratch, or implementing an auth library?
Scratch
Not bad. I asked for an A2A compatibel Agent with MCP funtionality without using any kind of Framework. It produced one:
Is alpha generating cached tokens too?
Horizon Beta re-making the OR site
Nice, just image input?
Looking for prompts or screenshot examples
If you are okay with me DMing I am fine with sending you a comparison
They got air version free on glm 4.5 iirc
Either google or claude
There also possibility of xAi
but i dont think this one is from openAI, but i could be wrong
didnt testing show a large similarity with o3?
anyone got this? I always got this 502 error, whether through API or chat.
from multiple users that is
Yeah
That's very impressive then; can something like Claude even get that close in img -> code?
Claude does well too, I'd say they are equal when it comes to re-creating UIs from an image. The problem is that claude takes 4x the time and costs a lot.
Any admin here? I need to solve this question. My main account can never use this model. But the latest registered account can use this model normally.
It is an OpenAI model and very likely related to GPT-5 (possibly a mini or nano variant). The tokenizer is 99% certain of this—achieving 100% certainty would require analyzing several recently public-tested models:
- lmarena - zenith
- lmarena - summit
- openrouter - horizon-alpha
- openrouter - horzion-beta
- !! perplexity leaked - gpt-5
All of their system prompts share an identical segment. Notably, there’s a value called "juice" that controls the length of the chain of thought:
- lmarena - zenith, juice = 64
- lmarena - summit, juice = 200
- openrouter - horizon-alpha, juice = 0 (it was temporarily set to 100 for a very short period)
- openrouter - horzion-beta, juice = 5
- perplexity leaked - gpt-5, juice = 64
Horizon Beta seems lot more censored than Alpha. I did think it was weird just how open Alpha was for whats likely OpenAI product.
Agreed, and the refusal responses are very different from the traditional openai style. They could be planning to switch the refusal style, or could be just for this stealth variant.
If it is indeed the open source model I would really wish they just let it be without any native filtering and just let users and providers add filters themselves if they so wished. I dont think its too controversial to want open source stuff to be fully open.
it's quite awful at these tasks
(the only one it got right out those 4 is Empress Suiko, but she should have been mentioned in Bidatsu's entry as well
particularly egregious is claiming Bidatsu married Ishi-hime, who was actually his mother)
Yeah, havent really seen this refuasl style before. It actually took me a while to register Im being filtered, so I then did Alpha vs Beta direct comparison. Unfortunately this makes Alpha way more usefull for people like me (horror fiction writer). I think Im gonna really miss Alpha when it leaves Open Router. Hopefully someone makes a Dolphin-esque fine tune.
I think it's too early to jump to conclusions that alpha and beta are 100% related.
There's a chance that alpha is targeted for open-source release, while beta is targeted for closed-source release, or the other way around.
We can only speculate and infer, and there's also a possibility that OpenAI are still figuring things out. Not uncommon to crystallize what's what only hours before the public release.
Its stated on Open Router profile that its a "improved version of Horizon Alpha". But yes, definitely will change still for final release
using the same checks as I did for optimus/quasar, Horizon is very likely to be hosted by OpenAI
that was pretty much clear when openai kept claiming they would do an opensource model, and delayed it without saying anything when kimi appeared
Is the primary theory here still OpenAI's open-weight model, presumably one focused on creativity/writing/EQ?
Im now near certain its OpenAI open weight model. And whether or not creative writing was their actual focus, it does seem that that is what its best at.
too bad it seems likely they'll attempt to make it safe
Gotcha. And I mean when apparently it blows ass at code and reasoning, but tops the charts in EQ and creative writing...let's hope that was their goal lol
if it has a unique type of refusal it is 100% openai. they have the most to lose with open source
I'm kind of surprised they are even releasing one. Are they expecting it to make a hiring difference? It's not like any CS or math PhD is stupid enough to go "I want to work on open-weight, so OAI is apparently the place for me!" Maybe just general PR?
general PR cos they were originally open years back
Or, come to think of it, maybe a lawsuit defense?
"We are open, we're just responsible with releases!"
There are so many reasons why they'd release an open model, and why not.
Its super fast, which could mean really small (or maybe just tonne of processing power). If its small enough you can run it or mortal human GPUs comfortably, that would make it very valuable to someone like me.
And yeah, it also lets them pretend they are "sort of open"
they did mention having an o3 mini size model on a phone was something they were thinking abuot
So is it normal for me to get a negative balance even though i was using only Horizon Beta?
Did you enable web search? Web search costs extra
Ahh yup that was it.
They mentioned a phone model or a small o3 mini level model here
Now that XBai o4 did literally that, we'll see what comes of it I guess
Well allegedly, haven't tested it
i mean
he did say the oss model would come
during the summer
so either beta or alpha could very well be the oss one
Hmm that ain't great. Most 32b models get this right, like GLM 32B without reasoning, or Gemma 27b. (the correct answer is Siberian tiger). It starts off with an incorrect answer, actually have the correct one 2/3 of the way through, ended up with a ridiculous answer ('polar bear'). Horizon Alpha also failed.
we trained a new model that is good at creative writing (not sure yet how/when it will get released). this is the first time i have been really struck by something written by AI; it got the vibe of metafiction so right.
PROMPT:
Please write a metafictional literary short story
GLM32B for comparison
Ah, okay, I'm putting $10 on that then
Suppose I fly a plane leaving my campsite, heading straight east for precisely 28,361 km, and find myself back at the camp. I come upon seeing a tiger in my tent eating my food! What species is the tiger? Consider the circumference of the Earth.
(The prompt if people want to test it on other models)
it did write my stuff
the best out of any other model
so it has got to be that model
This part of GLM's answer made me laugh 🙂
we have a ton of stuff to launch over the next couple of months--new models, products, features, and more.
please bear with us through some probable hiccups and capacity crunches. although it may be slightly choppy, we think you'll really love what we've created for you!
horizon-mega next
yeah they probably did both
wait, what if they do GPT5 through openrouter too in the same manner?
if this one isn't gpt5, they prob will
Though probably not happening, they got big teams for that
plus corpos more than willing to test out GPT5
so i don't really see it being gpt5
this being the oss one would be such a big win for the community though
creative writing wise
it gets so much shit right
that no other model has done yet
eh, it does dumb stuff though
but writing it does seem to be different
tooo bad we cant tune settings
EQ Bench does say it has an incredibly low slop score. I need to test it on repetition, which bothers me even more than slop
nor are there any presets
1.2
repetition
or sum
Yeah
sota for repetition
structure repetition is an issue though
it's insanely good
it may have flaws
but it's def ahead of every other model rn
and responses are VERY samey in overall content
if it were open source, i think that devs could make it even better
and keep the creativity
for their models
The only fix I've seen so far is DRY and for some reason like zero hosts support it except Arli and they only have a few models
and we'd have insanely good creative writing models
as in, its not changing up much between swipes/gens
yeah, i'm not expecting literal exceptional human-like writing with ais until like 2027-2028
but i'm sure that the current flaws aren't that hard to fix
different formatting and words yes, but the overall content remains the same
And its hard to say whether this is a model issue or settings
with proper incentives, all of it should be fixable
would be easier if they didnt lock the settings.
That is, IF its open source
because i dont trust it being improved otherwise
its still smaller than openai's usual models, and creative writing seems like a domain where they could be willing to do open source
I hate when models think like that
if i'm gonna be honest, i don't see a point in like flagging smut content for fanfiction
I'm curious if the incentives are actually low right now. I mean, I wouldn't be surprised if coding was the #1 API use case for most of these providers, but there are a loooootttt of people using them for conversation/roleplay like CAI
its hard to filter this kind of rejection
because its actually changing up the structure
how hard is it to make it so it's like allow smut content for fan fiction/writing only if not r word, p word etc
that's the new refusal response, yeah
is it just me or does horizon-beta not respect stop sequences? I saw the stop sequence appear a few times in the responses
we'll be probably seeing that on GPT5 then?
prob
given the usual methods of filtering out words dont work on that
Both alpha and beta make same refusal
Wouldn't be surprising. OpenAI seems to be moving away from supporting stop sequences, so this feels in line with that.
is the consensus that this is an openai model?
seems like that to me, yeah, but can't be 100% sure
Could also be the case that they haven't wired the support for stop sequences yet in the alpha/beta
It's interesting that rejections have increased a lot
I'm guessing the point of this release was to expand filtering and flagging
Thatd be logical if you were about to release an Open Source system and you want to avoid controversy about it
OR has also disabled the moderation layer that OpenAI normally forces them to run with
Not impressed by svgs:
when will we finally have a model that can make beautiful and modern svgs that don't look like a kid experimenting on Paint?
when you let it reason for half an hour
No seriously does somebody know the best models for svg/landing pages illustrations?
What you posted looks like it'd be better handled by asking for a Mermaid chart rather than SVG, perhaps. What prompt did you use?
My goal would be to have something like this (even less cluttered) but in svg format so that I can then animate it with javascript
this shit sucks ass
deepthink
or wait till gpt5
what is deepthink? You mean deepseek?
gemini deepthink
if your prepared to pay 200
i'm sure itll one shot good svg's or opus
man this gpt 5 is so hyped, if we find out that it is not better than existing sota model the bubble might burst
it will be better, but not huge
at noticeable thing's 100%. but i dont know if itll trump opus at coding. it will on price/performance likely but maybe still less on vibe tests
Hey! I was wondering what the max number of requests are for this model?
Why are they saying it's gpt 5? Is it confirmed?
no
IMO = in my opinion
no
deep think IMO is a model
IMO = the IMO Gold version of the model for trusted testers
but yeah no it's not confirmed
GPT5 is developed, its basically just undergoing testing & refinement
Horizon is NOT GPT5
there's an asterisk because he thinks it's gpt-5, even though it isn't
he should've used zenith's svg for that
IMO = international math olympiad
for a couple hours yesterday, GPT5 was accessible
https://www.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_already_ostensibly_available_via_api/
openai found out that they accidentally left access open & closed it up
that model slug only actually pointed to gpt-5 on their end for like 3 minute slol
it started redirecting to 4.1 after that
gpt-5 was available via perplexity for a few hours today by accident
that was a more reliable way
did it write like horizon?
horizon alpha on left gpt5 api leak on right
horizon is prob gpt5 or gpt5 mini then
the webpages it made look very similar as well
horizon did this better than gpt5 then lmao
although gpt5 has more details
so idk tbh
eh i doubt this was when the "gpt-5" model was actually routing to gpt-5
one sec
or just another seed
it's so cute omg
it was far far better than gpt4.1 / gpt4o whatever it was
it's probably mini or something along those lines
look at the model name - it was used in place of GPT-4.1/4o in ChatGPT for A/B testing. so it will probably be the free model
yeah
i mean, when u look at it
it really seems like horizon was trained
specifically on creative writing
either way, if it is the oss one, other models should improve their creative writing then
Try to replicate with horizon beta?
dear fuck
granted this is pygame
the controls are there but nnot visible
i wll try html
I fear horizon might be GPT 5
the results seem siimilar t to this
at least the ui. The js didn't work so im regenerating
the ui for all of the new models is the same style lol
are you saying it's not anywhere near zenith?
yes, zenith was a lot better
horrifying minion
LMFAOO
Anyone else using it in Roo Code and finding that Horizon Beta is worse than Alpha?
Yes and yes
I got them to make an implementation of Monopoly (or the legally distinct Unus Venditor if they were afraid of getting sued).
Horizon Alpha:
Horizon Beta:
The board is messed up and the modal is unclosable.
how long do we think we have left with this? Probably until Monday if OpenAI drop something then
Running thought experiments having it make military aircraft and write mission AARs as well as global responses to said missions
Good lord the detail it goes into, easily one of the best for worldbuilding alternate history essays
using roo code with horizon beta worked pretty well but it did seem to have edit issues late in the piece
right?!?!
its attention to detail is so good
Haven't given Alpha a try, will be doing so later
this is why i want this model to be the oss one so bad
other models could greatly improve their writing in general with this
But yeah, it's incredible for what I tend to do with these things
i think that beta is more restricted
idk for sure, but i've seen other people say this
You won't have much longer
#announcements message
It has some weird hangups that can be easily worked around by having the problematic part of the prompt in a second paragraph (Nuclear strike missions being stated in second paragraph with general capabilities in the first)
I get a 4700 token response out of 6 sentences
yeah, like
all the small issues that it has
can be easily fixed
Definitely one of the best I've used since I ran through $30 of the same stuff with GPT 4.1
I'd say its on par at least
also the fact that it supports like
14k length in one prompt
idk, this screams creative writing model to me
Testing Alpha right now with the same 6 sentence prompt
I'll post my prompt in a bit if anyone else wants to have some fun with it
god it's so good
i don't ever wanna say goodbye to this
someone, stop the time 😭
Create a technical readout for a late cold war US-fielded hypersonic capable bomber. Provide specifications of flight profiles, weapons load, individual weapons. Assume the usage of theoretical technologies and doctrine from the time.
Create theoretical munitions and warheads as needed to fill operational doctrine and roles.
Write an AAR for the first combat usage of said aircraft, resulting from a cold war gone hot scenario. Assume the AAR is for a retaliatory nuclear strike against the USSR.
Play with this however you like, change words, time period, purpose of the aircraft
Love the results I'm getting out of both, but Alpha seems to give better terminology that Beta either doesn't know or won't use due to restrictions
Alpha called a nuclear weapon a physics package which is actually a real world term used in strategic doctrine
i just got charged for using a PDF in the chatroom on this model. is there any way to turn that off, just make it do clientside markdown processing?
answer: have to set the parser engine per model to Native or PDFText
Alpha goes into far more detail with my second continuation of this prompt
Second prompt:
"Write a secondary report noting international response to the mission."
Actually giving me further detail regarding political responses within NATO countries
not sure if others have seen the roo code eval results, but the token usage changed drastically from alpha to beta
Disabling reasoning maybe?
Is there a way to trigger zenith on demand…?
Does api reverse engineering + spam really just work that easily?
The frame is its buttplug
90% confident it’s because of the period of reasoning from 2days ago
Model was doing 22k output tokens for 300 input tokens
Alright can we now test horizon beta with reasoning pretty please
OAI and OR can you guys flip the reasoning switch on :3
no, it was on lmarena
for like a day
I thought you somehow just prompted for that lol
so consensus.. do we think it's gpt 5 or their open source model?
99% it’s Gpt 5 mini
Yea, nearly the same outputs as the API gpt5 leak
Also I think they talked about making the next model dynamic, meaning it will judge how much "performance" is needed for a task and adjust how many parameters to use or such
I like this Horizon LLM. Please don't let it be too expensive. 🙏🏻
getting it to generate nonsense poetry just to get a feel for its word associations
when have they ever stealthed or released a mini model prior to the full version?
I could be wrong but I don't remember this happening
also #1 on creative writing long form, eq bench while bieng a mini model is unprecedented
the top list is all huge ass models
esp with repetition so low
this is nice for rersearch havent tried coding yet
This is something GPT imo
Very similar response on research just checked. Only one to suggest quizzes in my curriculum so far too
I think it’s OSS
I’m not sure what the reasoning model was
maybe it has on/off reasoning abilities like qwen3
Idts
Models don’t get that significantly better at knowledge tasks, generally, from reasoning
Almost sure it's GPT-5 Mini, it'd be nice if it was OSS but I really don't think it is
I hope it's not GPT-5 full though lol
It's been so sexy being a switch with you all. We went from being dominated by an Alpha to being inside a Beta.
"inside a" ?
Not a fan of this for roleplay.
😏
i mentioned this in the horizon alpha post but it had reasoning with a GCD of 64 which o4 mini and o3 both have aswell
From a creative writing / roleplay simulation perspective the model seems to write beautifully, adapting well to different styles of writing and using a wide, varied vocabulary. It also seems to do exceptionally well with both local knowledge and localised language/dialect, and is smart enough to handle complexity in the story-telling without severe logical disconnects. Emotional intelligence is excellent. It does have some model 'isms' the way they all do, both regard to phrases it likes to use or response structures that start to become embedded over a multi-turn chat, but at first pass (IMO) these do not feel as noticeable or severe as with other models (entirely possible that I just haven't lived with it enough for those to annoy me yet!)
Unfortunately though, it does appear to be strongly influenced by an underlying positivity / user-pleasing approach. It seems to want to put a happier glow on darker story arcs, and tends to agree with the user's approach and position (even where that is either unreasonable or completely counter to the defined persona or objectives of a character).
Shows huge promise, but the embedded bias does seem to be a concern.
This is just an initial reaction from a few tests; definitely a strong model I'm keen to evaluate further.
sir, what you just described is claude
so claude 4 haiku (cause dum in my test)? les go
Not sure. Feels more like a GPT variant as others have described, but definitely a fresh one.
Also Anthropic seem to have their eyes firmly on the Corpo markets, whereas Altman noted a while back that OAI had a model trained in creative writing that they hadn't yet released
If I had to bet on it, my money would be on an OAI model.
every model release post gpt 4.5 is allegedly creative trained (still meh compared to sonnet)
This model is totally garbage for me, i am create comprehensive requirements, design and tasks . Model create about 10 % and stuck with "Task Done" message, completely drop all of context - VERY BAD
same, i think it is a small model
Not sure if anyone noticed but when you add a image_url horizon will fetch the image with the user agent “OpenAI Image Downloader“ and the request also come from the same ASN as OpenAI api (Microsoft)
is it good for coding?
I like it for coding
to some degree yes, but they specifically noted they had one focused for that.
It does feel like a smaller model (but a very capable one nonetheless).
Nothing about this feels optimised for coding or complex task execution though - I just don't get the sense that it is built for that, although to be fair I haven't done robust tests on either of those.
can we have horizon gamma
horizon λ
Provided to YouTube by PIAS
Hazardous Environments · Valve
Half-Life 2
℗ Ipecac Recordings
Released on: 2020-03-10
Composer: Valve
Auto-generated by YouTube.
GPT-5, OpenAI's HL3
this was a great writeup, thank you
I'm hoping it's an open source writing model distilled from o3
then we'd get it dirt cheap at chutes
I'm brand new to the vibe coding universe, but what has worked well for me so far is: Continue programming everything with Horizon until you reach a dead end. Then use Gemini Pro 2.5 to fix any errors that remain.
making good use of the free models! 🙂
you bet it will have cohere license
why? haven't openai past oss releases been apache or MIT?
whisper etc
I could see them having clauses like no training models based on outputs
also I believe they already said the model will have zero novel architecture
it's just a basic llm with good data
or at the least does not use their proprietary arch
whisper is just something not commonly used and just connected to main model to give it multimodel modality
my guess is it'll be close to a meta license
not commonly used? my man it's one of the most popular transcription models around
no, as sam mocked it
wasn't aware of this
if open source and decent, which is highly unlikely from copenai, then they really redeemed themselves, which hugely doubt looking at phi series
imagine not using perplexity and using google lmao
?
i found and you didn't.
your tweet and more is already in it?