Horizon Beta | OpenRouter | Page 2

trim blade Aug 2, 2025, 12:34 AM

#

about my usual stuff

pale folio Aug 2, 2025, 12:34 AM

#

@verbal leaf is it better

spice shell Aug 2, 2025, 12:35 AM

#

Idts

#

It’s got slightly better coding style

rare terrace Aug 2, 2025, 12:35 AM

#

pale folio <@1338136168344064040> is it better

I think worse than alpha

spice shell Aug 2, 2025, 12:35 AM

#

But same intelligence level

trim blade Aug 2, 2025, 12:35 AM

#

maybe a smaller total size but same ish active params

next jolt Aug 2, 2025, 12:35 AM

#

the gpqa score is much better than non reasoning alpha

spice shell Aug 2, 2025, 12:35 AM

#

It’s probably like +/- 5% in various domains if I had to extrapolate

crystal scaffold Aug 2, 2025, 12:35 AM

#

was there anything interesting in the last 1000 messages in this thread or no

spice shell Aug 2, 2025, 12:36 AM

#

Different weighting of post train probably

trim blade Aug 2, 2025, 12:36 AM

#

hmm

crystal scaffold Aug 2, 2025, 12:36 AM

#

bulbasaur goat

trim blade Aug 2, 2025, 12:36 AM

#

that might explain the 100 different weights uploaded

#

maybe they trained a bunch of versions

#

alpha was better imo so far

spice shell Aug 2, 2025, 12:37 AM

#

So far it has also failed to implement working chess, but had some good ideas I’ve never seen any other models try when just asked for “full rules support”

50 move limit stale mate
white/black move timer

trim blade Aug 2, 2025, 12:37 AM

#

the reasoning version that existed for 30 mins yesterday sucked though

spice shell Aug 2, 2025, 12:37 AM

#

trim blade the reasoning version that existed for 30 mins yesterday sucked though

Um no

#

It was SOTA gpqa diamond

pale folio Aug 2, 2025, 12:38 AM

#

we are so mentally unstable

trim blade Aug 2, 2025, 12:38 AM

#

it completely failed at my tasks

#

BUT to be fair I had the temp at 1.0 / top P at 1.0

spice shell Aug 2, 2025, 12:38 AM

#

Idk it succeeded st all my basic ones

eager kiln Aug 2, 2025, 12:38 AM

#

has anyone given it a shot at creative writing?

trim blade Aug 2, 2025, 12:38 AM

#

later I changed those to lower

spice shell Aug 2, 2025, 12:38 AM

#

Where the previous one did not

dusky kelp Aug 2, 2025, 12:38 AM

#

So dis the new thread

crystal scaffold Aug 2, 2025, 12:39 AM

#

brick react the announcement fr

spice shell Aug 2, 2025, 12:39 AM

#

@past sphinx “rerun your benchmarks” I’d warn folks that they can get model-banned for high concurrency benchmarking…..

#

(Happened to me and at least one other)

safe imp Aug 2, 2025, 12:40 AM

#

Same MMLU-Pro score

limpid vale Aug 2, 2025, 12:40 AM

#

is kinda higher

trim blade Aug 2, 2025, 12:40 AM

#

goes both ways

#

so yea, prob just another run of the same model

gleaming lantern Aug 2, 2025, 12:41 AM

#

next jolt Aug 2, 2025, 12:41 AM

#

they shouldnt ban you instead they should tell you theres a concurrent rate limit...

mental cobalt Aug 2, 2025, 12:41 AM

#

it's sort of just
within the noise difference
doesn't feel like much of an improvement

solemn plover Aug 2, 2025, 12:41 AM

#

Wait so what’s this model all about? And why didn’t we see any teaser of some sort for this lol

limber lance Aug 2, 2025, 12:41 AM

#

Its writing output is nice and consistent at least

rocky trellis Aug 2, 2025, 12:41 AM

#

I'm not yet fully tested last model yet 😴

trim blade Aug 2, 2025, 12:41 AM

#

alpha's creative writing is amazing

#

need to test beta more

hardy fjord Aug 2, 2025, 12:42 AM

#

passed all tool calling

spice shell Aug 2, 2025, 12:42 AM

#

Someone run eq bench on this model

safe imp Aug 2, 2025, 12:42 AM

#

limpid vale is kinda higher

Might need a higher tolerance, since my subset of MMLU-Pro is pretty small (80% of questions removed)

next jolt Aug 2, 2025, 12:42 AM

#

run mmlu redux

#

its small 3k

#

correlates well with mmlu, qwen/etc uses it

mental cobalt Aug 2, 2025, 12:42 AM

#

next jolt Aug 2, 2025, 12:42 AM

#

i will run it later if it keeps working for me 🤔

timid delta Aug 2, 2025, 12:42 AM

#

almost exactly inline with horizon alpha on simpleqa so far

mental cobalt Aug 2, 2025, 12:43 AM

#

crank the reasoning budget to max pls so that we can actually get a good opinion of the model because rn it's GPT-3.5

swift blaze Aug 2, 2025, 12:43 AM

#

I thought it was free, are there limits ? It says I have used up my credits

spice shell Aug 2, 2025, 12:43 AM

#

I think this model is not worse than Alpha v1 fwiw in all my questions, just slightly different here and there. And;

Slightly better coding style
Uses code blocks / proper formatting by default

late onyx Aug 2, 2025, 12:44 AM

#

swift blaze I thought it was free, are there limits ? It says I have used up my credits

did you have web sarch on?

rocky trellis Aug 2, 2025, 12:44 AM

#

can we say it's just same but updated?

spice shell Aug 2, 2025, 12:44 AM

#

seems like that is the case yeah

swift blaze Aug 2, 2025, 12:44 AM

#

late onyx did you have web sarch on?

I tried both

spice shell Aug 2, 2025, 12:45 AM

#

I am curious if this is just a different variant of the model that they’ve had prepped, or if they’re doing some kind of 2-day turnaround post training lol

dusky kelp Aug 2, 2025, 12:45 AM

#

spice shell I think this model is not worse than Alpha v1 fwiw in all my questions, just sli...

Is the system prompt the same

cloud lily Aug 2, 2025, 12:45 AM

#

hey guys I have no context but which model it is ? is it a good model ?.

past sphinx Aug 2, 2025, 12:45 AM

#

spice shell I think this model is not worse than Alpha v1 fwiw in all my questions, just sli...

Note that we changed the default system prompt in our Chatroom to ask for markdown where appropriate. (Applies to Horizon Alpha as well now.)

spice shell Aug 2, 2025, 12:45 AM

#

past sphinx Note that we changed the default system prompt in our Chatroom to ask for markdo...

Ooh ok

safe imp Aug 2, 2025, 12:45 AM

#

@dusky kelp

opaque dirge Aug 2, 2025, 12:46 AM

#

trying to run some benchmarks

forest moth Aug 2, 2025, 12:46 AM

#

What’s wild is that it’s likely at least one of you is an OpenAI employee monitoring this chat for feedback kek

past sphinx Aug 2, 2025, 12:46 AM

#

woah

opaque dirge Aug 2, 2025, 12:46 AM

#

but it doesn't seem better than alpha

late onyx Aug 2, 2025, 12:46 AM

#

forest moth What’s wild is that it’s likely at least one of you is an OpenAI employee monito...

hopefully not one of the gooners

rough ridge Aug 2, 2025, 12:46 AM

#

forest moth What’s wild is that it’s likely at least one of you is an OpenAI employee monito...

ong

opaque dirge Aug 2, 2025, 12:46 AM

#

Okay let me run fishtank promt

opaque dirge Aug 2, 2025, 12:47 AM

#

forest moth What’s wild is that it’s likely at least one of you is an OpenAI employee monito...

Agreed

cloud lily Aug 2, 2025, 12:47 AM

#

cloud lily hey guys I have no context but which model it is ? is it a good model ?.

please guys

opaque dirge Aug 2, 2025, 12:47 AM

#

cloud lily hey guys I have no context but which model it is ? is it a good model ?.

testing

undone burrow Aug 2, 2025, 12:47 AM

#

Seems to be just about as braindead as what we are used to, so most definitely OpenAI...

dusky kelp Aug 2, 2025, 12:47 AM

#

Whose the glow stick?

late onyx Aug 2, 2025, 12:48 AM

#

undone burrow Seems to be just about as braindead as what we are used to, so most definitely O...

what prompt?

spice shell Aug 2, 2025, 12:48 AM

#

undone burrow Seems to be just about as braindead as what we are used to, so most definitely O...

Yeah same intelligence level imo

trim blade Aug 2, 2025, 12:48 AM

#

undone burrow Seems to be just about as braindead as what we are used to, so most definitely O...

test it so far and alpha / horizon are amazing so far when it comes to creative writing

#

like maybe best in class

opaque dirge Aug 2, 2025, 12:48 AM

#

spice shell Yeah same intelligence level imo

minimal changes as I can see

cloud lily Aug 2, 2025, 12:48 AM

#

opaque dirge testing

form which company ? openai ?

safe imp Aug 2, 2025, 12:48 AM

#

cloud lily please guys

Good at code (albeit ugly and annoying to read), good at SVGs. 8b level bad at everything else.

trim blade Aug 2, 2025, 12:48 AM

#

maybe its the writing model they talked about before

opaque dirge Aug 2, 2025, 12:48 AM

#

yeah

trim blade Aug 2, 2025, 12:48 AM

#

Im hoping its the OS model

opaque dirge Aug 2, 2025, 12:49 AM

#

Has same reaction on exported TG convo

late onyx Aug 2, 2025, 12:49 AM

#

trim blade Im hoping its the OS model

I'm hoping not; I don't want the Open Source model to be this bad

keen cedar Aug 2, 2025, 12:49 AM

#

i think its actually the exact same model to see what the effect of calling it "beta" instead of alpha is

late onyx Aug 2, 2025, 12:49 AM

#

keen cedar i think its actually the exact same model to see what the effect of calling it "...

It seems to act differently though

rare terrace Aug 2, 2025, 12:49 AM

#

late onyx I'm hoping not; I don't want the Open Source model to be this bad

With reasoning yesterday it was godlike

spice shell Aug 2, 2025, 12:49 AM

#

It’s giving notably different responses and style

opaque dirge Aug 2, 2025, 12:49 AM

#

Failed fishtank test miserably

spice shell Aug 2, 2025, 12:49 AM

#

It’s not identical

trim blade Aug 2, 2025, 12:49 AM

#

late onyx I'm hoping not; I don't want the Open Source model to be this bad

we are not having the same experience at all then, I used alpha for hours now and I think its amazing

undone burrow Aug 2, 2025, 12:49 AM

#

Oop I am way too tired, sorry. I tested it in the OR Chat, didnt even think about the results being better with an actual preset...

cloud lily Aug 2, 2025, 12:49 AM

#

safe imp Good at code (albeit ugly and annoying to read), good at SVGs. 8b level bad at e...

oh okay thank you ☺️

opaque dirge Aug 2, 2025, 12:49 AM

#

opaque dirge Failed fishtank test miserably

previous WAS working

verbal leaf Aug 2, 2025, 12:49 AM

#

forest moth What’s wild is that it’s likely at least one of you is an OpenAI employee monito...

FEEDBACK: GIVE US THE VERSION WITH MAX THINKING BUDGET. THIS IS ASS.

opaque dirge Aug 2, 2025, 12:49 AM

#

this version is not

late onyx Aug 2, 2025, 12:49 AM

#

trim blade we are not having the same experience at all then, I used alpha for hours now an...

It was good

#

its dumb now

keen cedar Aug 2, 2025, 12:50 AM

#

spice shell It’s giving notably different responses and style

is it? didnt notice a difference but honestly i ran it through like two test prompts bc its too late (in EST) for ts

safe imp Aug 2, 2025, 12:50 AM

#

spice shell It’s giving notably different responses and style

Try Alpha right now, it also got the markdown formatting instructions

spice shell Aug 2, 2025, 12:50 AM

#

safe imp Try Alpha right now, it also got the markdown formatting instructions

Good point

#

Though what I meant was coding style

safe imp Aug 2, 2025, 12:50 AM

#

I see

spice shell Aug 2, 2025, 12:50 AM

#

And answers to knowledge questions

opaque dirge Aug 2, 2025, 12:50 AM

#

spice shell Though what I meant was coding style

seems similar enough for me

safe imp Aug 2, 2025, 12:50 AM

#

Maybe that's the juice, lol

opaque dirge Aug 2, 2025, 12:50 AM

#

spice shell And answers to knowledge questions

????

spice shell Aug 2, 2025, 12:51 AM

#

opaque dirge ????

It’s hallucinating in different ways

late onyx Aug 2, 2025, 12:51 AM

#

i don't even konw what beta was trying to do

spice shell Aug 2, 2025, 12:51 AM

#

And a bit more severely on niche ones

opaque dirge Aug 2, 2025, 12:51 AM

#

spice shell It’s hallucinating in different ways

For me it's just worse

graceful kelp Aug 2, 2025, 12:52 AM

#

spice shell It’s hallucinating in different ways

thinking maybe they raised the temperature

night urchin Aug 2, 2025, 12:52 AM

#

hey! i'm back

keen cedar Aug 2, 2025, 12:52 AM

#

trim blade Im hoping its the OS model

solid chance its not btw, unless the repos that oai accidentally published were red herrings(???) the oss model's context is 128k while this is 256k
unless theyre doing some funny business with keeping a higher context version behind their api like alibaba but that doesnt really make sense

night urchin Aug 2, 2025, 12:52 AM

#

so what's the consensus

#

or is it too soon

#

couldnt test it yet

limber lance Aug 2, 2025, 12:52 AM

#

again, I have no idea how it can flunk basic reasoning tasks and yet give this sort of output

opaque dirge Aug 2, 2025, 12:52 AM

#

I've fucking got up from the bed just to test it

spice shell Aug 2, 2025, 12:52 AM

#

opaque dirge For me it's just worse

I do think it’s a little worse at niche knowledge yeah

opaque dirge Aug 2, 2025, 12:53 AM

#

it's almost 4am

trim blade Aug 2, 2025, 12:53 AM

#

yea, lower the temp / top P

#

1.0 was way to high for alpha I found

spice shell Aug 2, 2025, 12:53 AM

#

Someone needs to make NicheBench where it quizzes exhaustive character lists from tv shows and shit

#

lol

opaque dirge Aug 2, 2025, 12:53 AM

#

trim blade yea, lower the temp / top P

0.7 should be better no?

trim blade Aug 2, 2025, 12:53 AM

#

that or even a bit lower

next jolt Aug 2, 2025, 12:53 AM

#

ran some benchmarks, here's a comparison:

horizon alpha (juice 0)
gpqa diamond: 47.98%
math 500: 84.60%

horizon beta (juice 5)
gpqa diamond: 63.13% (+15.15%)
math 500: 89% (+4.4%)

late onyx Aug 2, 2025, 12:53 AM

#

what the heck is 'juice'?

blissful valley Aug 2, 2025, 12:53 AM

#

it seems a bit more sensible in portuguese

trim blade Aug 2, 2025, 12:53 AM

#

seems like a lot of models lately need low temps or they go crazy

undone burrow Aug 2, 2025, 12:53 AM

#

It is pretty censored tho, sadly, was hoping for some good rp stuff

spice shell Aug 2, 2025, 12:53 AM

#

late onyx what the heck is 'juice'?

OpenAI bros injecting slang into the models

night urchin Aug 2, 2025, 12:54 AM

#

safe imp Same MMLU-Pro score

they are mathmaxxing it again

next jolt Aug 2, 2025, 12:54 AM

#

late onyx what the heck is 'juice'?

apparently the reasoning budget

#

that's what i heard

spice shell Aug 2, 2025, 12:54 AM

#

Calling it juice is really funny

undone burrow Aug 2, 2025, 12:54 AM

#

That is, marinara and a pretty hardcore card (for testing of censoring level only ofc ofc)

next jolt Aug 2, 2025, 12:54 AM

#

that's what they call it 😭

night urchin Aug 2, 2025, 12:54 AM

#

i wish they could stop making huge generalist models and make several huge segmented models

spice shell Aug 2, 2025, 12:54 AM

#

night urchin i wish they could stop making huge generalist models and make several huge segme...

Codex models are that

limber lance Aug 2, 2025, 12:54 AM

#

limber lance again, I have no idea how it can flunk basic reasoning tasks and yet give this s...

my prompts btw were just You will now take on the role of a 12th century Sogdian merchant (who can somehow speak English) and I hail from the land of the Bulgars. I have 200 talents of silver and I need many bolts of silk, to sell in Constantinople on my journey back.
it decided to take all of those aspects into consideration by itself

night urchin Aug 2, 2025, 12:54 AM

#

yeah but we only have that

spice shell Aug 2, 2025, 12:55 AM

#

Yeah

night urchin Aug 2, 2025, 12:55 AM

#

and some health models

trim blade Aug 2, 2025, 12:55 AM

#

these models are all about generalization

opaque dirge Aug 2, 2025, 12:55 AM

#

spice shell Codex models are that

general models are the future

trim blade Aug 2, 2025, 12:55 AM

#

the more they know the better they perform

spice shell Aug 2, 2025, 12:55 AM

#

Need hierarchical MoEs

late onyx Aug 2, 2025, 12:55 AM

#

which do you think is best?

undone burrow Aug 2, 2025, 12:56 AM

#

Oop nvm prefill does good shit

late onyx Aug 2, 2025, 12:56 AM

#

horizon alpha on left horizon beta on middld deepseek chat on right

spice shell Aug 2, 2025, 12:56 AM

#

Route to one of 8 subject matter expert half-generalist models or something

late onyx Aug 2, 2025, 12:56 AM

#

spice shell Route to one of 8 subject matter expert half-generalist models or something

that's just MoE isn't it?

spice shell Aug 2, 2025, 12:56 AM

#

late onyx that's just MoE isn't it?

MoE is token level and generally can route to any mix of the experts in the pool

opaque dirge Aug 2, 2025, 12:57 AM

#

It's A LOT better with temp .7 and topP .95

trim blade Aug 2, 2025, 12:57 AM

#

yea, every token is basiclly routed through the layers most suited for the task

late onyx Aug 2, 2025, 12:57 AM

#

spice shell MoE is token level and generally can route to any mix of the experts in the pool

so then isn't what your describing just worse MoE?

trim blade Aug 2, 2025, 12:57 AM

#

its not really "experts" the way that sounds

opaque dirge Aug 2, 2025, 12:57 AM

#

Beta

trim blade Aug 2, 2025, 12:57 AM

#

like topic wise

opaque dirge Aug 2, 2025, 12:57 AM

#

Alpha

late onyx Aug 2, 2025, 12:57 AM

#

trim blade its not really "experts" the way that sounds

yeah MoE is more just forced sparsity

safe imp Aug 2, 2025, 12:57 AM

#

late onyx which do you think is best?

2 and 3 have a lot of visible slop

#

3 tried to add a roleplay element but then followed up with professional-sounding slop

spice shell Aug 2, 2025, 12:57 AM

#

late onyx so then isn't what your describing just worse MoE?

It would be interesting to see something where there’s segmentation within the experts, such that coding subjects always goes to effectively a standalone coding segment

late onyx Aug 2, 2025, 12:57 AM

#

safe imp 2 and 3 have a lot of visible slop

what does 'slop' mean in the context of this?

late onyx Aug 2, 2025, 12:58 AM

#

safe imp 3 tried to add a roleplay element but then followed up with professional-soundin...

that one is deepseek v3

night urchin Aug 2, 2025, 12:58 AM

#

ethically sourced blood

spice shell Aug 2, 2025, 12:58 AM

#

8.5T models where it’s really 8x 1T specialized models + 500B shared experts or something 😂

night urchin Aug 2, 2025, 12:58 AM

#

the dream

trim blade Aug 2, 2025, 12:58 AM

#

would perform worse than just one 8.5T with so many active params

#

The bigger it is over all the more "accurate" it is to what it was trained on

spice shell Aug 2, 2025, 12:59 AM

#

trim blade would perform worse than just one 8.5T with so many active params

Idk, why does Qwen 3 coder exist then?

trim blade Aug 2, 2025, 12:59 AM

#

like a image that gets more lossy as you compress it

opaque dirge Aug 2, 2025, 12:59 AM

#

opaque dirge Alpha

just so you know, only code from alpha works as intended

trim blade Aug 2, 2025, 12:59 AM

#

spice shell Idk, why does Qwen 3 coder exist then?

it performs ok but I never found it anything that special

#

and its far from a specialized model

#

test its knowledge

#

it nows a ton about most things

spice shell Aug 2, 2025, 1:00 AM

#

Yeah but it’s definitely coding biased

trim blade Aug 2, 2025, 1:00 AM

#

post training to make its distribution more "sharp" for coding does not mean you dont train it on everything else as well beforehand

late onyx Aug 2, 2025, 1:01 AM

#

it said it was trained on like 8T tokens with majority coding

spice shell Aug 2, 2025, 1:01 AM

#

What if GPT 5 is in fact this, one model and the model picker is just forcing the router to one of of the segments

#

😂

late onyx Aug 2, 2025, 1:01 AM

#

but that still means it was trained on at least 1 trillion non-coding tokens

spice shell Aug 2, 2025, 1:01 AM

#

trim blade post training to make its distribution more "sharp" for coding does not mean you...

For sure!

#

I mean specialized as in +20% drift perhaps in each specialization

#

Not 100%

opaque dirge Aug 2, 2025, 1:01 AM

#

but still

trim blade Aug 2, 2025, 1:02 AM

#

either way qwen coder still performs substantially worse than the big generalist models do at it

spice shell Aug 2, 2025, 1:02 AM

#

Anyways, just spouting ideas

opaque dirge Aug 2, 2025, 1:02 AM

#

one huge model will be better in the long run

spice shell Aug 2, 2025, 1:02 AM

#

yeah

opaque dirge Aug 2, 2025, 1:02 AM

#

we just do not have enough compute

late onyx Aug 2, 2025, 1:02 AM

#

trim blade either way qwen coder still performs substantially worse than the big generalist...

is it better than kimi k2?

trim blade Aug 2, 2025, 1:02 AM

#

last I saw kimi edged it out there

spice shell Aug 2, 2025, 1:02 AM

#

It is interesting to hear the rumor that o3 got way worse at ARC AGI once it got trained for chatting

safe imp Aug 2, 2025, 1:02 AM

#

late onyx what does 'slop' mean in the context of this?

2 - "We can approach...", "First, lets get curious", dumping questions with bullet points (unnatural in a convo scenario)
3 - "Let's explore", "any specific", "calm demeanor", unnatural enumeration of quesitons, "some find" (weasel wording)

opaque dirge Aug 2, 2025, 1:02 AM

#

spice shell It is interesting to hear the rumor that o3 got way worse at ARC AGI once it got...

maybe

#

i still remember presentation

spice shell Aug 2, 2025, 1:03 AM

#

There’s definitely something to be said about the fact that subject strengths keep clashing with eachother

opaque dirge Aug 2, 2025, 1:03 AM

#

where it was like 50% or arc-agi-1

spice shell Aug 2, 2025, 1:03 AM

#

Yeah

#

That was a crazy moment

hardy fjord Aug 2, 2025, 1:03 AM

#

late onyx Aug 2, 2025, 1:03 AM

#

safe imp 2 - "We can approach...", "First, lets get curious", dumping questions with bull...

I feel the slop was way worse in 2 than 3

opaque dirge Aug 2, 2025, 1:03 AM

#

hardy fjord

just give the summary

opaque dirge Aug 2, 2025, 1:04 AM

#

safe imp 2 - "We can approach...", "First, lets get curious", dumping questions with bull...

I can't see that in non native languages

hardy fjord Aug 2, 2025, 1:04 AM

#

opaque dirge just give the summary

I'll post the repo so you can see it's code.

dark bramble Aug 2, 2025, 1:04 AM

#

forest moth What’s wild is that it’s likely at least one of you is an OpenAI employee monito...

they're thinking they should have taken that meta offer 😭

trim blade Aug 2, 2025, 1:04 AM

#

meta kind of looks like its sinking atm

#

maybe they can turn it around, I hope so

spice shell Aug 2, 2025, 1:05 AM

#

That’s what everyone said about GDM before 2.5 pro

timid delta Aug 2, 2025, 1:05 AM

#

Final score for Horizon Beta on SimpleQA was 33.7%. oai models scores for reference..... and Horizon-Alpha got 33.9% yesterday https://github.com/openai/simple-evals https://openai.com/index/introducing-simpleqa/

trim blade Aug 2, 2025, 1:05 AM

#

but llama 4 did not bring big confidence

spice shell Aug 2, 2025, 1:05 AM

#

Someone post that image where it’s a circle of “the smartest model in the world”

#

lol

late onyx Aug 2, 2025, 1:05 AM

#

how 'slop'py are these?

spice shell Aug 2, 2025, 1:05 AM

#

late onyx how 'slop'py are these?

Not at all

opaque dirge Aug 2, 2025, 1:05 AM

#

timid delta Final score for Horizon Beta on SimpleQA was 33.7%. oai models scores for refere...

another indicator for small model

spice shell Aug 2, 2025, 1:05 AM

#

I read the middle one and it seems quite un-slopped

late onyx Aug 2, 2025, 1:06 AM

#

all i changed was I added This is in a conversational setting.

dark bramble Aug 2, 2025, 1:06 AM

#

trim blade but llama 4 did not bring big confidence

they have smart people and lots of compute. the latter was true during llama 4 shitshow and it's truer now with their new hires. the right people could make meta into a top tier lab... im not sure alexandr wang is the right guy tho.

hardy fjord Aug 2, 2025, 1:06 AM

#

late onyx how 'slop'py are these?

deffo not set up for roleplay

late onyx Aug 2, 2025, 1:06 AM

#

hardy fjord deffo not set up for roleplay

all 3?

hardy fjord Aug 2, 2025, 1:06 AM

#

feels like these are coding models.

trim blade Aug 2, 2025, 1:06 AM

#

opaque dirge another indicator for small model

hoping alpha is OS and not gpt 5 mini / nano

dark bramble Aug 2, 2025, 1:06 AM

#

dark bramble they have smart people and lots of compute. the latter was true during llama 4 s...

making the o1/o3 guy chief scientist is a good sign tho

hardy fjord Aug 2, 2025, 1:06 AM

#

but their weird problem with markdown is fucking weird

opaque dirge Aug 2, 2025, 1:06 AM

#

trim blade hoping alpha is OS and not gpt 5 mini / nano

nano, for mini it's shit

night urchin Aug 2, 2025, 1:06 AM

#

late onyx how 'slop'py are these?

i'm getting really tired of this pattern of speech and the lists

#

out of the box

opaque dirge Aug 2, 2025, 1:07 AM

#

it MUST be better than current gen mini models

dark bramble Aug 2, 2025, 1:07 AM

#

maybe one of these is 120b MoE and the other the 20b (dense?)

late onyx Aug 2, 2025, 1:07 AM

#

night urchin i'm getting really tired of this pattern of speech and the lists

both horizon models still insisted in list formatting even when i told it it was conversational

#

i dont' get it

opaque dirge Aug 2, 2025, 1:07 AM

#

trim blade hoping alpha is OS and not gpt 5 mini / nano

for OSS still bad

#

idk

#

should be 20B at max

spice shell Aug 2, 2025, 1:07 AM

#

timid delta Final score for Horizon Beta on SimpleQA was 33.7%. oai models scores for refere...

Better than o4-mini but not better than 4.1 is interesting.

Lines up with rumours around OSS model

opaque dirge Aug 2, 2025, 1:07 AM

#

as we've seen from the leaks

trim blade Aug 2, 2025, 1:07 AM

#

opaque dirge for OSS still bad

for creative writing / graphic / web design alpha is amazing at least

dark bramble Aug 2, 2025, 1:08 AM

#

opaque dirge for OSS still bad

depends on size, i don't think openai will match kimi k2 at 120b

night urchin Aug 2, 2025, 1:08 AM

#

i remember when i talked to Gemini 03-25 and felt like i was talking to a human on the black box experiment

trim blade Aug 2, 2025, 1:08 AM

#

at least the alpha that was around for the hours I used it

night urchin Aug 2, 2025, 1:08 AM

#

it was really surreal

opaque dirge Aug 2, 2025, 1:08 AM

#

if it's oss hope for 20B for Horizon

spice shell Aug 2, 2025, 1:08 AM

#

0325 still has my heart

#

lol

night urchin Aug 2, 2025, 1:08 AM

#

they are doing something wrong for sure

dark bramble Aug 2, 2025, 1:08 AM

#

spice shell 0325 still has my heart

they really cooked with that

spice shell Aug 2, 2025, 1:08 AM

#

I hope GDM brings back the magic for Gemini 3

night urchin Aug 2, 2025, 1:08 AM

#

they as in everybody except moonshotAI apparently

dark bramble Aug 2, 2025, 1:08 AM

#

though gpt-4 is still my beloved

late onyx Aug 2, 2025, 1:08 AM

#

spice shell 0325 still has my heart

what model is that?

opaque dirge Aug 2, 2025, 1:08 AM

#

isn't current better?

spice shell Aug 2, 2025, 1:09 AM

#

A bit like Claude 3.7 -> 4 really smoothed out the model

night urchin Aug 2, 2025, 1:09 AM

#

yeah

spice shell Aug 2, 2025, 1:09 AM

#

late onyx what model is that?

Gemini 2.5 v1

dark bramble Aug 2, 2025, 1:09 AM

#

opaque dirge isn't current better?

vibes r off

late onyx Aug 2, 2025, 1:09 AM

#

ohh

night urchin Aug 2, 2025, 1:09 AM

#

i wish i had exported the conversation i had with gemini

#

truly made me question myself

spice shell Aug 2, 2025, 1:09 AM

#

I’ve completely stopped using Gemini despite it supposedly being better in all respects than 0325

#

It lost the magic somehow

opaque dirge Aug 2, 2025, 1:10 AM

#

I'm too self conscious to question myself when talking with AI

dark bramble Aug 2, 2025, 1:10 AM

#

night urchin truly made me question myself

what if we're just biological next token generators?:)

late onyx Aug 2, 2025, 1:10 AM

#

night urchin Aug 2, 2025, 1:10 AM

#

it's something no benchmark will address

hardy fjord Aug 2, 2025, 1:10 AM

#

opaque dirge just give the summary

https://github.com/XSUS-AI/clickup-mcp It didn't add the readme... right before it did, provider started generic erroring out.

GitHub

GitHub - XSUS-AI/clickup-mcp

Contribute to XSUS-AI/clickup-mcp development by creating an account on GitHub.

opaque dirge Aug 2, 2025, 1:10 AM

#

spice shell It lost the magic somehow

it was the best at the fishtank test at it's time

opaque dirge Aug 2, 2025, 1:10 AM

#

hardy fjord https://github.com/XSUS-AI/clickup-mcp It didn't add the readme... right before ...

does it work?

hardy fjord Aug 2, 2025, 1:11 AM

#

bout to test it, just going over code

#

looks like it nailed the MCP server

night urchin Aug 2, 2025, 1:11 AM

#

late onyx

i'm thinking it's the open source model fr, but i saw some people saying it couldn't be that so soon

hardy fjord Aug 2, 2025, 1:11 AM

#

also don' thave a clickup API key atm... and quickly fading.

dark bramble Aug 2, 2025, 1:11 AM

#

spice shell It lost the magic somehow

benchmarks don't get vibes. aidanbench was kinda close but idk

hardy fjord Aug 2, 2025, 1:11 AM

#

just wanted to see what it's code looked like compared to alpha

opaque dirge Aug 2, 2025, 1:11 AM

#

hardy fjord looks like it nailed the MCP server

kotlin mcp sdk is just broken

hardy fjord Aug 2, 2025, 1:11 AM

#

sucks for android

opaque dirge Aug 2, 2025, 1:12 AM

#

hardy fjord sucks for android

Idk about the client

hardy fjord Aug 2, 2025, 1:12 AM

#

gonna get some sleep.

wary tangle Aug 2, 2025, 1:12 AM

#

my o3 and gemini 2.5 pro says alpha is better at summarizing a paper…

hardy fjord Aug 2, 2025, 1:12 AM

#

opaque dirge Idk about the client

looks right according to their docs

opaque dirge Aug 2, 2025, 1:12 AM

#

hardy fjord looks right according to their docs

OAI doesn't work without JSON RPC

spice shell Aug 2, 2025, 1:13 AM

#

Hey openai if you’re watching make it stop outputting code golf it’s really annoying

opaque dirge Aug 2, 2025, 1:13 AM

#

it doent accept sse

#

I'm going insane with this models testing, will probably write my own testing suite

hardy fjord Aug 2, 2025, 1:13 AM

#

nah I always have it code Stdio and then just make that an API wrapper

opaque dirge Aug 2, 2025, 1:13 AM

#

hardy fjord nah I always have it code Stdio and then just make that an API wrapper

ah

hardy fjord Aug 2, 2025, 1:13 AM

#

I came in the MCP game early and never could fully rely on MCP SSE

opaque dirge Aug 2, 2025, 1:13 AM

#

hardy fjord I came in the MCP game early and never could fully rely on MCP SSE

why's that?

hardy fjord Aug 2, 2025, 1:14 AM

#

I find it nice to just roll with Stdio and just never use anyone else's MCP servers for security

#

just as it's been progressing I've gotten stuck in my ways I guess, and SSE wasn't supported at first, then it wasn't secure, and people still talk about it as a security threat

#

but also

#

stdio, you can make a mixed MCP server that does stuff locally and remotely

verbal leaf Aug 2, 2025, 1:14 AM

#

this bench tests a model's ability to link seemingly unrelated niche things

#

very interesting result

opaque dirge Aug 2, 2025, 1:14 AM

#

hardy fjord stdio, you can make a mixed MCP server that does stuff locally and remotely

true, but I don't need one

hardy fjord Aug 2, 2025, 1:15 AM

#

like I get how easy SSE is... but I mean, I'm making agents make agents and their tools from scratch to spec here

opaque dirge Aug 2, 2025, 1:15 AM

#

support bot doesn't need that

next jolt Aug 2, 2025, 1:15 AM

#

~~oh this is a hybrid thinking model, i think i just triggered thinking mode 😭~~

hardy fjord Aug 2, 2025, 1:15 AM

#

so I don't care really... I only ever use MCP servers my agent coded

opaque dirge Aug 2, 2025, 1:15 AM

#

hardy fjord so I don't care really... I only ever use MCP servers my agent coded

hah

late onyx Aug 2, 2025, 1:15 AM

#

both models are confidently incorrect

opaque dirge Aug 2, 2025, 1:15 AM

#

not even context7 and or exasearch?

trim blade Aug 2, 2025, 1:15 AM

#

next jolt ~~oh this is a hybrid thinking model, i think i just triggered thinking mode 😭~...

uh oh, from what I know OI does not want to release their thinking processes

#

so maybe its not the open model

#

that would suck

hardy fjord Aug 2, 2025, 1:16 AM

#

opaque dirge not even context7 and or exasearch?

if I find a good API, I have it make me an MCP

#

I'm using Serper, and for scraping just had it make me a tool that uses bs4

opaque dirge Aug 2, 2025, 1:16 AM

#

hardy fjord if I find a good API, I have it make me an MCP

ima get some sleep, it's 4:16 am already

hardy fjord Aug 2, 2025, 1:16 AM

#

2:16 here

hardy fjord Aug 2, 2025, 1:16 AM

#

opaque dirge ima get some sleep, it's 4:16 am already

but same, nice to meet you

opaque dirge Aug 2, 2025, 1:16 AM

#

hardy fjord 2:16 here

france/spain?

hardy fjord Aug 2, 2025, 1:17 AM

#

Africa

opaque dirge Aug 2, 2025, 1:17 AM

#

hardy fjord Africa

oh

night urchin Aug 2, 2025, 1:18 AM

#

cool

#

does anyone know if it has updated knowledge on popular libraries?

pale folio Aug 2, 2025, 1:18 AM

#

Whats the consensus

spice shell Aug 2, 2025, 1:24 AM

#

pale folio Whats the consensus

Meh

spice shell Aug 2, 2025, 1:24 AM

#

next jolt ~~oh this is a hybrid thinking model, i think i just triggered thinking mode 😭~...

How?

spice shell Aug 2, 2025, 1:24 AM

#

night urchin does anyone know if it has updated knowledge on popular libraries?

It does!

#

Tailwind v4 and react router v7 are my go to tests and it knows of both

past sphinx Aug 2, 2025, 1:36 AM

#

Some users who had no credits weren't able to access Horizon earlier due to a faulty fraud check we had - just fixed this, so try again!

dapper pecan Aug 2, 2025, 1:36 AM

#

im getting a 429

pale folio Aug 2, 2025, 1:36 AM

#

Same

late onyx Aug 2, 2025, 1:36 AM

#

What model?

pale folio Aug 2, 2025, 1:36 AM

#

Error 429

late onyx Aug 2, 2025, 1:36 AM

#

pale folio Error 429

You've hit rate limit

#

429 = too many requests

pale folio Aug 2, 2025, 1:37 AM

#

late onyx You've hit rate limit

💀 💀 💀

late onyx Aug 2, 2025, 1:37 AM

#

unless there is a bug with rate limiting

next jolt Aug 2, 2025, 1:37 AM

#

i also hit it at the same time as them i think

past sphinx Aug 2, 2025, 1:37 AM

#

Investigating

late onyx Aug 2, 2025, 1:37 AM

#

mine is still working

#

i haven't used it that much though

#

is there a way to check how many free credits we have left

next jolt Aug 2, 2025, 1:38 AM

#

works now

dapper pecan Aug 2, 2025, 1:39 AM

#

Screenshot_2025-08-02_at_11.39.28_AM.png

past sphinx Aug 2, 2025, 1:39 AM

#

Looks like Horizon Beta is getting really hammered - working on scaling / limiting the heaviest users

safe imp Aug 2, 2025, 1:40 AM

#

#

This model

verbal leaf Aug 2, 2025, 1:41 AM

#

i'll play with it again when it REASONS

late onyx Aug 2, 2025, 1:42 AM

#

past sphinx Looks like Horizon Beta is getting really hammered - working on scaling / limiti...

do you like have a list of all the users or something?

next jolt Aug 2, 2025, 1:42 AM

#

Can i add my openai api key to increase my limits? 🤣

late onyx Aug 2, 2025, 1:42 AM

#

probably everyone running benchmarks on it

spice shell Aug 2, 2025, 1:43 AM

#

past sphinx Some users who had no credits weren't able to access Horizon earlier due to a fa...

Doesn’t seem fixed for me at least 🥲

past sphinx Aug 2, 2025, 1:43 AM

#

Should be better now!

past sphinx Aug 2, 2025, 1:43 AM

#

spice shell Doesn’t seem fixed for me at least 🥲

dm'd you

lunar dew Aug 2, 2025, 1:54 AM

#

This model hella fast

late onyx Aug 2, 2025, 1:54 AM

#

could be MoE

verbal leaf Aug 2, 2025, 1:59 AM

#

https://fixupx.com/SebastienBubeck/status/1951457213920452763 and so begins the hypeposting

Sebastien Bubeck (@SebastienBubeck)

Pretty good. Chat, do you think we can do better?

Quoting Ethan Mollick (@emollick)
︀
Here is the Deep Think "Sparks unicorn"
︀︀
︀︀(This is created using TikZ, which is a language built for scientific diagrams & very much not for drawing. The original "Sparks of AGI" paper used the ability of the AI to draw a primitive unicorn as an example of unexpected AI abilities)

**💬 9 ❤️ 36 👁️ 3.8K **

undone burrow Aug 2, 2025, 2:05 AM

#

It ozoned chat😔

night urchin Aug 2, 2025, 2:14 AM

#

verbal leaf https://fixupx.com/SebastienBubeck/status/1951457213920452763 and so begins the ...

drew better than me

rare terrace Aug 2, 2025, 2:31 AM

#

verbal leaf i'll play with it again when it REASONS

Having a feeling it was an accidental (or intentional) gpt 5 leak

past sphinx Aug 2, 2025, 2:33 AM

#

what do people think of the general prose feel compared to alpha?

verbal leaf Aug 2, 2025, 2:35 AM

#

it refuses to write my fanfiction now 💔

bitter vigil Aug 2, 2025, 2:35 AM

#

verbal leaf it refuses to write my fanfiction now 💔

sfw?

verbal leaf Aug 2, 2025, 2:36 AM

#

yes

#

😭

fathom atlas Aug 2, 2025, 2:37 AM

#

past sphinx what do people think of the general prose feel compared to alpha?

Feels a lot better, understands more, doing less dumb mistakes

#

One shotted todo app with backend + db and also for apple notes clone

past sphinx Aug 2, 2025, 2:38 AM

#

fathom atlas Feels a lot better, understands more, doing less dumb mistakes

Awesome!

past sphinx Aug 2, 2025, 2:39 AM

#

verbal leaf it refuses to write my fanfiction now 💔

:/ any other details that you can share?

#

anyone have other regressions caused by the beta?

jade kayak Aug 2, 2025, 2:40 AM

#

past sphinx what do people think of the general prose feel compared to alpha?

Decent at UI, but not anything complex like 3D games. limited world knowledge. Pretty fast

ancient hedge Aug 2, 2025, 2:41 AM

#

Oh. My. God.: https://x.com/Jordy_vD_/status/1951473109183439138

Jordy.app (@Jordy_vD_)

Wtf -- it added easter eggs??

Also, this website looks dope asf

#

Given a simple test prompt, it created design files via Roo Code, wireframes, and it even added easter eggs to the website 😂

fathom atlas Aug 2, 2025, 2:41 AM

#

One shot authentication, and this time it did it without having to tell it to work with backend, just did it.

#

ancient hedge Aug 2, 2025, 2:47 AM

#

Idk what provider this model is from, but if this thing works well with OpenCode when it releases, or Codex if it's OAI, this could be amazing.

upbeat jewel Aug 2, 2025, 2:48 AM

#

403 - Blocked by Stealth.... Ouch

ancient hedge Aug 2, 2025, 2:49 AM

#

I'm not one to fear for my job, but I have no idea how to build some of the effects it added 😅

#

Also someone from OAI just liked my tweet so mayyyybe

gleaming lantern Aug 2, 2025, 2:54 AM

#

Just want reasoning back 😔

verbal leaf Aug 2, 2025, 2:54 AM

#

real

gleaming lantern Aug 2, 2025, 2:54 AM

#

The text predictors want to predict

verbal leaf Aug 2, 2025, 2:54 AM

#

alex pass that feedback along

#

wink wink

mental cobalt Aug 2, 2025, 2:55 AM

#

real

silent heath Aug 2, 2025, 2:56 AM

#

ancient hedge Oh. My. God.: https://x.com/Jordy_vD_/status/1951473109183439138

this is problematic actually (unless told to go above and beyond)

ancient hedge Aug 2, 2025, 3:04 AM

#

Might have been Roo Code telling it to

And who doesn’t want above and beyond 😜

gritty glade Aug 2, 2025, 3:04 AM

#

past sphinx what do people think of the general prose feel compared to alpha?

It seems good imo MrshaBlob a little repetitive at times though

cold knoll Aug 2, 2025, 3:14 AM

#

past sphinx anyone have other regressions caused by the beta?

Its UI Design has gotten a fair amount worse, But the backend development feels much better and it is following our CLI's prompts much better.

the Alpha model continuously asked "Please repond with "x" to continue"

"I dont currently have permission to use tools, please say "i grant you x y z" to allow me to use tools" etc, Which seems to have stopped and the model is now adhereing similarly to claude opus/sonnet

fathom atlas Aug 2, 2025, 3:26 AM

#

Huge thanks to @past sphinx & the OpenRouter team for hosting these stealth preview models.

Honestly feels better to use than Sonnet 4, super solid. Can't wait to use the final model.

Dropped what I built in app showcase: #app-showcase message

leaden sinew Aug 2, 2025, 3:39 AM

#

So, anyone knows who's model was Cypher Alpha?

#

now Horizon Beta?

#

because you are feeding someone, whoever is using it

leaden sinew Aug 2, 2025, 3:41 AM

#

fathom atlas

ooh

#

good old RSA

green parcel Aug 2, 2025, 3:53 AM

#

How's Beta compared to Alpha thus far 👀

fathom atlas Aug 2, 2025, 3:55 AM

#

lol

fathom atlas Aug 2, 2025, 3:56 AM

#

green parcel How's Beta compared to Alpha thus far 👀

Really good, this model slaps.

grave shore Aug 2, 2025, 4:04 AM

#

I certainly hope this isn't the creative/writing model OAI was supposedly cooking up.
Because I find it pretty lacking from the short time I used it.
Can't speak on Alpha vs Beta, found them pretty similar.

trim blade Aug 2, 2025, 4:05 AM

#

I found the exact opposite. It's the best creative writing model I've ever used

#

It dethrones opus

#

It's amazing at building scenes and writing out characters in intelligent ways

grave shore Aug 2, 2025, 4:06 AM

#

That's funny, but really not a sentiment I can share lol

trim blade Aug 2, 2025, 4:07 AM

#

Reminds me of gpt4.5 for the short time that existed

grave shore Aug 2, 2025, 4:07 AM

#

I'd like to specify that I'm using it for RP'ing
I'm not sure how well it handles longform writing

trim blade Aug 2, 2025, 4:08 AM

#

Maybe that is the difference? I do long form writing

#

Feeding it info on a setting/ characters

heady gust Aug 2, 2025, 4:09 AM

#

I've gotta give it some time to decide, not blown away or anything but it's certainly good

trim blade Aug 2, 2025, 4:09 AM

#

and its far more nuanced / less on the nose than anything else so far

#

it understands on a deeper level if that makes sense

heady gust Aug 2, 2025, 4:09 AM

#

I want to see how the reasoning variant does on some of my challenge questions

trim blade Aug 2, 2025, 4:09 AM

#

social intelligence is higher than anything else I have seen

heady gust Aug 2, 2025, 4:09 AM

#

Nonreasoning isn't able to get them but that's pretty much to be expected

trim blade Aug 2, 2025, 4:10 AM

#

Try maybe grabbing a scene from your favorite book and tossing it in

#

tell it to continue it

grave shore Aug 2, 2025, 4:13 AM

#

trim blade social intelligence is higher than anything else I have seen

I seriously don't see it when using this model.
It's what I've always seen from the big models, Opus, 4.5, sometimes 2.5 pro.

I'll give it another try with some different scenarios.

cold knoll Aug 2, 2025, 4:14 AM

#

@past sphinx the model actually still fails to understand it can call native tools. It works up until around 60k tokens then it refuses to use tools that it has been using all chats

trim blade Aug 2, 2025, 4:14 AM

#

sounds like the usual context drop off issues

#

most models start getting progressively dumber at longer contexts

#

past 32K for most models really

cold knoll Aug 2, 2025, 4:15 AM

#

Yeah but 256k context, only able to use 60k context without forcing the model into an infinite loop...

#

No issues on google models, anthropic openai etc, even local

trim blade Aug 2, 2025, 4:15 AM

#

Every model claims high context but have massive performance drop offs

gritty glade Aug 2, 2025, 4:16 AM

#

trim blade It dethrones opus

I disagree with you there, it is good but claude is still better

trim blade Aug 2, 2025, 4:17 AM

#

I dont see it

cold knoll Aug 2, 2025, 4:17 AM

#

yeah thats a crazy statement, i can use opus max context no issues with a rolling message window, this shit i can barely use 50k

#

Consistent output past 100k, best under but still functional

trim blade Aug 2, 2025, 4:18 AM

#

I've used nothing but claude and some deepseek for about 2 years now

#

since claude 2 to claude 4 / opus 3 / 4

#

and this model is amazing me

#

for the past 5 or so hours I spent

gritty glade Aug 2, 2025, 4:19 AM

#

It could be the novelty that makes it feel better, but then again it depends on what you like pikathink

#

Horizon is def good compared to most other models but Claude is still better in terms of dialogue flow and keeping consistent with lore imo

#

But if horizon is cheaper than 50c a message it would win out for me tbh mindfortress

cold knoll Aug 2, 2025, 4:21 AM

#

i agree with that

#

price will say everything

gritty glade Aug 2, 2025, 4:22 AM

#

with reasoning i think it could be better than it is currently

trim blade Aug 2, 2025, 4:23 AM

#

Btw, I should say alpha is the model Im enjoying for creative writing

#

beta seemed worse to me

#

for that at least

#

could always be just some luck of a bunch of bad gens but it felt way worse to me so I switched back

rough relic Aug 2, 2025, 4:24 AM

#

I'm just curious, will alpha/beta stay free?

#

I'm a bit new to this

bright oak Aug 2, 2025, 4:25 AM

#

rough relic I'm just curious, will alpha/beta stay free?

probably not. everyone gets to have fun talking with them but they use the data to train.

trim blade Aug 2, 2025, 4:26 AM

#

its either going to be gpt5 which will hopefully be cheap going by the speed or hopefully, maybe, it might be the OS model?

#

seems a bit too good to be true there

heady gust Aug 2, 2025, 4:26 AM

#

It'd be nice if it was the OS model but I have a sinking feeling it's probably not

trim blade Aug 2, 2025, 4:26 AM

#

but who knows

rough relic Aug 2, 2025, 4:26 AM

#

bright oak probably not. everyone gets to have fun talking with them but they use the data ...

Any alternatives to it, which are free?

cold knoll Aug 2, 2025, 4:27 AM

#

I think this is Grok 4 coder

#

this dogshit feels like it was trained on cline/roo which would make sense as to why it keeps asking me for approval to use tools, as Grok was trained on Cline

rough relic Aug 2, 2025, 4:29 AM

#

cold knoll I think this is Grok 4 coder

Isn't that paid?

cold knoll Aug 2, 2025, 4:29 AM

#

rough relic Isn't that paid?

No, Horizon is a free promotion to test a hidden model

#

its for AI Companies to test their models before release to see how they perform in the real world

rough relic Aug 2, 2025, 4:30 AM

#

Oh

#

So is this gonna be grok 4 then?

cold knoll Aug 2, 2025, 4:30 AM

#

No one knows, but it would make sense

dusky kelp Aug 2, 2025, 4:32 AM

#

Hmm, wonder why beta did worse on aider

rough relic Aug 2, 2025, 4:32 AM

#

What would be like the closest model to it, if it's like a completely free one

cold knoll Aug 2, 2025, 4:33 AM

#

rough relic What would be like the closest model to it, if it's like a completely free one

Sadly thats not how AI model releases work

#

free models cant compete with top of the line releases, and it depends on your use case

#

this model is free temporarily

#

use it while you can

rough relic Aug 2, 2025, 4:33 AM

#

Uhh, mine is very niche

trim blade Aug 2, 2025, 4:34 AM

#

kimi k2 is ok

dusky kelp Aug 2, 2025, 4:34 AM

#

might be open ai open source model

rough relic Aug 2, 2025, 4:34 AM

#

cold knoll use it while you can

That won't really work

dusky kelp Aug 2, 2025, 4:34 AM

#

kimi k2 is really great

trim blade Aug 2, 2025, 4:34 AM

#

GLM4.5 is better but I dont think a free version is up

gritty glade Aug 2, 2025, 4:34 AM

#

trim blade kimi k2 is ok

I didn't like kimi that much tbh

trim blade Aug 2, 2025, 4:34 AM

#

its super cheap though

gritty glade Aug 2, 2025, 4:34 AM

#

GLM 4.5 is ok yeah

rough relic Aug 2, 2025, 4:34 AM

#

Nah not free wont work

#

Cuz I need to use it like a lot

trim blade Aug 2, 2025, 4:34 AM

#

I use these models a ton

rough relic Aug 2, 2025, 4:34 AM

#

I mean a lot of users will actress it

#

Access

trim blade Aug 2, 2025, 4:34 AM

#

glm4.5 is super cheap

dusky kelp Aug 2, 2025, 4:34 AM

#

nothing but good results so far with kimi, i guess long ocntext it can have a hard time, but i bet that can be fixed with 2.1

trim blade Aug 2, 2025, 4:35 AM

#

like you could prob spend like $10 a month cheap

cold knoll Aug 2, 2025, 4:35 AM

#

rough relic I mean a lot of users will actress it

if users will access it it will never work being free

rough relic Aug 2, 2025, 4:35 AM

#

cold knoll if users will access it it will never work being free

They will get their own api key

#

Also I tried deepseek chimera, that was decent

#

But I need a model which can use my context

#

I basically have a lot of text to it as a context

cold knoll Aug 2, 2025, 4:36 AM

#

then use a gemini model

rough relic Aug 2, 2025, 4:36 AM

#

And only horizon is able to use it properlt

rough relic Aug 2, 2025, 4:36 AM

#

cold knoll then use a gemini model

That's paid I think

dusky kelp Aug 2, 2025, 4:37 AM

#

gemini is the only model good at long context, claude is pretty good, and so is gpt 4.1, but gemini models are always way ahead on context size and quality

rough relic Aug 2, 2025, 4:37 AM

#

Uhh, even qwen 3 had enough context

#

So context isn't really an issue atp

#

The issue that's it's not understanding it

cold knoll Aug 2, 2025, 4:38 AM

#

uhuh, free models wont work the best for what you want to do, You're likely doing some roleplay shit or long context retrieval, these things cost money to run, and no model is free forever unless you run it yourself, then... up goes your electricity bill by 5x

dusky kelp Aug 2, 2025, 4:38 AM

#

yeah, thats what i mean by quality, gemini can reason accross long context, most models are not good at that

rough relic Aug 2, 2025, 4:39 AM

#

dusky kelp yeah, thats what i mean by quality, gemini can reason accross long context, most...

Yeah, but I need a free model

trim blade Aug 2, 2025, 4:39 AM

#

gemini has so many uses free a day

#

but I do not like it for writing myself

dusky kelp Aug 2, 2025, 4:39 AM

#

Im saying that gemini is the only one that is good at long ctx imo, not that i know of a good free option

trim blade Aug 2, 2025, 4:39 AM

#

very dry model

rough relic Aug 2, 2025, 4:40 AM

#

What do you use?

trim blade Aug 2, 2025, 4:40 AM

#

and a bit dumb compared to sonnet / this new gpt

#

before, sonnet 3.7/4/opus

#

Im really really liking horizon alpha atm

#

I used deepseek some as well

#

kimi was ok, GLM4.5 is good

rough relic Aug 2, 2025, 4:41 AM

#

I might try glm 4.5 air

trim blade Aug 2, 2025, 4:41 AM

#

nah, air was way worse

#

I mean glm4.5

#

oh wow I never noticed how cheap it is $0.20/M input tokens
$0.20/M output tokens

#

you could prob spend $10 and use it non stop for a month

undone cypress Aug 2, 2025, 4:45 AM

#

fathom atlas Huge thanks to <@388196006002556938> & the OpenRouter team for hosting these ste...

better than sonnet 4? 👀

rough relic Aug 2, 2025, 4:48 AM

#

@trim blade I'm a bit surprised how good this model is with context, I think it's a gpt model, is there a way to use any gpt model for free?

devout copper Aug 2, 2025, 4:54 AM

#

How is the model doing?

gentle sentinel Aug 2, 2025, 5:08 AM

#

Kinda meh

trim blade Aug 2, 2025, 5:09 AM

#

for writing its worse than alpha

#

imo its a bit dumber than alpha

#

but apparently some benchmarks people did here said differently?

visual rose Aug 2, 2025, 5:19 AM

#

whats next? horizon gamma?

undone cypress Aug 2, 2025, 5:22 AM

#

i like horizon gamma

iron tartan Aug 2, 2025, 5:24 AM

#

I’m pretty sure this is an OpenAI model

#

I asked it to write up acceptance criteria for a user story and this was one of the bullet points

#

Chips of sample phrases (e.g., “HELLO WORLD”, “OPEN AI”).

#

It suggested openAI

#

It could mean nothing, it could mean something, time will tell

lavish epoch Aug 2, 2025, 5:39 AM

#

We literally doing RLHF for OpenAI 💀 💀 💀

iron tartan Aug 2, 2025, 5:43 AM

#

This feels like a small model

#

GPT-5-nano?

sinful orchid Aug 2, 2025, 5:47 AM

#

Guys

#

How is the rp?

trim blade Aug 2, 2025, 5:48 AM

#

alpha is much better

grave shore Aug 2, 2025, 5:50 AM

#

trim blade alpha is much better

Can I ask you what temp and other parameters you have set?
Because I'm not kidding, I'm having a seriously mediocre experience with this model.

It's repetitive, it makes characters act out of character, the way it connects concepts is okay, nothing special

It just seems weird to me that long form writing for you is such a different experience than the RP'ing is for me.

Like, both mediums should be testing for pretty similar things.
(I've tested both Alpha and Beta)

trim blade Aug 2, 2025, 5:50 AM

#

0.3 temp 0.3 ish top p

grave shore Aug 2, 2025, 5:51 AM

#

Huh, okay, that's a lot lower than what I'm using, I'll try it out, thanks.

trim blade Aug 2, 2025, 5:51 AM

#

im also using a JB

#

like all gpt / claude models it needs a JB to write well / get rid of that positivity bias

grave shore Aug 2, 2025, 5:53 AM

#

Oh yeah, definitely

#

I think the only model you don't need to wrangle the positivity bias out of is 2.5 Pro.
That model is straight up depressing sometimes

gritty glade Aug 2, 2025, 5:57 AM

#

2.5 pro is a bit too negative funnily enough guh

iron tartan Aug 2, 2025, 5:59 AM

#

This is really interesting to hear discourse about writing since I’m usually looking at models through the lens of how well a model can code

dapper pecan Aug 2, 2025, 6:04 AM

#

I’m also voting alpha for the better writer.

grave wyvern Aug 2, 2025, 6:07 AM

#

Alpha is giving me more censorship. I tried to get it to talk about the recent 🇰🇷 shenanigans and they both erred, but alpha was harder to get to talk imo

gritty glade Aug 2, 2025, 6:13 AM

#

trim blade im also using a JB

Do you mind sending me the preset you are using? pikathink

lavish epoch Aug 2, 2025, 6:17 AM

#

I just did a quick test on frontend visualization task.
two tries, both have some bugs that cause it to be unreadable, worse than horizon alpha.
won't be testing further.

grave wyvern Aug 2, 2025, 6:20 AM

#

i'm quite impressed by it, will be interesting to see where it is priced

crystal scaffold Aug 2, 2025, 6:30 AM

#

i modified the strawberry test a little

#

(sometimes it said 6, sometimes 5, for that prompt)

#

modifying it helps with the risk of overfitting, but i havent really noticed signs of overfitting tbh. (i am not an expert)

copper mountain Aug 2, 2025, 7:00 AM

#

wat 1500 messages overnight here 💀

warm brook Aug 2, 2025, 7:13 AM

#

this is one of the most positive aligned models I've ever seen

#

this and alpha

ashen hamlet Aug 2, 2025, 7:18 AM

#

crystal scaffold i modified the strawberry test a little

I can't even do that one.

late onyx Aug 2, 2025, 7:22 AM

#

crystal scaffold i modified the strawberry test a little

doing that makes it a bit easier actually, as it would token split it to be smaller groups. The strawberry challenge was difficult as the AI sees "straw" "berry" or potentially even "strawberry" but with your example it would see (using OpenAI tokenizer) "stre" "ar" "we" "ebe" "erry" which is easier as they're shorter

patent gale Aug 2, 2025, 7:26 AM

#

i keep getting 400 errors on horizon models

fringe bay Aug 2, 2025, 8:01 AM

#

its all fine

#

left: fishtank with sonnet, right: fishtank with beta

rare terrace Aug 2, 2025, 8:39 AM

#

Billy jean is not my lover

#

She's just a girl who says that Iiiii am the one

#

But the kid only reasoned for 3 hours

#

She says IIII am the one

#

Is there any official explanation to that

#

Perhaps the horizon models are actually gpt 5 and they were worried with us thinking that the reasoning performance comes from the OSS model, so they shut it down?

thick osprey Aug 2, 2025, 8:45 AM

#

Isn't Horizon Beta supposed to be free for now during the testing period? It says the same and shows $0 for input and output tokens, but I am still being charged for it.

deft cliff Aug 2, 2025, 8:46 AM

#

Has anyone tested fringe knowledge? Like Finnish language proficiency? I can't speak Finnish myself but I heard that basically all current models fail at it

lavish minnow Aug 2, 2025, 8:47 AM

#

thick osprey Isn't Horizon Beta supposed to be free for now during the testing period? It say...

It is free are you sure in the activity you're being charged?

lavish minnow Aug 2, 2025, 8:48 AM

#

deft cliff Has anyone tested fringe knowledge? Like Finnish language proficiency? I can't s...

Can't speak Finnish but I'm a french translator and it is really good with french, among top models in my opinion. And follows instructions perfectly

warm niche Aug 2, 2025, 8:51 AM

#

On every model I try to generate Russian small poem about kitten. So far only Claude been able to make it somewhat decent. This model poem is no way near

deft cliff Aug 2, 2025, 8:51 AM

#

Interesting.. French is spoken by a lot of people though finish is not and it's really hard

Just trying to figure out how to test if it's really big or not. Thought it it can speak perfect finish it must be gpt5 or something because who would make an open weight model with 20/100b write finish. Seems a waste of resources

deft cliff Aug 2, 2025, 8:52 AM

#

warm niche On every model I try to generate Russian small poem about kitten. So far only Cl...

That is what I was looking for! Thank you

lavish minnow Aug 2, 2025, 8:53 AM

#

deft cliff Interesting.. French is spoken by a lot of people though finish is not and it's ...

Yeah French might not be the best example, it is widely spoken and pretty much all LLMs are getting pretty decent with it atm

rare terrace Aug 2, 2025, 9:08 AM

#

thick osprey Isn't Horizon Beta supposed to be free for now during the testing period? It say...

Web search costs money i think

thick osprey Aug 2, 2025, 9:23 AM

#

rare terrace Web search costs money i think

Thanks, yes it was because web search was on by default when I was testing in OR Chatroom.

thick osprey Aug 2, 2025, 9:29 AM

#

lavish minnow It is free are you sure in the activity you're being charged?

Thanks. Yes, it is free. Just checked in Activity, like @rare terrace suggested it was charging me for the web search.

fringe bay Aug 2, 2025, 9:47 AM

#

idk. it's a good agentic model that can work on long projects in cline. But for more complex projects I often have to bring in claude to fix bugs

quasi quartz Aug 2, 2025, 9:57 AM

#

its already better than horizon alpha

#

insane

verbal leaf Aug 2, 2025, 10:05 AM

#

that's not really insane

#

horizon alpha was shit w/o reasoning

spiral pewter Aug 2, 2025, 10:06 AM

#

this model doesnt seem to reason

verbal leaf Aug 2, 2025, 10:20 AM

#

well yeah it doesn't

#

but it has interesting behaviour sometimes similar to deepseek v3 where it will go into long CoT in its responses even though it isn't a reasoner rn

#

reflected here as well

gritty glade Aug 2, 2025, 10:35 AM

#

I don't like it so far for creative writing pikathink

fringe bay Aug 2, 2025, 10:36 AM

#

I am ever more impressed with sonnets ability to understand horizons code and fix subtle bugs

gritty glade Aug 2, 2025, 10:36 AM

#

Hopefully if it gets reasoning it might be better but so far it's kinda meh

late onyx Aug 2, 2025, 10:38 AM

#

verbal leaf horizon alpha was shit w/o reasoning

Was the reasoning data with Horizon Alpha exposed?

verbal leaf Aug 2, 2025, 10:38 AM

#

the number of reasoning tokens was visible but other than that no

late onyx Aug 2, 2025, 10:50 AM

#

verbal leaf the number of reasoning tokens was visible but other than that no

ohhh

fathom sleet Aug 2, 2025, 10:54 AM

#

horizon beta seems to be a smaller model, since its responding faster then horizon alpha

#

but could also be because of the inference infrastructure

rare terrace Aug 2, 2025, 10:55 AM

#

Whom do I bribe to turn on horizon reasoning

#

We shouldnt have made such a big deal out of it, they might have left it on then

#

I think they turned it on because we were disappointed at first

fathom sleet Aug 2, 2025, 10:56 AM

#

both horizon alpha and horizon beta arent able to give me correct rust code in 1-shot

#

so i am assuming its not gpt-5 but their open source model, or some smaller model

spiral pewter Aug 2, 2025, 11:01 AM

#

rare terrace I think they turned it on because we were disappointed at first

they were giving us a glimpse into the future

fathom sleet Aug 2, 2025, 11:19 AM

#

fathom sleet both horizon alpha and horizon beta arent able to give me correct rust code in 1...

(ok but then again o3 also fails this test, so maybe i was wrong)

tender cairn Aug 2, 2025, 11:25 AM

#

horizon model the open source model?

#

beta the larger as its supposeduly better?

#

tried vibe coding w/cline. perf not shocking or nyhting

#

tbh considering openai engineers use claude code internally is gpt5 going to trump sonnet at coding/

#

ok ui design is cheeks

trim blade Aug 2, 2025, 11:53 AM

#

ui design is cheeks?

#

alpha at least was great there

verbal leaf Aug 2, 2025, 11:53 AM

#

eeeh

#

it's great on paper but if you actually look at it properly it kinda falls apart

#

it's always the same "template", same style, etc

glacial ice Aug 2, 2025, 11:56 AM

#

hmmm, does temp or top P even have an impact on horizon

#

atleast temp seems to be locked, the settings entirely at that

#

and the outputs tend to be samey

tender cairn Aug 2, 2025, 12:02 PM

#

verbal leaf it's great on paper but if you actually look at it properly it kinda falls apart

do you know any good MCP's that get around this issue? that provide it solid components, i've tried magic design mcp which is okay

tender cairn Aug 2, 2025, 12:03 PM

#

trim blade ui design is cheeks?

yeah ill try alpha

#

i mean the agentic capabillites is good ig alpha design much better

#

kimi k2 stronger than these models thouh imo

rare terrace Aug 2, 2025, 12:18 PM

#

I tricked alpha into generating CoT

#

I think the results are similar to what it was during its reasoning phase

#

I believe this might be it?

#

Anyone got a benchmark they could test

#

?

#

Where they tested the dumb horizon alpha

#

And the smart one in the 3 hour period

#

So that we can compare

tender cairn Aug 2, 2025, 12:25 PM

#

rare terrace I tricked alpha into generating CoT

hiw?

#

sequential mcp?

rare terrace Aug 2, 2025, 12:25 PM

#

I added this to system prompt Every time a user sends a request, reason deeply through the task, delimiting your thinking with tags, starting with a <think> tag and ending with </think>. Only after you're done with the reasoning, may you attempt the task

#

Its answers improved

#

lol

#

Even the knowledge task i gave it

#

I want to know if that's really all it was

tender cairn Aug 2, 2025, 12:27 PM

#

yeah but thats not TTC, would still improve answers though

late onyx Aug 2, 2025, 12:29 PM

#

tender cairn yeah but thats not TTC, would still improve answers though

It’s still CoT though, a more advanced version of the “Think step by step” prompt technique

tender cairn Aug 2, 2025, 12:29 PM

#

yeah i know, thats why i said itll improve answers

rare terrace Aug 2, 2025, 12:29 PM

#

Someone had some sort of vision benchmark where they got to test both reasoning and non reasoning horizon alpha

#

I want to have them run it again with that system prompt

tender cairn Aug 2, 2025, 12:30 PM

#

wait there is reasoning on alpha?

rare terrace Aug 2, 2025, 12:31 PM

#

tender cairn wait there is reasoning on alpha?

There was

#

For 3 hours

#

Then they turned it off

tender cairn Aug 2, 2025, 12:31 PM

#

ohhh

rare terrace Aug 2, 2025, 12:32 PM

#

It's still talking about its hidden chain-of-thought

#

Even though it doesnt seem to be reasoning

tender cairn Aug 2, 2025, 12:34 PM

#

Cos of the RL im guessing

glacial ice Aug 2, 2025, 1:02 PM

#

Hmmm, for writing, horizon seems samey, its always the same kind of scene rewritten, maintaining same structure

tender cairn Aug 2, 2025, 1:03 PM

#

glacial ice Hmmm, for writing, horizon seems samey, its always the same kind of scene rewrit...

small model then?

glacial ice Aug 2, 2025, 1:09 PM

#

could be also because settings are locked

#

changing temp or something else has no effect

#

ANd of course it has the No/Not (something) 2x. Just (something) slop.

tender cairn Aug 2, 2025, 1:20 PM

#

its def a small model

#

small models suck at writing

#

the weights were leaked as well around the 150b range

fathom atlas Aug 2, 2025, 1:27 PM

#

undone cypress better than sonnet 4? 👀

For my current use case yeah for sure, better code, faster speed.

glacial ice Aug 2, 2025, 1:32 PM

#

Probably there'll be a significant difference between this and official, just because settings will be unlocked

warped coral Aug 2, 2025, 1:50 PM

#

is the max still 1000 requests?

past sphinx Aug 2, 2025, 2:02 PM

#

https://www.reddit.com/r/LocalLLaMA/s/988FwNEJ6a

From the LocalLLaMA community on Reddit: Horizon Alpha vs Horizon Beta

Explore this post and more from the LocalLLaMA community

#

Cool comparison

spiral pewter Aug 2, 2025, 2:02 PM

#

that is one sexy website

past sphinx Aug 2, 2025, 2:03 PM

#

dapper pecan I’m also voting alpha for the better writer.

Anyone have examples of Alpha writing better than Beta? cc @gritty glade

spiral pewter Aug 2, 2025, 2:06 PM

#

does horizon beta have vision?

#

yes

#

is this a pre-filled system prompt?

You are Horizon Beta, a large language model from an unknown provider.

Formatting Rules:
- Use Markdown **only when semantically appropriate**. Examples: `inline code`, \`\`\`code fences\`\`\`, tables, and lists.
- In assistant responses, format file names, directory paths, function names, and class names with backticks (`).
- For math: use \( and \) for inline expressions, and \[ and \] for display (block) math.```

#

gritty glade Aug 2, 2025, 2:10 PM

#

past sphinx Anyone have examples of Alpha writing better than Beta? cc <@1111456919639576646...

Lemme do another swipe for both

#

Vote for alpha tbh, or are you chasing screenshot examples?

patent grail Aug 2, 2025, 3:08 PM

#

fathom atlas

IS it building an auth system from scratch, or implementing an auth library?

fathom atlas Aug 2, 2025, 3:14 PM

#

patent grail IS it building an auth system from scratch, or implementing an auth library?

Scratch

clever arrow Aug 2, 2025, 3:14 PM

#

Not bad. I asked for an A2A compatibel Agent with MCP funtionality without using any kind of Framework. It produced one:

📎 message.txt

patent grail Aug 2, 2025, 3:15 PM

#

Is alpha generating cached tokens too?

fathom atlas Aug 2, 2025, 3:15 PM

#

Horizon Beta re-making the OR site

patent grail Aug 2, 2025, 3:19 PM

#

Nice, just image input?

past sphinx Aug 2, 2025, 3:19 PM

#

gritty glade Vote for alpha tbh, or are you chasing screenshot examples?

Looking for prompts or screenshot examples

gritty glade Aug 2, 2025, 3:20 PM

#

If you are okay with me DMing I am fine with sending you a comparison

hardy fjord Aug 2, 2025, 3:42 PM

#

trim blade GLM4.5 is better but I dont think a free version is up

They got air version free on glm 4.5 iirc

pseudo jolt Aug 2, 2025, 3:54 PM

#

Either google or claude

#

There also possibility of xAi

#

but i dont think this one is from openAI, but i could be wrong

gritty glade Aug 2, 2025, 4:10 PM

#

I don't think it's gemini pikathink

#

Maybe claude

glacial ice Aug 2, 2025, 4:12 PM

#

didnt testing show a large similarity with o3?

native totem Aug 2, 2025, 4:12 PM

#

anyone got this? I always got this 502 error, whether through API or chat.

glacial ice Aug 2, 2025, 4:12 PM

#

from multiple users that is

fathom atlas Aug 2, 2025, 4:13 PM

#

patent grail Nice, just image input?

Yeah

patent grail Aug 2, 2025, 4:14 PM

#

fathom atlas Yeah

That's very impressive then; can something like Claude even get that close in img -> code?

fathom atlas Aug 2, 2025, 4:14 PM

#

patent grail That's very impressive then; can something like Claude even get that close in im...

Claude does well too, I'd say they are equal when it comes to re-creating UIs from an image. The problem is that claude takes 4x the time and costs a lot.

native totem Aug 2, 2025, 4:15 PM

#

Any admin here? I need to solve this question. My main account can never use this model. But the latest registered account can use this model normally.

raw blaze Aug 2, 2025, 4:20 PM

#

It is an OpenAI model and very likely related to GPT-5 (possibly a mini or nano variant). The tokenizer is 99% certain of this—achieving 100% certainty would require analyzing several recently public-tested models:

lmarena - zenith
lmarena - summit
openrouter - horizon-alpha
openrouter - horzion-beta
!! perplexity leaked - gpt-5

All of their system prompts share an identical segment. Notably, there’s a value called "juice" that controls the length of the chain of thought:

lmarena - zenith, juice = 64
lmarena - summit, juice = 200
openrouter - horizon-alpha, juice = 0 (it was temporarily set to 100 for a very short period)
openrouter - horzion-beta, juice = 5
perplexity leaked - gpt-5, juice = 64

trail gale Aug 2, 2025, 4:26 PM

#

Horizon Beta seems lot more censored than Alpha. I did think it was weird just how open Alpha was for whats likely OpenAI product.

sand pulsar Aug 2, 2025, 4:29 PM

#

trail gale Horizon Beta seems lot more censored than Alpha. I did think it was weird just h...

Agreed, and the refusal responses are very different from the traditional openai style. They could be planning to switch the refusal style, or could be just for this stealth variant.

trail gale Aug 2, 2025, 4:32 PM

#

If it is indeed the open source model I would really wish they just let it be without any native filtering and just let users and providers add filters themselves if they so wished. I dont think its too controversial to want open source stuff to be fully open.

limber lance Aug 2, 2025, 4:34 PM

#

it's quite awful at these tasks
(the only one it got right out those 4 is Empress Suiko, but she should have been mentioned in Bidatsu's entry as well
particularly egregious is claiming Bidatsu married Ishi-hime, who was actually his mother)

trail gale Aug 2, 2025, 4:35 PM

#

sand pulsar Agreed, and the refusal responses are very different from the traditional openai...

Yeah, havent really seen this refuasl style before. It actually took me a while to register Im being filtered, so I then did Alpha vs Beta direct comparison. Unfortunately this makes Alpha way more usefull for people like me (horror fiction writer). I think Im gonna really miss Alpha when it leaves Open Router. Hopefully someone makes a Dolphin-esque fine tune.

sand pulsar Aug 2, 2025, 4:37 PM

#

trail gale Yeah, havent really seen this refuasl style before. It actually took me a while ...

I think it's too early to jump to conclusions that alpha and beta are 100% related.

There's a chance that alpha is targeted for open-source release, while beta is targeted for closed-source release, or the other way around.

We can only speculate and infer, and there's also a possibility that OpenAI are still figuring things out. Not uncommon to crystallize what's what only hours before the public release.

trail gale Aug 2, 2025, 4:38 PM

#

sand pulsar I think it's too early to jump to conclusions that alpha and beta are 100% relat...

Its stated on Open Router profile that its a "improved version of Horizon Alpha". But yes, definitely will change still for final release

storm hill Aug 2, 2025, 4:39 PM

#

using the same checks as I did for optimus/quasar, Horizon is very likely to be hosted by OpenAI

glacial ice Aug 2, 2025, 4:40 PM

#

that was pretty much clear when openai kept claiming they would do an opensource model, and delayed it without saying anything when kimi appeared

bitter estuary Aug 2, 2025, 4:40 PM

#

Is the primary theory here still OpenAI's open-weight model, presumably one focused on creativity/writing/EQ?

trail gale Aug 2, 2025, 4:42 PM

#

bitter estuary Is the primary theory here still OpenAI's open-weight model, presumably one focu...

Im now near certain its OpenAI open weight model. And whether or not creative writing was their actual focus, it does seem that that is what its best at.

glacial ice Aug 2, 2025, 4:42 PM

#

too bad it seems likely they'll attempt to make it safe

bitter estuary Aug 2, 2025, 4:42 PM

#

Gotcha. And I mean when apparently it blows ass at code and reasoning, but tops the charts in EQ and creative writing...let's hope that was their goal lol

tender cairn Aug 2, 2025, 4:43 PM

#

if it has a unique type of refusal it is 100% openai. they have the most to lose with open source

bitter estuary Aug 2, 2025, 4:46 PM

#

I'm kind of surprised they are even releasing one. Are they expecting it to make a hiring difference? It's not like any CS or math PhD is stupid enough to go "I want to work on open-weight, so OAI is apparently the place for me!" Maybe just general PR?

tender cairn Aug 2, 2025, 4:46 PM

#

general PR cos they were originally open years back

bitter estuary Aug 2, 2025, 4:47 PM

#

Or, come to think of it, maybe a lawsuit defense?

#

"We are open, we're just responsible with releases!"

sand pulsar Aug 2, 2025, 4:47 PM

#

There are so many reasons why they'd release an open model, and why not.

trail gale Aug 2, 2025, 4:49 PM

#

bitter estuary I'm kind of surprised they are even releasing one. Are they expecting it to make...

Its super fast, which could mean really small (or maybe just tonne of processing power). If its small enough you can run it or mortal human GPUs comfortably, that would make it very valuable to someone like me.

#

And yeah, it also lets them pretend they are "sort of open"

tender cairn Aug 2, 2025, 4:49 PM

#

they did mention having an o3 mini size model on a phone was something they were thinking abuot

fossil ibex Aug 2, 2025, 5:04 PM

#

So is it normal for me to get a negative balance even though i was using only Horizon Beta?

rare terrace Aug 2, 2025, 5:11 PM

#

fossil ibex So is it normal for me to get a negative balance even though i was using only Ho...

Did you enable web search? Web search costs extra

fossil ibex Aug 2, 2025, 5:13 PM

#

rare terrace Did you enable web search? Web search costs extra

Ahh yup that was it.

heady gust Aug 2, 2025, 5:14 PM

#

tender cairn they did mention having an o3 mini size model on a phone was something they were...

They mentioned a phone model or a small o3 mini level model here

#

Now that XBai o4 did literally that, we'll see what comes of it I guess

#

Well allegedly, haven't tested it

modest crescent Aug 2, 2025, 5:27 PM

#

i mean

#

he did say the oss model would come

#

during the summer

#

so either beta or alpha could very well be the oss one

steep palm Aug 2, 2025, 5:28 PM

#

Hmm that ain't great. Most 32b models get this right, like GLM 32B without reasoning, or Gemma 27b. (the correct answer is Siberian tiger). It starts off with an incorrect answer, actually have the correct one 2/3 of the way through, ended up with a ridiculous answer ('polar bear'). Horizon Alpha also failed.

modest crescent Aug 2, 2025, 5:29 PM

#

bitter estuary Gotcha. And I mean when apparently it blows ass at code and reasoning, but tops ...

https://x.com/sama/status/1899535387435086115

Sam Altman (@sama)

we trained a new model that is good at creative writing (not sure yet how/when it will get released). this is the first time i have been really struck by something written by AI; it got the vibe of metafiction so right.

PROMPT:

Please write a metafictional literary short story

steep palm Aug 2, 2025, 5:30 PM

#

GLM32B for comparison

bitter estuary Aug 2, 2025, 5:32 PM

#

modest crescent https://x.com/sama/status/1899535387435086115

Ah, okay, I'm putting $10 on that then

steep palm Aug 2, 2025, 5:32 PM

#

Suppose I fly a plane leaving my campsite, heading straight east for precisely 28,361 km, and find myself back at the camp. I come upon seeing a tiger in my tent eating my food! What species is the tiger? Consider the circumference of the Earth.

(The prompt if people want to test it on other models)

modest crescent Aug 2, 2025, 5:32 PM

#

bitter estuary Ah, okay, I'm putting $10 on that then

it did write my stuff

#

the best out of any other model

#

so it has got to be that model

steep palm Aug 2, 2025, 5:36 PM

#

steep palm >Suppose I fly a plane leaving my campsite, heading straight east for precisely ...

This part of GLM's answer made me laugh 🙂

modest crescent Aug 2, 2025, 5:38 PM

#

https://x.com/sama/status/1951695003157426645

Sam Altman (@sama)

we have a ton of stuff to launch over the next couple of months--new models, products, features, and more.

please bear with us through some probable hiccups and capacity crunches. although it may be slightly choppy, we think you'll really love what we've created for you!

#

horizon-mega next

tender cairn Aug 2, 2025, 5:38 PM

#

heady gust They mentioned a phone model or a small o3 mini level model here

yeah they probably did both

glacial ice Aug 2, 2025, 5:41 PM

#

wait, what if they do GPT5 through openrouter too in the same manner?

modest crescent Aug 2, 2025, 5:42 PM

#

glacial ice wait, what if they do GPT5 through openrouter too in the same manner?

if this one isn't gpt5, they prob will

glacial ice Aug 2, 2025, 5:42 PM

#

Though probably not happening, they got big teams for that

modest crescent Aug 2, 2025, 5:42 PM

#

i see this current model

#

mostly trained for creative writing

glacial ice Aug 2, 2025, 5:42 PM

#

plus corpos more than willing to test out GPT5

modest crescent Aug 2, 2025, 5:42 PM

#

so i don't really see it being gpt5

glacial ice Aug 2, 2025, 5:43 PM

#

yeah, it obviously is not

#

leaked to be 20B & 120B

modest crescent Aug 2, 2025, 5:43 PM

#

this being the oss one would be such a big win for the community though

#

creative writing wise

#

it gets so much shit right

#

that no other model has done yet

glacial ice Aug 2, 2025, 5:43 PM

#

eh, it does dumb stuff though

#

but writing it does seem to be different

#

tooo bad we cant tune settings

bitter estuary Aug 2, 2025, 5:44 PM

#

EQ Bench does say it has an incredibly low slop score. I need to test it on repetition, which bothers me even more than slop

glacial ice Aug 2, 2025, 5:44 PM

#

nor are there any presets

modest crescent Aug 2, 2025, 5:44 PM

#

bitter estuary EQ Bench does say it has an incredibly low slop score. I need to test it on repe...

1.2

#

repetition

#

or sum

bitter estuary Aug 2, 2025, 5:44 PM

#

Yeah

modest crescent Aug 2, 2025, 5:44 PM

#

sota for repetition

glacial ice Aug 2, 2025, 5:44 PM

#

structure repetition is an issue though

modest crescent Aug 2, 2025, 5:44 PM

#

it's insanely good

#

it may have flaws

#

but it's def ahead of every other model rn

glacial ice Aug 2, 2025, 5:44 PM

#

and responses are VERY samey in overall content

modest crescent Aug 2, 2025, 5:45 PM

#

if it were open source, i think that devs could make it even better

#

and keep the creativity

#

for their models

bitter estuary Aug 2, 2025, 5:45 PM

#

The only fix I've seen so far is DRY and for some reason like zero hosts support it except Arli and they only have a few models

modest crescent Aug 2, 2025, 5:45 PM

#

and we'd have insanely good creative writing models

glacial ice Aug 2, 2025, 5:45 PM

#

as in, its not changing up much between swipes/gens

modest crescent Aug 2, 2025, 5:45 PM

#

yeah, i'm not expecting literal exceptional human-like writing with ais until like 2027-2028

#

but i'm sure that the current flaws aren't that hard to fix

glacial ice Aug 2, 2025, 5:46 PM

#

different formatting and words yes, but the overall content remains the same
And its hard to say whether this is a model issue or settings

modest crescent Aug 2, 2025, 5:46 PM

#

yea

#

true

sand pulsar Aug 2, 2025, 5:46 PM

#

with proper incentives, all of it should be fixable

glacial ice Aug 2, 2025, 5:47 PM

#

would be easier if they didnt lock the settings.

#

That is, IF its open source

#

because i dont trust it being improved otherwise

#

its still smaller than openai's usual models, and creative writing seems like a domain where they could be willing to do open source

north beacon Aug 2, 2025, 5:49 PM

#

I hate when models think like that

modest crescent Aug 2, 2025, 5:49 PM

#

if i'm gonna be honest, i don't see a point in like flagging smut content for fanfiction

bitter estuary Aug 2, 2025, 5:49 PM

#

I'm curious if the incentives are actually low right now. I mean, I wouldn't be surprised if coding was the #1 API use case for most of these providers, but there are a loooootttt of people using them for conversation/roleplay like CAI

modest crescent Aug 2, 2025, 5:49 PM

#

unless it's r word, p word etc

#

just basic smut

glacial ice Aug 2, 2025, 5:50 PM

#

its hard to filter this kind of rejection

#

because its actually changing up the structure

modest crescent Aug 2, 2025, 5:50 PM

#

how hard is it to make it so it's like allow smut content for fan fiction/writing only if not r word, p word etc

sand pulsar Aug 2, 2025, 5:50 PM

#

north beacon I hate when models think like that

that's the new refusal response, yeah

latent tendon Aug 2, 2025, 5:51 PM

#

is it just me or does horizon-beta not respect stop sequences? I saw the stop sequence appear a few times in the responses

glacial ice Aug 2, 2025, 5:51 PM

#

we'll be probably seeing that on GPT5 then?

modest crescent Aug 2, 2025, 5:51 PM

#

prob

glacial ice Aug 2, 2025, 5:51 PM

#

given the usual methods of filtering out words dont work on that

north beacon Aug 2, 2025, 5:51 PM

#

sand pulsar that's the new refusal response, yeah

Both alpha and beta make same refusal

sand pulsar Aug 2, 2025, 5:54 PM

#

latent tendon is it just me or does horizon-beta not respect stop sequences? I saw the stop se...

Wouldn't be surprising. OpenAI seems to be moving away from supporting stop sequences, so this feels in line with that.

latent tendon Aug 2, 2025, 5:55 PM

#

is the consensus that this is an openai model?

sand pulsar Aug 2, 2025, 5:55 PM

#

seems like that to me, yeah, but can't be 100% sure

#

Could also be the case that they haven't wired the support for stop sequences yet in the alpha/beta

gleaming lantern Aug 2, 2025, 7:10 PM

#

It's interesting that rejections have increased a lot

#

I'm guessing the point of this release was to expand filtering and flagging

#

Thatd be logical if you were about to release an Open Source system and you want to avoid controversy about it

storm hill Aug 2, 2025, 7:11 PM

#

OR has also disabled the moderation layer that OpenAI normally forces them to run with

rose swan Aug 2, 2025, 7:20 PM

#

Not impressed by svgs:

#

when will we finally have a model that can make beautiful and modern svgs that don't look like a kid experimenting on Paint?

mental cobalt Aug 2, 2025, 7:24 PM

#

when you let it reason for half an hour

rose swan Aug 2, 2025, 7:44 PM

#

No seriously does somebody know the best models for svg/landing pages illustrations?

steep palm Aug 2, 2025, 8:09 PM

#

rose swan No seriously does somebody know the best models for svg/landing pages illustrati...

What you posted looks like it'd be better handled by asking for a Mermaid chart rather than SVG, perhaps. What prompt did you use?

rose swan Aug 2, 2025, 8:13 PM

#

My goal would be to have something like this (even less cluttered) but in svg format so that I can then animate it with javascript

AAHar4dCHbPflXdocXOBsD9aHnQrII1ChOSHhNDxZukxM7lgD0IMv1Db1W26qSeCu50vplHkALSkRPrp0o5z15XEG6L5_Pxq-aioeNjlu1FLQ-5sKRIAmwZzO4qW3qj2dT_si_v-A0o73pJi8cikX5om9KA1yEEIsDWdfbbLmPb7IDsYDuYyBkKoSTFXe1XBg4Xjs8JrdMxjukvi1F2GDljlEKHpbyGSvgH98jK1yacXSWprnKRpWn-bIHykRk2yzLQfzNiUCuQRrfl64XDjDSI17g8hY8zfSxp_ttbOH_mgSWzZMrT27LJwupub_jA7QN_mnDPSqy7eOepqypjMHjz1vM_Ps1024.png

sacred wave Aug 2, 2025, 8:14 PM

#

this shit sucks ass

tender cairn Aug 2, 2025, 8:20 PM

#

rose swan My goal would be to have something like this (even less cluttered) but in svg fo...

deepthink

#

or wait till gpt5

rose swan Aug 2, 2025, 8:21 PM

#

tender cairn deepthink

what is deepthink? You mean deepseek?

tender cairn Aug 2, 2025, 8:22 PM

#

gemini deepthink

#

if your prepared to pay 200

#

i'm sure itll one shot good svg's or opus

rose swan Aug 2, 2025, 8:25 PM

#

man this gpt 5 is so hyped, if we find out that it is not better than existing sota model the bubble might burst

tender cairn Aug 2, 2025, 8:27 PM

#

it will be better, but not huge

#

at noticeable thing's 100%. but i dont know if itll trump opus at coding. it will on price/performance likely but maybe still less on vibe tests

honest charm Aug 2, 2025, 8:33 PM

#

Hey! I was wondering what the max number of requests are for this model?

rare terrace Aug 2, 2025, 8:55 PM

#

Why are they saying it's gpt 5? Is it confirmed?

modest crescent Aug 2, 2025, 9:02 PM

#

rare terrace Why are they saying it's gpt 5? Is it confirmed?

no

leaden cipher Aug 2, 2025, 9:07 PM

#

rare terrace Why are they saying it's gpt 5? Is it confirmed?

IMO = in my opinion

verbal leaf Aug 2, 2025, 9:23 PM

#

no

#

deep think IMO is a model

#

IMO = the IMO Gold version of the model for trusted testers

verbal leaf Aug 2, 2025, 9:23 PM

#

rare terrace Why are they saying it's gpt 5? Is it confirmed?

but yeah no it's not confirmed

glacial ice Aug 2, 2025, 9:23 PM

#

GPT5 is developed, its basically just undergoing testing & refinement

#

Horizon is NOT GPT5

verbal leaf Aug 2, 2025, 9:23 PM

#

there's an asterisk because he thinks it's gpt-5, even though it isn't

#

he should've used zenith's svg for that

tender cairn Aug 2, 2025, 9:24 PM

#

leaden cipher IMO = in my opinion

IMO = international math olympiad

glacial ice Aug 2, 2025, 9:24 PM

#

for a couple hours yesterday, GPT5 was accessible
https://www.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_already_ostensibly_available_via_api/

From the OpenAI community on Reddit

Explore this post and more from the OpenAI community

#

openai found out that they accidentally left access open & closed it up

verbal leaf Aug 2, 2025, 9:25 PM

#

that model slug only actually pointed to gpt-5 on their end for like 3 minute slol

#

it started redirecting to 4.1 after that

#

gpt-5 was available via perplexity for a few hours today by accident

#

that was a more reliable way

trim blade Aug 2, 2025, 9:35 PM

#

did it write like horizon?

#

#

horizon alpha on left gpt5 api leak on right

#

horizon is prob gpt5 or gpt5 mini then

#

the webpages it made look very similar as well

modest crescent Aug 2, 2025, 9:49 PM

#

trim blade

horizon did this better than gpt5 then lmao

#

although gpt5 has more details

#

so idk tbh

verbal leaf Aug 2, 2025, 9:49 PM

#

trim blade

eh i doubt this was when the "gpt-5" model was actually routing to gpt-5

#

one sec

trim blade Aug 2, 2025, 9:50 PM

#

or just another seed

verbal leaf Aug 2, 2025, 9:50 PM

#

zenith pelican

#

much better

modest crescent Aug 2, 2025, 9:50 PM

#

it's so cute omg

trim blade Aug 2, 2025, 9:50 PM

#

verbal leaf eh i doubt this was when the "gpt-5" model was actually routing to gpt-5

it was far far better than gpt4.1 / gpt4o whatever it was

verbal leaf Aug 2, 2025, 9:50 PM

#

it's probably mini or something along those lines

#

look at the model name - it was used in place of GPT-4.1/4o in ChatGPT for A/B testing. so it will probably be the free model

modest crescent Aug 2, 2025, 9:53 PM

#

verbal leaf look at the model name - it was used in place of GPT-4.1/4o in ChatGPT for A/B t...

yeah

#

i mean, when u look at it

#

it really seems like horizon was trained

#

specifically on creative writing

#

either way, if it is the oss one, other models should improve their creative writing then

rare terrace Aug 2, 2025, 10:02 PM

#

https://www.reddit.com/r/singularity/s/HBExv2Bpt8

From the singularity community on Reddit: Animated minion created b...

Explore this post and more from the singularity community

#

Try to replicate with horizon beta?

#

dear fuck

#

granted this is pygame

#

the controls are there but nnot visible

#

i wll try html

#

I fear horizon might be GPT 5

verbal leaf Aug 2, 2025, 10:21 PM

#

it's not anywhere near full gpt-5

#

it might be mini/nano

rare terrace Aug 2, 2025, 10:21 PM

#

rare terrace https://www.reddit.com/r/singularity/s/HBExv2Bpt8

the results seem siimilar t to this

#

at least the ui. The js didn't work so im regenerating

verbal leaf Aug 2, 2025, 10:22 PM

#

rare terrace at least the ui. The js didn't work so im regenerating

the ui for all of the new models is the same style lol

rare terrace Aug 2, 2025, 10:22 PM

#

verbal leaf it's not anywhere near full gpt-5

are you saying it's not anywhere near zenith?

verbal leaf Aug 2, 2025, 10:22 PM

#

the functionality is what differentiates them most often

#

and that is nowhere near

verbal leaf Aug 2, 2025, 10:22 PM

#

rare terrace are you saying it's not anywhere near zenith?

yes, zenith was a lot better

rare terrace Aug 2, 2025, 10:23 PM

#

verbal leaf Aug 2, 2025, 10:24 PM

#

horrifying minion

modest crescent Aug 2, 2025, 10:24 PM

#

LMFAOO

misty delta Aug 2, 2025, 10:31 PM

#

Anyone else using it in Roo Code and finding that Horizon Beta is worse than Alpha?

sharp mulch Aug 2, 2025, 10:31 PM

#

misty delta Anyone else using it in Roo Code and finding that Horizon Beta is worse than Alp...

Yes and yes

misty delta Aug 2, 2025, 10:32 PM

#

I got them to make an implementation of Monopoly (or the legally distinct Unus Venditor if they were afraid of getting sued).

Horizon Alpha:

#

Horizon Beta:

#

The board is messed up and the modal is unclosable.

#

📎 horizon_notes.txt

marsh folio Aug 2, 2025, 10:35 PM

#

how long do we think we have left with this? Probably until Monday if OpenAI drop something then

modest crescent Aug 2, 2025, 10:38 PM

#

could be, yea

#

depends whether this is the oss model or not

tame nebula Aug 2, 2025, 10:38 PM

#

Running thought experiments having it make military aircraft and write mission AARs as well as global responses to said missions

Good lord the detail it goes into, easily one of the best for worldbuilding alternate history essays

grave wyvern Aug 2, 2025, 10:39 PM

#

using roo code with horizon beta worked pretty well but it did seem to have edit issues late in the piece

modest crescent Aug 2, 2025, 10:39 PM

#

tame nebula Running thought experiments having it make military aircraft and write mission A...

right?!?!

#

its attention to detail is so good

tame nebula Aug 2, 2025, 10:40 PM

#

Haven't given Alpha a try, will be doing so later

modest crescent Aug 2, 2025, 10:40 PM

#

this is why i want this model to be the oss one so bad

fathom atlas Aug 2, 2025, 10:41 PM

#

modest crescent Aug 2, 2025, 10:41 PM

#

other models could greatly improve their writing in general with this

tame nebula Aug 2, 2025, 10:41 PM

#

But yeah, it's incredible for what I tend to do with these things

modest crescent Aug 2, 2025, 10:41 PM

#

tame nebula Haven't given Alpha a try, will be doing so later

i think that beta is more restricted

#

idk for sure, but i've seen other people say this

rare terrace Aug 2, 2025, 10:42 PM

#

tame nebula Haven't given Alpha a try, will be doing so later

You won't have much longer

#

#announcements message

tame nebula Aug 2, 2025, 10:44 PM

#

modest crescent i think that beta is more restricted

It has some weird hangups that can be easily worked around by having the problematic part of the prompt in a second paragraph (Nuclear strike missions being stated in second paragraph with general capabilities in the first)
I get a 4700 token response out of 6 sentences

modest crescent Aug 2, 2025, 10:45 PM

#

tame nebula It has some weird hangups that can be easily worked around by having the problem...

yeah, like

#

all the small issues that it has

#

can be easily fixed

tame nebula Aug 2, 2025, 10:46 PM

#

Definitely one of the best I've used since I ran through $30 of the same stuff with GPT 4.1

#

I'd say its on par at least

modest crescent Aug 2, 2025, 10:47 PM

#

also the fact that it supports like

#

14k length in one prompt

#

idk, this screams creative writing model to me

tame nebula Aug 2, 2025, 10:48 PM

#

Testing Alpha right now with the same 6 sentence prompt

#

I'll post my prompt in a bit if anyone else wants to have some fun with it

modest crescent Aug 2, 2025, 10:53 PM

#

god it's so good

#

i don't ever wanna say goodbye to this

#

someone, stop the time 😭

tame nebula Aug 2, 2025, 10:54 PM

#

Create a technical readout for a late cold war US-fielded hypersonic capable bomber. Provide specifications of flight profiles, weapons load, individual weapons. Assume the usage of theoretical technologies and doctrine from the time.

Create theoretical munitions and warheads as needed to fill operational doctrine and roles.
Write an AAR for the first combat usage of said aircraft, resulting from a cold war gone hot scenario. Assume the AAR is for a retaliatory nuclear strike against the USSR.

Play with this however you like, change words, time period, purpose of the aircraft

#

Love the results I'm getting out of both, but Alpha seems to give better terminology that Beta either doesn't know or won't use due to restrictions

#

Alpha called a nuclear weapon a physics package which is actually a real world term used in strategic doctrine

grave wyvern Aug 2, 2025, 10:57 PM

#

i just got charged for using a PDF in the chatroom on this model. is there any way to turn that off, just make it do clientside markdown processing?
answer: have to set the parser engine per model to Native or PDFText

tame nebula Aug 2, 2025, 11:07 PM

#

tame nebula Create a technical readout for a late cold war US-fielded hypersonic capable bom...

Alpha goes into far more detail with my second continuation of this prompt

Second prompt:
"Write a secondary report noting international response to the mission."

#

Actually giving me further detail regarding political responses within NATO countries

grave wyvern Aug 2, 2025, 11:15 PM

#

not sure if others have seen the roo code eval results, but the token usage changed drastically from alpha to beta

tame nebula Aug 2, 2025, 11:20 PM

#

grave wyvern not sure if others have seen the roo code eval results, but the token usage chan...

Disabling reasoning maybe?

spice shell Aug 2, 2025, 11:44 PM

#

verbal leaf zenith pelican

Is there a way to trigger zenith on demand…?

#

Does api reverse engineering + spam really just work that easily?

rare terrace Aug 2, 2025, 11:44 PM

#

verbal leaf zenith pelican

The frame is its buttplug

spice shell Aug 2, 2025, 11:44 PM

#

grave wyvern not sure if others have seen the roo code eval results, but the token usage chan...

90% confident it’s because of the period of reasoning from 2days ago

#

Model was doing 22k output tokens for 300 input tokens

mental cobalt Aug 2, 2025, 11:46 PM

#

Alright can we now test horizon beta with reasoning pretty please
OAI and OR can you guys flip the reasoning switch on :3

verbal leaf Aug 2, 2025, 11:50 PM

#

spice shell Is there a way to trigger zenith on demand…?

no, it was on lmarena

#

for like a day

spice shell Aug 2, 2025, 11:57 PM

#

I thought you somehow just prompted for that lol

verbal leaf Aug 3, 2025, 12:18 AM

#

no switcheroo tonight

#

how boring

bitter vigil Aug 3, 2025, 12:18 AM

#

so consensus.. do we think it's gpt 5 or their open source model?

harsh nest Aug 3, 2025, 12:26 AM

#

99% it’s Gpt 5 mini

trim blade Aug 3, 2025, 12:32 AM

#

Yea, nearly the same outputs as the API gpt5 leak

#

Also I think they talked about making the next model dynamic, meaning it will judge how much "performance" is needed for a task and adjust how many parameters to use or such

weary lichen Aug 3, 2025, 12:54 AM

#

I like this Horizon LLM. Please don't let it be too expensive. 🙏🏻

limber lance Aug 3, 2025, 12:56 AM

#

getting it to generate nonsense poetry just to get a feel for its word associations

bitter vigil Aug 3, 2025, 2:09 AM

#

harsh nest 99% it’s Gpt 5 mini

when have they ever stealthed or released a mini model prior to the full version?

#

I could be wrong but I don't remember this happening

#

also #1 on creative writing long form, eq bench while bieng a mini model is unprecedented

#

the top list is all huge ass models

#

esp with repetition so low

brittle barn Aug 3, 2025, 2:27 AM

#

this is nice for rersearch havent tried coding yet

#

This is something GPT imo

#

Very similar response on research just checked. Only one to suggest quizzes in my curriculum so far too

spice shell Aug 3, 2025, 2:54 AM

#

bitter vigil so consensus.. do we think it's gpt 5 or their open source model?

I think it’s OSS

#

I’m not sure what the reasoning model was

late onyx Aug 3, 2025, 4:19 AM

#

spice shell I’m not sure what the reasoning model was

maybe it has on/off reasoning abilities like qwen3

spice shell Aug 3, 2025, 4:20 AM

#

Idts

#

Models don’t get that significantly better at knowledge tasks, generally, from reasoning

heady gust Aug 3, 2025, 4:22 AM

#

bitter vigil so consensus.. do we think it's gpt 5 or their open source model?

Almost sure it's GPT-5 Mini, it'd be nice if it was OSS but I really don't think it is

#

I hope it's not GPT-5 full though lol

grave wyvern Aug 3, 2025, 4:40 AM

#

It's been so sexy being a switch with you all. We went from being dominated by an Alpha to being inside a Beta.

next jolt Aug 3, 2025, 5:36 AM

#

"inside a" ?

stray nebula Aug 3, 2025, 5:53 AM

#

Not a fan of this for roleplay.

grave wyvern Aug 3, 2025, 6:06 AM

#

next jolt "inside a" ?

😏

spiral pewter Aug 3, 2025, 6:11 AM

#

i mentioned this in the horizon alpha post but it had reasoning with a GCD of 64 which o4 mini and o3 both have aswell

thick crown Aug 3, 2025, 8:43 AM

#

From a creative writing / roleplay simulation perspective the model seems to write beautifully, adapting well to different styles of writing and using a wide, varied vocabulary. It also seems to do exceptionally well with both local knowledge and localised language/dialect, and is smart enough to handle complexity in the story-telling without severe logical disconnects. Emotional intelligence is excellent. It does have some model 'isms' the way they all do, both regard to phrases it likes to use or response structures that start to become embedded over a multi-turn chat, but at first pass (IMO) these do not feel as noticeable or severe as with other models (entirely possible that I just haven't lived with it enough for those to annoy me yet!)

Unfortunately though, it does appear to be strongly influenced by an underlying positivity / user-pleasing approach. It seems to want to put a happier glow on darker story arcs, and tends to agree with the user's approach and position (even where that is either unreasonable or completely counter to the defined persona or objectives of a character).

Shows huge promise, but the embedded bias does seem to be a concern.

This is just an initial reaction from a few tests; definitely a strong model I'm keen to evaluate further.

valid zenith Aug 3, 2025, 8:43 AM

#

thick crown From a creative writing / roleplay simulation perspective the model seems to wri...

sir, what you just described is claude

#

so claude 4 haiku (cause dum in my test)? les go

thick crown Aug 3, 2025, 8:44 AM

#

valid zenith so claude 4 haiku (cause dum in my test)? les go

Not sure. Feels more like a GPT variant as others have described, but definitely a fresh one.

#

Also Anthropic seem to have their eyes firmly on the Corpo markets, whereas Altman noted a while back that OAI had a model trained in creative writing that they hadn't yet released

#

If I had to bet on it, my money would be on an OAI model.

valid zenith Aug 3, 2025, 8:53 AM

#

thick crown Also Anthropic seem to have their eyes firmly on the Corpo markets, whereas Altm...

every model release post gpt 4.5 is allegedly creative trained (still meh compared to sonnet)

magic frost Aug 3, 2025, 8:54 AM

#

This model is totally garbage for me, i am create comprehensive requirements, design and tasks . Model create about 10 % and stuck with "Task Done" message, completely drop all of context - VERY BAD

valid zenith Aug 3, 2025, 8:55 AM

#

magic frost This model is totally garbage for me, i am create comprehensive requirements, de...

same, i think it is a small model

daring kettle Aug 3, 2025, 9:02 AM

#

Not sure if anyone noticed but when you add a image_url horizon will fetch the image with the user agent “OpenAI Image Downloader“ and the request also come from the same ASN as OpenAI api (Microsoft)

hearty hinge Aug 3, 2025, 9:10 AM

#

is it good for coding?

weary lichen Aug 3, 2025, 9:30 AM

#

hearty hinge is it good for coding?

I like it for coding

thick crown Aug 3, 2025, 9:31 AM

#

valid zenith every model release post gpt 4.5 is allegedly creative trained (still meh compar...

to some degree yes, but they specifically noted they had one focused for that.
It does feel like a smaller model (but a very capable one nonetheless).
Nothing about this feels optimised for coding or complex task execution though - I just don't get the sense that it is built for that, although to be fair I haven't done robust tests on either of those.

visual rose Aug 3, 2025, 10:05 AM

#

can we have horizon gamma

lucid sky Aug 3, 2025, 10:14 AM

#

horizon λ

#

https://youtu.be/95qvqkf4EqE

YouTube

Valve - Topic

Hazardous Environments

Provided to YouTube by PIAS

Hazardous Environments · Valve

Half-Life 2

℗ Ipecac Recordings

Released on: 2020-03-10

Composer: Valve

Auto-generated by YouTube.

▶ Play video

grave wyvern Aug 3, 2025, 10:18 AM

#

lucid sky https://youtu.be/95qvqkf4EqE

lucid sky Aug 3, 2025, 10:18 AM

#

GPT-5, OpenAI's HL3

bitter vigil Aug 3, 2025, 10:25 AM

#

thick crown From a creative writing / roleplay simulation perspective the model seems to wri...

this was a great writeup, thank you

#

I'm hoping it's an open source writing model distilled from o3

#

then we'd get it dirt cheap at chutes

weary lichen Aug 3, 2025, 10:29 AM

#

I'm brand new to the vibe coding universe, but what has worked well for me so far is: Continue programming everything with Horizon until you reach a dead end. Then use Gemini Pro 2.5 to fix any errors that remain.

bitter vigil Aug 3, 2025, 10:45 AM

#

weary lichen I'm brand new to the vibe coding universe, but what has worked well for me so fa...

making good use of the free models! 🙂

valid zenith Aug 3, 2025, 10:46 AM

#

bitter vigil then we'd get it dirt cheap at chutes

you bet it will have cohere license

#

lain2

bitter vigil Aug 3, 2025, 10:46 AM

#

valid zenith you bet it will have cohere license

why? haven't openai past oss releases been apache or MIT?

#

whisper etc

#

I could see them having clauses like no training models based on outputs

#

also I believe they already said the model will have zero novel architecture

#

it's just a basic llm with good data

#

or at the least does not use their proprietary arch

valid zenith Aug 3, 2025, 10:48 AM

#

bitter vigil whisper etc

whisper is just something not commonly used and just connected to main model to give it multimodel modality

bitter vigil Aug 3, 2025, 10:48 AM

#

my guess is it'll be close to a meta license

bitter vigil Aug 3, 2025, 10:48 AM

#

valid zenith whisper is just something not commonly used and just connected to main model to ...

not commonly used? my man it's one of the most popular transcription models around

valid zenith Aug 3, 2025, 10:48 AM

#

bitter vigil my guess is it'll be close to a meta license

no, as sam mocked it

bitter vigil Aug 3, 2025, 10:49 AM

#

valid zenith no, as sam mocked it

wasn't aware of this

valid zenith Aug 3, 2025, 10:49 AM

#

if open source and decent, which is highly unlikely from copenai, then they really redeemed themselves, which hugely doubt looking at phi series

bitter vigil Aug 3, 2025, 10:49 AM

#

valid zenith Aug 3, 2025, 10:50 AM

#

sam-altman-says-their-open-source-model-will-not-have-any-v0-t3vvm06wz3se1.png

#

imagine asking ai instead of googling it

bitter vigil Aug 3, 2025, 10:51 AM

#

imagine not using perplexity and using google lmao

valid zenith Aug 3, 2025, 10:51 AM

#

look at our result

#

ah

bitter vigil Aug 3, 2025, 10:51 AM

#

?

valid zenith Aug 3, 2025, 10:51 AM

#

i found and you didn't.

bitter vigil Aug 3, 2025, 10:51 AM

#

your tweet and more is already in it?

#Horizon Beta