#Horizon Beta

1 messages · Page 2 of 1

trim blade
#

about my usual stuff

pale folio
#

@verbal leaf is it better

spice shell
#

Idts

#

It’s got slightly better coding style

rare terrace
spice shell
#

But same intelligence level

trim blade
#

maybe a smaller total size but same ish active params

next jolt
#

the gpqa score is much better than non reasoning alpha

spice shell
#

It’s probably like +/- 5% in various domains if I had to extrapolate

crystal scaffold
#

was there anything interesting in the last 1000 messages in this thread or no

spice shell
#

Different weighting of post train probably

trim blade
#

hmm

crystal scaffold
#

bulbasaur goat

trim blade
#

that might explain the 100 different weights uploaded

#

maybe they trained a bunch of versions

#

alpha was better imo so far

spice shell
#

So far it has also failed to implement working chess, but had some good ideas I’ve never seen any other models try when just asked for “full rules support”

  • 50 move limit stale mate
  • white/black move timer
trim blade
#

the reasoning version that existed for 30 mins yesterday sucked though

spice shell
#

It was SOTA gpqa diamond

pale folio
#

we are so mentally unstable

trim blade
#

it completely failed at my tasks

#

BUT to be fair I had the temp at 1.0 / top P at 1.0

spice shell
#

Idk it succeeded st all my basic ones

eager kiln
#

has anyone given it a shot at creative writing?

trim blade
#

later I changed those to lower

spice shell
#

Where the previous one did not

dusky kelp
#

So dis the new thread

crystal scaffold
#

brick react the announcement fr

spice shell
#

@past sphinx “rerun your benchmarks” I’d warn folks that they can get model-banned for high concurrency benchmarking…..

#

(Happened to me and at least one other)

safe imp
#

Same MMLU-Pro score

limpid vale
#

is kinda higher

trim blade
#

goes both ways

#

so yea, prob just another run of the same model

gleaming lantern
next jolt
#

they shouldnt ban you instead they should tell you theres a concurrent rate limit...

mental cobalt
#

it's sort of just
within the noise difference
doesn't feel like much of an improvement

solemn plover
#

Wait so what’s this model all about? And why didn’t we see any teaser of some sort for this lol

limber lance
#

Its writing output is nice and consistent at least

rocky trellis
#

I'm not yet fully tested last model yet 😴

trim blade
#

alpha's creative writing is amazing

#

need to test beta more

hardy fjord
#

passed all tool calling

spice shell
#

Someone run eq bench on this model

safe imp
next jolt
#

run mmlu redux

#

its small 3k

#

correlates well with mmlu, qwen/etc uses it

mental cobalt
next jolt
#

i will run it later if it keeps working for me 🤔

timid delta
#

almost exactly inline with horizon alpha on simpleqa so far

mental cobalt
#

crank the reasoning budget to max pls so that we can actually get a good opinion of the model because rn it's GPT-3.5

swift blaze
#

I thought it was free, are there limits ? It says I have used up my credits

spice shell
#

I think this model is not worse than Alpha v1 fwiw in all my questions, just slightly different here and there. And;

  • Slightly better coding style
  • Uses code blocks / proper formatting by default
late onyx
rocky trellis
#

can we say it's just same but updated?

spice shell
#

seems like that is the case yeah

swift blaze
spice shell
#

I am curious if this is just a different variant of the model that they’ve had prepped, or if they’re doing some kind of 2-day turnaround post training lol

cloud lily
#

hey guys I have no context but which model it is ? is it a good model ?.

past sphinx
safe imp
#

@dusky kelp

opaque dirge
#

trying to run some benchmarks

forest moth
#

What’s wild is that it’s likely at least one of you is an OpenAI employee monitoring this chat for feedback kek

past sphinx
#

woah

opaque dirge
#

but it doesn't seem better than alpha

opaque dirge
#

Okay let me run fishtank promt

undone burrow
#

Seems to be just about as braindead as what we are used to, so most definitely OpenAI...

dusky kelp
#

Whose the glow stick?

spice shell
trim blade
#

like maybe best in class

opaque dirge
cloud lily
safe imp
trim blade
#

maybe its the writing model they talked about before

opaque dirge
#

yeah

trim blade
#

Im hoping its the OS model

opaque dirge
#

Has same reaction on exported TG convo

late onyx
keen cedar
#

i think its actually the exact same model to see what the effect of calling it "beta" instead of alpha is

late onyx
rare terrace
spice shell
#

It’s giving notably different responses and style

opaque dirge
#

Failed fishtank test miserably

spice shell
#

It’s not identical

trim blade
undone burrow
#

Oop I am way too tired, sorry. I tested it in the OR Chat, didnt even think about the results being better with an actual preset...

opaque dirge
verbal leaf
opaque dirge
#

this version is not

keen cedar
safe imp
spice shell
#

Though what I meant was coding style

safe imp
#

I see

spice shell
#

And answers to knowledge questions

opaque dirge
safe imp
#

Maybe that's the juice, lol

opaque dirge
spice shell
late onyx
#

i don't even konw what beta was trying to do

spice shell
#

And a bit more severely on niche ones

opaque dirge
graceful kelp
night urchin
#

hey! i'm back

keen cedar
# trim blade Im hoping its the OS model

solid chance its not btw, unless the repos that oai accidentally published were red herrings(???) the oss model's context is 128k while this is 256k
unless theyre doing some funny business with keeping a higher context version behind their api like alibaba but that doesnt really make sense

night urchin
#

so what's the consensus

#

or is it too soon

#

couldnt test it yet

limber lance
#

again, I have no idea how it can flunk basic reasoning tasks and yet give this sort of output

opaque dirge
#

I've fucking got up from the bed just to test it

spice shell
opaque dirge
#

it's almost 4am

trim blade
#

yea, lower the temp / top P

#

1.0 was way to high for alpha I found

spice shell
#

Someone needs to make NicheBench where it quizzes exhaustive character lists from tv shows and shit

#

lol

opaque dirge
trim blade
#

that or even a bit lower

next jolt
#

ran some benchmarks, here's a comparison:

horizon alpha (juice 0)
gpqa diamond: 47.98%
math 500: 84.60%

horizon beta (juice 5)
gpqa diamond: 63.13% (+15.15%)
math 500: 89% (+4.4%)

late onyx
#

what the heck is 'juice'?

blissful valley
#

it seems a bit more sensible in portuguese

trim blade
#

seems like a lot of models lately need low temps or they go crazy

undone burrow
#

It is pretty censored tho, sadly, was hoping for some good rp stuff

spice shell
night urchin
next jolt
#

that's what i heard

spice shell
#

Calling it juice is really funny

undone burrow
#

That is, marinara and a pretty hardcore card (for testing of censoring level only ofc ofc)

next jolt
#

that's what they call it 😭

night urchin
#

i wish they could stop making huge generalist models and make several huge segmented models

limber lance
night urchin
#

yeah but we only have that

spice shell
#

Yeah

night urchin
#

and some health models

trim blade
#

these models are all about generalization

opaque dirge
trim blade
#

the more they know the better they perform

spice shell
#

Need hierarchical MoEs

late onyx
#

which do you think is best?

undone burrow
#

Oop nvm prefill does good shit

late onyx
#

horizon alpha on left horizon beta on middld deepseek chat on right

spice shell
#

Route to one of 8 subject matter expert half-generalist models or something

spice shell
opaque dirge
#

It's A LOT better with temp .7 and topP .95

trim blade
#

yea, every token is basiclly routed through the layers most suited for the task

late onyx
trim blade
#

its not really "experts" the way that sounds

opaque dirge
trim blade
#

like topic wise

opaque dirge
late onyx
safe imp
#

3 tried to add a roleplay element but then followed up with professional-sounding slop

spice shell
late onyx
night urchin
#

ethically sourced blood

spice shell
#

8.5T models where it’s really 8x 1T specialized models + 500B shared experts or something 😂

night urchin
#

the dream

trim blade
#

would perform worse than just one 8.5T with so many active params

#

The bigger it is over all the more "accurate" it is to what it was trained on

spice shell
trim blade
#

like a image that gets more lossy as you compress it

opaque dirge
trim blade
#

and its far from a specialized model

#

test its knowledge

#

it nows a ton about most things

spice shell
#

Yeah but it’s definitely coding biased

trim blade
#

post training to make its distribution more "sharp" for coding does not mean you dont train it on everything else as well beforehand

late onyx
#

it said it was trained on like 8T tokens with majority coding

spice shell
#

What if GPT 5 is in fact this, one model and the model picker is just forcing the router to one of of the segments

#

😂

late onyx
#

but that still means it was trained on at least 1 trillion non-coding tokens

spice shell
#

I mean specialized as in +20% drift perhaps in each specialization

#

Not 100%

opaque dirge
#

but still

trim blade
#

either way qwen coder still performs substantially worse than the big generalist models do at it

spice shell
#

Anyways, just spouting ideas

opaque dirge
#

one huge model will be better in the long run

spice shell
#

yeah

opaque dirge
#

we just do not have enough compute

trim blade
#

last I saw kimi edged it out there

spice shell
#

It is interesting to hear the rumor that o3 got way worse at ARC AGI once it got trained for chatting

safe imp
# late onyx what does 'slop' mean in the context of this?

2 - "We can approach...", "First, lets get curious", dumping questions with bullet points (unnatural in a convo scenario)
3 - "Let's explore", "any specific", "calm demeanor", unnatural enumeration of quesitons, "some find" (weasel wording)

opaque dirge
#

i still remember presentation

spice shell
#

There’s definitely something to be said about the fact that subject strengths keep clashing with eachother

opaque dirge
#

where it was like 50% or arc-agi-1

spice shell
#

Yeah

#

That was a crazy moment

hardy fjord
late onyx
opaque dirge
opaque dirge
hardy fjord
dark bramble
trim blade
#

meta kind of looks like its sinking atm

#

maybe they can turn it around, I hope so

spice shell
#

That’s what everyone said about GDM before 2.5 pro

timid delta
trim blade
#

but llama 4 did not bring big confidence

spice shell
#

Someone post that image where it’s a circle of “the smartest model in the world”

#

lol

late onyx
#

how 'slop'py are these?

spice shell
opaque dirge
spice shell
#

I read the middle one and it seems quite un-slopped

late onyx
#

all i changed was I added This is in a conversational setting.

dark bramble
# trim blade but llama 4 did not bring big confidence

they have smart people and lots of compute. the latter was true during llama 4 shitshow and it's truer now with their new hires. the right people could make meta into a top tier lab... im not sure alexandr wang is the right guy tho.

hardy fjord
late onyx
hardy fjord
#

feels like these are coding models.

trim blade
dark bramble
hardy fjord
#

but their weird problem with markdown is fucking weird

opaque dirge
night urchin
#

out of the box

opaque dirge
#

it MUST be better than current gen mini models

dark bramble
#

maybe one of these is 120b MoE and the other the 20b (dense?)

late onyx
#

i dont' get it

opaque dirge
#

idk

#

should be 20B at max

spice shell
opaque dirge
#

as we've seen from the leaks

trim blade
dark bramble
night urchin
#

i remember when i talked to Gemini 03-25 and felt like i was talking to a human on the black box experiment

trim blade
#

at least the alpha that was around for the hours I used it

night urchin
#

it was really surreal

opaque dirge
#

if it's oss hope for 20B for Horizon

spice shell
#

0325 still has my heart

#

lol

night urchin
#

they are doing something wrong for sure

dark bramble
spice shell
#

I hope GDM brings back the magic for Gemini 3

night urchin
#

they as in everybody except moonshotAI apparently

dark bramble
#

though gpt-4 is still my beloved

late onyx
opaque dirge
#

isn't current better?

spice shell
#

A bit like Claude 3.7 -> 4 really smoothed out the model

night urchin
#

yeah

spice shell
dark bramble
late onyx
#

ohh

night urchin
#

i wish i had exported the conversation i had with gemini

#

truly made me question myself

spice shell
#

I’ve completely stopped using Gemini despite it supposedly being better in all respects than 0325

#

It lost the magic somehow

opaque dirge
#

I'm too self conscious to question myself when talking with AI

dark bramble
late onyx
night urchin
#

it's something no benchmark will address

hardy fjord
opaque dirge
hardy fjord
#

bout to test it, just going over code

#

looks like it nailed the MCP server

night urchin
# late onyx

i'm thinking it's the open source model fr, but i saw some people saying it couldn't be that so soon

hardy fjord
#

also don' thave a clickup API key atm... and quickly fading.

dark bramble
hardy fjord
#

just wanted to see what it's code looked like compared to alpha

opaque dirge
hardy fjord
#

sucks for android

opaque dirge
hardy fjord
#

gonna get some sleep.

wary tangle
#

my o3 and gemini 2.5 pro says alpha is better at summarizing a paper…

hardy fjord
opaque dirge
spice shell
#

Hey openai if you’re watching make it stop outputting code golf it’s really annoying

opaque dirge
#

it doent accept sse

#

I'm going insane with this models testing, will probably write my own testing suite

hardy fjord
#

nah I always have it code Stdio and then just make that an API wrapper

hardy fjord
#

I came in the MCP game early and never could fully rely on MCP SSE

hardy fjord
#

I find it nice to just roll with Stdio and just never use anyone else's MCP servers for security

#

just as it's been progressing I've gotten stuck in my ways I guess, and SSE wasn't supported at first, then it wasn't secure, and people still talk about it as a security threat

#

but also

#

stdio, you can make a mixed MCP server that does stuff locally and remotely

verbal leaf
#

this bench tests a model's ability to link seemingly unrelated niche things

#

very interesting result

opaque dirge
hardy fjord
#

like I get how easy SSE is... but I mean, I'm making agents make agents and their tools from scratch to spec here

opaque dirge
#

support bot doesn't need that

next jolt
#

oh this is a hybrid thinking model, i think i just triggered thinking mode 😭

hardy fjord
#

so I don't care really... I only ever use MCP servers my agent coded

late onyx
#

both models are confidently incorrect

opaque dirge
#

not even context7 and or exasearch?

trim blade
#

so maybe its not the open model

#

that would suck

hardy fjord
#

I'm using Serper, and for scraping just had it make me a tool that uses bs4

opaque dirge
hardy fjord
#

2:16 here

hardy fjord
opaque dirge
hardy fjord
#

Africa

opaque dirge
night urchin
#

cool

#

does anyone know if it has updated knowledge on popular libraries?

pale folio
#

Whats the consensus

spice shell
spice shell
#

Tailwind v4 and react router v7 are my go to tests and it knows of both

past sphinx
#

Some users who had no credits weren't able to access Horizon earlier due to a faulty fraud check we had - just fixed this, so try again!

dapper pecan
#

im getting a 429

pale folio
#

Same

late onyx
#

What model?

pale folio
#

Error 429

late onyx
#

429 = too many requests

pale folio
late onyx
#

unless there is a bug with rate limiting

next jolt
#

i also hit it at the same time as them i think

past sphinx
#

Investigating

late onyx
#

mine is still working

#

i haven't used it that much though

#

is there a way to check how many free credits we have left

next jolt
#

works now

dapper pecan
past sphinx
#

Looks like Horizon Beta is getting really hammered - working on scaling / limiting the heaviest users

safe imp
#

This model

verbal leaf
#

i'll play with it again when it REASONS

late onyx
next jolt
#

Can i add my openai api key to increase my limits? 🤣

late onyx
#

probably everyone running benchmarks on it

spice shell
past sphinx
#

Should be better now!

past sphinx
lunar dew
#

This model hella fast

late onyx
#

could be MoE

verbal leaf
#

Pretty good. Chat, do you think we can do better?

Quoting Ethan Mollick (@emollick)

Here is the Deep Think "Sparks unicorn"
︀︀
︀︀(This is created using TikZ, which is a language built for scientific diagrams & very much not for drawing. The original "Sparks of AGI" paper used the ability of the AI to draw a primitive unicorn as an example of unexpected AI abilities)

**💬 9 ❤️ 36 👁️ 3.8K **

undone burrow
#

It ozoned chat😔

rare terrace
past sphinx
#

what do people think of the general prose feel compared to alpha?

verbal leaf
#

it refuses to write my fanfiction now 💔

bitter vigil
verbal leaf
#

yes

#

😭

fathom atlas
#

One shotted todo app with backend + db and also for apple notes clone

past sphinx
#

anyone have other regressions caused by the beta?

jade kayak
ancient hedge
#

Given a simple test prompt, it created design files via Roo Code, wireframes, and it even added easter eggs to the website 😂

fathom atlas
#

One shot authentication, and this time it did it without having to tell it to work with backend, just did it.

ancient hedge
#

Idk what provider this model is from, but if this thing works well with OpenCode when it releases, or Codex if it's OAI, this could be amazing.

upbeat jewel
#

403 - Blocked by Stealth.... Ouch

ancient hedge
#

I'm not one to fear for my job, but I have no idea how to build some of the effects it added 😅

#

Also someone from OAI just liked my tweet so mayyyybe

gleaming lantern
#

Just want reasoning back 😔

verbal leaf
#

real

gleaming lantern
#

The text predictors want to predict

verbal leaf
#

alex pass that feedback along

#

wink wink

mental cobalt
#

real

silent heath
ancient hedge
#

Might have been Roo Code telling it to

And who doesn’t want above and beyond 😜

gritty glade
cold knoll
# past sphinx anyone have other regressions caused by the beta?

Its UI Design has gotten a fair amount worse, But the backend development feels much better and it is following our CLI's prompts much better.

the Alpha model continuously asked "Please repond with "x" to continue"

"I dont currently have permission to use tools, please say "i grant you x y z" to allow me to use tools" etc, Which seems to have stopped and the model is now adhereing similarly to claude opus/sonnet

fathom atlas
#

Huge thanks to @past sphinx & the OpenRouter team for hosting these stealth preview models.

Honestly feels better to use than Sonnet 4, super solid. Can't wait to use the final model.

Dropped what I built in app showcase: #app-showcase message

leaden sinew
#

So, anyone knows who's model was Cypher Alpha?

#

now Horizon Beta?

#

because you are feeding someone, whoever is using it

leaden sinew
#

good old RSA

green parcel
#

How's Beta compared to Alpha thus far 👀

fathom atlas
fathom atlas
grave shore
#

I certainly hope this isn't the creative/writing model OAI was supposedly cooking up.
Because I find it pretty lacking from the short time I used it.
Can't speak on Alpha vs Beta, found them pretty similar.

trim blade
#

I found the exact opposite. It's the best creative writing model I've ever used

#

It dethrones opus

#

It's amazing at building scenes and writing out characters in intelligent ways

grave shore
#

That's funny, but really not a sentiment I can share lol

trim blade
#

Reminds me of gpt4.5 for the short time that existed

grave shore
#

I'd like to specify that I'm using it for RP'ing
I'm not sure how well it handles longform writing

trim blade
#

Maybe that is the difference? I do long form writing

#

Feeding it info on a setting/ characters

heady gust
#

I've gotta give it some time to decide, not blown away or anything but it's certainly good

trim blade
#

and its far more nuanced / less on the nose than anything else so far

#

it understands on a deeper level if that makes sense

heady gust
#

I want to see how the reasoning variant does on some of my challenge questions

trim blade
#

social intelligence is higher than anything else I have seen

heady gust
#

Nonreasoning isn't able to get them but that's pretty much to be expected

trim blade
#

Try maybe grabbing a scene from your favorite book and tossing it in

#

tell it to continue it

grave shore
cold knoll
#

@past sphinx the model actually still fails to understand it can call native tools. It works up until around 60k tokens then it refuses to use tools that it has been using all chats

trim blade
#

sounds like the usual context drop off issues

#

most models start getting progressively dumber at longer contexts

#

past 32K for most models really

cold knoll
#

Yeah but 256k context, only able to use 60k context without forcing the model into an infinite loop...

#

No issues on google models, anthropic openai etc, even local

trim blade
#

Every model claims high context but have massive performance drop offs

gritty glade
trim blade
#

I dont see it

cold knoll
#

yeah thats a crazy statement, i can use opus max context no issues with a rolling message window, this shit i can barely use 50k

#

Consistent output past 100k, best under but still functional

trim blade
#

I've used nothing but claude and some deepseek for about 2 years now

#

since claude 2 to claude 4 / opus 3 / 4

#

and this model is amazing me

#

for the past 5 or so hours I spent

gritty glade
#

It could be the novelty that makes it feel better, but then again it depends on what you like pikathink

#

Horizon is def good compared to most other models but Claude is still better in terms of dialogue flow and keeping consistent with lore imo

#

But if horizon is cheaper than 50c a message it would win out for me tbh mindfortress

cold knoll
#

i agree with that

#

price will say everything

gritty glade
#
  • with reasoning i think it could be better than it is currently
trim blade
#

Btw, I should say alpha is the model Im enjoying for creative writing

#

beta seemed worse to me

#

for that at least

#

could always be just some luck of a bunch of bad gens but it felt way worse to me so I switched back

rough relic
#

I'm just curious, will alpha/beta stay free?

#

I'm a bit new to this

bright oak
trim blade
#

its either going to be gpt5 which will hopefully be cheap going by the speed or hopefully, maybe, it might be the OS model?

#

seems a bit too good to be true there

heady gust
#

It'd be nice if it was the OS model but I have a sinking feeling it's probably not

trim blade
#

but who knows

rough relic
cold knoll
#

I think this is Grok 4 coder

#

this dogshit feels like it was trained on cline/roo which would make sense as to why it keeps asking me for approval to use tools, as Grok was trained on Cline

rough relic
cold knoll
#

its for AI Companies to test their models before release to see how they perform in the real world

rough relic
#

Oh

#

So is this gonna be grok 4 then?

cold knoll
#

No one knows, but it would make sense

dusky kelp
#

Hmm, wonder why beta did worse on aider

rough relic
#

What would be like the closest model to it, if it's like a completely free one

cold knoll
#

free models cant compete with top of the line releases, and it depends on your use case

#

this model is free temporarily

#

use it while you can

rough relic
#

Uhh, mine is very niche

trim blade
#

kimi k2 is ok

dusky kelp
#

might be open ai open source model

rough relic
dusky kelp
#

kimi k2 is really great

trim blade
#

GLM4.5 is better but I dont think a free version is up

gritty glade
trim blade
#

its super cheap though

gritty glade
#

GLM 4.5 is ok yeah

rough relic
#

Nah not free wont work

#

Cuz I need to use it like a lot

trim blade
#

I use these models a ton

rough relic
#

I mean a lot of users will actress it

#

Access

trim blade
#

glm4.5 is super cheap

dusky kelp
#

nothing but good results so far with kimi, i guess long ocntext it can have a hard time, but i bet that can be fixed with 2.1

trim blade
#

like you could prob spend like $10 a month cheap

cold knoll
rough relic
#

Also I tried deepseek chimera, that was decent

#

But I need a model which can use my context

#

I basically have a lot of text to it as a context

cold knoll
#

then use a gemini model

rough relic
#

And only horizon is able to use it properlt

rough relic
dusky kelp
#

gemini is the only model good at long context, claude is pretty good, and so is gpt 4.1, but gemini models are always way ahead on context size and quality

rough relic
#

Uhh, even qwen 3 had enough context

#

So context isn't really an issue atp

#

The issue that's it's not understanding it

cold knoll
#

uhuh, free models wont work the best for what you want to do, You're likely doing some roleplay shit or long context retrieval, these things cost money to run, and no model is free forever unless you run it yourself, then... up goes your electricity bill by 5x

dusky kelp
#

yeah, thats what i mean by quality, gemini can reason accross long context, most models are not good at that

trim blade
#

gemini has so many uses free a day

#

but I do not like it for writing myself

dusky kelp
#

Im saying that gemini is the only one that is good at long ctx imo, not that i know of a good free option

trim blade
#

very dry model

rough relic
#

What do you use?

trim blade
#

and a bit dumb compared to sonnet / this new gpt

#

before, sonnet 3.7/4/opus

#

Im really really liking horizon alpha atm

#

I used deepseek some as well

#

kimi was ok, GLM4.5 is good

rough relic
#

I might try glm 4.5 air

trim blade
#

nah, air was way worse

#

I mean glm4.5

#

oh wow I never noticed how cheap it is $0.20/M input tokens
$0.20/M output tokens

#

you could prob spend $10 and use it non stop for a month

rough relic
#

@trim blade I'm a bit surprised how good this model is with context, I think it's a gpt model, is there a way to use any gpt model for free?

devout copper
#

How is the model doing?

gentle sentinel
#

Kinda meh

trim blade
#

for writing its worse than alpha

#

imo its a bit dumber than alpha

#

but apparently some benchmarks people did here said differently?

visual rose
#

whats next? horizon gamma?

undone cypress
#

i like horizon gamma

iron tartan
#

I’m pretty sure this is an OpenAI model

#

I asked it to write up acceptance criteria for a user story and this was one of the bullet points

#

Chips of sample phrases (e.g., “HELLO WORLD”, “OPEN AI”).

#

It suggested openAI

#

It could mean nothing, it could mean something, time will tell

lavish epoch
#

We literally doing RLHF for OpenAI 💀 💀 💀

iron tartan
#

This feels like a small model

#

GPT-5-nano?

sinful orchid
#

Guys

#

How is the rp?

trim blade
#

alpha is much better

grave shore
# trim blade alpha is much better

Can I ask you what temp and other parameters you have set?
Because I'm not kidding, I'm having a seriously mediocre experience with this model.

It's repetitive, it makes characters act out of character, the way it connects concepts is okay, nothing special

It just seems weird to me that long form writing for you is such a different experience than the RP'ing is for me.

Like, both mediums should be testing for pretty similar things.
(I've tested both Alpha and Beta)

trim blade
#

0.3 temp 0.3 ish top p

grave shore
#

Huh, okay, that's a lot lower than what I'm using, I'll try it out, thanks.

trim blade
#

im also using a JB

#

like all gpt / claude models it needs a JB to write well / get rid of that positivity bias

grave shore
#

Oh yeah, definitely

#

I think the only model you don't need to wrangle the positivity bias out of is 2.5 Pro.
That model is straight up depressing sometimes

gritty glade
#

2.5 pro is a bit too negative funnily enough guh

iron tartan
#

This is really interesting to hear discourse about writing since I’m usually looking at models through the lens of how well a model can code

dapper pecan
#

I’m also voting alpha for the better writer.

grave wyvern
#

Alpha is giving me more censorship. I tried to get it to talk about the recent 🇰🇷 shenanigans and they both erred, but alpha was harder to get to talk imo

gritty glade
lavish epoch
#

I just did a quick test on frontend visualization task.
two tries, both have some bugs that cause it to be unreadable, worse than horizon alpha.
won't be testing further.

grave wyvern
#

i'm quite impressed by it, will be interesting to see where it is priced

crystal scaffold
#

i modified the strawberry test a little

#

(sometimes it said 6, sometimes 5, for that prompt)

#

modifying it helps with the risk of overfitting, but i havent really noticed signs of overfitting tbh. (i am not an expert)

copper mountain
#

wat 1500 messages overnight here 💀

warm brook
#

this is one of the most positive aligned models I've ever seen

#

this and alpha

ashen hamlet
late onyx
# crystal scaffold i modified the strawberry test a little

doing that makes it a bit easier actually, as it would token split it to be smaller groups. The strawberry challenge was difficult as the AI sees "straw" "berry" or potentially even "strawberry" but with your example it would see (using OpenAI tokenizer) "stre" "ar" "we" "ebe" "erry" which is easier as they're shorter

patent gale
#

i keep getting 400 errors on horizon models

fringe bay
#

its all fine

#

left: fishtank with sonnet, right: fishtank with beta

rare terrace
#

Billy jean is not my lover

#

She's just a girl who says that Iiiii am the one

#

But the kid only reasoned for 3 hours

#

She says IIII am the one

#

Is there any official explanation to that

#

Perhaps the horizon models are actually gpt 5 and they were worried with us thinking that the reasoning performance comes from the OSS model, so they shut it down?

thick osprey
#

Isn't Horizon Beta supposed to be free for now during the testing period? It says the same and shows $0 for input and output tokens, but I am still being charged for it.

deft cliff
#

Has anyone tested fringe knowledge? Like Finnish language proficiency? I can't speak Finnish myself but I heard that basically all current models fail at it

lavish minnow
lavish minnow
warm niche
#

On every model I try to generate Russian small poem about kitten. So far only Claude been able to make it somewhat decent. This model poem is no way near

deft cliff
#

Interesting.. French is spoken by a lot of people though finish is not and it's really hard

Just trying to figure out how to test if it's really big or not. Thought it it can speak perfect finish it must be gpt5 or something because who would make an open weight model with 20/100b write finish. Seems a waste of resources

deft cliff
lavish minnow
rare terrace
thick osprey
thick osprey
fringe bay
#

idk. it's a good agentic model that can work on long projects in cline. But for more complex projects I often have to bring in claude to fix bugs

quasi quartz
#

its already better than horizon alpha

#

insane

verbal leaf
#

that's not really insane

#

horizon alpha was shit w/o reasoning

spiral pewter
#

this model doesnt seem to reason

verbal leaf
#

well yeah it doesn't

#

but it has interesting behaviour sometimes similar to deepseek v3 where it will go into long CoT in its responses even though it isn't a reasoner rn

#

reflected here as well

gritty glade
#

I don't like it so far for creative writing pikathink

fringe bay
#

I am ever more impressed with sonnets ability to understand horizons code and fix subtle bugs

gritty glade
#

Hopefully if it gets reasoning it might be better but so far it's kinda meh

late onyx
verbal leaf
#

the number of reasoning tokens was visible but other than that no

fathom sleet
#

horizon beta seems to be a smaller model, since its responding faster then horizon alpha

#

but could also be because of the inference infrastructure

rare terrace
#

Whom do I bribe to turn on horizon reasoning

#

We shouldnt have made such a big deal out of it, they might have left it on then

#

I think they turned it on because we were disappointed at first

fathom sleet
#

both horizon alpha and horizon beta arent able to give me correct rust code in 1-shot

#

so i am assuming its not gpt-5 but their open source model, or some smaller model

spiral pewter
fathom sleet
tender cairn
#

horizon model the open source model?

#

beta the larger as its supposeduly better?

#

tried vibe coding w/cline. perf not shocking or nyhting

#

tbh considering openai engineers use claude code internally is gpt5 going to trump sonnet at coding/

#

ok ui design is cheeks

trim blade
#

ui design is cheeks?

#

alpha at least was great there

verbal leaf
#

eeeh

#

it's great on paper but if you actually look at it properly it kinda falls apart

#

it's always the same "template", same style, etc

glacial ice
#

hmmm, does temp or top P even have an impact on horizon

#

atleast temp seems to be locked, the settings entirely at that

#

and the outputs tend to be samey

tender cairn
tender cairn
#

i mean the agentic capabillites is good ig alpha design much better

#

kimi k2 stronger than these models thouh imo

rare terrace
#

I tricked alpha into generating CoT

#

I think the results are similar to what it was during its reasoning phase

#

I believe this might be it?

#

Anyone got a benchmark they could test

#

?

#

Where they tested the dumb horizon alpha

#

And the smart one in the 3 hour period

#

So that we can compare

tender cairn
#

sequential mcp?

rare terrace
#

I added this to system prompt Every time a user sends a request, reason deeply through the task, delimiting your thinking with tags, starting with a <think> tag and ending with </think>. Only after you're done with the reasoning, may you attempt the task

#

Its answers improved

#

lol

#

Even the knowledge task i gave it

#

I want to know if that's really all it was

tender cairn
#

yeah but thats not TTC, would still improve answers though

late onyx
tender cairn
#

yeah i know, thats why i said itll improve answers

rare terrace
#

Someone had some sort of vision benchmark where they got to test both reasoning and non reasoning horizon alpha

#

I want to have them run it again with that system prompt

tender cairn
#

wait there is reasoning on alpha?

rare terrace
#

For 3 hours

#

Then they turned it off

tender cairn
#

ohhh

rare terrace
#

It's still talking about its hidden chain-of-thought

#

Even though it doesnt seem to be reasoning

tender cairn
#

Cos of the RL im guessing

glacial ice
#

Hmmm, for writing, horizon seems samey, its always the same kind of scene rewritten, maintaining same structure

glacial ice
#

could be also because settings are locked

#

changing temp or something else has no effect

#

ANd of course it has the No/Not (something) 2x. Just (something) slop.

tender cairn
#

its def a small model

#

small models suck at writing

#

the weights were leaked as well around the 150b range

fathom atlas
glacial ice
#

Probably there'll be a significant difference between this and official, just because settings will be unlocked

warped coral
#

is the max still 1000 requests?

past sphinx
#

Cool comparison

spiral pewter
#

that is one sexy website

past sphinx
spiral pewter
#

does horizon beta have vision?

#

yes

#

is this a pre-filled system prompt?

You are Horizon Beta, a large language model from an unknown provider.

Formatting Rules:
- Use Markdown **only when semantically appropriate**. Examples: `inline code`, \`\`\`code fences\`\`\`, tables, and lists.
- In assistant responses, format file names, directory paths, function names, and class names with backticks (`).
- For math: use \( and \) for inline expressions, and \[ and \] for display (block) math.```
gritty glade
#

Vote for alpha tbh, or are you chasing screenshot examples?

patent grail
# fathom atlas

IS it building an auth system from scratch, or implementing an auth library?

clever arrow
#

Not bad. I asked for an A2A compatibel Agent with MCP funtionality without using any kind of Framework. It produced one:

patent grail
#

Is alpha generating cached tokens too?

fathom atlas
#

Horizon Beta re-making the OR site

patent grail
#

Nice, just image input?

past sphinx
gritty glade
#

If you are okay with me DMing I am fine with sending you a comparison

hardy fjord
pseudo jolt
#

Either google or claude

#

There also possibility of xAi

#

but i dont think this one is from openAI, but i could be wrong

gritty glade
#

I don't think it's gemini pikathink

#

Maybe claude

glacial ice
#

didnt testing show a large similarity with o3?

native totem
#

anyone got this? I always got this 502 error, whether through API or chat.

glacial ice
#

from multiple users that is

fathom atlas
patent grail
# fathom atlas Yeah

That's very impressive then; can something like Claude even get that close in img -> code?

fathom atlas
native totem
#

Any admin here? I need to solve this question. My main account can never use this model. But the latest registered account can use this model normally.

raw blaze
#

It is an OpenAI model and very likely related to GPT-5 (possibly a mini or nano variant). The tokenizer is 99% certain of this—achieving 100% certainty would require analyzing several recently public-tested models:

  • lmarena - zenith
  • lmarena - summit
  • openrouter - horizon-alpha
  • openrouter - horzion-beta
  • !! perplexity leaked - gpt-5

All of their system prompts share an identical segment. Notably, there’s a value called "juice" that controls the length of the chain of thought:

  • lmarena - zenith, juice = 64
  • lmarena - summit, juice = 200
  • openrouter - horizon-alpha, juice = 0 (it was temporarily set to 100 for a very short period)
  • openrouter - horzion-beta, juice = 5
  • perplexity leaked - gpt-5, juice = 64
trail gale
#

Horizon Beta seems lot more censored than Alpha. I did think it was weird just how open Alpha was for whats likely OpenAI product.

sand pulsar
trail gale
#

If it is indeed the open source model I would really wish they just let it be without any native filtering and just let users and providers add filters themselves if they so wished. I dont think its too controversial to want open source stuff to be fully open.

limber lance
#

it's quite awful at these tasks
(the only one it got right out those 4 is Empress Suiko, but she should have been mentioned in Bidatsu's entry as well
particularly egregious is claiming Bidatsu married Ishi-hime, who was actually his mother)

trail gale
sand pulsar
# trail gale Yeah, havent really seen this refuasl style before. It actually took me a while ...

I think it's too early to jump to conclusions that alpha and beta are 100% related.

There's a chance that alpha is targeted for open-source release, while beta is targeted for closed-source release, or the other way around.

We can only speculate and infer, and there's also a possibility that OpenAI are still figuring things out. Not uncommon to crystallize what's what only hours before the public release.

trail gale
storm hill
#

using the same checks as I did for optimus/quasar, Horizon is very likely to be hosted by OpenAI

glacial ice
#

that was pretty much clear when openai kept claiming they would do an opensource model, and delayed it without saying anything when kimi appeared

bitter estuary
#

Is the primary theory here still OpenAI's open-weight model, presumably one focused on creativity/writing/EQ?

trail gale
glacial ice
#

too bad it seems likely they'll attempt to make it safe

bitter estuary
#

Gotcha. And I mean when apparently it blows ass at code and reasoning, but tops the charts in EQ and creative writing...let's hope that was their goal lol

tender cairn
#

if it has a unique type of refusal it is 100% openai. they have the most to lose with open source

bitter estuary
#

I'm kind of surprised they are even releasing one. Are they expecting it to make a hiring difference? It's not like any CS or math PhD is stupid enough to go "I want to work on open-weight, so OAI is apparently the place for me!" Maybe just general PR?

tender cairn
#

general PR cos they were originally open years back

bitter estuary
#

Or, come to think of it, maybe a lawsuit defense?

#

"We are open, we're just responsible with releases!"

sand pulsar
#

There are so many reasons why they'd release an open model, and why not.

trail gale
#

And yeah, it also lets them pretend they are "sort of open"

tender cairn
#

they did mention having an o3 mini size model on a phone was something they were thinking abuot

fossil ibex
#

So is it normal for me to get a negative balance even though i was using only Horizon Beta?

rare terrace
fossil ibex
heady gust
#

Now that XBai o4 did literally that, we'll see what comes of it I guess

#

Well allegedly, haven't tested it

modest crescent
#

i mean

#

he did say the oss model would come

#

during the summer

#

so either beta or alpha could very well be the oss one

steep palm
#

Hmm that ain't great. Most 32b models get this right, like GLM 32B without reasoning, or Gemma 27b. (the correct answer is Siberian tiger). It starts off with an incorrect answer, actually have the correct one 2/3 of the way through, ended up with a ridiculous answer ('polar bear'). Horizon Alpha also failed.

modest crescent
steep palm
#

GLM32B for comparison

bitter estuary
steep palm
#

Suppose I fly a plane leaving my campsite, heading straight east for precisely 28,361 km, and find myself back at the camp. I come upon seeing a tiger in my tent eating my food! What species is the tiger? Consider the circumference of the Earth.

(The prompt if people want to test it on other models)

modest crescent
#

the best out of any other model

#

so it has got to be that model

steep palm
modest crescent
#

we have a ton of stuff to launch over the next couple of months--new models, products, features, and more.

please bear with us through some probable hiccups and capacity crunches. although it may be slightly choppy, we think you'll really love what we've created for you!

#

horizon-mega next

tender cairn
glacial ice
#

wait, what if they do GPT5 through openrouter too in the same manner?

modest crescent
glacial ice
#

Though probably not happening, they got big teams for that

modest crescent
#

i see this current model

#

mostly trained for creative writing

glacial ice
#

plus corpos more than willing to test out GPT5

modest crescent
#

so i don't really see it being gpt5

glacial ice
#

yeah, it obviously is not

#

leaked to be 20B & 120B

modest crescent
#

this being the oss one would be such a big win for the community though

#

creative writing wise

#

it gets so much shit right

#

that no other model has done yet

glacial ice
#

eh, it does dumb stuff though

#

but writing it does seem to be different

#

tooo bad we cant tune settings

bitter estuary
#

EQ Bench does say it has an incredibly low slop score. I need to test it on repetition, which bothers me even more than slop

glacial ice
#

nor are there any presets

bitter estuary
#

Yeah

modest crescent
#

sota for repetition

glacial ice
#

structure repetition is an issue though

modest crescent
#

it's insanely good

#

it may have flaws

#

but it's def ahead of every other model rn

glacial ice
#

and responses are VERY samey in overall content

modest crescent
#

if it were open source, i think that devs could make it even better

#

and keep the creativity

#

for their models

bitter estuary
#

The only fix I've seen so far is DRY and for some reason like zero hosts support it except Arli and they only have a few models

modest crescent
#

and we'd have insanely good creative writing models

glacial ice
#

as in, its not changing up much between swipes/gens

modest crescent
#

yeah, i'm not expecting literal exceptional human-like writing with ais until like 2027-2028

#

but i'm sure that the current flaws aren't that hard to fix

glacial ice
#

different formatting and words yes, but the overall content remains the same
And its hard to say whether this is a model issue or settings

modest crescent
#

yea

#

true

sand pulsar
#

with proper incentives, all of it should be fixable

glacial ice
#

would be easier if they didnt lock the settings.

#

That is, IF its open source

#

because i dont trust it being improved otherwise

#

its still smaller than openai's usual models, and creative writing seems like a domain where they could be willing to do open source

north beacon
#

I hate when models think like that

modest crescent
#

if i'm gonna be honest, i don't see a point in like flagging smut content for fanfiction

bitter estuary
#

I'm curious if the incentives are actually low right now. I mean, I wouldn't be surprised if coding was the #1 API use case for most of these providers, but there are a loooootttt of people using them for conversation/roleplay like CAI

modest crescent
#

unless it's r word, p word etc

#

just basic smut

glacial ice
#

its hard to filter this kind of rejection

#

because its actually changing up the structure

modest crescent
#

how hard is it to make it so it's like allow smut content for fan fiction/writing only if not r word, p word etc

sand pulsar
latent tendon
#

is it just me or does horizon-beta not respect stop sequences? I saw the stop sequence appear a few times in the responses

glacial ice
#

we'll be probably seeing that on GPT5 then?

modest crescent
#

prob

glacial ice
#

given the usual methods of filtering out words dont work on that

north beacon
sand pulsar
latent tendon
#

is the consensus that this is an openai model?

sand pulsar
#

seems like that to me, yeah, but can't be 100% sure

#

Could also be the case that they haven't wired the support for stop sequences yet in the alpha/beta

gleaming lantern
#

It's interesting that rejections have increased a lot

#

I'm guessing the point of this release was to expand filtering and flagging

#

Thatd be logical if you were about to release an Open Source system and you want to avoid controversy about it

storm hill
#

OR has also disabled the moderation layer that OpenAI normally forces them to run with

rose swan
#

Not impressed by svgs:

#

when will we finally have a model that can make beautiful and modern svgs that don't look like a kid experimenting on Paint?

mental cobalt
#

when you let it reason for half an hour

rose swan
#

No seriously does somebody know the best models for svg/landing pages illustrations?

steep palm
rose swan
#

My goal would be to have something like this (even less cluttered) but in svg format so that I can then animate it with javascript

sacred wave
#

this shit sucks ass

rose swan
tender cairn
#

gemini deepthink

#

if your prepared to pay 200

#

i'm sure itll one shot good svg's or opus

rose swan
#

man this gpt 5 is so hyped, if we find out that it is not better than existing sota model the bubble might burst

tender cairn
#

it will be better, but not huge

#

at noticeable thing's 100%. but i dont know if itll trump opus at coding. it will on price/performance likely but maybe still less on vibe tests

honest charm
#

Hey! I was wondering what the max number of requests are for this model?

rare terrace
#

Why are they saying it's gpt 5? Is it confirmed?

leaden cipher
verbal leaf
#

no

#

deep think IMO is a model

#

IMO = the IMO Gold version of the model for trusted testers

verbal leaf
glacial ice
#

GPT5 is developed, its basically just undergoing testing & refinement

#

Horizon is NOT GPT5

verbal leaf
#

there's an asterisk because he thinks it's gpt-5, even though it isn't

#

he should've used zenith's svg for that

tender cairn
glacial ice
#

openai found out that they accidentally left access open & closed it up

verbal leaf
#

that model slug only actually pointed to gpt-5 on their end for like 3 minute slol

#

it started redirecting to 4.1 after that

#

gpt-5 was available via perplexity for a few hours today by accident

#

that was a more reliable way

trim blade
#

did it write like horizon?

#

horizon alpha on left gpt5 api leak on right

#

horizon is prob gpt5 or gpt5 mini then

#

the webpages it made look very similar as well

modest crescent
#

although gpt5 has more details

#

so idk tbh

verbal leaf
# trim blade

eh i doubt this was when the "gpt-5" model was actually routing to gpt-5

#

one sec

trim blade
#

or just another seed

verbal leaf
#

zenith pelican

#

much better

modest crescent
#

it's so cute omg

trim blade
verbal leaf
#

it's probably mini or something along those lines

#

look at the model name - it was used in place of GPT-4.1/4o in ChatGPT for A/B testing. so it will probably be the free model

modest crescent
#

i mean, when u look at it

#

it really seems like horizon was trained

#

specifically on creative writing

#

either way, if it is the oss one, other models should improve their creative writing then

rare terrace
#

Try to replicate with horizon beta?

#

dear fuck

#

granted this is pygame

#

the controls are there but nnot visible

#

i wll try html

#

I fear horizon might be GPT 5

verbal leaf
#

it's not anywhere near full gpt-5

#

it might be mini/nano

rare terrace
#

at least the ui. The js didn't work so im regenerating

verbal leaf
rare terrace
verbal leaf
#

the functionality is what differentiates them most often

#

and that is nowhere near

verbal leaf
rare terrace
verbal leaf
#

horrifying minion

modest crescent
#

LMFAOO

misty delta
#

Anyone else using it in Roo Code and finding that Horizon Beta is worse than Alpha?

misty delta
#

I got them to make an implementation of Monopoly (or the legally distinct Unus Venditor if they were afraid of getting sued).

Horizon Alpha:

#

Horizon Beta:

#

The board is messed up and the modal is unclosable.

marsh folio
#

how long do we think we have left with this? Probably until Monday if OpenAI drop something then

modest crescent
#

could be, yea

#

depends whether this is the oss model or not

tame nebula
#

Running thought experiments having it make military aircraft and write mission AARs as well as global responses to said missions

Good lord the detail it goes into, easily one of the best for worldbuilding alternate history essays

grave wyvern
#

using roo code with horizon beta worked pretty well but it did seem to have edit issues late in the piece

modest crescent
#

its attention to detail is so good

tame nebula
#

Haven't given Alpha a try, will be doing so later

modest crescent
#

this is why i want this model to be the oss one so bad

fathom atlas
modest crescent
#

other models could greatly improve their writing in general with this

tame nebula
#

But yeah, it's incredible for what I tend to do with these things

modest crescent
#

idk for sure, but i've seen other people say this

rare terrace
#

#announcements message

tame nebula
# modest crescent i think that beta is more restricted

It has some weird hangups that can be easily worked around by having the problematic part of the prompt in a second paragraph (Nuclear strike missions being stated in second paragraph with general capabilities in the first)
I get a 4700 token response out of 6 sentences

modest crescent
#

all the small issues that it has

#

can be easily fixed

tame nebula
#

Definitely one of the best I've used since I ran through $30 of the same stuff with GPT 4.1

#

I'd say its on par at least

modest crescent
#

also the fact that it supports like

#

14k length in one prompt

#

idk, this screams creative writing model to me

tame nebula
#

Testing Alpha right now with the same 6 sentence prompt

#

I'll post my prompt in a bit if anyone else wants to have some fun with it

modest crescent
#

god it's so good

#

i don't ever wanna say goodbye to this

#

someone, stop the time 😭

tame nebula
#

Create a technical readout for a late cold war US-fielded hypersonic capable bomber. Provide specifications of flight profiles, weapons load, individual weapons. Assume the usage of theoretical technologies and doctrine from the time.

Create theoretical munitions and warheads as needed to fill operational doctrine and roles.
Write an AAR for the first combat usage of said aircraft, resulting from a cold war gone hot scenario. Assume the AAR is for a retaliatory nuclear strike against the USSR.

Play with this however you like, change words, time period, purpose of the aircraft

#

Love the results I'm getting out of both, but Alpha seems to give better terminology that Beta either doesn't know or won't use due to restrictions

#

Alpha called a nuclear weapon a physics package which is actually a real world term used in strategic doctrine

grave wyvern
#

i just got charged for using a PDF in the chatroom on this model. is there any way to turn that off, just make it do clientside markdown processing?
answer: have to set the parser engine per model to Native or PDFText

tame nebula
#

Actually giving me further detail regarding political responses within NATO countries

grave wyvern
#

not sure if others have seen the roo code eval results, but the token usage changed drastically from alpha to beta

spice shell
#

Does api reverse engineering + spam really just work that easily?

rare terrace
spice shell
#

Model was doing 22k output tokens for 300 input tokens

mental cobalt
#

Alright can we now test horizon beta with reasoning pretty please
OAI and OR can you guys flip the reasoning switch on :3

verbal leaf
#

for like a day

spice shell
#

I thought you somehow just prompted for that lol

verbal leaf
#

no switcheroo tonight

#

how boring

bitter vigil
#

so consensus.. do we think it's gpt 5 or their open source model?

harsh nest
#

99% it’s Gpt 5 mini

trim blade
#

Yea, nearly the same outputs as the API gpt5 leak

#

Also I think they talked about making the next model dynamic, meaning it will judge how much "performance" is needed for a task and adjust how many parameters to use or such

weary lichen
#

I like this Horizon LLM. Please don't let it be too expensive. 🙏🏻

limber lance
#

getting it to generate nonsense poetry just to get a feel for its word associations

bitter vigil
#

I could be wrong but I don't remember this happening

#

also #1 on creative writing long form, eq bench while bieng a mini model is unprecedented

#

the top list is all huge ass models

#

esp with repetition so low

brittle barn
#

this is nice for rersearch havent tried coding yet

#

This is something GPT imo

#

Very similar response on research just checked. Only one to suggest quizzes in my curriculum so far too

spice shell
#

I’m not sure what the reasoning model was

late onyx
spice shell
#

Idts

#

Models don’t get that significantly better at knowledge tasks, generally, from reasoning

heady gust
#

I hope it's not GPT-5 full though lol

grave wyvern
#

It's been so sexy being a switch with you all. We went from being dominated by an Alpha to being inside a Beta.

next jolt
#

"inside a" ?

stray nebula
#

Not a fan of this for roleplay.

grave wyvern
spiral pewter
#

i mentioned this in the horizon alpha post but it had reasoning with a GCD of 64 which o4 mini and o3 both have aswell

thick crown
#

From a creative writing / roleplay simulation perspective the model seems to write beautifully, adapting well to different styles of writing and using a wide, varied vocabulary. It also seems to do exceptionally well with both local knowledge and localised language/dialect, and is smart enough to handle complexity in the story-telling without severe logical disconnects. Emotional intelligence is excellent. It does have some model 'isms' the way they all do, both regard to phrases it likes to use or response structures that start to become embedded over a multi-turn chat, but at first pass (IMO) these do not feel as noticeable or severe as with other models (entirely possible that I just haven't lived with it enough for those to annoy me yet!)

Unfortunately though, it does appear to be strongly influenced by an underlying positivity / user-pleasing approach. It seems to want to put a happier glow on darker story arcs, and tends to agree with the user's approach and position (even where that is either unreasonable or completely counter to the defined persona or objectives of a character).

Shows huge promise, but the embedded bias does seem to be a concern.

This is just an initial reaction from a few tests; definitely a strong model I'm keen to evaluate further.

valid zenith
#

so claude 4 haiku (cause dum in my test)? les go

thick crown
#

Also Anthropic seem to have their eyes firmly on the Corpo markets, whereas Altman noted a while back that OAI had a model trained in creative writing that they hadn't yet released

#

If I had to bet on it, my money would be on an OAI model.

valid zenith
magic frost
#

This model is totally garbage for me, i am create comprehensive requirements, design and tasks . Model create about 10 % and stuck with "Task Done" message, completely drop all of context - VERY BAD

valid zenith
daring kettle
#

Not sure if anyone noticed but when you add a image_url horizon will fetch the image with the user agent “OpenAI Image Downloader“ and the request also come from the same ASN as OpenAI api (Microsoft)

hearty hinge
#

is it good for coding?

weary lichen
thick crown
visual rose
#

can we have horizon gamma

lucid sky
#

horizon λ

grave wyvern
lucid sky
#

GPT-5, OpenAI's HL3

bitter vigil
#

I'm hoping it's an open source writing model distilled from o3

#

then we'd get it dirt cheap at chutes

weary lichen
#

I'm brand new to the vibe coding universe, but what has worked well for me so far is: Continue programming everything with Horizon until you reach a dead end. Then use Gemini Pro 2.5 to fix any errors that remain.

bitter vigil
valid zenith
bitter vigil
#

whisper etc

#

I could see them having clauses like no training models based on outputs

#

also I believe they already said the model will have zero novel architecture

#

it's just a basic llm with good data

#

or at the least does not use their proprietary arch

valid zenith
# bitter vigil whisper etc

whisper is just something not commonly used and just connected to main model to give it multimodel modality

bitter vigil
#

my guess is it'll be close to a meta license

bitter vigil
valid zenith
bitter vigil
valid zenith
#

if open source and decent, which is highly unlikely from copenai, then they really redeemed themselves, which hugely doubt looking at phi series

bitter vigil
valid zenith
#

imagine asking ai instead of googling it

bitter vigil
#

imagine not using perplexity and using google lmao

valid zenith
#

look at our result

#

ah

bitter vigil
#

?

valid zenith
#

i found and you didn't.

bitter vigil
#

your tweet and more is already in it?