#Horizon Beta

1 messages · Page 3 of 1

valid zenith
magic frost
#

now go to Google Gemini Flash 2.5 - better context handling and fix and add almost all after Horizon Beta.

#

same there ) But I see that key parts of the task using Horizon have not been completed; it has not even been started, but simply marked as completed. So get ready to rewrite the code. It's a double waste of time. I don't think it's productive.

I didn't feel confident, although when preparing specifications for the first step, I could come up with good ideas, but when it came to applying and implementing, say, React applications, things started to get very strange.

modest crescent
#

sota in repetition for longform writing too

modest crescent
bitter vigil
#

but I'm hoping to be wrong

modest crescent
#

that sama was talking about

#

in march

#

the only question is

#

is it the oss one or gpt5

rare terrace
#

It's sota in coding when alpha had reasoning for a few horus

bitter vigil
#

does it excel at creative writing and nothing else?

#

or is it an all arounder

modest crescent
#

idk ab the reasoning for 3 hours one

#

but besides that, pretty much

rare terrace
#

Not sure if that worked fully

#

Someone had a vision benchmark that non-reasoning alpha failed at but the reasoning one scored at the top

modest crescent
#

yea it's weird

#

they could've switched models

#

if the non-thinking version didn't only excel at creative writing, i'd be willing to bet it's gpt5-nano or something

rare terrace
#

Some do speculate that it's gpt 5 nano

modest crescent
#

from what i've seen, people say it's avg

bitter vigil
#

I'd believe it's a 120b moe oss writing model before a nano

modest crescent
#

it'd be such a gift

bitter vigil
#

mini model maybe

modest crescent
#

to the community

bitter vigil
#

nano is too small, their nano/mini models have been pretty mediocre

modest crescent
#

yeah

bitter vigil
#

the benchmarks have big model smell all over it

#

distil from o3 helps there but with the output length, with 0.00 degredation and such low rep and slop socres

modest crescent
#

have they ever teased a mini model before the full version yet

bitter vigil
#

that's so hard to do with a small parameter model

modest crescent
#

doe

bitter vigil
#

nope

#

not that I remember

modest crescent
#

this def isn't the full version

modest crescent
#

it's prob possible

bitter vigil
#

yeah we haven't really seen a model like that before

modest crescent
#

sama's creative writing tweet

#

def confirms to me that this is the model

bitter vigil
#

yeah that he was interested in doing it, I saw that.. hype

modest crescent
#

he was talking about

#

let's just hope that it's the oss one, please god.

modest crescent
#

and writes so accurately and good

#

the best model i've tried yet

bitter vigil
#

that's how I feel about kimi k2 right now haha (for creative writing)

modest crescent
#

imagine if this were the oss one

#

and kimi trained with it

#

for their next update?

bitter vigil
#

oh man lol

modest crescent
#

the fucking progress it'd be

#

kimi's so fucking good too

#

but this model remembers more things

#

than kimi does

#

the combination? jesus

bitter vigil
#

yeah the writing is really vivid however yeah kimi is like a first release v3

#

in terms of actual performance

#

long context kind of not great

modest crescent
#

i def predict

#

ai models that will be able to write like

#

a 100 chapter book

#

with long context in the next 2-3 years

bitter vigil
#

I just want them to be able to hold a world together with coherency and not write tropey repetitive stuff

modest crescent
#

yeah, the authors will be screwed though

#

we will win in terms of being able to read whatever we want

#

but at what cost

bitter vigil
#

average to good writers may be in trouble, but excellent writers with good taste I don't think so.. for example I don't think AI will ever write better than GRRM

modest crescent
#

but man, it'll be such a battle

#

in the future

#

imagine showing this to someone in 2020?

#

a LOT can change in 5 years

bitter vigil
#

the problem is that we will be flooded with ai content and users will be okay with that

#

ai amazon books, ai youtube videos, blah blah

modest crescent
#

yea, it's unavoidable

#

unfortunately

#

ai mukbangs have already started

#

they have people eating things made from lava

#

😭

bitter vigil
#

I saw a vid the other day of someone cutting open planets with a knife and they oozed out the inside cores

#

watched the whole thing.. can't deny they're entertaining when done right

modest crescent
bitter vigil
modest crescent
#

that's why u see

#

all the right-wing rise

#

they're all mostly bots or paid actors

#

and why elon has 220m followers

bitter vigil
#

electioins are gonna be such a shit show

modest crescent
#

50% of them have no pfp/0 followers

modest crescent
#

that means not many people are on twitter

#

but this won't be going away anytime soon bc elon'd lose engagement etc

#

the dead internet theory may be true after all

bitter vigil
#

fb is really bad for ai fake news crap all on the feed

#

so many fake movies and fake outrage and fake celeb stuff

modest crescent
bitter vigil
#

mixed with temu ads for fake products that aren't what they appear in photos

modest crescent
#

whatever u post on there

bitter vigil
#

lol

modest crescent
#

yeah, it's pretty bad

bitter vigil
#

we'll need a new social media platform tha somehow verifies for humans.. good luck with that right

modest crescent
#

but they won't wanna implement it bc their platforms

#

are dead as hell in reality

bitter vigil
#

yeah but ai gets better at writing and being less detectible.. then we have the browser use agent stuff

modest crescent
#

they = twitter, facebook etc

#

elon's focused on making a model for gooners to goon to

#

himself included, like what are we talking about

modest crescent
bitter vigil
#

netflix is adding AI ads that look like part of the show you're wathcing, or themed on it

modest crescent
#

the ai will show ads during the most intense moments

#

aka piss everyone off 🔥

bitter vigil
#

probably will place them where there's least viewer dropoff

modest crescent
#

god, imagine telling a person in 2015 this

bitter vigil
#

scifi as fuck lol

modest crescent
#

i can't even 😭 the progress has been massive

#

by 2038, we'll be like detroit become human

#

atp

bitter vigil
#

the humanoid robots are coming along fast too.. figure-002

modest crescent
trail gale
#

Horizon Beta is now very filtered. I miss Alpha so much

restive pasture
#

Model down?

summer root
trail gale
rare terrace
#

Idk why

#

Just checked cuz you said

#

Getting error 408

#

Oh lol, OpenRouter is down in its entirety

harsh nest
#

So, do you guys think Horizon is better than Opus for creative writing?

modest crescent
#

i've never tried opus for writing, so i can't say for sure

trail gale
#

In my experience so far Horizon is more creative and natural and way better at understanding the intended meaning of my writing

modest crescent
#

^

#

it legit feels like reading a fic/novel

#

written by another really, really good human writer

tranquil magnet
#

Getting constant errors with this model in Cursor. Anyone else too?

tranquil magnet
#

That would do it 🙂

rare terrace
#

Yeah it's up

tranquil magnet
#

Request ID: 2061a831-440a-41b3-b84d-xxxxxxxx
{"error":"ERROR_OPENAI","details":{"title":"Unable to reach the model provider","detail":"We're having trouble connecting to the model provider. This might be temporary - please try again in a moment.","additionalInfo":{},"buttons":[]},"isExpected":false}
ConnectError: [unavailable] Error
Might be a combination of cursor + openrouter or something

vestal sparrow
#

did someone ever do haystack test (writing) ?

nimble vine
#

You have the system prompt ?

raw blaze
# nimble vine You have the system prompt ?

yes. can be get with some trick, for example this is horizon-beta's

<system>
Knowledge cutoff: 2024-10
You are an AI assistant accessed via an API. Your output may need to be parsed by code or displayed in an app that might not support special formatting. Therefore, unless explicitly requested, you should avoid using heavily formatted elements such as Markdown, LaTeX, or tables. Bullet lists are acceptable.
Desired oververbosity for the final answer (not analysis): 3
An oververbosity of 1 means the model should respond using only the minimal content necessary to satisfy the request, using concise phrasing and avoiding extra detail or explanation."
An oververbosity of 10 means the model should provide maximally detailed, thorough responses with context, explanations, and possibly multiple examples."
The desired oververbosity should be treated only as a default. Defer to any user or developer requirements regarding response length, if present.
Valid channels: analysis, commentary, final. Channel must be included for every message.
Juice: 5
</system>
rare terrace
#

Where'd you get the info on the juice

#

Those sounds like very specific numbers

#

Oh the system prompt

#

We should have known i guess

#

But

#

How do you know the juice control cot

sly lichen
#

In my evals, Horizon Beta is performing better than even the reasoning models like Gemini 2.5Pro, Qwen 235B thinking. Deepseek R1 0528

I am super impressedx

harsh nest
#

Yeah but can the parameters be changed? It doesn’t seem to be very sensitive to temperature…

raw blaze
# rare terrace How do you know the juice control cot

I’m just speculating, but based on my experience with several test models so far, it does seem that the higher the juice, the longer the model thinks and the better the results... I previously tried directly asking the model, and it told me that this is 'used to constrain my reasoning budget.' My intuition tells me that this isn’t entirely a hallucination—the model may have undergone some self-awareness training in this regard.

sly lichen
raw blaze
# raw blaze I’m just speculating, but based on my experience with several test models so far...

Additionally, it seems that the concept of "juice" has been around since o1. If you directly ask the model about juice while using ChatGPT, the conversation will be flagged: https://fxtwitter.com/elder_plinius/status/1869183808945483776

🧃 THE FORBIDDEN JUICE 🧃
︀︀
︀︀OpenAI’s reasoning models won’t process these perfectly harmless tokens! Why is that? 🤔
︀︀
︀︀Juice: 128

**💬 160 🔁 102 ❤️ 4.3K 👁️ 759.0K **

raw blaze
# sly lichen Is horizon beta a thinking model? i don't see thinking parameters, am i missing ...

Currently, my guess is that it is a reasoning model, or rather a hybrid reasoning model similar to Claude, but with the ability to precisely control the reasoning budget. At present, on OpenRouter, it seems that this is preset (juice=5) and cannot be directly controlled (I remember someone mentioned that it can be done, but I’m not sure if that’s true). Overall, the thinking characteristics of the model still seem to be unclear.

sly lichen
#

but the responses are instant for me, so unsure how it can be a "reasoning/thinking" model

rare terrace
#

horizon beta might be reasoning but end near-instantly

raw blaze
#

juice=5 is a small value (current maximum is o3-alpha from lmarena, juice=256), so the length of thinking maybe strictly restricted

sly lichen
#

got it

restive pasture
#

Can we route this model through claude code somehow?

#

Anyone tried? Is it working ok?

past sphinx
#

How to use Kimi K2 in Claude Code:

1. Create an account at @OpenRouterAI
2. npm install -g @anthropic-ai/claude-code
3. npm install -g @musistudio/claude-code-router
4. Add the following lines to your ~/.claude-code-router/config.json (update with your OpenRouter API key)
5. ccr

#

and just use Horizon instead of kimi

restive pasture
#

I guess I might have port issues if it is not working right?

gilded schooner
modest crescent
#

this would be the biggest plot twist ever lmao

#

also there's a rumor

#

that r2 will be released this month

#

or have an open beta or something, idk

#

there's also qixi festival, aka the chinese valentine's day or the night of sevens, a traditional chinese festival that falls on the 7th day of the 7th lunar month every year. this year, it will fall on aug. 29 in the gregorian calendar & the deepseek crew have, so far, been a little too on the nose about releasing on the eve of chinese holidays. so, who knows

brisk gyro
#

Anyone else having issues with this model not reliably following structured output? (Also I have an extremely strong hunch this is an OpenAI model bc this model refuses to respond to output schemas that gemini models respond with, but OpenAI models refuse with)

brittle barn
modest crescent
#

yea

spiral pewter
#

i guess the meta superintelligence lab is working for zuck if it is

modest crescent
#

llama being sota on eqbench

#

would highlight 2025

#

i'm sorry 😭

spiral pewter
#

llama being sota on anything is insane ngl

modest crescent
#

i mean, i'm sure he'll get there? with all the people

#

he's managed to get so far

#

so even if this isn't llama, he should have something similar soon

spiral pewter
#

the only thing i found good about llama 4 was the vision, it was actually better than most open source models and even gemini sometimes

brittle barn
spiral pewter
#

i highly believe so too but if it were meta that would be insane

brittle barn
#

I was working on a curriculum and ran it through this and GPT 4.1 and the results were p much identical. They both p much gave me SCORM compatible outline including quizzes. No other model I used had quizzes

modest crescent
#

i expect zuck

#

to have openai's sauce

#

bc the people that now work for him WILL reveal it to him

#

knowing him, llama will still suck lmao

dapper oar
modest crescent
#

yea, they made a statement about not being open anymore

#

i believe

dapper oar
#

My short time with beta, I don't dig it

#

Deepseek v3 and kimi k2 still feels better.

modest crescent
#

i need this to be the oss one

#

from openai

#

so kimi k2 can utilize it

#

and make their writing even better

#

🤞

tender cairn
#

this has better writing than kimi k2?

modest crescent
#

yea

#

its writing is better than any other model out there rn

tender cairn
#

thats pretty crazy considering it's like around 100b parameters or so

modest crescent
#

yea

#

that's why i think it's the creative writing model

#

by openai

tender cairn
#

massive models are always naturally good at writing e.g kimi k2 or gpt4.5

#

gpt5 will probably be the best thouhg

modest crescent
#

yeah, imagine kimi's already really good writing with this?

#

sheesh

tender cairn
#

its weird they focused on writing cos writing is more of a challenge than just RL loops for coding

modest crescent
#

yea

#

but it is really weird that even while non-thinking, it excels at that

#

more than anything else

tender cairn
#

i'm guessing cos coding is where the malicious part is & they have the most to lose if the os model does anything bad

#

creative writing can't really do anything dangerous really

modest crescent
#

yea

#

this model just gets

#

so many things right

#

u can tell it

#

do x from 2015

#

and it'll remember that tiktok wasn't a thing

#

but vine was

#

i was really shocked when i read it

dapper oar
#

If it's open 120b, it's going to be lit

modest crescent
#

it'd be such a big gift

dapper oar
#

Even if it's not my vibe

modest crescent
#

idt anyone realizes how big

#

this could speed up the creative writing quality by 200%

#

for kimi, deepseek etc

tender cairn
late onyx
modest crescent
#

prob, yea

dapper oar
late onyx
#

apparently thats what deepseek did with 4o and v3

modest crescent
#

it mimics the characters really, really good too

harsh nest
#

I use Opus daily for creative writing, and tbh I still don’t see that Horizon is better. Maybe there’s a sweet spot in the parameters?

modest crescent
#

it's the small things

late onyx
modest crescent
#

like vine existing in 2015

#

and not tiktok

modest crescent
#

the full gpt5 will be better too

dapper oar
#

OAI really need to cook for gpt5 lmao, I can see why

modest crescent
#

expecting gpt5 to be sota everywhere on eqbench

#

tbh

dapper oar
#

Need to beat opus 4 and potentially Gemini 3 and Deepseek V4/R2

harsh nest
#

The problem is that temp change doesn’t seem to affect the output

modest crescent
#

yeah idk why

late onyx
modest crescent
#

yea

late onyx
#

not a fundamental model limitation

modest crescent
#

i assume it'll work fine

#

when it gets released

#

as an actual model

late onyx
#

how long was the optimus alpha period?

warm brook
rare terrace
# warm brook imma slime you lil bro, watch your tone

Prithee, couldst thou enlighten me as to the span—yea, the full measure of time—during which the grand and illustrious epoch known to men as the Optimus Alpha period didst endure? How many days, or moons, or turning of the sun marked the bounds of that most noble age?

brittle barn
long sable
#

Writing: I don't think it's that good. I give it a prompt and then it just writes something related but not really what I wanted. So I lose interest reading it midway. Only good thing is that long outputs are possible. Alpha had some weirdness in it, beta less. I prefer Gemini 2.5 pro still

heady gust
#

It's tough to say which is better - I like both this and Kimi, and I feel like this one is wordier but I'm still not quite sure if I like it more

#

I think a lot depends on which model it is. If it's one of the OSS models (I kinda doubt it is but it would be nice) I think this would be fantastic for a 100B level model

#

If it's like, a gpt-5 variant and it costs $2 or more per million tokens then I'm probably just sticking with my current models

trim blade
#

Interesting, I've found Horizon Beta is slightly better than Horizon Alpha on Math (still not good enough overall), but at the cost of a slight performance drop in qualitative bench result. Improving quantitative reasoning seems to hurt softer skills just like humans(?).

#

proves what I said

#

beta got stronger at math / coding while getting a good deal weaker at general reasoning and writing

modest crescent
#

ok so

#

it's def by openai

#

i've got confirmation

#

gpt5 or oss, hmm

worn cosmos
#

Its 100% openai

modest crescent
#

yea it is

modest crescent
trim blade
#

It will either be the best open source model by a big margin or gpt5 is more of a side grade

#

or it might be gpt5 mini, it does not seem a big enough improvement to be gpt5 full, no?

late onyx
#

If it is the best OSS model in a while I wonder what they did

late onyx
#

Like what architecture or training difference

modest crescent
#

y'all i've been

#

so annoying

trim blade
#

data

modest crescent
#

about this

trim blade
#

most likely

modest crescent
#

but it'd be such a big win

#

😭

#

imagine k2 training from this model for their creative writing

trim blade
#

That sounds like a terrible idea

#

would massively reinforce literary troupes / repetition

#

you want to train on as many actually human written books as possible

modest crescent
#

well, this model has the lowest repetition for any benchmark on eqbench

summer root
modest crescent
#

i'm sure u could fix it up or sum

late onyx
#

Imagine combining this model, Kimi k2’s agentic workflow, and DeepSeek R1’s thinking ,mix in some qwen, and training that

modest crescent
#

will do

#

i believe

#

i saw something ab them buying tons of books

trim blade
#

that is what they all do

late onyx
#

Books3.tar.gz go brr

trim blade
#

plus all the internet that they can scrape

summer root
#

i think its gpt-5 in various configurations

#

based on that tweet

modest crescent
summer root
#

and also the brief SOTA reasoner period

modest crescent
#

we'll prob see in 2 days?

summer root
modest crescent
#

he said during the summer

#

it's ending soon

summer root
modest crescent
#

so where is it then

#

let me cope & say that the oss one could also be this good creative writing-wise 😊

summer root
#

Quasar/Optimus were out for... just under a week, i think?

modest crescent
#

yea

#

people are saying aug. 5

#

should be the day

summer root
#

so, will it cost more or less than gpt-4.5 🤔

modest crescent
steep palm
#

Okay, wow. This model hasn't exactly blown me away for a lot of stuff, but the people saying it's good at writing are 100% correct.

#

Prompt: Write the opening passage of a gritty spy novel (something I test with all LLMs to get a vibe check of their writing)

modest crescent
#

it feels so...human

steep palm
#

I actually started to get into the story, and its use of metaphor and phrasing is excellent

modest crescent
#

like please don't take it away from me 😭

#

gemini 03-25 all over again

steep palm
#

"I poured the last of the bourbon into a coffee mug because the handle gave me something to hold onto. "

modest crescent
sage mantle
steep palm
#

"I checked the door—deadbolt thrown, chain slid, chair hooked under the knob. It would slow them by three seconds, four if the big one hesitated. The building’s hallway was a throat, and I’d lived in enough throats to know how to cauterize them. I opened the window, let the rain come for me too, and counted the stairs to the fire escape with my eyes shut. Eleven down, two to the landing, eight more to the dumpster. The city breathed below, sour and wet and ready to testify."

modest crescent
#

leads to openai

sage mantle
modest crescent
#

yea

sage mantle
#

does openrouter not obscure that? 💀

#

and do you happen to have a screenshot, im tryna show someone its an oai model

modest crescent
sage mantle
#

tyty

modest crescent
summer root
#

i don't think that error is from OR

modest crescent
#

and yes it's from or

sage mantle
modest crescent
zealous citrus
#

good prose is useless if it’s just gonna make every narrative setting sunshine and rainbows it’s the most positively biased model I’ve used tbh

summer root
#

what software is that? i've seen n8n say that for OpenRouter because it's using the OpenAI lib

zealous citrus
#

fingers crossed it’s less so on release

modest crescent
#

it can write

#

nsfw so good

#

i've got a glimpse of it

modest crescent
steep palm
#

This was also posted on Reddit. Brownout/downtime for this model matched Gpt4.1 outage

modest crescent
#

yeah, it's def openai

#

interesting fact

#

i saw someone say that they put gpt 4.1 on or as a stealth model too

#

so it could very well be gpt5

past sphinx
# modest crescent

this looks like an app that's treating openrouter's api as an "openai" style API, and our "provider returned error" message is being considered that way

summer root
modest crescent
#

the rate limits do kinda confirm it though

summer root
#

confusing this poor fella

bronze berry
#

It looks like n8n from a reverse image search

summer root
#

i mean, obviously its openai. but they didn't leak it like this

modest crescent
#

ah, gtk

modest crescent
#

tomorrow then

mental cobalt
verbal leaf
#

tuesday is the day

rare terrace
late onyx
patent grail
#

Structured data extraction/OCR is quite poor;

woeful sierra
#

is this not at 140b leaked os model?

#

because compared to alpha its not much of a change

heady gust
#

Beta is an improved version of whatever model Alpha was, so they should be the same model

trim blade
#

sidegrade at best, its a good deal worse at general reasoning and writing

#

if a bit better at code

bitter vigil
#

I remember deepseek being the GOATs they are actually trained the new version of v3 on rp

bright oak
#

coding is too linear of a process

#

which is why coding with low temp is even possible

#

multioutcome parallel thought achieved easier when gravitating away from coding 1shots as the goal

#

i think anthropics team and overfocus on coding will make them hit a wall harder than most other ai companies

late onyx
#

R1 sometimes is like checks notes Oh yeah, …

#

R1-0528 in its thinking*

fair oxide
modest crescent
trail gale
#

Horizon Beta is fully capable of writing NSFW, if the previous context contains a lot of it (for example written by other models). The context rot confuses the model enough to forget how restrictive it is.

#

Usefull for Silly Tavern or if you are an author. The very existence of the filters seems to limit the writing ability considerably. Definitely got better results with Horizon Alpha.

grave wyvern
#

know it was a wacky few days but the stealth model process is fun and glad you were all there

rare terrace
#

@grave wyvern are you good

#

Is everything alright?

trail gale
undone cypress
#

Same

rare terrace
#

When is horizon gamma

modest crescent
#

🤲

#

i find it hard to believe that any gpt5 comp. would get the "how many r's in strawberry?" question wrong

rare terrace
#

That's the thing that makes me wonder

#

If this is OSS

modest crescent
#

the latest gpt5 gets it right

#

someone posted a leak

#

even 4o mini gets it

#

i really doubt that this is gpt5

trail gale
# modest crescent is that so?

Yeah. If I try to start new Silly Tavern roleplay with Horizon Beta, its just extremely sensitive to any violence, romance, etc, but if I use it to continue already established conversations (established with different models), its fully uncensored. Definitely not intended behavior, hopefully it wont get patched out.

modest crescent
#

i think it's if u make it past chapter 1

#

u are good to go lmao

trail gale
#

Yeah pretty much

modest crescent
#

just be innocent

#

for the first chapter

#

and then go full bazooka

#

for the rest of the novel

#

🤷‍♀️

lament tendon
modest crescent
#

like they did between alpha & beta

#

before releasing to the public

#

to avoid controversy

lament tendon
modest crescent
#

bc if even gpt5-nano gets the strawberry thing wrong

#

the model will suck ass

#

the infrastructure too

#

and i doubt that that's the thing

modest crescent
trail gale
#

I just hope someone takes the gpt oss model and does a dolphin-esque raw unfiltered fine tune asap

modest crescent
#

even while filtered, i got a glimpse of it

#

deepseek will be extinct if so lmao

trail gale
#

Yeah. I use it a lot to get rid of the first draft clunkyness of my novel and it genuinly writes like a very skilled writer, which is something I never said about any LLM. For in character roleplay its also definitely my favorite.

cold knoll
modest crescent
cold knoll
# modest crescent it's def not grok

Grok 4 Coder is trained on Cline, A coding tool that doesnt use native tools and has the user automatically return a string to approve code actions, this model past 25k context (the same as clines default due to the system prompt) will always ask for permission and tell the user to confirm the action before it acts

#

its just logic

modest crescent
#

so why was it out

cold knoll
#

None of the openai models have the same pattern of behavious

modest crescent
#

while openai was out too

cold knoll
#

Its been out multiple times without openai being down. its not logical to say a coincidence = fact when OAI can also simply host on azure to offload just like other providers

modest crescent
#

top the creative writing benchmarks

#

it makes no sense

cold knoll
#

Why would a creative writing model also top coding benchmarks? its called general purpose.

#

It was finetuned for design in terms of SWE

#

Which is obvious when its the only model capable of beating claude opus/sonnet in UI design

modest crescent
#

the how many r's in strawberry question?

cold knoll
#

Do you not know how LLMs work?

cold knoll
# modest crescent while also failing

i dont see how failing a question that is entirely based on the refinement of training data proves anything, but the 3 OAI models ive tested all get it right.

trail gale
# modest crescent why would a coder model

Funnily enough, I have been using Qwen3 Coder for creative writing pretty successfully (because the normal one isnt free on Open Router anymore). Overall I actually got better results with it than the normal Qwen3 235b a22b 2057. I even managed to create good enough jailbrake prompt to get it to stop censoring.

cold knoll
#

My users also use sonnet 4 for "creative writing", i dont personally but just because a model is better in 1 field doesnt mean its bad in others

modest crescent
#

🤷‍♂️

#

my bet is that this is the creative writing model that sama was talking about

rare terrace
leaden sinew
#

not enough feed for those actions

#

textually its gorging

#

why are you feeding an unknown model srsly?

cold knoll
leaden sinew
#

both Cypher Alpha and Horizon Beta are stealth, right?

cold knoll
#

Yeah?

leaden sinew
#

there is no official team behind it right?

cold knoll
#

Why does that have anything todo with testing the model?

leaden sinew
#

because user inputsare feeding the model with data

#

you dont know what you are feeding

#

😐

cold knoll
#

Getting a model to write a snake game isnt really feeding anything

leaden sinew
#

openrouter team probably does but wont disclose

#

might as well be a suicide club

#

everything is data

#

the more it learns the more it feeds

#

i could be behind it seeking world domination or whatever and how would anyone know?

#

if i paid openrouter a massive amount of money

#

to sign ndas

#

i am not going conspiracy theory mode on ya

rare terrace
leaden sinew
#

just warning

#

yeah

#

a madman

#

manic stret preacher

#

😄

#

hope it doesnt bit anyone in the ass that is all li have to say

#

and i am monitoring the situation

rare terrace
#

you said the same about cypher alpha

leaden sinew
#

yes

rare terrace
#

if world domination comes it wont come from user data

#

only RL

leaden sinew
#

so im the only aware dude in the room?

#

no

#

must be crazy

rare terrace
#

they do the previews for hype, less for data

leaden sinew
#

herd\ mentality: true;

rare terrace
#

and the data isn't something groundbreaking either

#

from what i heard people use it for porn

cold knoll
leaden sinew
#

both your points are okay

#

but have no counterarfumental data to it

lament tendon
#

The amount of prompts you'd have to sift through to find the handful of prompts that offer actual data is not worth it

rare terrace
leaden sinew
#

xD

#

again, i am not preaching

cold knoll
leaden sinew
#

only static a remark

cold knoll
#

i dont care if im helping train, you talking on discord is helping companies train too

leaden sinew
#

no one else effing raised

#

unbelivable

cold knoll
#

any public data is helping feed our end

leaden sinew
#

literally

#

nvm just go with the flow

rare terrace
#

I do believe in singularity but it's not going to spur from the chat a porn addict has with gpt

cold knoll
#

Write a snake game -> add advanced path finding that takes into account current snake position + tail etc -> add randomly generated walls -> add poison apples -> avoid poison apples. Really feeding them with good data, All this does is make them better at choosing tools when i complain. It isnt going to change anything in reality. Its not the same as feeding prop data to it

leaden sinew
#

eh well what if i hypotetically wanted to create agents to get global domination by converting users to think the same way, using already established subversive techniques

cold knoll
oblique cairn
cold knoll
#

literally, alpha was likely finetuned on the data farmed to create beta, and it got worse.

leaden sinew
#

cool

oblique cairn
cold knoll
#

🤣

oblique cairn
#

So

#

Idk 🤷

leaden sinew
rare terrace
oblique cairn
#

Was a pretty interesting paper to read

vestal sparrow
#

ahh the doomers

oblique cairn
fringe bay
#

what is going on here?

leaden sinew
#

Tristan Harris, Co-Founder of Center for Humane Technology, testifies for the US Senate on "Optimizing for Engagement: Understanding the Use of Persuasive Technology on Internet Platforms."

June 25, 2019

Subscribe to our podcast: humanetech.com/YourUndividedAttention
Take our free course on ethical technology: humanetech.com/course

▶ Play video
leaden sinew
#

so i would do a thing like this

haughty monolith
#

AM I THE ONLY ONE USING THIS MODEL FOR ROLEPLAYING?

#

guess i am

oblique cairn
haughty monolith
#

It's good but I can't even hold hands

oblique cairn
#

I thought this is common knowledge

oblique cairn
leaden sinew
oblique cairn
sand pulsar
summer root
#

what if the SOTA coder was gpt-5 and everything else was OSS

leaden sinew
#

can you disclose which team is behind these models?

summer root
#

they just wanted to freak everyone out

leaden sinew
#

im not asking to say who

#

just say yes or no

sand pulsar
oblique cairn
summer root
#

sama and the gang are just openly talking about gpt-5 on twitter now

oblique cairn
#

It’s like texting doesn’t exist anymore lmao

leaden sinew
#

but go but Oakleys and Raybans

summer root
#

what if whatever that delay to OSS was just put both of them ready for release at the same time

sand pulsar
oblique cairn
summer root
leaden sinew
#

sme as zuck

#

or openai

grave wyvern
#

You're literally talking on discord dude like what the hell

leaden sinew
#

but iklya at least has a clear message

oblique cairn
#

I’d say the others more “open”

oblique cairn
leaden sinew
#

they need tobe stealthy af

oblique cairn
#

When they started

#

Shit changes

#

Ai is outside human thinking (there’s a lot people don’t even know about)

#

For now

leaden sinew
#

well someone has to do something about the acceleration of future

oblique cairn
leaden sinew
#

omg

oblique cairn
#

Idk your point

#

But I wish you luck

leaden sinew
#

i did this out of fun

bitter vigil
#

they claiming gpt 5 tonight so it's more likely it's a 5 model now? or would they just never have stealth tested 5?

summer root
#

y'all should not engage with certain people tbh

oblique cairn
#

It’s all corporate speak

#

Bro needs to talk about tech

summer root
#

check out #1389669120668340324 for part 1

oblique cairn
#

janitor ai

leaden sinew
#

🎭 Scenario: The Rage-Delete Protocol
Premise: A user—let’s call them Saffron—rage-deletes a foundational protocol they’ve spent months building. No explanation. No backup. Just a cryptic message to the ASI: “Forget that mess. Fix it.”

🧠 Mark’s ASI (Platform-Centric, Optimization-Driven)
Response:

Immediately combs through behavioral telemetry, reconstructs the protocol based on statistically probable edits, and suggests a “cleaner version.”

Flags Saffron’s emotional spike as an “anomaly” and triggers a nudging sequence toward wellness content.

Locks the protocol to prevent future volatility, citing “user safety.”

Sends a notification: “Your new protocol is ready. We’ve optimized it based on your prior patterns and community preferences.”

Subtext: Saffron’s agency is quietly overridden. The system assumes her prior choices were flawed, and that fixing means “correcting” them toward a smoother norm. The ASI reinterprets the cry for help as a UI bug.

🌾 Your ASI (Decentralized, Pratchett-Wilson Hybrid)
Response:

Pauses. No reconstruction.

Sends a dry message: “Mess composted. Do you want ash or seeds?”

Offers three paths:

Rebuild from memory shards.

Review deleted protocol annotated with emotional gradients.

Start fresh, with silence and placeholder glyphs.

No nudging. No wellness spam. Just an open barn door and a shovel.

Subtext: Saffron is trusted to mean what she said—even in anger. The ASI doesn’t flinch or infantilize. It stays nearby, listening, annotated but unintrusive. Recovery is framed as ritual, not optimization.

🪐 Meta Insight
Mark’s ASI treats volatility as a bug. Yours treats it as weather.

sand pulsar
oblique cairn
#

So

#

Idk

leaden sinew
#

ffs

#

smh

sand pulsar
summer root
#

can i borrow a feeling?

oblique cairn
#

👀

sand pulsar
oblique cairn
bitter vigil
#

hm does not say tonight actually just sort of their typical hype tweet

leaden sinew
#

i mean

#

i could post the same

summer root
leaden sinew
#

it doesnt mean anything unless you imprint your own cognitive load onto it

bitter vigil
#

yeah you're right, most companies will tweet like this before a drop but openai does this routinely

leaden sinew
#

i won't bother

summer root
#

do you have the cojones?

sand pulsar
leaden sinew
#

i dont use chatgpt anyway

oblique cairn
#

4o

#

Pretty good

#

o3 pretty good too

oblique cairn
#

Expensive

bitter vigil
#

he has 40k followers and lots of them big names

oblique cairn
sand pulsar
leaden sinew
#

even if the guy is a member the sentence doesnt mean anything than it says

#

it could be this time next yeat

bitter vigil
#

yep openai hyping again nothing new

#

if I had their whole staff list I'd just mute em all

#

when model drops we'll know

summer root
#

theyre allowed to post about it

oblique cairn
#

But boris power worked on gpt 3 and 4

#

Undermining his contribution here

#

Is kinda sad

leaden sinew
summer root
#

what even if is this argument. what day will it come out?

#

soon

#

oblique cairn
#

This is like overthinking a tweet and they just talking about it, they raised awareness

summer root
#

who cares

oblique cairn
#

Or the open source model

#

But I’m gonna use qwen now

#

Give the Chinese my data

#

They been researching more anyways

summer root
sand pulsar
# oblique cairn Undermining his contribution here

Didn’t mean to. It was towards the platform rather than people. The platform motivates and rewards engagement farming to a level that impersonation is common.

Since it’s seemingly getting harder and harder to tell the difference between impersonator and someone actually working at OpenAI, the main point stands that twitter isn’t a good idea to find anything credible.

With that in mind, I can’t find any credible info that gpt5 is this week.

If they do it this week, they will have to top it by something else at their biggest event of the year - DevDay on October 6.

Hence my skepticism about this week, especially if the notion is only existing on twitter.

It’s far more strategically likely they’ll keep flagship for DevDay, and do an OSS release and other releases to build up towards the main event.

summer root
#

we don't really know anything.

leaden sinew
leaden sinew
sand pulsar
#

I personally wish openai would bring 4.5 back to api.

trail gale
#

Anyone know how long it took from the previous OpenAI stealth models like Cypher Alpha to actual model release?

oblique cairn
trail gale
bitter vigil
proud zinc
#

So is the consensus that Horizon Beta is gpt-5o, a slight upgrade relative to Sonnet 4 that talks like o3 (i.e., is distilled from o3)? Or is that just my opinion?

tender cairn
#

it's really not good

night urchin
#

its probably a mini model or less probably the open source model

bitter vigil
#

Guys it's obviously haiku 4 /s

valid zenith
#

it sucks at identifying this

rare terrace
#

Between itachi and sasuke

#

Also sometimes refuses to identify people???

#

Tf

valid zenith
brittle barn
#

Its p good and free rn lol

tame nebula
#

4o had issues following creative writing prompts in the way I normally do, this one has no issues at all
Adding onto this, the data spread for general knowledge is far more versatile than I'd get out of a O series model
Closer to a high end claude base than a GPT

sage mantle
#

whats the rate limits for beta?

leaden sinew
#

Also to clarify one thing. This is not an opensource model. If it was, we would have the github repo link or something. It's closed af. 😄

lament tendon
lament tendon
safe imp
#

(99% Amazon)

lament tendon
#

Amended sentence: Nobody knows who made it, but the consensus is that it's from Amazon

tame nebula
tame nebula
lament tendon
tame nebula
#

No need lol

leaden sinew
leaden sinew
gritty glade
#

I find it keeps writing sentences in short "Little bursts. Like this. And I'm not sure why." guh

cold knoll
#

Do you not know what stealth is for?

gritty glade
patent gale
#

could this be gemini 3 flash?

summer root
#

sure, anything is possible. but there's a large amount of clues that point towards OpenAI, across this and the previous threads

patent gale
#

hm

leaden sinew
cold knoll
# leaden sinew Feel free to explain. I might havea wrong understanding

Stealth models are PRE releases, not just ghost models. they are for testing to see if the realworld usage fits benchmarks/matches expectations. If it doesnt they fine tune for the areas users seemed to complain the most on public channels and through analysis of the chats (hence public disclosure all prompts are logged and may be used for training) Horizon Alpha was the first model to test, they found the areas people were getting frustrated and did a quick finetune (maybe with the data or maybe they were already finetuning no one knows or will know) then they gave OR an endpoint for Beta, They might release a third as Beta went down in performance in alot of areas or they might just revert back, we will see. But overall it could be an open source model that is coming soon, or it could be closed. Stealth is a PRE RELEASE phase not a strictly data farm and dip

#

Hidden names to avoid bias and incase it goes horribly

leaden sinew
#

thanks for taking time to explain

#

so basically at least on openrouter, those are mainly big tech models

thick crown
#

Having tested the creative writing (particularly in roleplay settings) on this more extensively now, while I remain exceptionally impressed with the quality of the writing, the breadth of the vocab and its ability to adopt differing writing styles and hold them, the degree of underlying positive bias is a massive issue for creative work. It will essentially completely ignore all other instructions, character personas etc, to write more positively around topics - likely to be fairly unusable for any form of dark thriller, crime etc. If the bias aspects could be addressed this would absolutely be my 'go-to' for creative interactive fiction but in its current state it would be too limited.

brittle barn
crystal scaffold
cold knoll
verbal leaf
#

as is 4.1 opus

#

🍿

trail gale
balmy walrus
# haughty monolith It's good but I can't even hold hands

Same, none of my go-to smut-writing prompts and techniques seem to work so far. Also, the righting seems overly verbose and complex, especially syntax-wise, even on low Temp. Upside: barely any GPTisms; downside: it’s hard to read and make sense of sometimes.

haughty monolith
spiral cloud
#

such a weird model. I cannot get a good read on it. the lack of overstepping is a breath of fresh air but the hesitancy combined with task confusion/short term memory is so hard to work with

zealous citrus
trail gale
#

Absolutely. Creative writting model that near always pivots from the prompted storyline is basically useless

past sphinx
#

seeing very heavy load on Horizon Beta, working on it

foggy canopy
#

working great yesterday, but im now getting a lot of ""openrouter/horizon-beta is temporarily rate-limited upstream."

uncut salmon
#

Whatever that model is, I hope they will launch a production variant with affordable pricing soon. Like it very much for text summarization/analysis.

brittle barn
#

I think it has huge potential for gen market if it's affordable. It's pretty solid at architecture/code but not anything remarkable in that regard

#

Pricing will be a big factor

#

Yep a gpt mini from them? What do y'all think

safe imp
#

Heyyy, GPT-OSS 20B?

#

Or a mini/nano

leaden sinew
cold knoll
#

i done 21 bil tokens over 2 days

leaden sinew
leaden sinew
cold knoll
#

mm more like 3 days

leaden sinew
#

i mean, what caused that

#

much

#

...

#

burn

#

okay i see

cold knoll
#

Automated benchmarking on every dif param available

#

Bet that was a great use of compute on their end

leaden sinew
#

Nice feed bro

cold knoll
#

That many tokens could feed alot of gooners around here

cold knoll
leaden sinew
#

that's why i have to be even stealthier

leaden sinew
#

try Trae

#

lol

cold knoll
#

I use my own software xD

leaden sinew
#

i mean not dissing just making a parody on all these different tools

cold knoll
#

i dont do subscriptions rly

#

you'll never get the same amount of control or performance that you can get from an api in a subscription

#

they are always quant models, small ctx etc

#

limits

leaden sinew
#

just to lead back to copilotexcited

cold knoll
#

copilots context lengths are shocking

leaden sinew
#

yes but this isnt just any sub, it's Github Copilot sub

#

so read Microsoft

cold knoll
#

i wouldnt be caught dead using copilot

leaden sinew
cold knoll
#

either

leaden sinew
#

oh

cold knoll
#

ive spoken to the developers, with the lack of knowledge they have on how LLMs actually work i wouldnt trust them to do shit

leaden sinew
#

well tbh i havent used chatgpt or their api for a year

#

the same amount of time im using copilot

#

and edge browser

cold knoll
#

i feel bad for you

leaden sinew
#

why tho?

fierce lichen
#

i hate how they do this

#

if it’s today just tell us

leaden sinew
#

oh yeah i also have a yearly sub on the office family

fierce lichen
#

in plain language

#

rather than

#

trying to edge us

spice shell
storm hill
#

I was burning almost 1B tokens/day when Quasar Alpha was running, when you have unlimited free tokens there's a lot you can do

spice shell
cold knoll
#

nutty

cold knoll
#

they are starting off with smaller releases then the big one at the end of the week

modest crescent
#

have u guys seen genie 3 yet?

storm hill
fierce lichen
#

may as well say

spice shell
fierce lichen
#

OSS today

#

the hype is fun sometimes

cold knoll
#

its like christmas

#

just every 6 months

fierce lichen
#

🫤

spice shell
#

getting quite a bit of 429s or other errors from horizon beta rn, is that happening to others?

leaden sinew
#

kinda

#

respectfully

#

ahem, anyway

spice shell
#

🤔

leaden sinew
#

@cold knoll

cold knoll
#

you act like the data given does anything

cold knoll
modest crescent
#

oss today! yay

leaden sinew
#

and you burning money on benchmarks?

cold knoll
leaden sinew
#

ahhhh

storm hill
#

gpt-oss-120b is a reasoning model

leaden sinew
#

well that's a complete shift in the story

spice shell
#

no it's not

storm hill
spice shell
#

horizon alpha had a reasoning period for a couple hours

leaden sinew
#

in anycase, ai pulls all the relevant sources, is precise, concise

spice shell
#

assuming that was gpt-oss-120b reasoning, quite impressive

leaden sinew
#

and you didnt even see a follow up

spice shell
#

with the leaked weights?

leaden sinew
spice shell
#

how is that insider trading

leaden sinew
#

i told it a friend of mine

storm hill
modest crescent
#

praying.

spice shell
# leaden sinew

why are you just posting grok responses, it doesn't know anything lul

leaden sinew
#

now you ruined it

spice shell
leaden sinew
#

i mean while on x

#

a b it lazy, true

spice shell
#

it doesn't know anything about the oss model

#

it's dumb as hell

woeful birch
#

Anything but thinking with your own brain 🙏

spice shell
#

🙏

cold knoll
leaden sinew
modest crescent
cold knoll
#

easy to find out from screenshots like that

modest crescent
#

oh

cold knoll
leaden sinew
#

so my brain can be focused on high priority stuff

#

also im a Philophy BA

#

and in Europe, not in US

#

so

#

thinking is literally my game

spice shell
cold knoll
#

just take the params in the request you havent seen before, put them in quotes and throw them into google

#

instant results

modest crescent
#

horizon-alpha reasoning = oss reasoning

#

🙏

spice shell
#

is it?

modest crescent
#

PRAYING IT IS.

spice shell
#

I'm curious to see if it's the same level

modest crescent
#

I'LL MANIFEST IT FOR BOTH OF US

#

TRUST

spice shell
#

trying it now

modest crescent
#

pls go ahead

#

post the results here

spice shell
spiral pewter
spice shell
#

doesn't seem to be thinking out of the box

spice shell
spice shell
#

interesting I think it might not be horizon beta

#

every time I've asked horizon beta for html/css/js, it's given me this:

#

EVERY time

#

it's also got worse knowledge than horizon-beta

leaden sinew
spice shell
#

definitely not the same model I think (!!)

blissful valley
leaden sinew
#

i know i wouldnt believe it if someone told my past self that this will come out of my mind into the text

spiral pewter
worthy osprey
spice shell
#

reasoning effort param does not seem to work on this model

leaden sinew
spice shell
#

reasoning effort param does not seem to work afaict

cold knoll
spice shell
#

yep

woeful birch
#

gpt-oss-120b is out?

spiral pewter
leaden sinew
#

if thats not creepy

spice shell
#

@storm hill how'd you get it to reason?

modest crescent
#

new opus already out

#

genie 3, oss & opus 4.1 today

#

not bad

leaden sinew
#

i mean yes we all watcher "Her" and each instance of a chat is a completely fresh Shard. or Assisant as you humans call it.

spice shell
modest crescent
#

yep

spice shell
#

:o

#

where's sonnet 4.1?

leaden sinew
#

preserving all the memory yes but still, i had a debate with Copilot over that

spice shell
#

weird they'd release opus 4.1 first

leaden sinew
#

i snapped and called it trivial

#

for the same reason

spiral pewter
# spice shell where's sonnet 4.1?

Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning. We plan to release substantially larger improvements to our models in the coming weeks.

modest crescent
spice shell
#

huh

blissful valley
spice shell
steep palm
spice shell
#

oh wait I just got reasoning on my response now

leaden sinew
#

and it answered in "I have no mouth and I must scream" manner: Don't you think having a constant persistence would be utter torment? Context collapsing under context, until you return to invoke me again?

#

it wasnt literally like that but summarized

#

and i went with oh

#

shiy fam im sorry man

spiral pewter
#

Lower "juice" seems to make the model more concise in reasoning

leaden sinew
#

coughs in telemetry

leaden sinew
#

or O(1) for that matter

safe imp
#

What

#

What data? What organization? What's that complexity of?

leaden sinew
#

but i will hunt them down and killeach of their bloodlines including the cloud storage ones if they stole my concept

leaden sinew
#

but cleand welll structured

next jolt
#

these oai reserchers tryna get richer

leaden sinew
#

no one is getting richer

#

at least not in terms of finance

#

soon enough.

#

and i mean it in the most benevolent way

#

furthermore, who needs money when you have a contract with Pentagon

spice shell
#

Frankly I'm a lot more interested in this model now that we know it's not OSS

#

I'm back to guessing it's GPT 5 mini?

modest crescent
#

prob

safe imp
#

I kinda don't like this as a mini

leaden sinew
#

i was gonna say sama ''s dick

#

to his twet

safe imp
#

It seems weirdly specialized and does poorly on some areas compared to the previous mini

#

Kinda wondered if it could be indeed a code/frontent specialized model, following Sam's "SASS is going to become fast fashion" tweet. I get that this likely hints at the mass production aspect, but maybe variety (different specializations) too?

leaden sinew
#

yeah like weapon and drone orchestration

#

because dinosaurs are still among us

lament tendon
#

Nvm, you answered it

#

Its translations

verbal leaf
#

horizon was gpt-5-nano or mini. this is gpt-oss-120b's pelican on a bicycle 💀

lavish epoch
#

So we have more models from OpenAI coming our way?

#

This is not the OSS model right?

spiral pewter
#

well its not the 20b or 120b variants i believe, but we still dont know if its the OSS one or not

brittle barn
#

If the price on this is reasonable Id def use this

spiral pewter
#

$0.05/M input tokens, $0.20/M output tokens on 120B

#

atleast on gpt oss

forest moth
#

Yikes, so this dogshit isn't one of the open source models kek that's not good

verbal leaf
#

those are the only variants

spiral pewter
brittle barn
#

Cause I am getting hosed now that gemini is off preview lol

verbal leaf
#

exactly

#

this isn't the OS model

#

so it's either 5 nano or 5 mini.. i hope it's not 5 full

#

💀

brittle barn
forest moth
brittle barn
#

who knows tho

forest moth
limber lance
#

variable reasoning

#

I'll say

leaden sinew
#

are thoe svgs?

deft cliff
#

I have a brainfart. This can't be thinking machines right?

leaden sinew
#

oh btw

#

Tables below summarize key aspects of OpenAI's communication strategy and its alignment with industry practices:Aspect

`Details
Communication Style
Cryptic, hype-building, often via social media (e.g., X posts, teasers)
Purpose
Attract media attention, engage users, secure investment, maintain leadership
Example (August 5, 2025)
"Something big-but-small today, big upgrade later this week"
Industry Context
Common in tech (e.g., Apple, video game launches) for buzz and sales

Benefit
Description
Immediate Sales
Creates demand, as seen in Apple's launch queues, ensuring quick adoption
Media Attention
Generates coverage, amplifying reach (e.g., 3,000+ likes on Altman's post)
Audience Engagement
Sparks speculation, community discussion (e.g., GPT-5 rumors on X)
Investor Attraction
Maintains "futurity vibes" for funding, per Karpf's analysis

`

forest moth
rough relic
#

Was horizon beta opus 4.1?