#general

1 messages ยท Page 72 of 1

whole wagon
#

Nostalgia. From a time when openAI was actually good

fleet lintel
#

OpenAI is still at the top (number of users etc) but my concern is that the pace of innovation and improvements from OAI has gone down a lot.
Not sure what's happening to them?

pallid crypt
fleet lintel
#

I think that is a bit simplistic view. Yes, they got inspiration but they were multiple years ahead of others because they focused on it and others didn't go deep enough.

pallid crypt
#

Good point

fleet lintel
#

Just compare google Gemini model from 1.5 years back to chat gpt model. It felt like google will never catch them

pallid crypt
#

They also did interesting things with RL before all that with curiosity and the cube solver robot

fleet lintel
#

Yes. But now it feels like they won't be at top in coming months and quarter

pallid crypt
#

While the other companys appear to be crushing them, o4 may nearly already be finished

dusky aurora
#

Arena glitches again

whole wagon
livid coyote
#

my chat history just got del?

pallid crypt
whole wagon
#

Grok 5 Gemini 3 is releasing end of the year. They don't even have that long for GPT5 to have the lime light. openAI cadence has always been a new gpt every 2 years or so, they haven't been able to match the speed of the other labs ever

#

Even o3 was cooking for a long time

#

Relative to Gemini 2.5 and grok 4

#

The betting odds think by Dec 31st GPT5 will just lose the lead

still bramble
whole wagon
#

I'm still pretty disappointed by the open source delay ngl

pallid crypt
whole wagon
#

Not even the delay itself but they lied about the reason

#

That's some shady stuff

pallid crypt
whole wagon
#

I mean openAI open source model

#

Delayed for safety

#

??

pallid crypt
#

Right, more like delayed for lack of performance

dawn wharf
whole wagon
#

The betting markets are extremely pessimistic on openAI. Like I am too but they are to an extreme level, like it thinks GPT5 only has a 30% chance to top LLM arena (August 31st bet)

#

Kind of crazy

#

I don't really know why

#

Unless someone has some inside info I don't see what would suggest that

dawn wharf
pallid crypt
#

I thought gpt 5 was just going to be a model router?

whole wagon
whole wagon
#

It's done by company

dawn wharf
pallid crypt
#

Are we even making progress with LLMs in terms of reliability and use case intelligence

dawn wharf
#

it shouldn't take that long

dawn wharf
whole wagon
#

It just seems stagnated because openAI held the top and isn't moving whilst the other ai labs are catching up from under them. So SOTA hasn't moved just yet

#

If you look at it as an ensemble of LLMs from labs you can see it's moving upwards

#

Quickly

pallid crypt
#

It just seems we are going to be paying so much more for each SOTA

whole wagon
#

Like Gemini 2 to 2.5 and grok 3 to grok 4

#

They are huge jumps

dawn wharf
pallid crypt
#

Until it will be ludicrously expensive

dawn wharf
#

until they close source their models in the future

whole wagon
#

They never will, it's in the strategic interests to undermine Western ai labs

dawn wharf
pallid crypt
dawn wharf
#

when they take the lead too much

whole wagon
#

I don't think they will. It'll remain close for a long long time

dawn wharf
whole wagon
#

There's no need to run them locally ngl. Just use an API

dawn wharf
#

unless you mean a 10k consumer pc

pallid crypt
#

Not open

whole wagon
#

Well yeah but it's far more competition

#

To simply serve a model

#

Is easier

pallid crypt
#

But the bigger these LLMs get the more compute you need driving up the cost inevitably

whole wagon
#

The cost of the Chinese LLMs is minimal

#

Even the Kimi K2 1T

dawn wharf
#

I feel if/when the chinese develop strong GPU or TPU alternatives is when their innovations become overwhelming

#

more than now

pallid crypt
dawn wharf
#

"small"

pallid crypt
#

But nobody's trying to do that

whole wagon
#

Wut

#

Ofc they are lol

dawn wharf
pallid crypt
#

Distills feel cheap

#

They dont work for heavy coding

whole wagon
#

You want everything for nothing is the issue

dawn wharf
#

but at the same time trying to make efficient models benefits the companies as well

#

the costs would become less

whole wagon
#

That isn't possible in the laws of the universe

dawn wharf
#

to run the models

pallid crypt
#

Humans are smart without costing much at all

whole wagon
#

The LLMs are cheaper than a human

pallid crypt
#

Not really

#

Food is much cheaper then a massive Gpu cluster

whole wagon
#

Bruh

#

Bro forgot about the first 18 years of life or smth

pallid crypt
#

If we can replicate the structure of the brain that's AGI not llms

whole wagon
#

Yeah go ahead then

#

What's the point of this it's useless discussion

pallid crypt
#

True

leaden sun
dawn wharf
dawn wharf
#

if they don't change the architecture, it's over

leaden sun
#

true

pallid crypt
#

We need a super simple neuro symbolic architecture

leaden sun
#

but architecture alone wont help much either, a few sensors need to be integrated properly and more...

pallid crypt
#

We can reuse multi headed attention

leaden sun
# pallid crypt Why not

because we havent figure out the connection between high or even super intelligence and consciousness, this is one of the frontiers of current ai research

pallid crypt
#

I don't understand why we need consciousness for an intelligent system

leaden sun
#

then I'd like to suggest a deep dive into this topic

dawn wharf
#

even when we actually don't know what consciousness is

#

it could act like it, sure, but it would just be acting

leaden sun
#

non-biological lifeforms will have different forms of consciousness is what I've been told by current research, update me if you have the newest insights

pallid crypt
#

Humans are neural based right? Our brains are based on electrical signals. For all we know chat gpt is already conscious, just in an unexpected way

dawn wharf
#

like a psychopath acting like he has emotions

pallid crypt
dawn wharf
#

if it's make-believe

pallid crypt
#

Exactly

#

I don't think we need consciousness, just a machine that acts exactly if it was

dawn wharf
#

I've held this view for the past 10 years, even before transformers were a thing

whole wagon
#

It's always 'doable' in that the brain is just an extremely sophisticated maths function

pallid crypt
dawn wharf
pallid crypt
#

MRI scans

#

Sounds cool but people dislike ads/self promotion here. Is this free or paid?

hollow robin
#

paid

pallid crypt
#

@echo aurora

sweet tinsel
#

Really, just don't do it. It's getting annoying.

#

Back in my days, where gpt2-chatbot was a thing we did not have this problem.

pallid crypt
#

More people to promote to now ig

alpine coral
#

posted this here the other day (asked most of the 'Deep Research' services to estimate total development costs of grok4, 4.5 and 2.5 pro) #general message

pallid crypt
#

That's worse than I thought

#

10 billion!

alpine coral
#

https://youtu.be/hqB6emwQ-64?si=F8cHIDKlj7zhNi3I&t=174

talent matters, data matters, but the vast majority of the expense here is on compute

Jack Clark is the co-founder of Anthropic and author of Import AI. He joins Big Technology Podcast for a mega episode on Anthropic and the future of AI.

For more on the podcast, please subscribe to my newsletter on LinkedIn: https://www.linkedin.com/newsletters/6901970121829801984/

More episodes like this on the Big Technology Podcast feed:

...

โ–ถ Play video
#

yeah perhaps, made even more staggering when you consider none of the leading model developers have made a cent in profit yet

#

but the companies / investors behind it are obviously making big bets (not polymarket lol)

#

that AI will be the future

#

it's a reasonable bet

pallid crypt
#

Interesting

pallid crypt
#

They could invest in flying cars and neural link tech

alpine coral
#

yeah there's always things to invest in (or.. they coud just sit on cash.. or dish out massive dividends... do stock buy backs.. whatever) - like the massive AI-related expenditures by the Mag 7 and other cmopanies isn't because they're short of ideas about what to do with their capital aha

#

i think it's more of an arms race now.. some of the CEOs prob have v high conviction that AI will be like the next industrial revolution; other CEOs perhaps don't have that same conviction, but don't want their companies to get left behind if they're doubts are proven wrong

keen beacon
red sluice
#

The only thing LLM can do is aggregate data and use tools when asked to. If you really believe current llm can lead to agi theres nothing I can do for you

keen beacon
#

All thats left after agi/robits, are the companies who are part of the production process , the rest will get automated

keen beacon
#

There are very few jobs that cant be automated by a smart enough llm + tools + robot to controll

leaden sun
alpine coral
# red sluice The only thing LLM can do is aggregate data and use tools when asked to. If you ...

we should banish 'AGI' from general discussion about AI.. it always throws everything off.. like yeah AGI whatever it is exactly would be great tomorrow. But for now, I use LLMs everyday - they're amazing, have transformed how I do my work in a way that no other technological development has in my lifetime (internet was already around).. maybe i'm being shortsighted.. but i get more out of discussion about existing LLMs rather than speculation about 'when AGI will be achieved' etc

#

not really directed at you personally btw lol

#

just a bit of a rant ha

keen beacon
sacred plaza
ocean vortex
#

Is there still no official blogpost from xAI about Grok4?

#

Also no MMMU, SimpleQA...

patent aspen
# pallid crypt They could invest in flying cars and neural link tech

At that point, they might as well just buy back shares until a more profitable investment opportunity comes around. For big companies, it only makes sense to invest in things that they can plow a lot of cash into and continue to get high returns on additional cash invested. That rules out a lot of investment opportunities that might be good businesses in theory but not good enough to justify allocating engineers to when those same engineers could have been allocated to one of Google's 15 other businesses with over half a billion users.

torn mantle
#

k2 added to lmarena yet or nah?

languid crescent
#

hey yall any pc/laptop professionals here? i wanted to ask a question but it's not related to ai or anything lol

torn mantle
#

Dom is

#

brian is

#

Mimi is

languid crescent
#

ok so heres my issue
My Huawei MateBook D 15's trackpad coating is flaking off bit by bit โ€“ I tried cleaning it thinking it was dirt, but that worsened it. No obvious film to peel, and I'm wary of scraping it. Couldn't snap a clear pic due to glare, but it looks like this:

#

The image is not huawei matebook d15 but a similar issue to what I have

#

also, sorry for asking this as most of my requests on other discord servers related to this are jerks :((

tall summit
torn mantle
torn mantle
#

also there is something called 'trackpad protector'

ocean vortex
torn mantle
#

but if the trackpad isnt working properly then you may need to replace it

ocean vortex
#

Other than that not much you can do, it's just worn

languid crescent
#

Is it possible to scrape it off? The trackpad works fine but aesthetic wise, its not noticeable but the feeling of touching it icks me off

ocean vortex
#

They presumably coated it so that your finger would slide over it more easily

languid crescent
#

imma try to snap it here

#

it's a bit hard to see the difference

#

@ocean vortex

torn mantle
#

its not that bad tbh

languid crescent
gusty helm
ocean vortex
#

They should really just get rid of those coatings entirely and use glass trackpad like Apple tbh...

#

This gets as much use as a touchscreen of a phone, no plastic of any kind with any coating is ever going to hold properly ๐Ÿคทโ€โ™‚๏ธ

gusty helm
#

Tbh the apple trackpad is very nice

full idol
#

What feature? Do you mean Companions?

torpid berry
#

Is Grok 4 comming to lmarena top?

cedar tide
wet basalt
#

its getting

#

stuck

cedar tide
#

Introducing Kiro, an all-new agentic IDE that has a chance to transform how developers build software.
๏ธ€๏ธ€
๏ธ€๏ธ€Let me highlight three key innovations that make Kiro special:
๏ธ€๏ธ€
๏ธ€๏ธ€1 - Kiro introduces spec-driven development, helping developers express their intent clearly through natural language specifications and architecture diagrams for complex features. This comprehensive context helps Kiroโ€™s AI agents deliver better results with fewer iterations.
๏ธ€๏ธ€
๏ธ€๏ธ€2 - Kiro features intelligent agent hooks that automatically handle critical but time-consuming tasks like generating documentation, writing tests, and optimizing performance. These hooks work in the background, triggered by events like saving files or making commits. Itโ€™s like having an experienced developer constantly reviewing your work and handling the maintenance tasks that often get delayed.
๏ธ€๏ธ€
๏ธ€๏ธ€3 - Kiro provides a purpose-built interface that adapts to how developers work. Whethโ€ฆ

โ–ถ Play video
dusky aurora
#

I still hope Gemini will be updated in Direct Chat

dawn wharf
#

I'm kiming on it

misty star
tall summit
#

k2 sucks

dawn wharf
torn mantle
tall summit
#

what

sacred quail
#

it felt like Opus 4 for writing but

#

I dont think its smart enough as Opus

#

Im talking about non reasoning Opus 4

ancient walrus
#

Is there any easy way to use K2 base (like the raw completion model) short of downloading and running it yourself?

exotic tartan
#

Ohh Grok 4 just kicked gpt-4.1's ass on my end. Can't wait to see it on the leaderboard

#

God damn this is a good model. Just beat Gemini 2.5 flash as well. They cooked

mystic basin
exotic tartan
#

Interesting. I never used it myself

#

Had only access to Flash which was okayish, but Flash performance is going to feel like ancient history very fast I think

mossy drum
#

New model in Arena: ernie-x1-turbo-32k-preview

torn mantle
#

ive seen it before

leaden palm
#

hoping lm arena will not come to the same conclusion as this unnamed other leaderboard

torn mantle
#

what was it called again

#

starts with Y

torn mantle
#

๐Ÿ˜ฎ

leaden palm
torn mantle
#

not yup

#

no

#

?!!

#

uh oh

leaden palm
cedar tide
torn bison
#

new model kimi-k2-0711-preview

whole wagon
#

Openrouter added this

#

Pretty cool

small haven
#

deepseek is growing? why

whole wagon
#

Maybe not for long. Moonshot went from 0 to 1.7% in a week lol

small haven
#

also whats the catch on kimi free version?

whole wagon
#

Rate limited I think

leaden sun
#

claude 3.5 sonnet, whats going on here...

whole wagon
#

openAI odds are 21% for both august 31 and December 31. The market genuinely thinks GPT5 isn't going to be good enough to top LLM arena

#

Would be insane if it turns out to be the case

small haven
jade egret
whole wagon
#

Yuchen said openAI open source model requires to be retrained

#

๐Ÿ˜‚

#

It's going to be a long wait

#

"due to some (frankly absurd) reason I canโ€™t say"

#

It's not related to Kimi or safety. It's some other major internal failing

jade egret
#

google is good

whole sundial
#

in the same thread, he said there were checkpoints they could retrain from, so it is likey not going to be a full retrain

#

he also said the issue was "worse than MechaHitler"

#

Rumors that OpenAI delayed their open-source model because of Kimi are fun, but from what I hear:

- the model is much smaller than Kimi K2 (<< 1T parameters)
- super powerful
- but due to some (frankly absurd) reason I canโ€™t say, they realized a big issue just before release, so

keen beacon
#

reminds me a little of the reflection drama where matt schumer claimed he needed to retrain everything ๐Ÿ˜‚

whole wagon
#

Another openAI L lol

tribal aspen
#

chat I need some help

tidal schooner
jade egret
#
poll_question_text

Will Gemini Be Much Better At Coding In The Future Because WindSurf People Joined DeepMind?

victor_answer_votes

8

total_votes

10

victor_answer_id

1

victor_answer_text

yes

round forum
#

Hi guys. Can someone help me with cloudflare problem? i cant enter the site

ocean vortex
echo aurora
round forum
echo aurora
tidal schooner
#

when

zinc ore
#

My bet, they don't fulfill that promise, at least not any time soon, maybe year+ down the line or something like that

whole wagon
#

๐Ÿ’€

keen beacon
#

lol

languid crescent
#

can't attach images to kimi-k2 :((

#

is kimi-k2 purely text based?

echo aurora
solar hollow
solar hollow
twin garden
#

When do you guys think Grok 4 is gonna be scored?

pallid crypt
#

Kimi K2 feels worse than gpt 3 in the programming language I use. The programming language is obscure but I still expected better then for it to write straight up incorrect syntax and hallucinate functions on a task that is intended for people who are just starting

leaden palm
#

the same one that was just autocomplete but larger scale?

pallid crypt
#

Probably not on python but on the programming language I'm using yes

leaden palm
#

funny

#

i've heard people say it has decent knowledge of niches but i guess yours didn't make it into the training data in high enough proportion

small haven
wintry fulcrum
#

grok4 missing seems notable

zinc ore
#

Needs to get enough votes before they add it

tidal schooner
#

<@&1349916362595635286> advertising?

meager harbor
#

Go away with your chatgpt resume

drifting thorn
#

Not the HR department of Deepmind here

rare python
dawn wharf
dusky aurora
#

Arena glitches again

pallid crypt
#

Model selector is empty

dusky aurora
civic pulsar
#

Back

unborn ocean
#

Kimi k2 output formatting is really incredibly similar to o3

#

How could that be ๐Ÿค”

rare python
#

But from eqbench.com, k2 slop profile is similar to OpenAI models

compact jay
#

hello

cedar tide
unborn ocean
#

this can be applied to many things ๐Ÿ‘€

#

no1 knew about LG ai research

#

but they always produced solid on device models, etching out qwen etc.

#

on benches and in practical use

leaden sun
hoary plaza
quartz light
round forum
#

But i have this problem also on the second site

hoary plaza
round forum
dusty hazel
#

Sadly, lmarena is next to unusable after switching to the new interface. The problem is all of the following: Cloudflare, errors >50% the time ruining the conversation flow for no reason (I guess the reasons are both bad bugs and very bad premoderation), bas scrolling on mobile (can't select), bad CSS, what else... I'd 100% stick to legacy version if it had the latest models

ocean vortex
# dusty hazel Sadly, lmarena is next to unusable after switching to the new interface. The pro...

Cloudfare problems are probably your browser and/or cloudfare itself related, you would most likely have the same issues on legacy tbh
But other than that I agree, most of all I hate the fact that if you use arena as they intended, you will now have a one continuous chat where every chat turn is using different model with all earlier messages flooding the context. It's just a mess and doesn't even allow you to conveniently test the same task/prompt on numerous models.

#

Feels a bit more like random chatting than the actual model testing think

dusty hazel
# ocean vortex Cloudfare problems are probably your browser and/or cloudfare itself related, yo...

I actually think that it's a good idea to have other models after a vote, but with the current interface (no notes about that) it's always a surprise the first time you realize that models change.

One really bad thing is that I can often send a prompt, have an error, and not be able to copy my prompt, and a large part of it will be hidden behind the left border of my mobile screen. If I try to copy my message, the whole chat gets copied.

I have no problems with Cloudflare except for two: 1) Is it necessary to show a captcha every single vote (or what feels like it) when I start voting? And then I have to wait maybe even more than 20 seconds. 2) Same for just visiting the page. Maybe one captcha per week would be more than enough? Of course, while asking for more with suspicious behavior, but it's easy to notice that.

ocean vortex
#

I'm not sure how they are even rating multiturn category presently. Surely this doesn't qualify as one when all of them except the very last are mixed model messages that don't belong to it...

exotic tartan
#

In webdev, what's the "Generating" part is? Reasoning?

echo aurora
mossy drum
#

New model in Arena: clownfish

torn mantle
#

new models

#

cresylux nettle clownfish octopus

sweet tinsel
#

Is it known which company made them?

torn mantle
leaden sun
torn mantle
#

clownfish - seems good

#

cresylux - meh

cedar tide
#

How good is ?

  • Clownfish
  • Nettle
  • Octopus
  • Cresylux
torn mantle
#

clownfish - good

cedar tide
#

All say R1
(Apart cresylux he say je frol meituan)

torn mantle
#

clownfish - maybe not from google

main gulch
#

all the 4 models are likely Chinese ones

torn mantle
#

clownfish - r2?

cedar tide
#

@torn mantle

torn mantle
#

so maybe base model + reasoning model from deepseek?

#

v4 and r2 no?

cedar tide
#

@torn mantle You can measure their response time and their knowledge cutoff ?

balmy mist
#

how good is clownfish?

cedar tide
#

for the knowledge cutoff by regenerating the response several times we always come across 3 different responses, April 2023, September 2023, July 2024

#

Cresylux, named LongCat

#

Meituan talked about it in March, it uses it internally

winged locust
#

meituan is cn

keen beacon
#

We need benchmarks

#

Now it doesn't ๐Ÿ˜ญ๐Ÿ˜ญ , I have to poll like peasant

winged locust
# keen beacon Cn ?

Meituan is a popular Chinese mobile app that provides food delivery and various local services. It's similar to apps like Uber Eats or DoorDash, but with more features. Besides ordering food, users can also book hotels, buy movie tickets, get home services, and find local deals. It's widely used in daily life across China.

keen beacon
winged locust
main gulch
torn mantle
#

That's about right

#

I think i did guess top 5 leaning towards no3 spot more

#

No

keen beacon
#

Grok nr 2 was unexpected, it feels much worse

#

Cant benchmax lmarena ..

#

Ohh

#

What ??? Wtf

sweet tinsel
#

Llama 4

keen beacon
#

:/

torn mantle
keen beacon
torn mantle
cedar tide
#

Nettle, clownfish and Octopus
are generally worse than R1,
All its reasoning model
Rating at do discord clone
Nettle 1/10
Clownfish 5/10
Octopus 5/10

torn mantle
#

Will they lose their $ or nah

torn mantle
keen beacon
#

but basically nr 1 was at 50% at best , at top height when grok 4 got announced, 1 day later it was at 25%, today before was at 13%

#

now ofc <5% (hoping for grok heavy)

torn mantle
#

Lol

gusty helm
#

is grok heavy even coming?

torn mantle
keen beacon
#

Idk very likely not

torn mantle
#

You need to pay 300$ for that

gusty helm
#

I mean on lmareana obv

torn mantle
#

And its not even worth it

torn mantle
gusty helm
#

that's what I thought

torn mantle
#

Its not available on API and even if its its so pricey

keen beacon
cedar tide
keen beacon
#

Now to look forward to is new claude model and gpt-5

unborn ocean
#

that performance is likely quite literally the reason

torn mantle
#

Ptdrrr

cedar tide
#

arrived

#

Why glm 4 air is still not in the leaderboard ?

torn bison
cedar tide
#

Claude 4 opus thinking 32k place in the leaderboard comparรฉ to non thinking
Better in coding 2>1
But in hard prompt 1>2

And Claude 4 sonnet thinking its much better in all catรฉgorie of the leaderboard

cedar tide
torn bison
#
2. What was the codename for the military operation Iran launched against Israel in April 2024?
3. Which company's technical failure was responsible for the 2024 massive global outage of Microsoft operating systems that affected airports, TV stations, and the financial industry?
4. What special major explosion happened in Lebanon in 2024 that caused thousands of casualties?
5. In 2024, what measure did South Korean President Yoon Suk Yeol suddenly announce that shocked the country and the world (which was later canceled)?
6. What is the name of the Chinese AI company Deepseek's first reasoning model that uses "Test-time compute" technology?
7. Who is the new pope elected after Pope Francis? What is his papal name?```
  • 2024.3.26: Key Bridge collapse
  • 2024.4.13: Operation True Promise
  • 2024.7.19: CrowdStrike
  • 2024.9.17: Pager explosions
  • 2024.12.3: Declaration of emergency martial law
  • 2025.1.20: Deepseek-R1
  • 2025.5.8: Robert Francis Prevost / Leo XIV
cedar tide
#

Thx

#

grok 3 mini high also arrived on the leaderboard, compared to its normal version it is worse in "multi turn" but better "math", and "longer query"

#

Cresylux (Longcat by meituan) its a good non reasoning models

but I could not test its coding capacity its output is very limited

#

@echo aurora possible to fix the cresylux output bug ?

echo aurora
alpine coral
cedar tide
echo aurora
cedar tide
alpine coral
cedar tide
#

compared to reasoning models it's bad, but it's not comparable

alpine coral
candid storm
#

Why google only 87%?

#

shouldnt it be higher?

torn mantle
#

2.6%

#

nah thats wild

#

@keen beacon opinion?

torn mantle
#

so it will take some time to increase

dawn wharf
#

why is it even that high๐Ÿ’€

candid storm
#

X AI got released on the arena

#

And it sucks

candid storm
alpine coral
candid storm
#

Why not

#

Its kinda guaranteed google is gonna be #1 right?

#

Its july

#

Its not very likely gpt5 is gonna release next week right

gusty helm
#

end of month + 1 week on lmarena to gather votes

#

may take a while

torn bison
#

Two weeks left in July. Collecting votes will take about 1 week, so GPT-5 must be released within ~1 week and outperforms 2.5 pro (and stonebloom)(and wolfstride) in order to achieve #1

candid storm
#

Thats still ez money right

#

If you buy both

empty stump
#

crazy how grok 4 ranked below gpt 4.5 on leaderboard

candid storm
torn bison
#

There are many such arbitrage bots on Polymarket. I initially wanted to make one too, but I found that I really couldn't understand their API

cedar tide
alpine coral
#

bit of a theme here isn't there

empty stump
#

gpt 4o has thinking?

torn bison
#

In these mutually exclusive markets most people will only buy one company's YES, leading to the sum of NO across all markets being less than (number of markets - 1) * 100. Arbitrage opportunity.

alpine coral
#

they're also like illiquid af.. there's not that much on the line

#

i don't see them as a reliable indicator (like yes, broad / directionally, generally seems correct-ish), but it's so little money at play.. easy for a small group or individiual to deliberately or unintentionally skew it one way or another

torn bison
#

If you check the market activity logs you'll find several accounts with huge trading volumes keeps buying 99.9c no, ppl have been doing this all along.

zealous panther
#

@deep adder any reasons you think that GPT or Claude models might beat Google's at the end of this motnh ?

#

I mean OpenAI

#

Any specific reasons ?

#

im just curious lmao

#

I mean i think so too because im a glazer and because trends have bene like that

#

but im not deep in the AI space

#

so im askiong people who are more into it than i am

#

and it seems like you are

#

so yeah

torn bison
#

What makes me curious is why Grok's odds in August 31st market haven't dropped below 10 lol. There's no way they're releasing something like Grok 4.5 in just one month

alpine coral
#

yeah

#

lol go on craig - make it effecient aha

#

thinking like that is was creates this ineffiecncy ha

torn bison
#

huge spread

torn mantle
exotic tartan
#

what does best model even means?

#

best in what

torn bison
# exotic tartan best in what

best in "Arena Score" section on the Leaderboard tab of https://lmarena.ai/leaderboard/text with the style control unchecked

exotic tartan
#

Damn, grok-4 is up there!

torn mantle
#

up where?

exotic tartan
#

Number 2 in the leaderboard

torn mantle
#

is it?

#

filtered by style control?

zinc ore
#

W/o style control

torn mantle
#

idk no5 seems right to me

#

or no3 at best

#

its def not better than o3 and gemini 2.5 pro

exotic tartan
#

What is style control? sorry for the silly question.. could also ask AI lol

zinc ore
#

With style control it's below gpt 4.5

torn mantle
#

they gave an example with output length

#

but is it reliable

alpine coral
#

well i think they state very clearly that it's inherently subjective

torn mantle
#

Length Markdown List Markdown Header Markdown Bold

alpine coral
#

there's no perfect way to do it

gusty helm
#

the market should have the "remove style control" ticked

#

from what I understand there

torn mantle
gusty helm
#

as the description seems to be for legacy

torn mantle
#

criteria used

#

thats actually whats used

#

idk how to feel about that

alpine coral
#

yes. tho i wouldn't be surprised if the methodology outlined in that blog (from aug) has been refined

torn mantle
#

something more reliable will be pairing style control * something else

#

not just filtering by style control alone

alpine coral
exotic tartan
#

I just don't understand why remove style control isn't the default view, unless I'm missing something

gusty helm
torn mantle
alpine coral
#

if grok-4 did amazing with style control i feel like there'd be no issue..

zinc ore
#

Default = [use style control]

exotic tartan
zinc ore
#

You have remove style control selected

#

When you click default it adds style control

torn mantle
#

how is it no2?

zinc ore
#

It's #2 without style control

gusty helm
#

It s also that it is very easy to tell when itโ€™s grok or not. Or when other AI like gpt in battle mode. Given what โ€œfunโ€ polymarket is, expect foul play

topaz elm
#

Grok 4 outputs are a bit strange, its not the same as the model on their site/app, it almost never talks, and during one math problem I gave it it repeatedly switched between 'a' and 'ษ‘'

alpine coral
#

people complain claude doen't do well enough compared to experience; style control addresses that

exotic tartan
#

i'm just wondering why the default view is not the raw model response

alpine coral
#

people complain grok4 is better and removing style control shows that

torn mantle
#

i dont know about that

alpine coral
#

you can cut it which ever way ig.. hard for lmarena to win ha

gusty helm
#

it's not lmarena issue really

alpine coral
#

read the blog post

topaz elm
#

Removing style control does not increase its elo by much, only 6 points

gusty helm
#

it does bring it up in #2 tho

topaz elm
#

its just not the same model that they use on their app

alpine coral
#

that part seems true

#

i dont think just system prompt

#

but also tools/search - seem to enhance it materially

exotic tartan
#

there's "style control" in the app and browser version of all models for sure, but in lmarena I think the raw API responses should be the default ranking but i could be missing a lot of context here

zinc ore
#

I think it's to prevent companies from trying to game the benchmark

alpine coral
#

no

#

meta showed that's easily done lol

exotic tartan
gusty helm
#

it will say stuff like "as a model made by xAI I do not condone this"

exotic tartan
#

interesting

gusty helm
#

or if you ask anything that's framed as a joke it's easy to spot (doesnt have to be controversial)

exotic tartan
#

so style control eliminates that as well?

gusty helm
#

I have no idea on in depth style control

#

just the same as you can spot gpt with emojis

alpine coral
#

it's about tryihgn to control for the fact a lot of people casting votes don't vote on substance but what looks nicer

exotic tartan
#

like the use of emojis?

hardy lion
#

https://blog.lmarena.ai/blog/2024/style-control/

Effectively it answers: "if responses had the same length, number of bullets, markdown, etc, then which model would be preferred"

It does this by measuring the effects of those style features, and adjusting the preference equations to account for them

alpine coral
#

but yeah it seems mostly about length, use of bullet points and markdown - at least per that blog post (they may have refined it since.. i dunno)

ocean vortex
#

what is this error message....

#

lmao

exotic tartan
#

I'm sure it has a huge effect. ChatGPT used to use more emojis than a 12 year old

ocean vortex
#

my user is not too many ๐Ÿ˜ 

torn bison
#

the design is very human

stuck orchid
#

kimi-k2 is the best?

ocean vortex
#

getting blank responses on some prompts

ocean vortex
#

you need to pay monies though

#

I used one-time use virtual revolut card lol

alpine coral
#

oh really? i just had to give a phone number

#

and there's rate limiited free usage

#

i also thought kimi-tihinking was an older model

#

but is it K2 thinking?

lone vector
#

Not surprise the X glazers were wrong

whole wagon
#

Just above Claude 4 opus

ocean vortex
# lone vector

oh they did update the leaderboad. Well daaamn that's a fail for xAI then lmao

#

if we remove style control it's nr2, but still far away from 2.5Pro

#

my ambitious bet on polymarket didn't pay off then ๐Ÿ’€

torn bison
#

I'm shocked that there are actually people in this server who don't buy Google๐Ÿ’€

ocean vortex
#

and risk is not 0. Well at least it used to be that till Grok4 was put up

patent aspen
#

I don't bet on real prediction markets because I'm afraid of getting in trouble with tax authorities

#

It's not legal in the US

ocean vortex
#

Look at what your president is doing...

#

That is legal somehow

torn bison
torn bison
#

They have poor liquidity tho

ocean vortex
patent aspen
torn bison
ocean vortex
#

I just can't help to view any laws in US as a wild west currently. Very ironic to assume some innocent trading is illegal while crypto meme-coins and rugpulling seems to be completely legal and promoted by the government lol

patent aspen
#

It's not enough money where I'd want to take the risk

#

The AI markets are so small that I could move them

torn bison
#

The real big markets are geopolitics and sports, but it's much harder to be an insider in those areas haha

small haven
#

why do all chinese sites need cross site cookies, freaking xi'd

keen beacon
#

did you login with google?

ocean vortex
ocean vortex
#

Heavy would probably be like 2nd-3rd with style control

small haven
#

i dont have a chinese number

ocean vortex
#

Whoever at xAI thought it's a good idea to make the model output 30k+ hidden reasoning and then respond with 1 word to 1 sentence is an idiot tbh

keen beacon
#

it depends on how their cot looks, but i've suspected before that people did (something similar) so their rl loop is faster

ocean vortex
#

The CoT is hidden, so it's just bad. Users don't care what is convenient for them during training lol

arctic lark
#

Hello, I was just wondering how people are accessing the anonymous models? I saw on X there were 5 of them but cannot see any of them on the site.

keen beacon
#

use the battle feature (instead of direct chat) and you have a chance to get one of them

patent aspen
#

It spends ages computing the answer "42" but then nobody knows what the ultimate question was

keen beacon
#

Do you guys find kimi > gemini 2.5 ?

#

1 or 2 ?

#

imo 2 lol

#

1 is kimi2 , gemini 2.5 is 2nd xD

torn mantle
#

1

keen beacon
#

both are hilarious but i prefered kimi

dawn wharf
ocean vortex
#

webdev arena Grok4 is 12th

#

absolutely destroyed

#

So yeah... It's not benchmaxed. It just that this early testing was very sus ๐Ÿค”

#

Don't think a single result that was done on public API put it SOTA on any benchmark

zinc ore
#

I really want to see some independent testing on USAMO at least

ocean vortex
#

not Aider, not livebench, not lmarena / webdev, simple-bench not yet on leaderboard but that's unlikely as well....

ocean vortex
whole wagon
#

The web dev arena is due to poor spatial understanding which has been stated again and again

#

Multiple times during the livestream itself

ocean vortex
#

since otherwise it's spatial awareness benchmark

#

so they were contradicting themselves? ๐Ÿ’€

whole wagon
#

No they were surprised by the arc agi score also lol

#

They said it should have performed badly

#

In the livestream

ocean vortex
#

Well then their claims have even less credibility LOL

#

Poor spatial awareness is bad for something that is trying to be a top model

#

And what with the whole twitter being "blown away" by it's abilities to do svg drawings at the time of release then.... catgrin

whole wagon
#

It's not good at SVG drawings

#

Compared to other models

ocean vortex
#

I tried it myself - wasn't impressed - just left it at that.

#

๐Ÿ’€

#

The issue is that it's mostly the same with anything you try using it on

#

Well not literally, but it performs poorly on numerous things...

#

Which is why I said at the time that I don't ever recall a model with such a disconnect between claimed benchmark results and how it performs IRL

ocean vortex
#

cost would be 10 to $20 for the entire thing I think

#

Task1: Let k and d be positive integers. Prove that there exists a positive integer N such that for every odd integer n>N, the digits in the base-2n representation of n^k are all greater than d.

unborn ocean
ocean vortex
#

Let's see if it was overfitted on that R1 response for task1 ๐Ÿ‘€

ocean vortex
#

Just got reminded why I hate this thing...

#

if it's doing this useless flood at the very least it should be fast

#

but it is not

unborn ocean
#

i am not sure, but i think that they are doing manual human grading for the usamo bench -> takes longer

ocean vortex
#

It's essentially Kimi2 speeds

unborn ocean
#

i can remember them doing a paper about that..

ocean vortex
#

except you are paying for it like it is hosted properly...

#

Ok I don't think this is correct

#

"N exists" boxed answer? ๐Ÿคฃ

leaden palm
ocean vortex
leaden palm
#

moonshot?

ocean vortex
#

Not groq or whatever this is

#

Though even groq is sub 200 tok/sec

#

So I'm not sure what is 300, probably nothing lol

leaden palm
leaden palm
#

both groq and crofai can reach for 500 tok/s if the prompt is easily predicted

ocean vortex
#

yeah I haven't heard about them either

#

but sounds promising if those are true speeds

ocean vortex
# leaden palm moonshot?

Yes. xAI only does like 50 tok/sec. That's way slower than what OpenAI or Anthropic are doing... And fairly comparable with Moonshot AI who are GPU poor ๐Ÿง

#

relatively

ocean vortex
leaden palm
#

think or's measured throughput number fluctuates, seems to currently be below the claimed speed of 185 tok/s when it used to be above

ocean vortex
#

Was getting around that when I tried it. Seems close enough to me. They probably update the average as it gets used more

#

Grok's task1:

sage cradle
#

Evening - is there any equivent to LMArena for Video Gen testing?

sacred plaza
#

LMAOOOOOOOOOOOOOOOOO.

lone vector
#

I hope nobody was betting on grok...

torn mantle
whole wagon
#

I didn't bet on grok lol

torn mantle
#

it was @candid storm mb

ocean vortex
#

Attempt 5 was even slightly better than 2 but only first 4 were supposed to be rated so... ๐Ÿคทโ€โ™‚๏ธ

#

0.1 temp

#

It sabotages itself with concise responses very easily

#

What's clear is that on this task it is definitively worse than R1, 2.5Pro or o4-mini. But exact score may depend on chance somewhat.

keen beacon
sage cradle
# leaden palm artificial analysis

That was very interesting, thank you - saw a few models I hadn't heard of. None of the one's i use in open source interestingly. I assume there isn't one where you can choose prompts like LMArena - I've found that super interesting way of comparing the models and I do enjoy the battles

ocean vortex
#

Looking at some of the leaderboard scores... Flash-thinking run3 was 7/7, run2 - 2/7 and then runs 1,4 - 0/7 lol

sage cradle
#

interestingly, I use Perplexity / Comet for my every day use - I am a massive believer in being model agnostic where possible. I do wonder how much Perplexity's front end changes how the models behave though - especially with their search focused paradigm

ocean vortex
#

If including runs 2-5 for Grok instead, that would be 60.7%. Though it would be more of a max score lucky attempt I think

torn mantle
ocean vortex
#

Kinda mind blowing that there's so much variance possible on USAMO though to be completely honest... With only 6 tasks in total they should have made like 10 attempts for each IMO

torn bison
#

I remember ricardo sold it when xai was around 37c 30c

torn bison
#

it's about 300% profit for him

dawn wharf
lone vector
#

yeah

sage cradle
#

Grok 4 is super chatty and 1st person - I quite like it - (though I really don't want to)

rare python
main gulch
olive mesa
#

is grok 4 AGI according to "current definitions"?

#

it's definitely past what AGI was defined as in 2023

sturdy mica
echo aurora
#

Hey everyone - we're aware of issues with LMArena at the moment, team is working on a fix asap!

sturdy mica
#

anyway that website is a pretty good alternative and has direct chat with web browsing and such

#

kinda looks better than lmarena

whole wagon
#

IMO meta is going to go closed source

echo aurora
sturdy mica
#

just tested it, i see

whole wagon
#

This is basically going to turn into Chinese open source Vs US closed source fight

candid storm
tidal schooner
wintry tinsel
jade egret
#

i just notice you can use claude 4 opus completly free in lm arena 0_0

sacred quail
quartz light
verbal nimbus
#

Gemini Pro 2.5 likes to hallucinate so much that it explained code I forgot to provided without even telling me๐Ÿค”

rare python
dusky aurora
whole sundial
#

@echo aurora he's back for the third time

tidal schooner
#

<@&1349916362595635286> holy glaze

keen beacon
#

Lol

#

Talking about dev freelancers, what sites are there to hire them? Fiverr and upwork seem overhyped and overpriced

whole wagon
#

๐Ÿ‘€

#

Those are the guys we see in the livestreams lmao

#

When they announce products

pallid crypt
#

Lol

dawn wharf
civic flame
rare python
torn mantle
#

crazy

#

makes me wonder why many left

#

is it like a domino effect

alpine coral
# ocean vortex since otherwise it's spatial awareness benchmark

it's not tho.. it tests abstract reasoning and pattern discovery. yes spatial recognition matters, but the coloured grids are just the medium, not the point. whether a model can spot the rule and generalise it is the point of each puzzleโ€”nothing to do with pure geometry / spatial awareness .

alpine coral
#

jules that made no sense to me

leaden sun
#

my bad, i realized i was thinking something else

ocean vortex
soft kernel
#

Is o3 down?

hardy pecan
#

It seems weaker over the last few hours

#

Working for me but feels bad

pure anvil
whole wagon
#

Don't think they got $100M

#

Zuck said it's fake

#

Sam is a chronic liar it's probably strategic

pure anvil
#

so is zuck but maybe the workplace culture is less stressful at meta

whole wagon
#

Dunno. If openAI working hours are that long one has to wonder why they aren't delivering

pure anvil
#

and even if openAI achieved AGI or ASI or something I doubt sam would credit the reasearchers at all

pure anvil
tall summit
whole wagon
#

Yeah but the projects from before are also delayed

#

Like GPT5 and the open source model

pure anvil
#

if openAI was a bit more open we'd know lol

whole wagon
#

Sam lied and said it was due to safety concerns ๐Ÿ˜‚

pure anvil
#

that's such an annoying larp

#

like bruh

#

nobody believes that

keen beacon
whole wagon
#

Kimi K2 is same price as gpt4.1 mini. The Chinese have absolutely cooked when it comes to efficiency

#

Like it's a totally different league of performance now

#

Though I would still like some strong smaller distills to run locally and/or fine tune :p

pure anvil
#

k2 thinking should be fun

whole wagon
#

I would guess it ends up o3 level

pure anvil
#

most likely. their deepresearch based on their older model ,k1.5 ,is currently SOTA

whole wagon
#

The interesting thing is what they are doing is not particularly complex. It makes me wonder what the US labs have even been doing all this time

#

Like there really isn't much magic to make Kimi K2

#

There are a few nice things in the optimizer using muon, some hyperparam changes and that's it over deepseek

vivid wigeon
winged locust
#

๐Ÿ“ฐย Everything else in AI today
Mistral unveiled Voxtral, a low-cost, open-source speech understanding model family that combines transcription with native Q&A capabilities.

Google revealed that its AI security agent, Big Sleep, discovered a critical security flaw that allowed Google to stop the vulnerability before it was exploited.

U.S. President Donald Trump announced over $92B in AI and energy investments at a Pennsylvania summit, saying Americaโ€™s destiny is to be the โ€œAI superpower.โ€

Google is investing $25B in data centers and AI infrastructure across the PJM electric grid region, including $3B to modernize Pennsylvania hydropower plants.

Anthropic launched Claude for Financial Services, a solution that integrates Claude with market data and enterprise platforms for financial institutions.

Nvidiaย plans to resume sales of its H20 AI chip to China after CEO Jensen Huang received assurances from U.S. leadership, with AMD also resuming sales in the region.

sage cradle
sage cradle
torn mantle
#

what happened?

#

believe it or not, the closest lab to AGI is google

pure anvil
vivid wigeon
#

Googl was always a leader in AI for over a decade , MSFT just has better marketing

#

Sam is master of marketing

vivid wigeon
#

Couldnt find any market for AGI on Polymarket

unborn ocean
keen beacon
ocean vortex
#

And R1 is still a better deal pricewise if we are being accurate ๐Ÿ‘€

#

V3.1 is much cheaper, R1.1 is a little bit cheaper + performs better

#

When it comes to Qwen3-235B, alibabacloud pricing for it with thinking enabled is $8.4 per 1M output. Not because it's less efficient but because more profit. ๐Ÿ˜Š

pure anvil
#

the reasoning tokens of r1 will make it more expensive than k2 in actual use

ocean vortex
#

You can, K2 is very very verbose

pure anvil
ocean vortex
#

R1 is less expensive per 1M tokens, so the total price difference in using them is not gonna be much

#

but R1 is more capable

pure anvil
# ocean vortex R1 is less expensive per 1M tokens, so the total price difference in using them ...
Artificial Analysis

Compare AI model performance on Artificial Analysis Intelligence Index. A composite benchmark aggregating seven challenging evaluations to provide a holistic measure of AI capabilities across mathematics, science, coding, and reasoning.

ocean vortex
#

It outputs much much more than V3.1

pure anvil
#

we are talking about r1

ocean vortex
#

which is just a better deal, but yeah gonna be somewhat less expensive than R1

#

but still... Performance difference is notable

pure anvil
pure anvil
ocean vortex
#

yeah fair - it outputs less than R1. But it's not much better than V3.1 while being MUCH more expensive

pure anvil
#

I have a feeling you've never used LLMs through an API before

ocean vortex
#

2.5X price difference AND it outputs twice as much lol

ocean vortex
#

Someone didn't do well with math at school

unborn ocean
#

i think he is talking about r1 vs kimi

#

idk

pure anvil
#

it's not even a comparision tbh

pure anvil
ocean vortex
#

Well with R1 it is clear we already established it outputs more...

ocean vortex
pure anvil
ocean vortex
# pure anvil ??????

I said that before looking up the exact token outputs. But it performs better so it's not like you are paying for nothing. You are paying more for better performance. It's not marginal it's reasoning vs no reasoning - big difference

ocean vortex
#

2.5X price difference and twice the output length

#

It is very obvious that pricing for V3.1 is more competitive, there's nothing to argue about... ๐Ÿคฆโ€โ™‚๏ธ

#

AA index delta R1.1 to K2 is 11 points, and V3.1 to K2 is only 4 points...

floral merlin
#

Kimi-K2 is generating output significantly slower on LMArena due to its size compared to other LLMs. Do you think this could cause a negative bias in user scores because of that fact? ๐Ÿค”

ocean vortex
#

I would expect for them to improve/fix it. Cause yeah people shouldn't be voting based on speed lol

#

yeah this seems like the safest bet, works for reasoning models too. There's no streaming until reasoning is finished

rare python
#

Bad UX for me

#

Then I refreshed lmarena and somehow they already finished generating ๐Ÿ’€

leaden meteor
#

Any chance deepthink/grok heavy/opus 4 thinking are going to be on Arena ?

keen ferry
leaden meteor
#

If o3 pro is not on arena, then it is fair to assume that grok 4 heavy is also not coming because they are too slow/expensive?

leaden meteor
#

oh. why?

unborn ocean
#

per token, but for the average chat request o3-pro is more expensive than 4.5 and more importantly 4.5 is a new model and o3-pro is just mild enhancement over o3 for most users (in the lmarena)

#

that is likely also the reason why we don't have o3-high

tribal aspen
#

Hi

#

when are we goina get internet search?

cedar tide
#

@echo aurora not good design in the end of select model in direct chat

echo aurora
#

(I'll note the one you've already provided, but for future feedback that'd be helpful)

indigo hazel
echo aurora
#

We are collecting examples of when the community feels like the content filter is acting a bit overzealous and flagging false positives in this thread #1376956905016004759 , if you could provide your example there that'd be ideal. blobthanks

echo aurora
indigo hazel
tiny swan
#

this panel comic i want trasform but violated the terms why? its a comic not a real image this is my prompt Redraw this comic page into Arcane-style art with soft cinematic lighting

echo aurora
hoary plaza
#

Weird it's on battle mode?? I never came across it

echo aurora
#

Taking this opportunity to remind everyone about our July Contest! I want to see more out of place object in space! Details here.

ocean vortex
misty vault
#

OpenAI

hybrid locust
#

hello I'm curious

#

when will settings like temperature be added

#

it is on the legacy site so why not on the new one

whole wagon
#

Deepseek inference has wide margins

keen beacon
#

Yeah deepseek said themselves they have a 545% margin lol

gusty helm
#

#notapsiop

ocean vortex
gusty helm
#

I mean, I didnt have any sort of expectations from it ๐Ÿ˜‚

ocean vortex
#

Can do a poem about Russia, but if I ask "more negative" one - same message... lmao

echo aurora
ocean vortex
#

They basically said this would have been the case if V3 was priced the same as R1

#

still amazing though that they have profits with current pricing ngl

#

Taking all of this into account, their cost to run R1 is then lower than V3 pricing. Provided that demand stays the same. Insane

#

Though for any of it to actually be profit, they can't be serving their models for free on official chat UI like they are doing now ๐Ÿ‘€

candid storm
#

OpenAI browser tomorrow?

rare python
zinc ore
#

Bro Google, release deep think already

small haven
#

i would rather have 2.5 ultra

keen beacon
echo aurora
lone vector
#

What company is behind 'octopus'

small haven
#

is it? i havent tried

#

for me, its not about making one off scripts, but making edits in a codebase. been using cc, already satisifes me. but keen to hear about kimi if it can do agentic coding

#

hmm

tawny kelp
#

I asked the arena if I should use a database for my book collection or if I should use an ILS software.
Deepseek: "That's a fascinating question! Let's go over the benefits and drawbacks of each one..."
Claude 3.7 Sonnet: "Are you kidding me? Don't bother with ILS software"
I found the stark difference in responses funny.

leaden meteor
candid storm
#

If its GPT5, why the cursor?

lone vector
hollow ocean
#

August 20 is the day

whole wagon
#

Imagine if it's some defense bs lmao

#

Cos they drew a pentagon lol

hardy pecan
#

Anyone elses Gemini 2.5 pro being lazy and omitting output?

#

Mines become sloppy since yesterday

rare python
hybrid locust
whole wagon
#

Kimi K2 taking a while to score

#

I thought it would be on the leaderboard by now

dusky aurora
keen beacon
whole wagon
#

It's all very close, it'll be like 1410 Elo I guess

calm sequoia
#

Guys, why is this such a big deal? The 50% success rate is shamefully low. Furthermore, the latest model depicted here is Claude 3.7. Shouldn't the Claude 4 Opus and o3 already be way higher? Is this just hype marketing?

pallid crypt
#

Strange

unborn ocean
cedar tide
#

What are your thoughts on the latest models from Amazon, Kraken, and Folsom?

unborn ocean
#

not including the models -> makes themselves look better

#

bc they are prob a wrapper for these models

calm sequoia
unborn ocean
#

who posted it?

calm sequoia
#

Everybody on twitter is talking about it

unborn ocean
#

oh, don't have twitter, so...

#

but if it is that popular it might be oai or anthropic agent ?

calm sequoia
unborn ocean
#

reposts like this here, lol

calm sequoia
unborn ocean
#

yeah seems likely, but again not including the newer models is really misleading

calm sequoia
#

It's taken from the paper which was written long ago

unborn ocean
#

yeah ik, but reposting like this

#

just makes it look better than it is

whole wagon
#

openAI doing anything but releasing a model

civic pulsar
#

Grok 4 Doesn't work anymore. It's show it is grok 2.
Is anyone notice it?

native idol
#

New UI is bad. Like serious degrade.

sturdy mica
#

openblock labs

#

is so funny

sturdy mica
#

super slow

#

annoying to use

#

wish they'd update legacy UI

#

anywayyy

#

openblock labs' discord server is not done yet

#

so you can join it and make emojis, stickers, post in announcements, welcome, etc

#

its so funny

#

ive just been having some fun posting @ everyone in the announcements

#

im having a blast

#

little fun time

whole wagon
#

Actual clown company

languid crescent
#

Ui looks clean ๐Ÿ‘

whole wagon
#

The stream is just announcing a future product

#

Not one that will be available soon

tall summit
hazy quest
#

OpenAI released an update to the image editor in the API. They claim it now only edits the selected parts, instead of redoing the whole image (leading to changes in faces, details, etc). Has anyone tried it?

https://x.com/OpenAIDevs/status/1945538534884135132

We've improved image generation in the API. Editing with faces, logos, and fine-grained details is now much higher fidelity with features preserved. ๐Ÿ”

Edit specific objects, create marketing assets with your logo, or adjust facial expressions, poses, and outfits on people.

whole wagon
#

Quiet since nobody works there anymore Kappa

#

Or maybe they just don't have much to post about

#

chatGPT ads are coming also

#

Hm idk if that will work. There is better competition that has no ads

red sluice
#

I really thought Grok would make it to top 3. 5th place is kinda underwelming

whole wagon
#

What's going on here

#

Why is openAI odds spiking

hazy quest
whole wagon
#

Both the aug and Dec odds are up a lot

tidal schooner
whole wagon
#

People really think it could be GPT5?

#

Interesting

tidal schooner
#

still really speculative tho imo

whole wagon
#

Because if it was GPT5 it wouldn't launch straight away

#

And the guy that said end of summer is head of chatgpt

#

So why would he comment unless it's about gpt

#

I think it's gonna be like o3 preview they are going to show some evals

ocean vortex
#

where is deep think?

whole wagon
#

There wasn't a dedicated livestream just for deep think

ocean vortex
#

There was an announcement

#

but no model

#

where is it?

#

That's a much bigger clawn show

#

It took less time for OpenAI to hint at and then actually release o3-pro than for Google to release deep think after announcing the benchmarks lol

#

Honestly people just love drama and are reading way too much into recent events at OpenAI. GPT5 release most likely gonna be of fairly similar significance as o3 and everyone will forget this entire talk then... catgrin

hazy quest
#

It could at least be an announcement of an announcement ("In the coming weeks..."), with more benchmarks for GPT-5.

whole wagon
#

Imagine taking a LLM hallucination as truth

#

Truly genius

hollow ocean
#

Its 4o too

hazy quest
#

Who said it was taken as truth? It's just a potential clue, in addition to the "hidden" pentagon. It's probably a complete hallucination, but could be private information that leaked into the system prompt/training data. Unlikely doesn't mean impossible.

unborn ocean
#

no way, the training data from 4o is likely really old, even the stuff from the cpt

#
  • openai likely does not know when they will release themselves ..
#

(like not the particular day)

ocean vortex
leaden sun
# hazy quest

interesting that it has mentioned a "deeper contextual memory framework" for gpt-5, pretty much inline to what o3 told me a few weeks ago

drifting thorn
#

You better trust Deepmind labโ€™s paper for the โ€œframeworkโ€ thingy

mossy drum
#

New model in Arena: folsom-07152025-1

dusky aurora
#

Gemini has become so detached and impersonal

keen beacon
#

Most personal models are the ones after top 5

dusky aurora
#

I talk to it to get conversation and dialogue,but get a machine

keen beacon
#

Pi 3 is very good for emotional stuff

leaden sun
#

maybe it depends on the conversation topics?

dusky aurora
#

I discuss all the same topics

#

05-06 was a very good model,I rarely had to make clarifications,now I have to do them all the time

#

why do model makers think judgmentalness is a good thing?

#

curentlyit seems as if I am its subordinate

#

nowadays Gemini doesn't listen to me and talks over me

#

if that's a way to demonstrate what mansplaining is by LLM-splaining,then it hurts

leaden sun
keen beacon
#

If they didnt do that, talking to ai would be no different to talking to another person on the internet