#general

1 messages · Page 31 of 1

ocean vortex
#

it's at a clear disadvnatge againt competition in some areas, and this includes all their reasoning models as they are based on it

#

I wouldn't go as far than that, as the trend lately has been downsizing... So maybe in the long-run the trend/progress will catch up to it and it's not gonna be too small. But for now it does seem suboptimal as people just can't get spatial awareness at a top level without decently big model size

#

though we also need to keep in mind that we just do not know what could potentially be done with RL training using notably bigger model. Maybe it's diminishing returns, but maybe the gains are actually more substantial

lavish orchid
#

anyone else have the problem of Gemini 2.5 Pro not using terminal in Cursor?

#

Claude 3.7 runs it 20 times and does everything I ask it, Gemini 2.5 Pro is unsure and asks to install libraries etc

#

prompted it in user rules too no change

ocean vortex
#

it would unlikely to scale the same way as normal classic chat LLMs

#

so I would think "perfect size" for that is something different (bigger), even with the current metrics and their limitations

tall summit
#

WHAT

#

AAAAAAAAA

small haven
#

u just want to see the world burn dont you 😭

ocean vortex
#

how is this real... 😭💀

small haven
#

it was half that last week

#

o1 pro is unlimited lessss gooo

wintry locust
#

for o1 it's called juice

ocean vortex
balmy mist
#

we still aint got no new stuff?

ocean vortex
#

people gonna have to worry about this stupid yap score now when trying to eval openai models 💀💀

balmy mist
#

whar hppened to r2?

leaden palm
#

probably wont be released super soon tho

#

polymarket: 9% chance in apr, 90% chance apr/may/jun

#

manifold: 25% chance in apr, 88% chance in apr/may, 93% chance in apr/may/june

ocean vortex
leaden palm
#

@keen beacon alt?

keen beacon
#

ive never seen a model do this 🤣

oblique flint
small haven
#

how tf is not being arbitraged

keen beacon
leaden palm
#

ah forgot that exists

knotty jetty
#

Is there a reason why they don’t include Claude’s web search function for the web search leaderboard

keen beacon
#

yea to make sure it hits 1345 words. i didnt request that 🤣

knotty jetty
#

Ok

#

But it’s only in premium just so you know

#

What do you mean

#

The lmarena api

#

Ok

#

So do you know why they haven’t added it

#

Oh

#

No it’s really good

#

I have perplexity pro and I feel like Claude has gotten many more answers right

#

For real

#

Oop

#

Well my dad uses perplexity too

#

Should I tell him to stop paying for it

elder rapids
#

it's crazy how media is so far off from the reality of these products

#

and fake news is the reason why a lot of them still live

#

perplexity should ong be dead rn

#

dawg

#

why are you saying it's a scam

#

because you agree with me

#

😭

#

ye

#

that's what I'm saying

keen beacon
#

u used to get like 600 messages a day on perplexity

knotty jetty
#

My dad uses its API, he runs a music player company and he uses it to find creators numbers and their emails and tells them stuff. Is he cooked or nah

keen beacon
#

with a lot of models, even if u dont use it for search

elder rapids
#

there's always going to be replacements

#

in the AI industry

#

maybe even made by openAI

knotty jetty
elder rapids
#

depends

#

that kind of stuff is hard to get wrong

#

especially for an LLM

knotty jetty
#

Ok thank god bro

keen beacon
#

u cant disable it now?

#

thats dumb

knotty jetty
#

My dad has been working on ts for 10 years and I’m scared ai is gonna screw him up

elder rapids
#

if he adapts

#

he'll be elevated

#

there's not much downsides

#

still depends

knotty jetty
elder rapids
#

since it'll get to a point where not taking jobs

#

becomes unethical

elder rapids
#

who knows

#

ong

earnest parcel
#

gemini is absolutely not impressed by sonnet 3.7:thinking chess moves it seems 😄

elder rapids
knotty jetty
elder rapids
#

it seems like the best chess model currently tbh, though as base

#

the gpt models

#

might be the best

#

especially 4.5

earnest parcel
elder rapids
#

but when you tweak them

#

2.5 pro is the best

#

nahhh

knotty jetty
elder rapids
#

the industry doesn't allow that

elder rapids
#

but I mean

#

prompting wise

#

not temp control

#

and other dev stuff

knotty jetty
knotty jetty
#

Disappointing

#

See this is why screen burning is gonna fail bro

elder rapids
knotty jetty
knotty jetty
#

Proudly black owned business if that interests anyone at all💀

elder rapids
#

send that to him not me 💯

knotty jetty
#

Bro come on

keen beacon
#

yea 🤣 qwq 32b preview

knotty jetty
#

Blacks I wild, just call us black people bro😭

keen beacon
#

they added a sh1t ton of rl on top of it probably more than qwq full tbh

meager sun
#

👹 evil

knotty jetty
#

Dw I don’t really care

elder rapids
#

deadass

#

can't even say type sh**

#

you can suspect it, but not reasonably believe it imo

#

they're not in the same position DeepMind is

#

they don't have the data scientists

#

they don't have the researchers

#

ye but I think we can assume they have the researchers, but not insane data scientists

#

nah not in comparison

#

just as it is

#

ye ofc

#

but it's not necessarily sacrifice

#

in the way it's suggested

#

deepmind

#

anthropic

#

private institutions

#

universities

#

ion think anyone who works at these companies primarily subscribe to the ideals

#

ye

#

I would say only the really top guys

#

that represent those ideals

#

which is inherent to the ideology themselves

keen beacon
#

twink

elder rapids
#

I mean, if I were a standard worker

#

I wouldn't care about these things

#

I'm trying to work hard and get research in lmao

#

for money

elder rapids
#

but can output quality for the direction the company intends

#

ion know about the specific situation too much with Ilya

#

but that's prob what happened, and he likely shifted

ocean vortex
#

no "secret sauce". Just a head start when it mattered + userbase and some really smart engineers. Funding helps as well ofc

keen fulcrum
elder rapids
#

this is the same thing someone sent earlier lmao

#

"(This content is from public information and is for reference only and does not constitute investment advice) Investment is risky, please be cautious when entering the market!"

#

just in a different format

#

or actually prob where they got it from

keen beacon
#

damn u know chinese lol? or just guessed it immediately

elder rapids
#

guessed it immediately

#

but I know a little Chinese

keen fulcrum
#

I believe this to be the case
lets await next week and hopefulyl get some news
R2 and Qwen 3 are imminent to release soon

keen beacon
#

the qwen 3 release seems to be significant, they did llama cpp prs/transformers prs/vllm prs/mobile apps/etc far before the release of qwen 3

keen fulcrum
#

I believe Google will drop theirs soon after R2

keen beacon
#

im hoping they release a qwen 3 reasoning model off the bat, but im most excited for new pretrained base models for fine-tuning, etc. qwen 2.5 was exceptional

keen fulcrum
#

The coder models

elder rapids
#

which undermines it's value as a concept stock

#

since it's not new Information

keen beacon
#

yea theyre releasing smaller models too

#

maybe a 32b alternative but moe so itll inference faster

elder rapids
#

I truly don't think it's going to get that much better from deepseek

keen fulcrum
#

Oh indeed browser integrated llms soon to be the next thing

elder rapids
#

get rid of 2.5 pro, get rid of o3s and o4 minis release

#

let r2 release

#

do you seriously think the gap would've become THAT wide

keen beacon
#

i dont know what to expect with r2 tbh

elder rapids
#

without those 2 crazy releases

#

nah

keen fulcrum
#

There is the possibility in the room qwen 3 will outperform R2, lets see

elder rapids
#

I can't believe deepseek would've accomplished that

keen beacon
#

i dont think r2 will outperform 2.5 pro at least in simpleqa i think

elder rapids
#

let alone at the level of o3

keen beacon
#

i use 2.5 pro on stuff that requires a lot of world knowledge/niche world knowledge

#

its exceptional (compared to other reasoning models)

elder rapids
#

especially when r1 wasn't really that good

keen fulcrum
elder rapids
#

😭 🙏

keen fulcrum
#

Why?
AI got significantly better as soon as R1

elder rapids
#

that's literally impossible

#

the time period is too narrow

#

that means they weren't planning on releasing o3 mini after the announcement

#

and it takes a ton of time

#

for them to prepare it

#

won't release it on a whim like that

#

unless it's truly done

#

especially with how integrated it was

#

take a look at 2.0 flash thinking

#

lol

#

same thing

keen beacon
#

it defo was the reason and the reason why they started working on improving the reasoning summary

elder rapids
keen beacon
#

i dont recall r1/o3 mini timelines that much tbh ive no idea about timeline

elder rapids
#

since people were whining about it

#

you cant attribute any of these AI things to the release of deepseek r1

#

timelines don't add up

#
  • understanding what even goes on behind these ai
#

and what entails the integration

keen beacon
#

vision model and btw thats a bad way to test lol

brittle tiger
#

grok has good team from what i can tell. i heard they pay way more than other labs, basically a working for elon tax

elder rapids
#

probably ye

#

but

#

the thing is, they're not as old

#

as other labs

torn mantle
#

I cant get enough of o3

elder rapids
#

and that's a serious factor

elder rapids
#

jkjkjk

#

nahhh

#

I think it's actually starting

#

you cant get the jump from 2.0 flash thinking to 2.5 pro

#

without a major breakthrough

#

oh

#

wait wym?

#

era of huge growth

#

oh ye

#

if r2 doesn't close the gap tbh

#

I can reasonably assume

#

open source is going to be pretty bad

#

for a little while

#

until they come up with something

keen beacon
#

its their time this time

#

hopefully

#

im not sure about deepseek but the qwen chat website was updated with strings of a qwen plus sub with video gen, image gen, access to qwen 3, etc

knotty jetty
#

I’ve tried everything aside from deepseek tbh

#

No like not deep research

#

Just like general search

#

What’s tavily?

#

How do I use it

keen beacon
#

bruh if u pay those prices lol

#

paying for o4 mini/o3's api and enabling first party web tools, etc on the api is more reasonable tbh

ember rapids
#

Sam did say r1 made them move several releases

#

I wonder what impact r2 will have

earnest parcel
elder rapids
elder rapids
#

I got 2.5 pro to play at around 1900~ ish

#

as that's below my elo

small haven
#

bro what is happening to chatgpt, everything is 64k context max

willow grail
#

Cervical Spine Risk: Rotating your head 180 degrees is generally not recommended. It puts significant stress on the cervical vertebrae, discs, ligaments, and potentially the vertebral arteries that run through the neck bones to supply the brain. Doing this forcefully or if you have underlying neck issues (even unknown ones) could risk injury, nerve impingement, or (rarely) vascular problems causing dizziness or pain.

olive mesa
earnest parcel
# elder rapids probably to be expected without any prompting

this is with prompting, and also there is no way 2.5 Pro (or any language model) comes even remotely close to 1900 ELO. I have tested and played matches and tournament around 200 times by now (using all types of different methods), and the strongest any LLM ever came was GPT-3.5 Instruct in movetext continuation (aka Chess notation recall from training data). Other language models play more in the 400 Elo range, even SOTA.

keen beacon
olive mesa
keen beacon
#

don't leave me like that smh..

olive mesa
#

sorry lol.. ill try not to in the future

keen beacon
#

(i'm joking)

#

don't stress 😭

olive mesa
#

lmao ok 😭

neat apex
#

Rellying a lot in book moves or gimmick moves anyway

earnest parcel
neat apex
#

A yes, at bullet

#

I am saying the most optimist scenario ever

#

Who said 1900?

neat apex
#

For some reason at defense it holds very well

earnest parcel
neat apex
#

That chess.com one what says the estimated one

#

Just like the gif shows, at developing it is great but at end game it trash out

#

It must be the reason it gived an high value

earnest parcel
# neat apex It must be the reason it gived an high value

I don't know of any system that can take a game, and determine "ELO" based on it.... that would be super inaccurate. Elo is based on your opponents strenght, and the outcome, not on how good your moves looked in isolation..... Either way, I have tested a ton, and recorded a lot of games, and most SOTA models play around 400 ELO level (when compared to Lichess opponents), and are unable to beat the weakest Stockfish 14 level (sub 800 ELO)

neat apex
#

Oh

#

Whata twist

I think the 1000 elo number is somehow accurate but limited to hard macths since chess youtubers played against

#

And for some reason it plays way better at middle game, likely due memorizing moves

elder rapids
#

you either don't know what level they really are playing at with lack of experience in chess or you don't know how to prompt

#

has to be one of those two

earnest parcel
#

(also you just lowered your treshhold by 500 ELO, impressive)

elder rapids
#

yk what just get up a game lmao

elder rapids
#

that's an entirely different claim lmao

#

I'm not saying 2.5 pro plays at that level

raven void
#

GPT 5 at home

elder rapids
#

I'm saying if you urge recollection, it'll be at least 1400

#

not restrictive to 2.5 pro

earnest parcel
elder rapids
#

and the longer it goes

#

it'll definitely deteriorate

elder rapids
earnest parcel
#

really?`so I can get any model to play 2k elo also. they played E4

elder rapids
earnest parcel
#

well unlike you I already provided I have collected data (169 games as of now, between multiple modes). would love to see anything despite baseless claims about your 1900 or 1400 elo LLMs

elder rapids
#

just saying

#

not a lot of people are that good at understanding models

#

entire runs can easily be invalidated if you don't adjust prompt techniques respective to the model

earnest parcel
#

i am not interested in troll statements. either provide proof for baseless claims or I got nothing to discuss with you

elder rapids
#

wym?

#

I thought we were playing it already

#

that's why I said c5 lmao

earnest parcel
#

saying c5 is not proof of LLM playing 1400 or previously claimed 1900 elo.

elder rapids
#

and outputs

#

lmao

#

it's not that deep

#

dude actually blocked me lmaoooo

#

😭

#

anyone want to go against 2.5 pro?

#

just for fun, and for the sake of testing

keen beacon
#

They need to delete this new 4o personality

elder rapids
#

it's creative

#

but saying anything to it poisons the well

#

adjusting to the user is cool, that's what I like about 2.5 pro

#

but goddamn

#

I have to prompt it everytime I talk to it

#

to be the way I like it

keen beacon
#

2.5 pro is great to talk to in comparison. They are trying way too hard with the new 4o etc. I didn't think they'd keep trying to force it this hard

elder rapids
#

and how it's better than 3.7 sonnet

#

etc

#

but it's kinda of surreal

keen beacon
#

Maybe they like the sycophancy lol

elder rapids
#

knowing how synthetic 4o is

#

compared to the sonnet models

small haven
#

what they need to do is release

#

o3 pro

elder rapids
elder rapids
#

back to the pro plan I go

#

🙏

small haven
#

its gon be worth it

#

its worth it rn but i want worthier

leaden palm
#

if o3 is deep research lite whats o3 pro

elder rapids
#

if Google releases another model after that tho

#

ion know what I'm gonna do

#

imagine anthropic releases 4.0

leaden palm
small haven
#

o3 is deep research

elder rapids
small haven
#

o3 pro is deep anus research

elder rapids
#

have you guys tested the deep researches enough

#

I don't use the Gemini one or the openAI one that much

small haven
#

yes

elder rapids
#

so I'm not sure which one is better

leaden palm
#

somehow my stupid scaffold with exa and gemini flash+pro outperforms all the other free ones

small haven
#

nothing beats it

elder rapids
#

or nah

small haven
#

obviously

elder rapids
#

you have to get the subscription ig

#

to even use it

keen beacon
#

What do y'all usually use deep research for

small haven
#

*trial first month

elder rapids
small haven
#

but it just spits out mumbo jumbo, not wurf

elder rapids
#

wym?

#

its done what I've asked it

#

very well

small haven
#

oai deep research has high entropy info for every sentence

#

gemini dr is just stale and a bunch of unneeded detail

elder rapids
small haven
elder rapids
#

actually nah

#

I'll just read back

#

and do one for 2.5 pro with the same prompt

keen beacon
# leaden palm

Hmm what are you expecting out of "demo results" btw lol

leaden palm
elder rapids
#

I disagree a ton with the density of info, but the formatting seems more consistent in openAI DR

#

they seem too similar to compare on that part, or its unnecessary comparison

#

since there's necessarily limited amount of info for a topic

small haven
#

gemini dr is like a student trying to fill up the minimum word count

elder rapids
small haven
#

oh mb i read it wrong

elder rapids
#

I'm asking which one is better, or which one sufficiently describes the information, not summarize it

small haven
#

sorry

#

yea i agree

elder rapids
#

in openAI DR it fetches good asf insight

#

and I've seen the same with Geminis

#

Gemini seems to verbosify a ton tho

#

but it doesn't seem like I'm getting less info

#

just more yap

leaden palm
small haven
#

"prioritize verbosity" loool

elder rapids
#

prioritize verbosity 😭🙏

#

nerfing the search

small haven
#

well great o3 solved a unit test where o1 pro couldnt... nice

raven void
#

o3 is just so good

#

Gemini got the answer to my problem wrong even on the meta synthesis only o3 solved it

small haven
#

yea..

#

ok so wen full o4 tho 😭

raven void
#

when they have o5 pro internally 🤣

raven void
small haven
#

to be working at oai, that must be insane

woeful geyser
#

I feel that the vibe of o3 is too easy to recognize as well.

fleet lintel
#

April was kind of disappointing.. no new top model was released. bunch of hype but nothing materialised.

earnest parcel
plain zinc
drifting thorn
#

2.5 Pro is the SOTA...

solar nebula
#

yes

earnest parcel
#

03-25, it doesn't count as April release (even though it got rebranded from exp to preview)

fleet lintel
#

o3, o4-mini-high : disappointing. OAI tried to play game by releasing a bit prematurely to take on google but honestly they are just meh compared to 2.5 pro

plain zinc
#

1️⃣ US20250131254A1:

@GoogleDeepMind의 ‘기능 보존형 신경망 확장 기술’ 관련 특허로, 기존에 학습된 지식을 소실하지 않으면서도 AI 모델의 규모를 확장하거나 특정 목적에 맞게 특화시킬 수 있는 혁신적인 방법을 제시합니다. 이는 마치 사람이 새로운 지식을 배울 ...

#

From this post, you can understand where Google gets several models in LMarena.

#

So much more in such large numbers

calm sequoia
torn mantle
#

@balmy mist i think i may be right again, we may get r2 this monday / next week

#

lot of hints as well

small haven
#

day 11 , where is o3 pro

#

lemme guess when r2 gets released? smh

hardy pecan
fleet lintel
#

does anyone know of insider Apple info? Are they planning to compete in AI space at all?

mossy drum
#

New model in Arena: llama-4-scout-17b-16e-instruct

keen fulcrum
small haven
hardy pecan
unborn ocean
#

Maybe the already used it for the early Gemini 2.0 iterations back in 2024 (Like all the ones that where aistudio only)

unborn ocean
#
poll_question_text

Deepseek R2 before May?

victor_answer_votes

7

total_votes

17

victor_answer_id

2

victor_answer_text

No ❌

ocean vortex
# small haven it was half that last week

for what model? I'm absolutely blown away after I found out this silly score applies to playground and API too. If what you are saying is true that means any benchmarks that people did earlier might not even be possible to reproduce now. Changing stuff like that can always have unintended consequences, even if in theory higher value should be better. There is no place in API for crazy sht like that lol

ocean vortex
full kite
#

Guys I'm better

vagrant field
#

hi all

#

hi !

#

depends , but I use gemini*, o4-mini-high , and claude 3.*

#

gemini 2.5 pro and claude 3.7

alpine coral
alpine coral
#
  • System / platform (oai)
  • Developer (who can add what we call a system message)
  • User
#

if there's a conflict, the higher one has authority

#

as it explains in its reasoning (and hence why it said 8192;the 2903 in the devloper message was overriden)

keen beacon
#

this seems about right

brittle tiger
#

Is yap score unique for each user preference or are they just changing it for everyone on the fly looking for a sweet spot?

alpine coral
# keen beacon this seems about right

i think they might be from the same platform-level prompt (oai's instructions, which [are meant to] override anything given in the Developer Prompt (API) as well as end user messages

#

the 8196 comes from the platform-level prompt (as do, I think, those instructions about mirroring style etc)

alpine coral
knotty jetty
#

Ok thanks bro

alpine coral
#

np

brittle tiger
full kite
#

what the fk is a yap score

#

I NEED SMART MIND HERE Please

#

I have this test tomorrow

#

It's like the final test of the year right

#

But it's a mock one

#

Idk sht about what the subjects are

#

I don't go to classes and everything

#

What would be the best method to know how to solve them

#

Like I have access to all the past mock tests from the last 4 years

#

and the correction of them

#

And also videos about ppl solving them

#

Like Idk what to do chat

#

Please 😭

keen beacon
#

Get ready @full kite

full kite
full kite
keen beacon
leaden palm
full kite
full kite
leaden palm
#

see what you don't know

#

what you don't understand

#

look for patterns in that

keen beacon
#

ur in ai discord bro just cheat

#

all my classes online

#

4.0 gpa

full kite
keen beacon
#

4.0 gpa out of 4

#

thats the highest u can get

full kite
#

that's like good

keen beacon
#

ye

#

i have a spare laptop i put in the corner

#

then use parsec on main desktop

#

remote into it

#

gemini free trial gives u unlimited screenshots too

flint sand
#

also don't cheat dude

keen beacon
full kite
#

I'm using google ai studio

flint sand
# keen beacon whats that

your gpa isn't directly based on your marks, it's based on how much you scored relative to the highest marks obtained in class

keen beacon
#

p sure ai studio's 2.5 pro without premium is only like 500k tokens then you get rate limited for a day

leaden palm
#

or use the official student free trial if you're in university

full kite
#

Dude 2.5 pro is free and 1 million token per chat

#

what is pro even about 🙏 😭

#

I have the flash one too

#

faster

leaden palm
full kite
#

ong

leaden palm
full kite
#

what does RPM means

leaden palm
#

requests per minute

leaden palm
# leaden palm

actually maybe ai studio has higher limits than the ai studio api... idk

full kite
#

GUYS

#

I need to study jessu

#

please help me

leaden palm
#

jesus?

full kite
#

yeahuh

#

help me studying

leaden palm
#

lmao

full kite
#

😡

ocean vortex
ocean vortex
frail thorn
#

Guys does anybody have a chatgpt premium account that could be shared with me?

ocean vortex
full kite
frail thorn
#

I need full access

#

😭😭😭😭

ocean vortex
full kite
frail thorn
#

Its whatever

ocean vortex
#

well then @full kite could give it to you maybe

full kite
frail thorn
#

REALLY

full kite
#

😡

ocean vortex
#

he does't need his it seems

frail thorn
full kite
#

Ok tell me what

#

I'll see if I accept

frail thorn
#

I just need chatgpt to help me

#

the normal one doesn't accpet all of my files

#

and so on

full kite
#

how many files

frail thorn
#

many...

#

like 10 at least

ocean vortex
frail thorn
ocean vortex
#

it's free and the king for file uploads

frail thorn
#

but like

full kite
frail thorn
full kite
#

okay what happen

frail thorn
#

wait

full kite
#

🤨 🫃

frail thorn
#

🫃🫃🫃🫃🫃

full kite
#

listen I'm going to quit

#

if you don't tell me

#

😡

frail thorn
#

TELL YOU WHAT

#

oh my days

#

you know what

#

QUIT

full kite
#

WHAT HAPPEN WITH GOOGLE GEMINI

#

IT HAS 2 000 000 TOKENS FREE

#

PER CHAT

#

CHATGPT PRO IS 128 000

frail thorn
#

😇

full kite
#

😡

full kite
#

I'll help you

#

I want to help you

#

@frail thorn

#

?

#

😔

frail thorn
#

😀

full kite
keen fulcrum
frail thorn
#

🙄

frail thorn
timber kiln
ocean vortex
frail thorn
#

I'll include the thank you

full kite
leaden palm
ocean vortex
#

🫃

full kite
#

You fkn looser

frail thorn
#

someones mad

full kite
#

✡️

frail thorn
#

🤗

full kite
#

🥀

frail thorn
full kite
ocean vortex
#

I wouldn't really say chatgpt is better for what you are trying to do even... But ok whatever 😂

frail thorn
#

JUST CHILL OUT GUYS

#

its just 20 dollars

#

Its not like I'm going to die from it

full kite
ocean vortex
frail thorn
#

LOLLLLLLL

full kite
#

WE GOT ONE

frail thorn
#

now that made me giggle

full kite
#

coins clipper

frail thorn
#

gold detector

full kite
#

the

#

why have you buy the chatgpt

frail thorn
#

🤗

full kite
#

thing

frail thorn
#

I'm just talking about me

#

I'm not including any other reference

full kite
frail thorn
#

😭

#

I NEED TO GO BACK TO RESEARCHING

#

cya

full kite
#

🤨

frail thorn
full kite
#

are you doing research about the juice

#

😔

#

yes ?

#

Sez u

#

aware

native shoreBOT
#

dynoSuccess diana12493_32 has been warned.

full kite
#

Hello??

#

Can the other guy be warn too

#

😡

keen beacon
#

oy vey stop noticing

native shoreBOT
#

dynoSuccess thekingofnothing_ has been warned.

#

dynoSuccess m0_0d_ai has been warned.

leaden palm
#

which is why if you build anything and want it to be free you have to get like 3 other providers

#

25 rpd does not serve 2-3 users actively building things

#

25 rpd is the documented limit

#

and i've ran into it before

full kite
ocean vortex
#

wtf happened here

frail thorn
full kite
zinc ore
small haven
bright kayak
pliant cypress
#

Logan just hint 2.5 ultra and remove tweet very fast 😆

keen beacon
#

what did he say

small haven
#

buddy perma works at google and says this

pliant cypress
# keen beacon what did he say

Something about making custom t-shirts with text "1400+ ELO club", "smell big model", "AGI when". But maybe its just coping

sage raptor
#

maybe he is drunk

knotty jetty
#

Yo gemini 2.5 pro in perplexity is peak

#

Also I hate perplexity but its really good with the llm

full kite
knotty jetty
full kite
#

what

#

dude I know nothing of the sht

#

I'm using google studio ai

#

whats perplexity

knotty jetty
#

Just look it up bro

full kite
#

omfg

leaden palm
small haven
#

deepwiki's deep research is underrated

wintry tinsel
#

Remember how they murdered the old server

#

🥲

leaden palm
small haven
#

took about 3-4 mins to finish

full kite
leaden palm
#

which is... elo data

full kite
#

yeah what about it

#

aok

#

I thought it was a chess thing, they were talking about that earlier

leaden palm
#

anything with pairwise comparisons can be measured with elo

full kite
#

like llm will never be able to play chess

full kite
#

can we play on it or something

leaden palm
wintry tinsel
#

LLM’s dead end 😱

#

LLM’s failed project

leaden palm
#

you can also compare other things:

  • llms using github repos
  • llms using search
  • llms making websites
  • llms writing code
  • image generation
wintry tinsel
#

Rip LLM’s

#

Support Yan Le Cun

leaden palm
torn mantle
#

im not impressed with these models tbh

#

they just seem like gemini 2.5 pro 03

full kite
#

I don't know what a repos is

leaden palm
#

thats ok, a lot of lm arena is for coders and you might not be one

full kite
#

I'm a coder

leaden palm
full kite
#

I know about the green board

#

contributions sht

small haven
#

my iq is dropping

elder rapids
#

where's Nw 🙏😭

sage raptor
#

soon

keen beacon
#

in the void

elder rapids
#

the longer these models are unreleased

#

the more there's a chance people catch up

#

and it's unimpressive

wintry tinsel
#

We need human brain computer

#

Neurons in the circuit

#

Than true AGI

#

🔥

golden ocean
#

lets disect urs so we can start the experiments immediately

wintry tinsel
#

That would be cheating I’m already AGI

misty vault
#

more like artificial general stupidity

#

Ok that sounds more like a direct insult rather than a silly joke

#

no more funny

wintry tinsel
#

Are LLM’s capable of being stupid?

#

Perhaps stupidity is a byproduct of intelligence

leaden palm
#

qwen 500m:

wintry tinsel
#

Well you have a point

small haven
#

ok but wen o3 pro

#

baited, just wanted to spawn my man craig

elder rapids
#

nah ion think so

#

a major part of what made o1 pro so good was it's ability for pure longer context reasoning

small haven
#

ye im using o3 more than o1 pro..

#

what we know is its 10x compute (so thinking for 10x longer than o1)

keen beacon
#

lol

small haven
#

and the fact it is outputting as a one shot answer rather than streaming tokens, means it is using an internal canvas in the backend

#

so it is constantly iterating its final answer, thats my hypothesis

elder rapids
#

we can infer

#

longer reasoning

#

whereas pro is probably fine tuned

small haven
#

and internal canvas

elder rapids
#

specifically to force a longer reasoning chain + better initial instruction retention

small haven
#

internal canvas, checking its own answer on a pass@1, then reiterates it, until it is satisfied with a hard limit of 10x compute

elder rapids
#

there won't be an o3 pro if it's not better than o1 pro

#

just saying lmao

willow grail
small haven
#

can sam stop riding his husband and release o3 pro

willow grail
#

its a ceo lol?

#

watch movies

#

then u know

#

rich people are very unhappy

#

thats why i am happy

#

nonetheless you are not rich

#

cause its adverse to not say it

#

youre stuffed with adverse lies

#

/s

#

ceo's dont live on prairies, therefore they are unhappy creatures.

terse shuttle
#

Should we expect a new image generation model from openAI in the arena?

willow grail
#

you have cute attributes, federighi

#

ceos dont have small things.

#

see it as necessary.

#

we said the same word... lol

#

._. we are models?

#

i will weigh 81kg after water flunctuations in exactly 30 days

#

now i am 82.6kg

#

having diseases which makes loosing weight slower is bad

#

i think u just wanna beleive they are as happy as a nanny in a village

#

cause u wanan be rich too

#

so u have a goal to go for right now

ocean vortex
#

No it’s more like multiple attempts and consensus system. You can still set low-med-high for pro.

tall summit
small haven
#

omg sam tweeted, plz be o3 pro

golden ocean
#

its him riding his husband

tall summit
golden ocean
#

hi

leaden palm
#

probably a refreshed pretrain

keen beacon
#

its likely to be a cpt of 2.0 pro

#

it isnt. even openai cant increase simpleqa that much yet thru reasoning

small haven
#

i love o3

#

but i would love o3 pro even more

dapper aspen
#

hi

balmy mist
#

we got deepseek yet?

#

imma go to china and fight Winnie if we dont get r2 this week

alpine coral
#

just got folsom-exp-v1, which i haven't seen or heard of before - new anon model?

#

presumably related to cobalt, apricot

#

so amazon ig

torn mantle
small haven
#

when o3 shows these traces in the cot, i kinda leak a bit

keen beacon
calm sequoia
alpine coral
#

it seemed quite fast and dumb

cedar tide
small haven
#

why is o3 limited at 64k context, absolute dogwater

keen beacon
#

omg qwen 3

#

apparently pre trained on 36 trillion tokens 😮 (2x qwen 2.5). multiple moe models?

stuck orchid
#

Hi.
Is there a chat limit in https://beta.lmarena.ai? Or is it possible to test ai even on long requests (10K+ tokens)?

keen beacon
torn mantle
keen fulcrum
torn mantle
#

So llama 4 reasoning models werent added on lmarena giving the recent controversies

#

And its kinda weird too qwen3 isnt added yet in an anonymous battle mode

ocean vortex
keen beacon
torn mantle
humble sonnet
neon anchor
#

Sunstrike is not generating code at all

balmy mist
#

Anybody tested qwen3?

keen beacon
#

the 235b moe (apparently) with reasoning mode might be very impressive

sage raptor
#

No eletricity in europe 😭

keen beacon
#

it's just that those were the ones that accidentally appeared on modelscope briefly

#

anyway, to summarise, this week we are getting:

  • qwen 3 (likely today)
  • amazon nova premier (wed)
  • deepseek R2 (somewhat likely, depends on qwen 3)
brittle tiger
#

Would be cool if r2 went on arena before debuting

keen beacon
#

i expect that to be a very good model

balmy mist
#

i kinda like that, give us time to collect ourselves

torn mantle
#

Qwen 3 is what people expected llama 4 to be like

calm sequoia
#

The GPT 4o doesn't allow altering real photos due to their stupid policies. Anyone have the jailbreak code or the alternative tool?

sonic tendon
# torn mantle Yea the big one is 235b

oh damn
was wondering if qwen3 was going to include a big model
seems sort of odd that they didn't include it in the huggingface pr from a week or two ago, but maybe they're trying to keep it closed?

keen beacon
sonic tendon
#

ah

keen beacon
#

they just briefly put out a fp8 quantized qwen 3 0.6b on hf and removed it

torn mantle
#

Will 235b be the biggest model released by them so far?

#

Or they released smth bigger before

keen beacon
#

im so hyped rn

calm sequoia
#

I think nothing will be better for at least a month

calm sequoia
ocean vortex
#

it should at least be mostly as good as 2.5 I think.

#

probably. But it's not gonna be worth using it given the price that's for sure lol

#

technically sota now is o3 anyway

keen beacon
#

i dont think they will put o3 pro in the arena

barren prairie
keen beacon
#

imho the most likely thing to dethrone gemini 2.5 pro will be gemini 2.5 pro (at least in the arena) 🤣

torn mantle
#

In all seriousness o3 model is a lot better than gemini 2.5 pro at general tasks

#

You can still feel the robotic vibes from gemini

full kite
#

Just got r2 it's good

ocean vortex
full kite
#

Guys can IA do my homework pls

#

it's a math test

calm sequoia
#

Why did you change your username from Mango to Diana? 🙂

calm sequoia
#

if R2 > Behemot, Zuck is cooked

keen beacon
#

thats obviously gonna be the case tbh

full kite
calm sequoia
full kite
#

what is behemot

keen beacon
#

behemot is agi

alpine coral
calm sequoia
#

Behemot is minotaur shrek

severe tinsel
#

Hi, I’m not a pro, will Qwen3 compete with Qwen 2.5 Max at the top of the leaderboard or its not the same category?

full kite
#

agi doesn't exist

keen beacon
#

or in the arena

severe tinsel
#

Yea i mean one of them

keen beacon
#

probably the big one will be competitive in the leaderboard

severe tinsel
#

Okay thanks!

stuck orchid
#

Qwen 3 may be better than Gemini-2.5-pro 👍

full kite
#

qwen is slow asf

ocean vortex
full kite
ocean vortex
#

it's behemoth

full kite
#

like a bear with teef

#

4 legged bear

ocean vortex
#
SciiFii Wiki

The Behemoth (Megasus mammothoides, name meaning "great mammoth pig") is a species of large land mammal that originally didn't exist, but was created by SciiFii and introduced to the African...

alpine coral
#

bro... it's trolling....

#

(i assume / hope lol)

ocean vortex
alpine coral
#

lol

keen beacon
#

try it yourself lol

#

direct chat / side by side?

#

i think side by side was configured differently a while back it might have better limits

#

yea

alpine coral
#

i dunno about coding (let alone for a specifc programming language), but i feel 2.5 pro is just solid as af all round

keen beacon
#

its incredible i still main it over anything else

alpine coral
#

it's more usable than o3 (esp o3 high)

#

yes

keen beacon
#

its definitely seen less c++ than the others and its more things to manage

#

yea

#

if look at it historically probably. it depends on how much of it was curated, but i think its likely to be even less

#

im not very sure about the others. but 1 and 2 is highly likely to be python/javascript (not sure which one is which though)

#

it depends really but with a gc and such its generally slower/much slower i think

#

did u try?

#

2.5 pro has the best context retention/usage in a model i think anyway, it helps a lot

#

you should probably upload the code into the repo instead of it being in a zip

#

do u have git installed?

#

i think u can also upload folders directly on the website

#

maybe ask 2.5 pro to teach u git

#

its more convenient and allows u to have version control

alpine coral
#

i'm getting so many errors in the arena atm

#

i feel like the yap score has been there all along (and at 8192).. recently discovered / noticed, rather than added..

#

also the other models' responses.. 3.7-sonn and v3 do well; sunstrike also (though verbose af)

#

folsom-exp-v1 assumes it's a bitcoin or something - pretty terrible response imo

tall summit
alpine coral
#

oai reasoning models have a top-level prompt that gives guidance about how to intereact and includes this part about a 'yap score' (which has always been at 8192, so far as i can tell)

alpine coral
#

yeah i mean, to the extent there's some oai-imposed instrucitons that include this thing called a yap score, it's real

tall summit
#

wow okay.

alpine coral
#

whether it's concerning though im not really sure tbh

#

like esp if it's been there all along

#

it may just be for stylistic purposes

#

but yeah that said.. my intial reaction was to think it was a way for oai to dynamically throttle outputs on o models in chagtgpt, to like manage costs / compute

keen beacon
#

iirc it was a thing since launch

#

o3/o4 mini launch

alpine coral
#

yeah i only learnt about it here a couple of days ago

#

but now i think it's likely been there all along (and am not really bothered by it.. like there's no indication of it being used to nerf the models or whatever... yet anyway aha)

#

i kinda wonder if, in cases where lots of reasoning tokens are used especially, the final outputs could be super lengthy, and this was just their solution (prompting), rather than it actually meant to be dynamic

brittle tiger
cedar tide
#

Qwen 3 253b will be better than deepseek 3.1 671b ? (And Maverick 400b)

torn mantle
#

The question is : will the smaller models be even better than Maverick?

#

Imagine 30b > 400b

cedar tide
#

apart from their version with reasoning

keen beacon
#

i think all/most of them are hybrid reasoning models

cedar tide
#

"let's train our model to get higher chat slop ELO scores"

*model starts exclusively outputting pure chat slop*

not meant to be a jab at any one lab in particular, just highlighting a particularly bad incentive structure I see rn. there's a reason you don't find Claude at #1 on chat slop leaderboards. it's the LLM equivalent of optimizing for video watch time in a social media algo.

cedar tide
keen beacon
#

might be a placeholder but the pr also mentioned the 8b model which was confirmed

cedar tide
keen beacon
#

wont be surprised if its real tho

keen fulcrum
#

Qwen 3 dropping

keen beacon
#

yup they're trickling in

torn mantle
keen beacon
#

will probably be all done in the next hour or two

#

I'm just waiting for the big one

torn mantle
#

Its either o3 or gemini 2.5 pro that deserves #1 spot tbh

keen fulcrum
#

Qwen 3 released
︀︀
︀︀Qwen3-8B
︀︀
︀︀Qwen3 Highlights
︀︀
︀︀Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5:
︀︀
︀︀•⁠Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data.
︀︀•⁠Training Techniques and Model Architecture: Qwen3 incorporates a series of training techiques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and ov…

keen beacon
#

lmao someone leaked qwen 3 32b it seems

#

seemingly one of the randos (non qwen team) in the hf qwen org

cedar tide
keen beacon
#

eh they've started dropping them now anyway

#

doesn't matter much

#

not many details there tho

#

why are there random people in the qwen hf org lol

#

it'll be officially released literally in the next hour probably

#

I wonder if we also see something from deepseek this week

#

it would make sense but who knows

#

the 235b possibly

#

depends on how strong their reasoner is

keen beacon
#

I do expect it to beat R1 minimum really

#

if they can't even do that it's a flop

#

llama 4 reasoning releasing at llamacon tomorrow by the looks of it, frontend is ready

#

i would also expect behemoth

tall summit
#

i think o3 is better at translation than g2.5pro

keen beacon
#

its gonna be a huge flop

#

if it's anything like the rest sk

#

there will be a lot of memes about qwen and llama i suspect

#

idk about qwen

#

i have a lot more faith in them than i do meta

#

yann lecooked

keen beacon
#

how qwen was the llama 4 people expected

#

oh

#

yeah nevermind

keen fulcrum
keen beacon
#

its a shame that guy is uploading the ggufs publicly before qwen officially announces it

keen fulcrum
#

Must be some employee with access to the system

keen beacon
keen fulcrum
#

Oh lol

keen beacon
#

I also hope that 15b moe is real, it's an awesome size

#

It's insane how many models they are gonna release. With 36 trillion tokens in pretraining not to mention the reasoning training etc

keen fulcrum
#

Is R2 coming out this week too?

keen beacon
#

Idk lol

#

Its qwens week probably

barren prairie
keen fulcrum
#

There was a leak for deepseek too
if its not this week, must be next one

sonic tendon
keen fulcrum
#

Behemoth

sonic tendon
#

yeah but

#

eh

sonic tendon
#

i would personally expect q3 within the next 24 hours

torn mantle
#

Weird they didnt add any new models on lmarena