keen fulcrum May 3, 2025, 10:30 AM

#

What is Teamfood?

alpine coral May 3, 2025, 10:38 AM

#

keen fulcrum What is Teamfood?

as they say, it's a codename for the memory feature (weird one indeed tho ha)

patent bane May 3, 2025, 10:54 AM

#

gpt-4-turbo is released?

keen fulcrum May 3, 2025, 10:58 AM

#

alpine coral as they say, it's a codename for the memory feature (weird one indeed tho ha)

Well its not something eatable
Great name for a brand selling productivity snacks

brisk turret May 3, 2025, 11:13 AM

#

Jesus Christ update the leaderboard

#

You had one job!

keen fulcrum May 3, 2025, 11:15 AM

#

12 is missing

#

I wouldn't straightly put gpt 4o as number 3
o3 will eventually be number 1 with more votes

alpine coral May 3, 2025, 11:18 AM

#

keen fulcrum 12 is missing

so are 5 and 6

keen fulcrum May 3, 2025, 11:18 AM

#

I wonder how that happens

alpine coral May 3, 2025, 11:18 AM

#

pretty sure it's how it's meant to be

keen fulcrum May 3, 2025, 11:18 AM

#

Isn't it some very simple logic?
Or does that happen when models get removed?

alpine coral May 3, 2025, 11:19 AM

#

actually yeah i dunno

brisk turret May 3, 2025, 11:20 AM

#

This leaderboard is very poorly managed

alpine coral May 3, 2025, 11:20 AM

#

nah they periodically update it..

#

calm down lol

hardy pecan May 3, 2025, 11:20 AM

#

The “holes” you’re seeing in the Rank * (UB) column aren’t a bug in the UI, they’re a side-effect of how that leaderboard does its UB‐ranking: it groups all models whose Arena-Scores are not statistically distinguishable (based on the 95 % CIs) into the same bucket, gives that bucket a rank number, then only bumps to the next rank when a model is significantly worse. If no model ever fell into the “bucket” that would have been called rank 5 (or 6, 10, 12, etc.) then you simply won’t see anyone labeled with that rank. In short, the missing numbers are just empty significance‐tiers—no model earned them.

alpine coral May 3, 2025, 11:21 AM

#

ty

hardy pecan May 3, 2025, 11:21 AM

#

Whenever two or more models get folded into the same “bucket” because their 95 % confidence intervals overlap, they all take the lowest ordinal rank of that bucket, and the next model’s rank number jumps ahead past all of them. In other words:

• No one owns rank 5 or 6 because the models that would have occupied positions 5 and 6 were tied with higher-ranked buckets and so got pulled up into those buckets.
• Likewise, the three models tied at “9” consume positions 10 and 11 (so you never see a standalone 10 or 11), and leave the next distinct model at rank 11 or 13 depending on the bucket size.

It’s just the standard “competition-style” (or “min”) ranking with ties: you collapse tied items into the same rank, then skip over all of the internal slot numbers that are swallowed by the tie.

#

Thanks AI!!!!

brisk turret May 3, 2025, 11:24 AM

#

It's stupid that's not how ranking works

#

They overcomplicated it

#

There are no ties anyway

#

There should not be buckets

#

Some idiot tried too hard to seem smart

#

And vastly overcomplicated a very simple task

north horizon May 3, 2025, 12:10 PM

#

i've definitely had ties where both models generate essentially the same ui

#

recently had grok 3 and maverick do that

keen beacon May 3, 2025, 12:26 PM

#

found this interesting

#

surprising to see 2.5 flash cost more, but i suppose it depends on the task

#

alpine coral May 3, 2025, 1:10 PM

#

earnest parcel May 3, 2025, 1:17 PM

#

alpine coral

wait until you see phi-4-reasoning-plus, which will be 2-3x of the next entry

keen fulcrum May 3, 2025, 1:23 PM

#

poll_question_text

Which model do you prefer?

victor_answer_votes

6

total_votes

9

victor_answer_id

5

victor_answer_text

Nightwhisper

solar hollow May 3, 2025, 1:26 PM

#

do we know what model nightwhisper is?

alpine coral May 3, 2025, 1:32 PM

#

interesting - cheers!

golden ocean May 3, 2025, 1:42 PM

#

butt cheeks

brittle tiger May 3, 2025, 2:00 PM

#

keen beacon found this interesting

I think it's dumb they don't include repeats for cost or output tokens. That's big part of evals and regular use. Aider bench does making his cost comparisons more valuable

ornate stump May 3, 2025, 2:03 PM

#

Does anyone that use NotebookLM and know why the new 2.5 update (until yesterday it was still using 2.0) doesn't affect the audio overview (podcast)? I

calm sequoia May 3, 2025, 2:23 PM

#

keen beacon surprising to see 2.5 flash cost more, but i suppose it depends on the task

Haha the cost per token is small on the flash but it takes too much tokens to think. I guess the llama models aren't actually so bad. They just presented it wrong.

#

Also it's surprising that the o3 spending is so high. The artificial analysis must be hell of a benchmark

keen fulcrum May 3, 2025, 2:56 PM

#

Wow only two voted for sunstrike

#

Sunstrike is probably gemini 2.5 ultra

torn mantle May 3, 2025, 3:08 PM

#

keen fulcrum Wow only two voted for sunstrike

Doesn't make any sense if NW is an old checkpoint

keen beacon May 3, 2025, 3:09 PM

#

general purpose vs finetuned for web dev

torn mantle May 3, 2025, 3:09 PM

#

It wasn't just good for web dev

#

It could do complex reasoning in 1 shot

#

accurate visualisation

keen beacon May 3, 2025, 3:10 PM

#

being finetuned on something doesnt always necessitate losing abilities on other stuff

torn mantle May 3, 2025, 3:10 PM

#

Good at following prompts as well

keen beacon May 3, 2025, 3:10 PM

#

did sunstrike appear in the general arena as well?

torn mantle May 3, 2025, 3:10 PM

#

Frostwind/sunstrike... Are all finetuned on web dev imo

torn mantle May 3, 2025, 3:10 PM

#

keen beacon did sunstrike appear in the general arena as well?

.

#

Yea

keen beacon May 3, 2025, 3:10 PM

#

hmm

leaden palm May 3, 2025, 3:12 PM

#

i'm expecting qwen 3 235B to be a very efficient model, on this price vs elo plot i'd expect it to be right where my cursor is

#

non log scale edition:

#

well maybe my cursor was too far to the left but you get the idea

tall summit May 3, 2025, 3:21 PM

#

leaden palm well maybe my cursor was too far to the left but you get the idea

whats the metric of the x axis

leaden palm May 3, 2025, 3:21 PM

#

price

#

/ mtok

#

mixed

tall summit May 3, 2025, 3:22 PM

#

leaden palm price

oh i thought the scale only went from 0 to 1 for a moment considering you cropped it

blazing rune May 3, 2025, 3:23 PM

#

keen beacon surprising to see 2.5 flash cost more, but i suppose it depends on the task

I think it's because o4 mini high is lazy, especially with code

#

so it's naturally more concise

leaden palm May 3, 2025, 4:10 PM

#

dario said something that loosely translates to if they didnt cook they wont call it claude 4

small haven May 3, 2025, 4:19 PM

#

eww no

alpine coral May 3, 2025, 4:42 PM

#

torn mantle Yea

yeah ive had it [sunstriker] a few times in the general arena - iirc thought it was v similar to 2.5 pro, though perhaps a bit weaker

#

though a week or so ago.. not sure if it's still around

unborn ocean May 3, 2025, 4:47 PM

#

idk, thinking it might be a bit higher price wise, because I am not sure if the current prices (of 0,1 per million in and out from deepinfra and fireworks) will stick

unborn ocean May 3, 2025, 4:47 PM

#

leaden palm i'm expecting qwen 3 235B to be a very efficient model, on this price vs elo plo...

(especially not the fireworks one, as they literally said they are working on the pricing)

leaden palm May 3, 2025, 4:48 PM

#

well if things change lmb will update

#

but im expecting deepinfra to stay low

#

it has a small # of active params and itll only get more optimized after all

unborn ocean May 3, 2025, 4:48 PM

#

leaden palm it has a small # of active params and itll only get more optimized after all

yeah but llama price is higher and they have lower active param

#

deepseek is also close in active param (27B idk for sure though)

leaden palm May 3, 2025, 4:50 PM

#

unborn ocean yeah but llama price is higher and they have lower active param

as long as its under $0.5 ill be happy

keen beacon May 3, 2025, 4:50 PM

#

Deepinfra price matched fireworks which has temp pricing but I guess deepinfra didn't realize that and/or still want to compete. I think it might encourage lower prices overall though because of this

unborn ocean May 3, 2025, 4:50 PM

#

it will probably depend on the size of the userbase, as MoE deployment efficiency scales with the userbase

unborn ocean May 3, 2025, 4:51 PM

#

leaden palm as long as its under $0.5 ill be happy

me to

keen beacon May 3, 2025, 4:53 PM

#

Qwen 235b is slow as molasses on deepinfra, fireworks is an unknown quant iirc (but much faster)

unborn ocean May 3, 2025, 4:55 PM

#

keen beacon Qwen 235b is slow as molasses on deepinfra, fireworks is an unknown quant iirc (...

True, 20 tokens/s + thinking Model = 💩

torn mantle May 3, 2025, 5:49 PM

#

https://x.com/iruletheworldmo/status/1918660478542172405

🍓🍓🍓 (@iruletheworldmo) on X

appreciate early access elon, i’ve never used anything like grok 3.5

this is a fundamental shift in intelligence.

with gpt 4 i felt sparks of agi, today i’ve tasted the cambrian explosion of true artificial intelligence.

#

this guy still has a platform?

#

did he just took pic shared by techdevnotes and changed the background color and made it his own?

#

original : https://x.com/techdevnotes/status/1918292261659365576

Tech Dev Notes (@techdevnotes) on X

Grok iOS App is getting ready as well for 3.5, subtitle will be updated

zinc ore May 3, 2025, 5:50 PM

#

Dudes shameless glazing is kinda impressive ngl

torn mantle May 3, 2025, 5:51 PM

#

well he did some modifications

keen beacon May 3, 2025, 5:51 PM

#

torn mantle did he just took pic shared by techdevnotes and changed the background color and...

lmao

torn mantle May 3, 2025, 5:52 PM

#

keen beacon lmao

you can actually see it in pixels

keen beacon May 3, 2025, 5:52 PM

#

yea 🤣 🤣

torn mantle May 3, 2025, 5:52 PM

#

he has no shame

keen beacon May 3, 2025, 5:52 PM

#

torn mantle he has no shame

hes a massive troll

torn mantle May 3, 2025, 5:52 PM

#

keen beacon hes a massive troll

he was in chatgpt discord server and we called him out

keen beacon May 3, 2025, 5:52 PM

#

o

leaden palm May 3, 2025, 5:57 PM

#

torn mantle did he just took pic shared by techdevnotes and changed the background color and...

wait this is so funny, i applied some css filters and you can tell that he used the "draw on image" functionality of his screenshot editor to explicitly cover up the background text

keen beacon May 3, 2025, 5:58 PM

#

leaden palm wait this is so funny, i applied some css filters and you can tell that he used ...

omg 🤣

alpine coral May 3, 2025, 6:00 PM

#

lol this is classic

sage raptor May 3, 2025, 6:01 PM

#

alpine coral May 3, 2025, 6:02 PM

#

he's just taking the piss at that point isn't he?

small haven May 3, 2025, 6:03 PM

#

wait we have grok 3.5 alrdy?

keen beacon May 3, 2025, 6:03 PM

#

yeah and o3 pro on the o1 pro api

small haven May 3, 2025, 6:04 PM

#

the joke isn't hitting no more

alpine coral May 3, 2025, 6:04 PM

#

small haven May 3, 2025, 6:07 PM

#

i have supergrok, no 3.5..

zinc ore May 3, 2025, 6:08 PM

#

alpine coral he's just taking the piss at that point isn't he?

He's been doing this since last year, it gets him views and he even openly admits that he's lying (at least he has in tweets a couple of times), and laughs at all the people who believe him.

alpine coral May 3, 2025, 6:09 PM

#

ha yeah i mean kinda fair

zinc ore May 3, 2025, 6:09 PM

#

He's so openly dishonest that I kinda like that account (not for anything serious though)

small haven May 3, 2025, 6:13 PM

#

alpine coral

strawberry man doesnt even have grok 3.5, buddy woulda screenshot..

torn mantle May 3, 2025, 6:19 PM

#

leaden palm wait this is so funny, i applied some css filters and you can tell that he used ...

yea he left out the ones on customize grok section tho

#

i cant stand him

torn mantle May 3, 2025, 6:19 PM

#

alpine coral

LMAO

#

so true

keen beacon May 3, 2025, 6:20 PM

#

sage raptor

this is actually the worst bait of all time

#

elon isn't gonna let bro tap

torn mantle May 3, 2025, 6:21 PM

#

sage raptor

confident in his lies

#

nah this is crazy

leaden palm May 3, 2025, 6:22 PM

#

keen beacon elon isn't gonna let bro tap

this is the most that's happened so far

torn mantle May 3, 2025, 6:23 PM

#

he has actually a condition called pseudologia fantastica

keen beacon May 3, 2025, 6:23 PM

#

elon is so stupid it actually annoys me

torn mantle May 3, 2025, 6:23 PM

#

pseudologia fantastica: 'Pathological lying, also known as pseudologia fantastica, is a chronic behavior characterized by the habitual or compulsive tendency to lie.'

keen beacon May 3, 2025, 6:23 PM

#

okay buddy you're dumb but now it's everybody else's problem because you have so much influence

#

go away

torn mantle May 3, 2025, 6:24 PM

#

and probably combined with a personality disorder

#

those type of people scares me

sage raptor May 3, 2025, 6:30 PM

#

🤣

keen beacon May 3, 2025, 6:41 PM

#

lol I think he's completely lost it

zinc ore May 3, 2025, 6:49 PM

#

That's literally par for how he talks, he was doing this for the "strawberry" models for upcoming openAI models, last year.

balmy mist May 3, 2025, 6:50 PM

#

sage raptor 🤣

dude is a joke

#

if it truly is not as good as he says i am finally gonna unfollow him

zinc ore May 3, 2025, 6:51 PM

#

You follow him unironically??

balmy mist May 3, 2025, 6:51 PM

#

yo can these fools release o3 pro:
https://x.com/sama/status/1918735773098004680

Sam Altman (@sama) on X

i have been on a shopping bender this morning, this is much better than i expected!

keen fulcrum May 3, 2025, 6:53 PM

#

sage raptor 🤣

Is it that crazy?

torn mantle May 3, 2025, 6:54 PM

#

balmy mist yo can these fools release o3 pro: https://x.com/sama/status/1918735773098004680

maybe they are waiting for another lab to release smth first

torn mantle May 3, 2025, 6:54 PM

#

sage raptor 🤣

farming engagements for $10

#

how desperate can someone be

tall summit May 3, 2025, 6:55 PM

#

leaden palm this is the most that's happened so far

HAHAHAHAHA

torn mantle May 3, 2025, 6:55 PM

#

elon ruined it with this engagement thingy

torn mantle May 3, 2025, 6:55 PM

#

leaden palm this is the most that's happened so far

aa

#

no wonder

tall summit May 3, 2025, 6:56 PM

#

sage raptor 🤣

this guy talks like chatgpt used to

#

https://x.com/gork/status/1918729399249371417

gork (@gork) on X

@HERMESINVEST1 @iruletheworldmo nah he's just edging us like always

ocean vortex May 3, 2025, 7:20 PM

#

sage raptor

dumbass

small haven May 3, 2025, 7:21 PM

#

https://x.com/btibor91/status/1918700964967796973

claude advanced research ranked at #4

Tibor Blaho (@btibor91) on X

Summary of my findings after comparing more Deep Research reports (these are my personal opinions based on my own tests and experience)

- No perfect solution yet, since all current deep research tools have limitations and make occasional errors, so you need to verify the

torn mantle May 3, 2025, 7:28 PM

#

small haven https://x.com/btibor91/status/1918700964967796973 claude advanced research rank...

tbh kinda agree about claude deep research

#

takes a lot of time with a mid-quality report

#

less details

leaden palm May 3, 2025, 7:30 PM

#

the only thing thats good is that it can use any mcp

tall summit May 3, 2025, 7:31 PM

#

small haven https://x.com/btibor91/status/1918700964967796973 claude advanced research rank...

ai font

torn mantle May 3, 2025, 7:31 PM

#

claude & gemini & grok are more like:

study #1 found 50% improvements.
study #2 found 40% improvements.

but they don't really compare studies and give a conclusion, oai dr does that, it goes into different studies, compares them together and gives you the conclusion, it even looks for the same factors/parameters used and list them, it actually understands what it needs to do.

small haven May 3, 2025, 7:31 PM

#

torn mantle takes a lot of time with a mid-quality report

i think it could really be good if they are a proper base model like o3, but they're stuck with 3.7 sonnet

#

like fetching almost 1k in sources is very impressive

wintry tinsel May 3, 2025, 7:32 PM

#

eLon musk sand God confirmed

gilded drift May 3, 2025, 7:33 PM

#

Guys . the o3 offered in alpha lmarena . Is it high , medium or low ?

torn mantle May 3, 2025, 7:33 PM

#

Prompt : Based on a critical synthesis of recent, high-quality human clinical trials and systematic reviews, determine which compound – Berberine, Propolis, or Resveratrol – demonstrates the most compelling evidence for promoting overall health.

https://claude.ai/public/artifacts/a8a2d065-ddf4-4acf-853d-ddfd9a2fe15e
https://chatgpt.com/share/6813e578-eb4c-8012-976b-f07d475cdac9
https://grok.com/share/bGVnYWN5_03b16e10-b0c0-45b3-8014-ab796009f7b0

ChatGPT

ChatGPT - Health Compound Comparison

Shared via ChatGPT

#

thanks to @small haven

#

grok chose the easiest road, it selected a study that includes 54 systematic reviews and it expanded on that, but its not really a deep research

#

but its still better than gemini

#

oai went in and understood what parameters to compare and analysed them and then came to a conclusion.

#

which is pretty neat tbh

leaden palm May 3, 2025, 7:36 PM

#

imma run this on my gemini pipeline

torn mantle May 3, 2025, 7:36 PM

#

this would take you a lot of time to do

torn mantle May 3, 2025, 7:39 PM

#

leaden palm imma run this on my gemini pipeline

run it boi

#

you have gemini advanced?

#

are we sure that the free version uses gemini 2.5 pro for deep research?

leaden palm May 3, 2025, 7:39 PM

#

no i just used their api + exa

#

looks like it only did 3 searches and didn't make any tables

#

📎 report.md

#

let me see what happens if i tell it to search deeper and use tables

#

go 2

📎 report.md

torn mantle May 3, 2025, 7:46 PM

#

leaden palm go 2

not bad

#

yea this is better than my gemini report

zinc ore May 3, 2025, 7:51 PM

#

torn mantle are we sure that the free version uses gemini 2.5 pro for deep research?

Free version doesn't use 2.5

sage raptor May 3, 2025, 8:00 PM

#

#

do they really have access to grok 3.5 ?

small haven May 3, 2025, 8:02 PM

#

no, its just engagement bait

leaden palm May 3, 2025, 8:03 PM

#

sage raptor do they really have access to grok 3.5 ?

what do you think

sage raptor May 3, 2025, 8:04 PM

#

leaden palm what do you think

probably no

golden ocean May 3, 2025, 8:08 PM

#

gpt-4-32k-0314 access

calm sequoia May 3, 2025, 8:10 PM

#

torn mantle oai went in and understood what parameters to compare and analysed them and then...

I didn't know resveratrol reduced BP. Thanks!

remote niche May 3, 2025, 8:14 PM

#

zinc ore Free version doesn't use 2.5

sorry noob question , what does gemeini paid give vs the free version , will i see improvement on the gemini 2.5 pro i am using on AI studio

zinc ore May 3, 2025, 8:16 PM

#

They're talking about deep research

#

Free version of deep research doesn't use 2.5, is my understanding. But maybe my information is outdated and that's changed over the past month.

remote niche May 3, 2025, 8:18 PM

#

ok so its the same 2.5 on paid and free models , not the low ,medium high crap Open AI uses

tall summit May 3, 2025, 8:33 PM

#

sage raptor

LMAO this is obvious satire

#

pretty sure he's satirizing the people who really do claim they have early access for engagement bait

sage raptor May 3, 2025, 8:34 PM

#

tall summit pretty sure he's satirizing the people who really do claim they have early acces...

ya lol

tall summit May 3, 2025, 8:34 PM

#

HAHAHAHAHA

#

now the question is does he know thats satire and is he playing dumb or

#

you can tell whether they have early access because they say concrete things

small haven May 3, 2025, 8:38 PM

#

lemme guess, you have access

#

i like how u can use o3 to quickly test hypothesis... im hard.

zinc ore May 3, 2025, 8:53 PM

#

tall summit now the question is does he know thats satire and is he playing dumb or

No, he knows, he openly mocks all the people that believe him

#

That's why it's hilarious seeing people take him serious, even here

brittle tiger May 3, 2025, 9:04 PM

#

@torn mantle here's 2.5 deep research on your prompt
https://docs.google.com/document/d/15ZthQT7kwJu0MLgqFTyqGjkX5pdol3qz2YwV922x0m4/edit?usp=sharing

Google Docs

Compound Health Evidence Comparison

A Comparative Evaluation of Berberine, Propolis, and Resveratrol for Overall Health Promotion Based on Human Clinical Evidence (2015-2025) 1. Introduction 1.1 Overview Berberine, an isoquinoline alkaloid derived from plants like Coptis chinensis and Berberis species; Propolis, a complex resinous ...

vivid oyster May 3, 2025, 9:25 PM

#

ocean vortex May 3, 2025, 10:12 PM

#

remote niche ok so its the same 2.5 on paid and free models , not the low ,medium high crap O...

it's not "crap". 2.5 is equivalent to low-med (slightly longer than low but notably shorter thinking than medium) and you are stuck with this 1 option. With OpenAI you can choose to have longer or shorter reasoning than that. 2.5 is better base model, but o3-high gets much more done with test-time compute alone than 2.5. This means that for precision complex recursive tasks o3 is simply better.

remote niche May 3, 2025, 10:14 PM

#

ocean vortex it's not "crap". 2.5 is equivalent to low-med (slightly longer than low but nota...

i have used 2.5 and o3 model for medical research , atleast in my case senario , i found 2.5 to be vastly superior

ocean vortex May 3, 2025, 10:15 PM

#

remote niche i have used 2.5 and o3 model for medical research , atleast in my case senario ,...

You were probably testing knowledge mostly then, for that you rarely need reasoning...

remote niche May 3, 2025, 10:16 PM

#

it does require some reaasoning , not as much as maths or CS but yeah

#

ok which version of o3 is available for plus users

ocean vortex May 3, 2025, 10:23 PM

#

remote niche ok which version of o3 is available for plus users

medium, it's the "default"

remote niche May 3, 2025, 10:24 PM

#

so it the deep research the high model ?

ocean vortex May 3, 2025, 10:24 PM

#

as in, if not said otherwise it's always that

#

deep research is something custom made to work for that. It was using a version of o3 even before it was released as standalone model

remote niche May 3, 2025, 10:25 PM

#

so we dont have access to o3 high at all as plus users right

#

and any idea if grok3.5 would be integrated to LM arena board

leaden palm May 3, 2025, 10:26 PM

#

remote niche and any idea if grok3.5 would be integrated to LM arena board

of course it'll hit the leaderboards

#

whether it will be in direct chat is a different question

torn mantle May 3, 2025, 10:26 PM

#

brittle tiger <@295243581818404874> here's 2.5 deep research on your prompt https://docs.googl...

This is a high quality report

remote niche May 3, 2025, 10:27 PM

#

any bets if grok 3.5 will top the board ?

leaden palm May 3, 2025, 10:29 PM

#

remote niche any bets if grok 3.5 will top the board ?

finally, an actual use for those incessant prediction markets

torn mantle May 3, 2025, 10:29 PM

#

@brittle tiger it may actually be better than the oai dr version

remote niche May 3, 2025, 10:30 PM

#

OAI has a dr version ?, elaborate pls

leaden palm May 3, 2025, 10:30 PM

#

remote niche OAI has a dr version ?, elaborate pls

they werent replying to you

brittle tiger May 3, 2025, 10:33 PM

#

torn mantle <@266308552111554560> it may actually be better than the oai dr version

i always runs both on prompts for my usecases. usually prefer 2.5 like 60% of the time but sometimes oai is way better. the one click to google docs is very nice with 2.5 and sourcing is done better

torn mantle May 3, 2025, 10:34 PM

#

brittle tiger i always runs both on prompts for my usecases. usually prefer 2.5 like 60% of th...

I think the one that was shared was based on o4mini ver

wintry tinsel May 3, 2025, 10:34 PM

#

sage raptor

no way?

torn mantle May 3, 2025, 10:34 PM

#

Nonetheless it was good as well

wintry tinsel May 3, 2025, 10:34 PM

#

Thid i

#

Grok 3.5 is real

#

eLon ASI confirmed?

#

I was getting worried for a second it wasn't real that's a relief

sour spindle May 3, 2025, 10:37 PM

#

Grok 3 was my favorite model for a month or so

#

Haven’t touched it in awhile

torn mantle May 3, 2025, 10:37 PM

#

wintry tinsel eLon ASI confirmed?

Yes ASI

#

Not even AGI

#

Straight up ASI

remote niche May 3, 2025, 10:38 PM

#

i know AGI what is ASI ?

wintry tinsel May 3, 2025, 10:38 PM

#

I expect no less from the mastermind Braniac Elon himself

remote niche May 3, 2025, 10:38 PM

#

super intelligence ?

wintry tinsel May 3, 2025, 10:38 PM

#

grok 3.5 is the greatest event of the millenium

golden ocean May 3, 2025, 10:40 PM

#

source

#

in alternate timeline oai made gpt-5 agi instead of 4o

wintry tinsel May 3, 2025, 10:47 PM

#

grok 3.5 is a scam?! Shocks the entire industry

brittle tiger May 3, 2025, 10:56 PM

#

wintry tinsel grok 3.5 is the greatest event of the millenium

damn the greatest model of the millenium has 20% odds of being #1 on arena by end of may

brittle tiger May 3, 2025, 11:27 PM

#

bleak venture May 4, 2025, 12:01 AM

#

Hey everyone!
I'm new to Arena and excited to join the community! I wanted to ask about how to add the DeepSeek-R1T-Chimera model to the arena.
There are indications that it performs better than DeepSeek-R1, and it would be super interesting to see if it outcompetes other models.
Any guidance on how to get started would be greatly appreciated!
Thanks, DK

bleak venture May 4, 2025, 12:02 AM

#

bleak venture Hey everyone! I'm new to Arena and excited to join the community! I wanted to as...

(It's been in the top 10 trending models on Hugging Face this week, https://huggingface.co/tngtech/DeepSeek-R1T-Chimera, https://x.com/tngtech/status/1916284566127444468)

small haven May 4, 2025, 12:17 AM

#

brittle tiger

such a fkced leaderboard, like oai should be first, not google

brittle tiger May 4, 2025, 12:22 AM

#

small haven such a fkced leaderboard, like oai should be first, not google

Are you saying market odds for top score at end of May are bad or lmarena is?

small haven May 4, 2025, 12:24 AM

#

brittle tiger Are you saying market odds for top score at end of May are bad or lmarena is?

the latter

brittle tiger May 4, 2025, 12:26 AM

#

small haven the latter

I'll ask o3-pro api for some ideas on how to improve

small haven May 4, 2025, 12:26 AM

#

doesnt hit no more

misty vault May 4, 2025, 1:36 AM

#

leaden palm May 4, 2025, 1:40 AM

#

misty vault

are these new generations or from the archive

#

surely bing chat no longer exists

small haven May 4, 2025, 1:56 AM

#

huh

#

theyre offloading gpus for o3 pro, i respect that.. they will need it

small haven May 4, 2025, 2:31 AM

#

just get extra accounts buddy

small haven May 4, 2025, 4:24 AM

#

gork is using grok 3.5 aint it ..

#

like what hahha

zinc ore May 4, 2025, 4:39 AM

#

That user is on discord

small haven May 4, 2025, 4:41 AM

#

it is automated, check x.com/@gork

zinc ore May 4, 2025, 4:44 AM

#

Actually you may be right, there's a gork on discord that always links to the Twitter account, but now I'm not sure they're even related. Would be interesting if that is 3.5

olive mesa May 4, 2025, 6:00 AM

#

sage raptor

what

#

they're basically saying grok 3.5 is around ASI level

#

lmao might as well have said 1000-5000x because they would easily be able to make self-improving ai

torn mantle May 4, 2025, 6:10 AM

#

sage raptor

Why is elon retweeting his posts?

torn mantle May 4, 2025, 6:11 AM

#

small haven like what hahha

Yea i saw that yesterday

#

One of xai devs asked people to follow @gork or smth like that

#

Didn't think too much about it tbh

keen beacon May 4, 2025, 7:21 AM

#

olive mesa they're basically saying grok 3.5 is around ASI level

it's probably only slightly better than grok 3

#

and i bet their reasoning implementation still sucks

ocean vortex May 4, 2025, 7:38 AM

#

small haven like what hahha

hmmm
https://x.com/gork/status/1918908197349593234

#

keen beacon May 4, 2025, 7:44 AM

#

i hope they keep gork around

keen fulcrum May 4, 2025, 7:51 AM

#

keen beacon i hope they keep gork around

Is gork by x

keen beacon May 4, 2025, 7:51 AM

#

also looks like gork can use the web/twitter for up to date info

keen fulcrum May 4, 2025, 7:51 AM

#

Is it finetuned to respond so sarcastically

keen beacon May 4, 2025, 7:51 AM

#

keen fulcrum Is gork by x

it looks like it

#

elon has been promoting it and it only appeared under a week ago

#

it's basically just the @grok account but with a proper personality

#

but the rumour is it's also them piloting 3.5

keen fulcrum May 4, 2025, 7:53 AM

#

X is filled with parody accounts, makes it hard to find out whether the message came from the real person

keen beacon May 4, 2025, 7:53 AM

#

it's not a real person lol

#

can't be

ocean vortex May 4, 2025, 8:13 AM

#

keen beacon can't be

it's either Elon himself or some another bot

torn mantle May 4, 2025, 8:38 AM

#

its probably elon

#

he already runs multiple accounts

keen beacon May 4, 2025, 8:38 AM

#

its not possible to reply that fast as a human

torn mantle May 4, 2025, 8:38 AM

#

keen beacon its not possible to reply that fast as a human

hes jobless

#

some new neuralink tech implanted in his brain

#

100%

keen beacon May 4, 2025, 8:39 AM

#

torn mantle hes jobless

understanding a comment/thread and typing it all out. and the sheer amount of replies. its not possible for a human

#

its a bot 99% likely. grok 3.5 i guess. it makes sense for a marketing stunt

torn mantle May 4, 2025, 8:41 AM

#

ofc its grok 3.5

#

we are just trolling

keen beacon May 4, 2025, 8:41 AM

#

yea i didnt get that until u typed neuralink lol

torn mantle May 4, 2025, 8:42 AM

#

xd

#

so we should have like grok 3.5 mini ( the one prob used on gork bot acc ) and reasoning + instruct model?

keen beacon May 4, 2025, 8:42 AM

#

maybe

#

im more excited for 2.5 ultra though

torn mantle May 4, 2025, 8:43 AM

#

or more likely a finetuned ver for x

#

yea this is probably the case

#

a finetuned ver for x

torn mantle May 4, 2025, 8:43 AM

#

keen beacon im more excited for 2.5 ultra though

isnt it just more rate limits

keen beacon May 4, 2025, 8:44 AM

#

i hope the situation is similar to 2.5 pro but if the rate limits are far worse then its unusable

#

if it's a huge model like 1.0 ultra the rate limits will be harsh

#

on ai studio it'll probably be something like 10 per day

drifting thorn May 4, 2025, 9:01 AM

#

brisk turret May 4, 2025, 9:01 AM

#

So lm arena just stopped updating

keen beacon May 4, 2025, 9:02 AM

#

keen beacon if it's a huge model like 1.0 ultra the rate limits will be harsh

hopefully its just slightly larger compared to 2.5 pro and they can still host it reasonably
e.g. sonnet 3 -> sonnet 3.5 increase in size. gemini 1.5 flash -> gemini 2.0 flash size increase. (gem 2 flash lite is seemingly the same size of gemini 1.5 flash)

drifting thorn May 4, 2025, 9:08 AM

#

I guess Google TPU will help them a lot

#

So even with a large model (>1T parameters) the cost is reasonable

ocean vortex May 4, 2025, 9:09 AM

#

brisk turret So lm arena just stopped updating

why did you vote for Behemoth that's like easily the least promising one of the bunch 💀

drifting thorn May 4, 2025, 9:09 AM

#

I'm the believer of this paper

📎 Advances_and_Challenges_in_Foundation_Agents.pdf

#

And I guess stop caring bout AI that much rn until the intellegence explosion

#

is a good decision

keen beacon May 4, 2025, 9:11 AM

#

drifting thorn So even with a large model (>1T parameters) the cost is reasonable

im not sure they would do a >1t model though. its kinda risky. even 1.0 ultra wasnt 1t+

#

a model that is slightly larger than 2.5 pro makes more sense, but its still very plausible it could be above or around 1t

keen fulcrum May 4, 2025, 9:13 AM

#

drifting thorn

Grok 3.5 > R2

drifting thorn May 4, 2025, 9:14 AM

#

from the current situation, my prediction is that larger parameters=better SimpleQA performance

keen beacon May 4, 2025, 9:15 AM

#

drifting thorn from the current situation, my prediction is that larger parameters=better Simpl...

yes this is generally true

#

but i dont think they made 2.0 pro larger compared to 2.5 pro and there was like a 10% increase in simpleqa

drifting thorn May 4, 2025, 9:18 AM

#

so I think with proper training and architecture, the bigger the better

#

since LLMs and LMMs don't know what is "fact"

alpine coral May 4, 2025, 9:39 AM

#

torn mantle or more likely a finetuned ver for x

yeah i mean who knows.. but as far as i can tell it could just be a fine tune of an existing grok or whatever (nothing particularly special about the responses; they're just super snarky )

#

but maybe it is grok 3.5 (presumably fine tuned or with some system prompt) and part of a marketing stunt

unborn ocean May 4, 2025, 9:52 AM

#

keen beacon im not sure they would do a >1t model though. its kinda risky. even 1.0 ultra wa...

Well the old ultra was monolithic and it’s save to say the new one won’t be. Furthermore a total parameter count of >1t seems very likely considering that most models these days get awfully close to that number (and are significantly weaker than ultra will likely be, e.g. deepseek @ around 700b)

#

I would honestly suspect that 2.5 pro is already above the 1t parameter count (but that really is just me guessing)

#

In the age of complicated MoE or mixture of modalities or mixture of interleaved experts and complex adaptive quantisation strategies and complicated spec decode algorithms the total parameter count does not mean as much as it used to.

keen fulcrum May 4, 2025, 10:04 AM

#

https://world.org what is this and why is openai promoting it
is this one of the dozens of defi apps

World - The real human network.

Identity, finance and community for every human.

calm sequoia May 4, 2025, 10:24 AM

#

This may be one of Sam's startups

barren prairie May 4, 2025, 10:42 AM

#

keen beacon on ai studio it'll probably be something like 10 per day

It is fine for me 10 rpd

calm sequoia May 4, 2025, 10:57 AM

#

Do anyone here really believes the new Qwen is better than o1 at coding?

keen beacon May 4, 2025, 10:59 AM

#

calm sequoia Do anyone here really believes the new Qwen is better than o1 at coding?

it isnt even thinking there lol

#

i dont know how to feel about that

unborn ocean May 4, 2025, 11:24 AM

#

calm sequoia Do anyone here really believes the new Qwen is better than o1 at coding?

Idk maybe because it is not diff, but whole edits

#

But it really seems a bit too high

dapper raven May 4, 2025, 11:32 AM

#

أنظمة الذكاء الاصطناعي: تستخدمها أم لا؟ شاركنا رأيك! استبيان قصير (7-10 دقائق) لبحث في علم النفس: https://forms.gle/u7EdUQ2DR9BuYVJQA شكراً!

Google Docs

Your Experiences with AI Chatbots / تجاربك مع أنظمة ا...

misty vault May 4, 2025, 1:14 PM

#

#

torn mantle May 4, 2025, 1:30 PM

#

LMAO

leaden meteor May 4, 2025, 2:04 PM

#

When can we expect grok 3.5 on arena? even under anonymous name... since I dont see anyone talking about it yet, I am assuming it is not on arena yet even as anonymous model..

pastel monolith May 4, 2025, 2:39 PM

#

why was GLM-4-32B-0414 not added to the arena?

calm sequoia May 4, 2025, 2:41 PM

#

Lol dude, you need months not weeks

cedar tide May 4, 2025, 2:41 PM

#

Until o3 pro that coming soon

brittle tiger May 4, 2025, 2:41 PM

#

pastel monolith May 4, 2025, 2:42 PM

#

according to who is a model selected to be added to lmarena?

tall summit May 4, 2025, 2:48 PM

#

brittle tiger

where is that

keen beacon May 4, 2025, 2:53 PM

#

https://io.google/

Google I/O 2025

sturdy mica May 4, 2025, 3:12 PM

#

wow

leaden palm May 4, 2025, 3:12 PM

#

keen fulcrum https://world.org what is this and why is openai promoting it is this one of the...

he doesn't know about ubi

sturdy mica May 4, 2025, 3:13 PM

#

why are “1 week” and “7 days” a choice

#

https://tenor.com/view/cillianmurphygun-cillianmurphy-jazmincoded-gif-3811002643797340103

Tenor

#

oh

#

i misread 7 weeks

#

https://tenor.com/view/dr-house-house-md-doctor-freaky-gif-7807484460015619799

Tenor

golden ocean May 4, 2025, 3:14 PM

#

leaden palm May 4, 2025, 3:14 PM

#

but there is 4 weeks vs 1 month

tall summit May 4, 2025, 3:14 PM

#

golden ocean

HAHAHA

leaden palm May 4, 2025, 3:15 PM

#

...

sturdy mica May 4, 2025, 3:15 PM

#

#

i sent a photo of me

#

im so handsome right

golden ocean May 4, 2025, 3:16 PM

#

this is my photo

#

im so handsome right

leaden palm May 4, 2025, 3:16 PM

#

hmm this isnt 1024x1536

sturdy mica May 4, 2025, 3:16 PM

#

thats like 1.9 mc

#

the colors are weird

vivid oyster May 4, 2025, 3:17 PM

#

New gamini 5.6 ultra coder is so good

misty vault May 4, 2025, 3:19 PM

#

sturdy mica thats like 1.9 mc

golden ocean May 4, 2025, 3:19 PM

#

vivid oyster New gamini 5.6 ultra coder is so good

sturdy mica May 4, 2025, 3:19 PM

#

misty vault

wtf is this

misty vault May 4, 2025, 3:20 PM

#

idk

#

hitler bed wars perhaps

brittle tiger May 4, 2025, 3:24 PM

#

leaden palm he doesn't know about ubi

You don't need crypto for ubi

vivid oyster May 4, 2025, 3:25 PM

#

golden ocean

#

#

Gemain 2.5 pro tweaking

high egret May 4, 2025, 3:44 PM

#

when are we getting qwen3 on lmarena plzzzz

balmy mist May 4, 2025, 3:44 PM

#

vivid oyster New gamini 5.6 ultra coder is so good

How are you using it?

rugged brook May 4, 2025, 3:45 PM

#

balmy mist How are you using it?

Its out

#

havent u hard of it

rugged brook May 4, 2025, 3:46 PM

#

vivid oyster

this is gemini 5.6 pro coding

#

its in ai studio

leaden palm May 4, 2025, 3:49 PM

#

high egret when are we getting qwen3 on lmarena plzzzz

in testing? in direct chat? on the leaderboard?

balmy mist May 4, 2025, 3:57 PM

#

rugged brook this is gemini 5.6 pro coding

no its not lol

#

send screenshot

sage raptor May 4, 2025, 4:12 PM

#

#

its real

keen beacon May 4, 2025, 4:12 PM

#

omg

high egret May 4, 2025, 4:20 PM

#

omg it's real

#

it's a special model they put it in Times New Roman

leaden palm May 4, 2025, 4:29 PM

#

rugged brook this is gemini 5.6 pro coding

real

wheat onyx May 4, 2025, 4:35 PM

#

https://x.com/nobel_lauraette/status/1919047987105280309?s=19

Arbitrary leaks (@nobel_lauraette) on X

Actual grok 3.5 benchmark scores

#

Imagine this is real

keen beacon May 4, 2025, 4:35 PM

#

fake grok 3.5 is asi++

#

it scores 100% on every benchmark even on questions with incorrect ground truths because it knows that

leaden palm May 4, 2025, 4:38 PM

#

"first model that can reason from first principles and come up with answers that simply don’t exist on the Internet"

only 7.8% improvement on google proof QA

keen beacon May 4, 2025, 4:39 PM

#

wheat onyx https://x.com/nobel_lauraette/status/1919047987105280309?s=19

the scores are somewhat plausible

barren prairie May 4, 2025, 4:43 PM

#

leaden palm real

I didn t found it

leaden palm May 4, 2025, 4:46 PM

#

was this just added today?

#

for the record:

You may only use this website for your personal or internal business purposes. You must not access the website programmatically, scrape or extract data, manipulate any leaderboard or ranking, or authorize or pay others to access or use the website on your behalf. Unauthorized use may result in suspension or termination of your access, including access by your organization.

keen fulcrum May 4, 2025, 4:46 PM

#

wheat onyx May 4, 2025, 4:49 PM

#

So do we just cancel all our ai services if it's true?

keen fulcrum May 4, 2025, 4:49 PM

#

Indeed
SuperGrok

brittle tiger May 4, 2025, 4:49 PM

#

Fake as hell

keen fulcrum May 4, 2025, 4:49 PM

#

Is the way to go

brittle tiger May 4, 2025, 4:49 PM

#

Those grok 3 eval numbers are all wrong

keen fulcrum May 4, 2025, 4:49 PM

#

I don't think so
Its realistic
Gemini 2.5 Ultra isn't released yet

keen beacon May 4, 2025, 4:51 PM

#

brittle tiger Those grok 3 eval numbers are all wrong

they arent

#

seems to line up with grok 3 thinking on the blog post

brittle tiger May 4, 2025, 4:54 PM

#

keen beacon seems to line up with grok 3 thinking on the blog post

link? am i looking at wrong numbers?

https://x.ai/news/grok-3

Grok 3 Beta — The Age of Reasoning Agents | xAI

We are thrilled to unveil an early preview of Grok 3, our most advanced model yet, blending superior reasoning with extensive pretraining knowledge.

keen beacon May 4, 2025, 4:55 PM

#

brittle tiger link? am i looking at wrong numbers? https://x.ai/news/grok-3

look at grok 3 beta thinking and hover ur cursor over the scores to see non cons 64

leaden palm May 4, 2025, 4:56 PM

#

keen beacon seems to line up with grok 3 thinking on the blog post

but not exactly...
84.2 is not 83.9, 77.1 is not 77.3, 80.4 is not 80.2, 76.2 is not 76

keen beacon May 4, 2025, 4:56 PM

#

yea variance is expected

#

if they remeasured it

#

the simpleqa score increase is sus, but i don't know

#

if it is fake the person who made it understands how scores work a little though its somewhat believable

brittle tiger May 4, 2025, 4:58 PM

#

comparing cons@64 to pass@1 for 2.5 and o3 is dumb though

torn mantle May 4, 2025, 4:58 PM

#

wheat onyx https://x.com/nobel_lauraette/status/1919047987105280309?s=19

+100% wen?

#

true

#

its probably that

#

bigbrain = you

#

typing so fast

#

wait

#

are you @gork ?

keen fulcrum May 4, 2025, 5:00 PM

#

torn mantle are you @gork ?

Gork is indeed by xAI

sturdy mica May 4, 2025, 5:00 PM

#

keyboard practice+ai

keen fulcrum May 4, 2025, 5:00 PM

#

I pinged gork and grok replied

sturdy mica May 4, 2025, 5:00 PM

#

=*

#

yes

torn mantle May 4, 2025, 5:01 PM

#

keen fulcrum Gork is indeed by xAI

yea gork is grok 3.5

keen fulcrum May 4, 2025, 5:01 PM

#

torn mantle yea gork is grok 3.5

A tweaked 3.5
if it is

#

Why would they bring an old grok 3 bot to life?

#

It became round only a week ago, we believe for testing the humour of LLMs

torn mantle May 4, 2025, 5:03 PM

#

elon been hyping it since yesterday

#

some xai devs as well

keen fulcrum May 4, 2025, 5:03 PM

#

They decided to run a live test instead of lmarena
hopefully tomorrow on lmarena too

keen beacon May 4, 2025, 5:04 PM

#

keen fulcrum

where did you get this from?

#

those are pretty insane numbers for a base model (that is... if those benchmarks are for the base)

#

scroll up "arbitrary leaks" apparently

keen fulcrum May 4, 2025, 5:04 PM

#

https://x.com/nobel_lauraette

Arbitrary leaks (@nobel_lauraette) on X

/aicg/ refugee

keen beacon May 4, 2025, 5:04 PM

#

wheat onyx https://x.com/nobel_lauraette/status/1919047987105280309?s=19

oh

#

i mean, the design of the table does seem in line with xAI

keen beacon May 4, 2025, 5:05 PM

#

keen beacon those are pretty insane numbers for a base model (that is... if those benchmarks...

this is the reasoning variant if it is real im pretty sure

#

not much else to go off of tho

keen fulcrum May 4, 2025, 5:05 PM

#

This is a very uncredible source, so don't expect this to be true

keen beacon May 4, 2025, 5:05 PM

#

i doubt that

#

nah

#

the variance in the grok 3 scores could indicate remeasuring, its quite a minor detail for someone to notice/fake

leaden palm May 4, 2025, 5:05 PM

#

at least by default

#

does anyone happen to have any messages back from when people were still doubting reasoning

keen beacon May 4, 2025, 5:05 PM

#

Lmao

keen beacon May 4, 2025, 5:05 PM

#

keen beacon the variance in the grok 3 scores could indicate remeasuring, its quite a minor ...

yeah I doubt any run of the mill troll would be paying that close attention

keen fulcrum May 4, 2025, 5:05 PM

#

How much context and parameters will grok 3.5 have?

keen beacon May 4, 2025, 5:06 PM

#

probably similar to 3

#

so ~1T

keen fulcrum May 4, 2025, 5:06 PM

#

I want 1M context

leaden palm May 4, 2025, 5:07 PM

#

nah elons probably been hounding the engineers

#

"api on launch"

#

"make it so i can put all of the x codebase in there"

keen beacon May 4, 2025, 5:07 PM

#

given the increasing pace of stuff about 3.5 trickling out, i would say a release in the next few days is quite likely

wheat onyx May 4, 2025, 5:07 PM

#

Ive been using grok 3 on and off, I've actually found its gotten worse and hallucinating more

leaden palm May 4, 2025, 5:07 PM

#

wheat onyx Ive been using grok 3 on and off, I've actually found its gotten worse and hallu...

openai, anthropic, xai... thats what they always say

leaden palm May 4, 2025, 5:08 PM

#

leaden palm openai, anthropic, xai... thats what they always say

to this date there hasnt been any real evidence in #1278144735264903178

keen beacon May 4, 2025, 5:08 PM

#

where did ulink

#

its unknown for me

leaden palm May 4, 2025, 5:08 PM

#

keen beacon where did ulink

anthropic discord

keen beacon May 4, 2025, 5:08 PM

#

oh

#

Grok 3.5

torn mantle May 4, 2025, 5:12 PM

#

I want 10M context

#

nah claude max is meh

vivid oyster May 4, 2025, 5:13 PM

#

ر

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

زز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

#

ز

leaden palm May 4, 2025, 5:13 PM

#

finally i have a reason to mod

keen beacon May 4, 2025, 5:13 PM

#

banana

#

hmm

torn mantle May 4, 2025, 5:24 PM

#

trust center

#

hmmmmmmmmmmmmmm

#

mmmmmmmmmmmmmmmmmmm

brittle tiger May 4, 2025, 6:07 PM

#

keen beacon given the increasing pace of stuff about 3.5 trickling out, i would say a releas...

yeah. this week seems reasonable. i bet team is moving extra hard to get it out before I/O. even if 3.5 is better than I/O releases they wouldn't want to take the chance

torn mantle May 4, 2025, 6:18 PM

#

what am i reading...

#

70B valuation for this

olive mesa May 4, 2025, 6:20 PM

#

torn mantle what am i reading...

tf

#

i swear llms are always so cringe trying to act "gen z"

#

if that acc is using grok 3.5 then i have low expectations

keen beacon May 4, 2025, 6:21 PM

#

olive mesa i swear llms are always so cringe trying to act "gen z"

the best model i've ever tested for doing it while not making me want to die is claude 3 opus

#

jailbreaked opus was so good

teal mantle May 4, 2025, 6:32 PM

#

keen beacon the best model i've ever tested for doing it while not making me want to die is ...

Is opus no longer served? Rip

#

Btw should I get supergrok or chatgpt plus/pro?

#

One for grok 3.5 another for o3

keen fulcrum May 4, 2025, 6:33 PM

#

I wonder whether AI will be given a voice

brittle tiger May 4, 2025, 6:34 PM

#

gork voice would be fortnite zoomer vocal fry fr fr

small haven May 4, 2025, 6:40 PM

#

o3 pro next week i can feel it

small haven May 4, 2025, 6:45 PM

#

torn mantle what am i reading...

pretty sure this guy is responsible for 90% of gork hahah

torn mantle May 4, 2025, 6:46 PM

#

small haven pretty sure this guy is responsible for 90% of gork hahah

Yep

#

Its him

#

I was going to say that but i forgot

#

Its probably him

ember rapids May 4, 2025, 6:55 PM

#

torn mantle what am i reading...

Wdym looks like agi to me

tall summit May 4, 2025, 7:13 PM

#

torn mantle what am i reading...

is that not a real person

small haven May 4, 2025, 7:15 PM

#

rip turing test

keen beacon May 4, 2025, 7:23 PM

#

lowk gonna kms

#

this is a grown man 💔

zinc ore May 4, 2025, 7:24 PM

#

Dude is honestly weird, even changing his name to gorklon rust was whew

torn mantle May 4, 2025, 7:24 PM

#

So cringe

small haven May 4, 2025, 7:25 PM

#

thats our era's edison right there

teal mantle May 4, 2025, 7:30 PM

#

keen beacon lowk gonna kms

What in tarnation

#

Btw should I get supergrok for 3.5 or chatgpt plus or pro for o3?

#

Need both a frontier reasoner and deep research

But my usage is not enough to exhaust both

torn mantle May 4, 2025, 7:34 PM

#

Deep research, you can go with gemini advanced

#

Grok isnt worth it atm

small haven May 4, 2025, 7:35 PM

#

the edge is obviously chatgpt

teal mantle May 4, 2025, 7:35 PM

#

torn mantle Deep research, you can go with gemini advanced

Can’t get gemini advanced, I made my account in unsupported countries

torn mantle May 4, 2025, 7:36 PM

#

teal mantle Can’t get gemini advanced, I made my account in unsupported countries

Good job

#

😭

teal mantle May 4, 2025, 7:37 PM

#

torn mantle Good job

Imply I picked where to live in

#

The Google accounts I have are few years old

zinc ore May 4, 2025, 7:38 PM

#

Skill issue

torn mantle May 4, 2025, 7:43 PM

#

Just create a new acc and use vpn or smth

#

You get 2 months for free

torn mantle May 4, 2025, 7:44 PM

#

teal mantle The Google accounts I have are few years old

Its better than paying $200/month

small haven May 4, 2025, 7:54 PM

#

unpopular opinion but chatgpt pro is cheaper than gemini advanced/claude max/supergrok combined, iykyk

ocean vortex May 4, 2025, 7:54 PM

#

keen beacon this is a grown man 💔

from all the oligarchs in the world he is acting like the dumbest of them all

#

like he is regarded or smth catgrin

keen beacon May 4, 2025, 7:56 PM

#

elon retweeted the grifter

#

including the benchmark screenshot

#

weird

zinc ore May 4, 2025, 8:02 PM

#

Free hype for the normies

small haven May 4, 2025, 8:02 PM

#

keen beacon elon retweeted the grifter <:sk:1216494955246256179>

the fact he retweeted that makes me believe those numbers are true?

#

u proly right haha

late path May 4, 2025, 8:09 PM

#

is grok 3.5 in arena now?

teal mantle May 4, 2025, 8:09 PM

#

But did other labs use cons@64?
if not it mean it is bogus

keen beacon May 4, 2025, 8:10 PM

#

they have in the past

#

anthropic did for sonnet 3.7 i think its not clear how many samples they used

#

they report pass@1 and "parallel test time compute" (seemingly cons-like, as they mention they sample several sequences)

#

weird how they dont mention the sample count at all

#

the language used seems to be trying to obfuscate the nature of it

ocean vortex May 4, 2025, 8:13 PM

#

it's not, it's new. based on gpt4.1

#

o3-preview was old

keen beacon May 4, 2025, 8:15 PM

#

there are so many footnotes on the 3.7 sonnet benchmark graphic lol

north vale May 4, 2025, 8:17 PM

#

it's probably not that new, prolly o3 is built on 4.1 and 4.1 is a few months old

#

bc they have same knowledge cutoff

keen beacon May 4, 2025, 8:17 PM

#

it is. they retrained o3 though

north vale May 4, 2025, 8:19 PM

#

they are pretty good and have a lot of compute but they haven't rly closed the gap, like research wise they will just have fewer algo improvements than oai, gdm or anthropic will

small haven May 4, 2025, 8:19 PM

#

cus elon is talent poaching across his 3 companies, thats his moat lol

keen beacon May 4, 2025, 8:20 PM

#

i doubt its an early checkpoint if its real

#

this is the version theyre gonna release soon

#

"early checkpoint" is being used as a marketing term there

#

did they ever release grok 3 (full) reasoning btw?

small haven May 4, 2025, 8:25 PM

#

the question is o3 pro ≷ grok 3.5 ?

keen beacon May 4, 2025, 8:25 PM

#

they reported scores though

#

vaporware lol

#

it probably does

#

it sucks

wintry locust May 4, 2025, 8:28 PM

#

keen beacon elon retweeted the grifter <:sk:1216494955246256179>

jesus christ

#

ok coming clean i made that chart

#

i wanted to get my empty twt account to some amount of followers so it wouldn't get shadow banned

keen beacon May 4, 2025, 8:28 PM

#

lol are you fr

wintry locust May 4, 2025, 8:28 PM

#

yea

keen beacon May 4, 2025, 8:28 PM

#

send a ss of the profile

wintry locust May 4, 2025, 8:29 PM

#

teal mantle May 4, 2025, 8:29 PM

#

Because of prior experiences

wintry locust May 4, 2025, 8:29 PM

#

it's just an edit of the grok 3 blog post chart

#

with numbers copy+pasted from the gemini 2.5 pro blog post that i noised a little

lean whale May 4, 2025, 8:39 PM

#

Hey guys, I have been working on a project for systematic LLM jailbreaking and red-teaming. Would appreciate any feedback!

https://github.com/General-Analysis/GA

GitHub

GitHub - General-Analysis/GA: An encyclopedia of jailbreaking techn...

An encyclopedia of jailbreaking techniques to make AI models safer. - General-Analysis/GA

late path May 4, 2025, 8:40 PM

#

the leaderboard hasnt been updated for 13 days

#

the next update should not have grok3.5 on it i guess?

keen beacon May 4, 2025, 8:43 PM

#

grok3.5 isnt an anon model afaik

late path May 4, 2025, 8:43 PM

#

yeah so 2.5 pro will still be 1st...

balmy mist May 4, 2025, 8:51 PM

#

that dude is a clown tho right?

balmy mist May 4, 2025, 8:52 PM

#

lean whale Hey guys, I have been working on a project for systematic LLM jailbreaking and r...

ill take a look, nice work tho

mossy drum May 4, 2025, 9:11 PM

#

New model in Arena: cobalt-exp-beta-v10

keen fulcrum May 4, 2025, 9:17 PM

#

mossy drum New model in Arena: `cobalt-exp-beta-v10`

New meta model?

ember rapids May 4, 2025, 9:19 PM

#

Elon retweeting fake screenshots from known grifters im dead

keen beacon May 4, 2025, 9:19 PM

#

3.5 is probably worse than they expected/aimed for

#

itll be fine probably

#

thats it

wintry locust May 4, 2025, 9:22 PM

#

it worries me that we will have 2 basically relevant models that are both referred to as 3.5

#

with claude and grok

keen beacon May 4, 2025, 9:43 PM

#

simpleqa needs to be 100% too

#

its asi++

#

so i call it fake

#

im joking but i agree

#

i doubt theyre gonna do this (make it larger)

#

maybe. some benchmarks it might not be possible to reach 100% w/o contamination either. (wrong questions/ground truths, so it'd have to be trained on the questions to get the right/wrong answer)

#

2.5 ultra might beat gpt 4.5 in simpleqa, but we'll see

#

im super interested in it

ocean vortex May 4, 2025, 9:48 PM

#

lmfao

#

4.0 is gork

#

you must be confusing it with a dork

#

easy mistake to make

leaden palm May 4, 2025, 9:52 PM

#

did we get this yet

keen beacon May 4, 2025, 9:54 PM

#

leaden palm did we get this yet

scroll up

leaden palm May 4, 2025, 9:54 PM

#

how far

keen beacon May 4, 2025, 9:54 PM

#

#general message

leaden palm May 4, 2025, 9:55 PM

#

ok well we were generally skeptic from the start

leaden palm May 4, 2025, 9:55 PM

#

leaden palm

i was talking about this one post

keen beacon May 4, 2025, 9:55 PM

#

leaden palm i was talking about this one post

that wave guy might be the same guy

#

idk' lol im gonna go to bed

ocean vortex May 4, 2025, 9:56 PM

#

That's Elon's thing. Promote fabricated things and call everything real you don't like propaganda/fake

torn mantle May 4, 2025, 10:10 PM

#

ocean vortex That's Elon's thing. Promote fabricated things and call everything real you don'...

Guess what hes doing rn?

#

Reposting that fake scs

#

I refuse to believe its better than o3 and gemini 2.5

#

Xai are so loud when it comes to their product, if it was really a big leap they would just straight up say that

leaden palm May 4, 2025, 10:14 PM

#

torn mantle Xai are so loud when it comes to their product, if it was really a big leap they...

openai:

torn mantle May 4, 2025, 10:16 PM

#

leaden palm openai:

Openai is on another level of hyping stuff

small haven May 4, 2025, 10:24 PM

#

torn mantle Reposting that fake scs

i mean atp someone from xai would have had said to him its fake, but hes still retweeting it mins ago.. lol

misty vault May 4, 2025, 10:24 PM

#

#

hardy pecan May 4, 2025, 10:32 PM

#

polymarket odds dropped biggly as of 20 mins ago, very interesting..

misty vault May 4, 2025, 10:34 PM

#

hardy pecan polymarket odds dropped biggly as of 20 mins ago, very interesting..

lmfao

torn mantle May 4, 2025, 10:37 PM

#

hardy pecan polymarket odds dropped biggly as of 20 mins ago, very interesting..

Cuz of elon's post

hollow ocean May 4, 2025, 10:38 PM

#

Should the best models be P2W or no

small haven May 4, 2025, 10:39 PM

#

market is never wrong!

hollow ocean May 4, 2025, 10:39 PM

#

small haven market is never wrong!

It was wrong a couple times last year

late path May 4, 2025, 10:39 PM

#

just bought google at 66 3h ago💀

#

stupid market

hollow ocean May 4, 2025, 10:40 PM

#

Everyone that did their research didn’t bet on OpenAI to have the best

small haven May 4, 2025, 10:40 PM

#

hollow ocean It was wrong a couple times last year

yea im kiddin obv lol

keen fulcrum May 4, 2025, 10:47 PM

#

leaden palm May 4, 2025, 11:02 PM

#

i set up a market: https://manifold.markets/KTibow/which-day-of-may-will-grok-35-be-re?play=true

Manifold

Which day of May will Grok 3.5 be released?

leaden palm May 4, 2025, 11:48 PM

#

theres daily "quests" and you could buy more

#

just like in games, you can cash in but you can't cash out

#

well you used to be able to donate it to charity and they had a short stint with "sweepcash" but that's gone now

#

unfortunately not

#

lol no

#

not sure what ev means in this context

#

if youre talking about what you get out of it:

#

its just intellectual stimulation

#

you want to be a good predictor

#

you want to make the leaderboards

#

you dont want to go broke

#

you want to see number go up

topaz peak May 5, 2025, 12:05 AM

#

hardy pecan polymarket odds dropped biggly as of 20 mins ago, very interesting..

LMAO, damn people are gullible

torn mantle May 5, 2025, 12:12 AM

#

evpi

#

evm

#

ev

#

evsi

small haven May 5, 2025, 12:28 AM

#

u can't be serious.. 😭

zinc ore May 5, 2025, 12:29 AM

#

It ain't even coming this month

small haven May 5, 2025, 12:31 AM

#

smart money would buy lowkey

small haven May 5, 2025, 12:31 AM

#

leaden palm you want to see number go up

can u make one for o3 pro with expected day

alpine coral May 5, 2025, 12:35 AM

#

keen fulcrum New meta model?

amazon i think. they have been releasing iterations pretty steadily (originally it was cobalt-exp-v1, last week or so it was up to v7, now v10 ig (hasn't been remarkable at all in my experience, but seems like they're in the process getting something ready to release)

misty vault May 5, 2025, 12:37 AM

#

I don't know yet. Will you harm me if I harm you first?

golden ocean May 5, 2025, 12:47 AM

#

spit it out @worthy thunder

worthy thunder May 5, 2025, 12:47 AM

#

Context Arena Update: Added several Mistral LLMs to the MRCR 2needle leaderboard. (https://x.com/DillonUzar/status/1919191240123289920)

AUC @ 128k Results (Mistral Models):

mistral-small-3.1-24b-instruct: 47.7%
mistral-large-2411: 24.1%
ministral-8b: 22.8%
ministral-3b: 13.9%
See all results at: https://contextarena.ai/

Mistral Small 3.1 is currently performing between GPT-4.1 Mini and o3-mini based on the AUC @ 128k metric. For comparison, Gemini 2.5 (the current leader) and GPT-4.1 have been added to the main chart.

NOTE: The results for Mistral Small 3.1 (2503) and Mistral-8B (2503) are dated March 2025, while the others are from October/November 2024. Tests were conducted using endpoints from @OpenRouterAI and @MistralAI (presumably BF16).

More to come: Other model results will be released sporadically over the next few weekdays (including some 4needle and 8needle results), alongside new UI features. I've been pretty focused on analyzing some of the model results recently and figuring out ways to provide better insights and options for grading, so model results have slowed in the process. I will provide a summary once the rollout is complete.

Some UI enhancements have already partially rolled out, such as new information displays on hover and updated hover effects. A new diff viewer has also partially rolled out, with further improvements planned.

Enjoy.

misty vault May 5, 2025, 1:48 AM

#

leaden palm May 5, 2025, 1:52 AM

#

misty vault

are you just posting the archives or do you somehow still have real access

elder rapids May 5, 2025, 2:52 AM

#

hollow ocean Should the best models be P2W or no

nah that's not consistent in front of technological progression

#

in front of any

small haven May 5, 2025, 3:37 AM

#

pro is a scam wtf

#

had like 70~ reqs, now 0

zinc ore May 5, 2025, 3:40 AM

#

https://x.com/scaling01/status/1919217718420508782

Lisan al Gaib (@scaling01) on X

The Ultimate LLM Meta-Leaderboard averaged across the 28 best benchmarks

Gemini 2.5 Pro > o3 > Sonnet 3.7 Thinking

leaden palm May 5, 2025, 3:54 AM

#

small haven pro is a scam wtf

looks like a daily limit on top of the monthly one

small haven May 5, 2025, 3:57 AM

#

leaden palm looks like a daily limit on top of the monthly one

yea looks like it, it just becomes more stricter as time passes, not better

balmy mist May 5, 2025, 3:57 AM

#

small haven pro is a scam wtf

lol

small haven May 5, 2025, 3:57 AM

#

they are severely gpu constrained

balmy mist May 5, 2025, 3:57 AM

#

you pay 200?

small haven May 5, 2025, 3:58 AM

#

balmy mist you pay 200?

ya not wurf it

trim vale May 5, 2025, 5:53 AM

#

Dum question

#

So theres the qwen3 235b a22b gguf model right

#

Which means there are 235b parameters in total but only 22b parameters are activated at once

#

Can i run it with the same amount of ram as a regular 22b gguf model?

#

Or do i still need enough ram to fit an entire 235b model inside?

#

Or something in between 🤔

worthy thunder May 5, 2025, 6:09 AM

#

You generally still need enough RAM to fit the entire 235B model (like in this model)

oblique flint May 5, 2025, 6:14 AM

#

afaik MoE doesn't activate experts based on what prompt you gave, it's done on a per token basis so a single prompt can involve a lot of different experts. So if you can't fit the model into ram fully, it'll have to load from your disk to load in the right experts, that will be much much slower than loading from ram

trim vale May 5, 2025, 6:17 AM

#

Gotchu

calm sequoia May 5, 2025, 8:17 AM

#

Some context for where the Grok 3.5 stands

#

Seems like the reasoning is SOTA, but the base model is not as good as OG GPT 4.5

#

The next generation of models (GPT 5, Gemini 3?) will saturate most benchmarks 😄

torn mantle May 5, 2025, 8:25 AM

#

calm sequoia Some context for where the Grok 3.5 stands

those are fake benchmarks

calm sequoia May 5, 2025, 8:26 AM

#

Your opinion or confirmed?

torn mantle May 5, 2025, 8:26 AM

#

kinda surprising that its not added yet on lmarena

torn mantle May 5, 2025, 8:26 AM

#

calm sequoia Your opinion or confirmed?

yea

#

by the guy who shared it

calm sequoia May 5, 2025, 8:26 AM

#

Then why musk retweeted

torn mantle May 5, 2025, 8:27 AM

#

calm sequoia Then why musk retweeted

cuz hes a big dumbo

#

he deleted it btw

calm sequoia May 5, 2025, 8:28 AM

#

Haha if he really retweeted fake benchmarks for his product he's at least braindead

#

Missing Elon of 2017 so much

ocean vortex May 5, 2025, 9:21 AM

#

calm sequoia Haha if he really retweeted fake benchmarks for his product he's at least braind...

I feel like that's how he came into power (politics) in the first place. By promoting the party he's aligned with for the big part with misinformation on his social media platform. I don't think he gets the concept of verifying the info, or just deliberately chooses not do it way too often and one-sided.

calm sequoia May 5, 2025, 10:34 AM

#

What's with this space. Avengers of missinformation

torn mantle May 5, 2025, 10:42 AM

#

calm sequoia What's with this space. Avengers of missinformation

typical unemployed person living off twitter engagements

calm sequoia May 5, 2025, 10:42 AM

#

Elon (Adrian Dittman) seems quite employed 😄

hardy pecan May 5, 2025, 11:29 AM

#

Maybe wait for a credible source first lol, literally nothing confirmed

teal mantle May 5, 2025, 11:52 AM

#

calm sequoia What's with this space. Avengers of missinformation

Holy coalmine of ian miles cheong

teal mantle May 5, 2025, 12:09 PM

#

He is one of the last person, like ever to be informed in AI

balmy mist May 5, 2025, 12:24 PM

#

will we ever get o3 pro?

torn mantle May 5, 2025, 12:41 PM

#

it is

kind cloud May 5, 2025, 12:51 PM

#

Does anyone know how to continue chatting with the model after voting and finding out its name? I know how, but is that method well-known?

golden ocean May 5, 2025, 12:52 PM

#

yes

kind cloud May 5, 2025, 12:55 PM

#

golden ocean May 5, 2025, 12:59 PM

#

lmaoo

vivid oyster May 5, 2025, 1:02 PM

#

When tf is grok in arena

keen beacon May 5, 2025, 1:03 PM

#

r2 will be the cheapest there

#

might not beat the others

#

i didnt test it much/at all. (my impression of it is primarily through the comparison images, im largely uninterested in webdev)

#

i largely dont

#

idk i dont really code using ai yet so idk about best practices

#

yea

ocean vortex May 5, 2025, 1:08 PM

#

probably making it more verbose. It's relatively concise. This could indirectly prolong thinking as well

keen beacon May 5, 2025, 1:08 PM

#

technically no its not on the dot

#

1,048,576

ocean vortex May 5, 2025, 1:09 PM

#

I mean it is relatively speaking, esp if you consider the entire output thinking included and compare that to medium/high

#

it's a thinking model, it needs to output a ton if you are after maximizing every last bit of performance 👀

keen beacon May 5, 2025, 1:11 PM

#

ocean vortex I mean it is relatively speaking, esp if you consider the entire output thinking...

o3 medium does less, at least based on the thinking tokens used to run the artifical analysis gamut

ocean vortex May 5, 2025, 1:11 PM

#

keen beacon o3 medium does less, at least based on the thinking tokens used to run the artif...

link? The test I did myself earlier put it well below medium

keen beacon May 5, 2025, 1:11 PM

#

#

i was surprised about it myself

#

o3 costs the most though

hollow ocean May 5, 2025, 1:12 PM

#

https://tenor.com/view/we're-so-back-we-are-so-back-we-are-back-elden-ring-we-are-back-elden-ring-victory-gif-18104847912830278299

Tenor

ocean vortex May 5, 2025, 1:13 PM

#

keen beacon

I think this could be measuring smth else. Otherwise this makes no sense lol

keen beacon May 5, 2025, 1:13 PM

#

ocean vortex I think this could be measuring smth else. Otherwise this makes no sense lol

?

#

oh

keen beacon May 5, 2025, 1:13 PM

#

ocean vortex I think this could be measuring smth else. Otherwise this makes no sense lol

it isnt

#

#

it makes sense if u include the pricing, o3 is still the most expensive model to run there

#

even though the tokens outputted are less

ocean vortex May 5, 2025, 1:16 PM

#

keen beacon

it's weird. I'm sure there's some kind of explanation why they came up with this. But we know for a fact IRL R1 outputs less thinking tokens than O3, not even talking about 2.5...

keen beacon May 5, 2025, 1:17 PM

#

ocean vortex it's weird. I'm sure there's some kind of explanation why they came up with this...

? where did u come with that

#

the r1 thing

#

if you look at the cost to run it makes sense

alpine coral May 5, 2025, 1:18 PM

#

ocean vortex it's weird. I'm sure there's some kind of explanation why they came up with this...

eh?

ocean vortex May 5, 2025, 1:19 PM

#

keen beacon ? where did u come with that

there are ton of providers that cap it at 8k output. Also from my experience it rarely goes beyond that. While for o3 it's not unusual at all to see 20k+

keen beacon May 5, 2025, 1:20 PM

#

ocean vortex there are ton of providers that cap it at 8k output. Also from my experience it ...

when they cap it at 8k its usually the output response tokens not including cot output tokens

alpine coral May 5, 2025, 1:20 PM

#

ocean vortex there are ton of providers that cap it at 8k output. Also from my experience it ...

it's literally how many tokens are used to run artificialanalysis' benchmark. it's not based on or meant to represent regular usage or specific usage like coding

also note that it's 2.5 flash at the far left (2.5 pro also uses way more tokens than o series models, but not as many)

ocean vortex May 5, 2025, 1:21 PM

#

keen beacon when they cap it at 8k its usually the output response tokens not including cot ...

it's including the thinking. Saw it getting cut when that was capped at 4k (sambanova?)

keen beacon May 5, 2025, 1:21 PM

#

keen beacon when they cap it at 8k its usually the output response tokens not including cot ...

this is how deepseek does it

keen beacon May 5, 2025, 1:21 PM

#

ocean vortex it's including the thinking. Saw it getting cut when that was capped at 4k (samb...

thats different

keen beacon May 5, 2025, 1:21 PM

#

ocean vortex it's including the thinking. Saw it getting cut when that was capped at 4k (samb...

and its no longer the case

alpine coral May 5, 2025, 1:22 PM

#

ocean vortex it's including the thinking. Saw it getting cut when that was capped at 4k (samb...

yeah tbf it gets capped / cut-off on OR too sometimes - but i think that's on OR's side rather deeepseeks

keen beacon May 5, 2025, 1:22 PM

#

its more but deepseek host it only at 64k i think

alpine coral May 5, 2025, 1:22 PM

#

they have a reasoning effort param

#

which seems poorly implemented (trying to do it across providers)

#

i dunno

ocean vortex May 5, 2025, 1:23 PM

#

keen beacon this is how deepseek does it

that how Deepseek does it but almost no one else hosting R1 I believe

#

it's capped for the entire thing

alpine coral May 5, 2025, 1:23 PM

#

nah not a prompt.. not sure tbh

#

eh i dunno but it would be hiliarous if that was their implementation

keen beacon May 5, 2025, 1:23 PM

#

yeah they usually just pass through

#

i dont think or does anything

#

if they do its not intended

alpine coral May 5, 2025, 1:24 PM

#

yeah i know what you're saying

keen beacon May 5, 2025, 1:24 PM

#

they are proxying all of the requests they can mess with it

#

i highly, highly doubt they are

alpine coral May 5, 2025, 1:25 PM

#

i think they're assuming google etc will have actual reasoning budget hyper parameters at some point

#

i dunno what it does in the meantime

ocean vortex May 5, 2025, 1:26 PM

#

@keen beacon maybe OpenAI are better at overfitting 💀

#

but like I said it's weird, will do some more testing

blazing rune May 5, 2025, 1:26 PM

#

How would they get the data if you are running it locally?

keen beacon May 5, 2025, 1:27 PM

#

5g covid something like that (/s)

blazing rune May 5, 2025, 1:27 PM

#

Oh, it has been like that for years already

#

Llama 4 was trained on Facebook data iirc

keen beacon May 5, 2025, 1:28 PM

#

whatsapp is encrypted right? i assume if u communicate with meta bots/etc then ur data is collected

#

for sharing war plans

ocean vortex May 5, 2025, 1:34 PM

#

keen beacon and its no longer the case

they are capping entire output just like before, but they did bump that limit from 4k to max configurable 8k now

#

o4-mini-high and o3 would be essentially unusable even with 8k

keen beacon May 5, 2025, 1:34 PM

#

ocean vortex they are capping entire output just like before, but they did bump that limit fr...

interface limit

ocean vortex May 5, 2025, 1:35 PM

#

my point is that this number includes thinking

#

and also that there are still a ton of providers including azure, with low limit where r1 is still very much usable

#

now compare that to this:

#

R1 would never generate anywhere near that

keen beacon May 5, 2025, 1:37 PM

#

ocean vortex R1 would never generate anywhere near that

yeh because its capped at 32k reasoning tokens, at least on deepseek api

ocean vortex May 5, 2025, 1:38 PM

#

keen beacon yeh because its capped at 32k reasoning tokens, at least on deepseek api

I have never seen it reaching that cap personally

keen beacon May 5, 2025, 1:38 PM

#

and that task requires a lot of reasoning tokens. deepseek usually does 15k+ on some of my problems

#

sometimes more

ocean vortex May 5, 2025, 1:38 PM

#

keen beacon and that task requires a lot of reasoning tokens. deepseek usually does 15k+ on ...

it has no problem providing the answer with sambanova 8k cap

keen beacon May 5, 2025, 1:38 PM

#

ocean vortex it has no problem providing the answer with sambanova 8k cap

is it right tho?

#

o4 mini tries harder

#

even if it is wrong

ocean vortex May 5, 2025, 1:39 PM

#

who cares if it's right. We are talking which one outputs more thinking LOL

keen beacon May 5, 2025, 1:39 PM

#

ocean vortex who cares if it's right. We are talking which one outputs more thinking LOL

ok its nonsensical then

#

ur argument

ocean vortex May 5, 2025, 1:39 PM

#

wdym

#

My argument is that o4-mini and o3 outputs more thinking tokens, that's it

keen beacon May 5, 2025, 1:40 PM

#

u have to take into whether the answer is right or not or its arbitrary

ocean vortex May 5, 2025, 1:40 PM

#

nothing to do with which performs better

#

on any given task

ocean vortex May 5, 2025, 1:41 PM

#

keen beacon u have to take into whether the answer is right or not or its arbitrary

not if you can't get R1 get anywhere close to that no matter the prompt

#

Also, this is not claude. Model has no clue about thinking budget, it will only get cut off. If it doesn't then it used all the tokens it could want

keen beacon May 5, 2025, 1:44 PM

#

is it releasing?

#

talking requests? 🙏

#

what

ocean vortex May 5, 2025, 1:47 PM

#

So like... Try sambanova interface and come up with a prompt for R1 that would use more for the thinking than 8k cap. I'm sure you will find this is next to impossible lol

@keen beacon

#

while for both o4-mini and o3, this is super easy

keen beacon May 5, 2025, 1:50 PM

#

ocean vortex So like... Try sambanova interface and come up with a prompt for R1 that would u...

? its so easy LMAO

#

There are 5 houses, numbered 1 to 5 from left to right, as seen from across the street. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics:

Each person has a unique name: Eric, Alice, Peter, Bob, Arnold
Each person has a unique type of pet: hamster, fish, cat, dog, bird
Each person has an occupation: doctor, engineer, artist, lawyer, teacher
Each person has a favorite color: green, blue, yellow, red, white
The people keep unique animals: bird, cat, horse, fish, dog

Clues:

The person who owns a dog is Arnold.
The bird keeper is in the fourth house.
The person who keeps a pet bird is directly left of the dog owner.
The person who loves white is somewhere to the left of the person who is a lawyer.
The person who loves yellow is directly left of the person whose favorite color is green.
The cat lover is Bob.
The cat lover is somewhere to the left of Eric.
The person who keeps horses is in the fifth house.
The person who is a lawyer is directly left of the person who is a teacher.
The person who is a doctor is in the first house.
Alice is the person who loves yellow.
The person who loves blue is directly left of the person with an aquarium of fish.
The person who loves yellow is in the first house.
The person with a pet hamster is the person who is an artist.
Eric is the dog owner.

ocean vortex May 5, 2025, 2:15 PM

#

keen beacon There are 5 houses, numbered 1 to 5 from left to right, as seen from across the ...

ok I stand corrected. But all attempts I did of this are still below 16k. So my point stands. I just showed 50k with o4-mini-high above without even trying 😄

keen beacon May 5, 2025, 2:16 PM

#

ocean vortex ok I stand corrected. But all attempts I did of this are still below 16k. So my ...

ok how many tokens did o4 mini take though lol (on that problem)

#

also its a single prompt. look at artificial analysis where they ran several benchmarks. plus with your prompt it didnt even get the answer right with less tokens, so its arbitrary in that instance too. might as well call og gpt 4 so much more efficient if u just look at that. (it does not make sense if you do not consider correctness)

#

how youre contesting running several standarized benchmarks and seeing the amount of tokens used just because of that 1 prompt is ridiculous

#

and the comparison/reasoning for that 1 prompt does not make sense

ocean vortex May 5, 2025, 2:18 PM

#

keen beacon ok how many tokens did o4 mini take though lol (on that problem)

less, but this is task specific... I focuse on the maximum instead which is way more accurate I believe. If you take a subset of tasks a certain model could generate more purely because it's a different model which is probably how artificial analysis arrived there. What's more important is how far can it go

#

when R1 generates more, it's not by 40k more lol

#

and with OpenAI it is

keen beacon May 5, 2025, 2:20 PM

#

ocean vortex less, but this is task specific... I focuse on the maximum instead which is way ...

ok but u have no data to prove that u literally have a single anomalous prompt. then ur justifying it with specious reasoning

ocean vortex May 5, 2025, 2:21 PM

#

keen beacon ok but u have no data to prove that u literally have a single anomalous prompt. ...

wdym. There are plenty of prompts where OpenAI will generate beyond 16k, you don't even need to be trying...

soft kernel May 5, 2025, 2:26 PM

#

I know this sounds stupid but,is 4.5 still available on the web?(battle part,not direct chat)
I think 1 month ago it was available

keen beacon May 5, 2025, 2:28 PM

#

ocean vortex wdym. There are plenty of prompts where OpenAI will generate beyond 16k, you don...

those questions are probably a ~~'biased distribution'~~ biased sample unless its representative. those standardized benchmarks that artificialanalysis runs are way more representative for actual usage than potentially specific cherry picked problems that cause specific model outliers. and does r1 even get the questions right? (if it uses less tokens but gets the answer wrong it is also meaningless)

ocean vortex May 5, 2025, 2:35 PM

#

keen beacon those questions are probably a ~~'biased distribution'~~ biased sample unless it...

I don't think it's meaningless since it could get it wrong specifically because it's reluctant to output more.

I think the truth here is somewhere in the middle to be completely honest. Maybe on average it's that - OpenAI models fairly mindful of the usage and don't output a lot at all times. But when the model sees a task it can't solve without outputting a ton.. O3/O4 will do it and R1 or 2.5 most likely not, saw Gemini taking shortcuts to arrive at the answer faster more than once. And as for R1, once again it seemingly can't even get past 16k

#

So maybe that averge metric does not tell us all that much. You wouldn't want a model that outputs more when that's not needed and less when it is needed. A bit artificially bumping up the average (relative to what it is capable of performance wise with test-time compute)

#

Esp since we have quite a few of the models wasting tokens on 2nd guessing themselves rather than doing the task efficiently lol

#

so yeah... if we took like 100 test prompts and model1 would do say 50 attempts around 90k generated while the other 50 around 10k;
while model2 would generate consistently around 55k for all of them leading up to slightly higher average...

I would still say model1 is much less test-time compute limited and better optimised tbh

#

it would obviously be something way more random, but for the sake of simplicity assuming easy to reference numbers

ocean vortex May 5, 2025, 3:07 PM

#

looking at 2.5 some more digging into thinking, it does have a weird habit of rewriting the thinking into final response, even the parts where it self-corrects pretending it made the same mistake all over again... So you do have a healthy amount of generated data that is completely pointless catgrin

#

but if you made it even more verbose, that would very likely help with the actual useful problem solving part as well

patent aspen May 5, 2025, 3:14 PM

#

keen beacon There are 5 houses, numbered 1 to 5 from left to right, as seen from across the ...

Back in the day, the Prolog programming language would have been a great tool to solve this kind of problem

sonic tendon May 5, 2025, 3:39 PM

#

oo, source?

#

ah

keen beacon May 5, 2025, 3:48 PM

#

i wonder if they'll release on api or not

keen beacon May 5, 2025, 3:48 PM

#

keen beacon i wonder if they'll release on api or not

imagine if they release the scores for grok 3.5 full reasoning but never release it lol like grok 3 full reasoning

#

i have a feeling the reason they didn't release grok 3 full reasoning was because it sucked

#

hopefully grok 3.5's reasoning implementation does not

#

so they wouldn't have a reason to

#

i wonder if they stopped using qwq preview traces 🤣 (very suspect)

#

probably didn't help things ☠️

keen beacon May 5, 2025, 3:49 PM

#

keen beacon probably didn't help things ☠️

maybe theyre using r1 for cold start now 🤣

#

lmaoo

unborn ocean May 5, 2025, 4:03 PM

#

keen beacon i wonder if they stopped using qwq preview traces 🤣 (very suspect)

did they? totally missed it

#

never spend much time on xAI

keen beacon May 5, 2025, 4:08 PM

#

unborn ocean did they? totally missed it

yes. take it with a grain of salt tho (obviously speculation) but i am reasonably confident in it. not gonna explain it again tho

terse ermine May 5, 2025, 4:39 PM

#

Hi all, I just joined so sorry if this is the wrong channel to ask this question. But I was curious if anyone has a plot of ELO vs model release date. Thank you so much!

balmy mist May 5, 2025, 4:44 PM

#

https://x.com/mark_k/status/1919432963235733667

Mark Kretschmann (@mark_k) on X

Scoop: Grok 3.5 probably coming today.

#

griter?

balmy mist May 5, 2025, 4:44 PM

#

terse ermine Hi all, I just joined so sorry if this is the wrong channel to ask this question...

someone does

keen beacon May 5, 2025, 4:52 PM

#

its vaporware

hollow ocean May 5, 2025, 4:56 PM

#

balmy mist https://x.com/mark_k/status/1919432963235733667

It’s confirmed

leaden meteor May 5, 2025, 5:00 PM

#

How come they are releasing before letting it loose anonymously on arena? I am guessing it's not meant for arena questions...

keen beacon May 5, 2025, 5:01 PM

#

leaden meteor How come they are releasing before letting it loose anonymously on arena? I am ...

elon probably wouldve wanted to flex the arena score if it was better than 2.5 pro. its probably meh/fine, and the xai guys didnt think they should put it on as a pre-release (if it cant beat 2.5 pro)

#

maybe grok 3.5 will be great though, idk. even if it is good, its gonna be unusable for me because of the X peddling 😭

balmy mist May 5, 2025, 5:03 PM

#

hollow ocean It’s confirmed

what time?

hollow ocean May 5, 2025, 5:05 PM

#

balmy mist what time?

Evening

balmy mist May 5, 2025, 5:06 PM

#

hollow ocean Evening

which timezone?

sage raptor May 5, 2025, 5:06 PM

#

Pt probably

hollow ocean May 5, 2025, 5:08 PM

#

balmy mist which timezone?

PT

balmy mist May 5, 2025, 5:08 PM

#

sage raptor Pt probably

ughhh

#

oh yeah they never released that

torn mantle May 5, 2025, 5:11 PM

#

balmy mist https://x.com/mark_k/status/1919432963235733667

maybe next monday

balmy mist May 5, 2025, 5:11 PM

#

torn mantle maybe next monday

nahh it has to be this week

torn mantle May 5, 2025, 5:11 PM

#

it was based on this post https://x.com/veggie_eric/status/1919420805987082535

Eric Jiang (@veggie_eric) on X

It's gonna be an awesome week I can feel it. Have a great Monday!

#

he seems excited for the release

#

same

#

i had access for 2h

#

then it glitched out

balmy mist May 5, 2025, 5:12 PM

#

torn mantle i had access for 2h

how was it?

torn mantle May 5, 2025, 5:12 PM

#

yes

wintry tinsel May 5, 2025, 5:12 PM

#

ffinally we will see Gemini dethroned

torn mantle May 5, 2025, 5:13 PM

#

same

#

it was good

keen beacon May 5, 2025, 5:13 PM

#

torn mantle it was good

2.5 pro level?

balmy mist May 5, 2025, 5:13 PM

#

torn mantle it was good

like better than 2.5?

keen beacon May 5, 2025, 5:13 PM

#

i doubt 2.5 pro will be dethroned tbh

wintry locust May 5, 2025, 5:13 PM

#

torn mantle same

I have a feeling my leg is being pulled here

#

did you really...

torn mantle May 5, 2025, 5:14 PM

#

keen beacon 2.5 pro level?

not sure, needs more testing

torn mantle May 5, 2025, 5:14 PM

#

wintry locust I have a feeling my leg is being pulled here

well someone lied, lets follow that lie

wintry tinsel May 5, 2025, 5:14 PM

#

keen beacon i doubt 2.5 pro will be dethroned tbh

it may cost more but it will be better

torn mantle May 5, 2025, 5:14 PM

#

we should start asking Craig

keen beacon May 5, 2025, 5:14 PM

#

ahahaha

torn mantle May 5, 2025, 5:14 PM

#

sure

wintry tinsel May 5, 2025, 5:14 PM

#

unlike O series model Grok models are versatile like Claude

balmy mist May 5, 2025, 5:14 PM

#

wow

keen beacon May 5, 2025, 5:14 PM

#

isnt grok 3.5 asi?

torn mantle May 5, 2025, 5:15 PM

#

graig may as well be https://x.com/iruletheworldmo

🍓🍓🍓 (@iruletheworldmo) on X

grok 3.5 is asi

#

yes

keen beacon May 5, 2025, 5:15 PM

#

they skipped agi like o1 -> o3

balmy mist May 5, 2025, 5:15 PM

#

torn mantle graig may as well be https://x.com/iruletheworldmo

it better be asi or i am unfollowing that guy finally lol

torn mantle May 5, 2025, 5:15 PM

#

balmy mist it better be asi or i am unfollowing that guy finally lol

wait what

#

are you following him?

hollow ocean May 5, 2025, 5:16 PM

#

torn mantle graig may as well be https://x.com/iruletheworldmo

Strawberry guy is a middle schooler

balmy mist May 5, 2025, 5:16 PM

#

torn mantle are you following him?

yeah, for entertainment

wintry tinsel May 5, 2025, 5:16 PM

#

balmy mist it better be asi or i am unfollowing that guy finally lol

bruh what

torn mantle May 5, 2025, 5:16 PM

#

balmy mist yeah, for entertainment

nah this is sus

wintry tinsel May 5, 2025, 5:16 PM

#

Grok 3.5 Is new Elo king fr

balmy mist May 5, 2025, 5:16 PM

#

torn mantle nah this is sus

wym?

hollow ocean May 5, 2025, 5:17 PM

#

My friend knows 🍓 guy irl he’s in middle school

balmy mist May 5, 2025, 5:17 PM

#

i follow a lot of ppl, and he was right about o4 in april

wintry tinsel May 5, 2025, 5:17 PM

#

Do you see any Ultra releasing today? No than stop tickling my impatient nuts

torn mantle May 5, 2025, 5:18 PM

#

wintry tinsel Do you see any Ultra releasing today? No than stop tickling my impatient nuts

he saw it in the UI

#

added for 1 sec

wintry tinsel May 5, 2025, 5:18 PM

#

Alright '

balmy mist May 5, 2025, 5:18 PM

#

hollow ocean My friend knows 🍓 guy irl he’s in middle school

fr?

wintry tinsel May 5, 2025, 5:18 PM

#

I hope it ultra sucks cuz gemini is not based

torn mantle May 5, 2025, 5:18 PM

#

@deep adder how was ultra?

#

was it good?

misty vault May 5, 2025, 5:18 PM

#

wintry tinsel Do you see any Ultra releasing today? No than stop tickling my impatient nuts

can i stomp on your nuts instead?

hollow ocean May 5, 2025, 5:19 PM

#

balmy mist fr?

Yeah his name is Jonathan

misty vault May 5, 2025, 5:19 PM

#

torn mantle he saw it in the UI

source: shizophrenia

wintry tinsel May 5, 2025, 5:19 PM

#

misty vault can i stomp on your nuts instead?

if it is for AGI ok

torn mantle May 5, 2025, 5:19 PM

#

balmy mist wym?

why would you follow a grifter, a liar, an attention seeker?

#

he really doesnt have any insider infos

misty vault May 5, 2025, 5:19 PM

#

wintry tinsel if it is for AGI ok

sydney gpt4

wintry tinsel May 5, 2025, 5:19 PM

#

torn mantle why would you follow a grifter, a liar, an attention seeker?

Because he has seen AGI fr

keen beacon May 5, 2025, 5:20 PM

#

not just AGI, he has seen ASI

#

i think u meant behemot

torn mantle May 5, 2025, 5:20 PM

#

wintry tinsel Because he has seen AGI fr

money grabbing AGI

#

keen beacon May 5, 2025, 5:21 PM

#

grok 3 already has X ads

hollow ocean May 5, 2025, 5:21 PM

#

🍓 guy is from Alaska and he’s in 7th grade

torn mantle May 5, 2025, 5:22 PM

#

chatgpt is already surpassing x on daily users

#

let elon focus on his gork new bot

misty vault May 5, 2025, 5:24 PM

#

      /\_____/\
     /  o   o  \
    ( ==  ^  == )
     )         (
    (           )
   ( (  )   (  ) )
  (__(__)___(__)__)

balmy mist May 5, 2025, 5:25 PM

#

torn mantle why would you follow a grifter, a liar, an attention seeker?

i follow elon lol and a whole bunch of other ppl like that

torn mantle May 5, 2025, 5:25 PM

#

balmy mist i follow elon lol and a whole bunch of other ppl like that

i have them blocked tbh

balmy mist May 5, 2025, 5:25 PM

#

you blocked elon?

torn mantle May 5, 2025, 5:25 PM

#

cant stand them

#

yea

balmy mist May 5, 2025, 5:25 PM

#

you can do that

torn mantle May 5, 2025, 5:25 PM

#

i blocked him

keen beacon May 5, 2025, 5:26 PM

#

x is a cesspool

#

he didnt even mean it

balmy mist May 5, 2025, 5:26 PM

#

i didnt even know you can block ppl on X

keen beacon May 5, 2025, 5:27 PM

#

probably his dreams of an everything site or smthing

#

why hes calling everything x

hollow ocean May 5, 2025, 5:27 PM

#

Yeah

sonic tendon May 5, 2025, 5:30 PM

#

hollow ocean My friend knows 🍓 guy irl he’s in middle school

strwb? or the other one

hollow ocean May 5, 2025, 5:31 PM

#

sonic tendon strwb? or the other one

Strawberry

#

His real name is Jonathan

#

My friend has a picture of him logged into his account on his laptop

#

I’ll ask him for it

#

Skinny guy from middle school

#

Yeah

#

No reply yet he’s busy

soft kernel May 5, 2025, 5:42 PM

#

hollow ocean Yeah

What the strawberry guy from twitter?

hollow ocean May 5, 2025, 5:43 PM

#

soft kernel What the strawberry guy from twitter?

“iruletheworldmo”

soft kernel May 5, 2025, 5:43 PM

#

Still stuck in 2000s

cedar tide May 5, 2025, 6:02 PM

#

Qwen 3 on webdev is it also the thinking version?

#

for those who want to know what the 4 new leaderboard models are
Qwen 3 253b
Gemma 3 12b
Gemma 3 4b
Olmo 2 32b

#

We waiting for nova premier on the arena (to make fun of him a little 😅)

balmy mist May 5, 2025, 6:15 PM

#

https://x.com/gork so this is....

gork (@gork) on X

just gorkin' it

small haven May 5, 2025, 6:19 PM

#

day 19 with no o3 pro

ocean vortex May 5, 2025, 6:25 PM

#

ocean vortex so yeah... if we took like 100 test prompts and model1 would do say 50 attempts ...

in other words, peak reasoning token counts and frequency of those counts is way more important than the total average for determining how much test-time compute a model can use.

tall summit May 5, 2025, 6:37 PM

#

balmy mist https://x.com/gork so this is....

i like this guy

#

how good is qwen 3 at translation?

keen beacon May 5, 2025, 6:44 PM

#

It's asi ofc

cedar tide May 5, 2025, 6:55 PM

#

You can try a prompt for me ?

torn mantle May 5, 2025, 6:59 PM

#

cedar tide You can try a prompt for me ?

ofc he can

small haven May 5, 2025, 6:59 PM

#

im still waiting on a ss of o3 pro on o1 pro api

cedar tide May 5, 2025, 6:59 PM

#

torn mantle ofc he can

Are you kidding me ?

torn mantle May 5, 2025, 7:00 PM

#

cedar tide Are you kidding me ?

no?

teal mantle May 5, 2025, 7:03 PM

#

is deepsearch or deeperresearch powered by grok 3 or grok 3 mini?

cedar tide May 5, 2025, 7:03 PM

#

torn mantle it was good

Is think ?

cedar tide May 5, 2025, 7:04 PM

#

cedar tide You can try a prompt for me ?

Ah, I just saw that he no longer has access.

calm sequoia May 5, 2025, 7:04 PM

#

What's up with this variance 🤔

gilded drift May 5, 2025, 7:05 PM

#

teal mantle is deepsearch or deeperresearch powered by grok 3 or grok 3 mini?

Grok 3 , the think option is grok 3 mini . If i remember well

torn mantle May 5, 2025, 7:05 PM

#

cedar tide Ah, I just saw that he no longer has access.

he never had access

#

he should make an account like iruletheworldmo

keen beacon May 5, 2025, 7:05 PM

#

calm sequoia What's up with this variance 🤔

qwen3 does extremely well in distribution

small haven May 5, 2025, 7:05 PM

#

craig is iruletheworldmo confirmed

calm sequoia May 5, 2025, 7:05 PM

#

It's probably first time model is so good at math and so bad at style control

cedar tide May 5, 2025, 7:06 PM

#

torn mantle he never had access

And you too ?

torn mantle May 5, 2025, 7:06 PM

#

cedar tide And you too ?

no one has access

torn mantle May 5, 2025, 7:06 PM

#

small haven craig is iruletheworldmo confirmed

actually thats plausible

#

he talks like him too

wintry locust May 5, 2025, 7:11 PM

#

calm sequoia What's up with this variance 🤔

shows it is strength

ocean vortex May 5, 2025, 7:12 PM

#

tall summit how good is qwen 3 at translation?

with less common languages 30b and 32b seems quite bad. 235b is decent

ocean vortex May 5, 2025, 7:18 PM

#

calm sequoia What's up with this variance 🤔

it can surprise you sometimes, but it's way too inconsistent overall and can fail easy prompts you wouldn't expect. That's the consensus on it I would say.

teal mantle May 5, 2025, 7:21 PM

#

gilded drift Grok 3 , the think option is grok 3 mini . If i remember well

quantity wise grok's deep research beats chatgpt
quality wise the other way around

ocean vortex May 5, 2025, 7:21 PM

#

teal mantle quantity wise grok's deep research beats chatgpt quality wise the other way aro...

quantity is irrelevant though if it can't deliver lol

teal mantle May 5, 2025, 7:21 PM

#

if it is full grok 3 then it is cheap for them to do deep research on grok 3 as opposed to OpenAI using o3

ocean vortex May 5, 2025, 7:23 PM

#

teal mantle if it is full grok 3 then it is cheap for them to do deep research on grok 3 as ...

cost wise I doubt it's cheaper for them. It's a bigger model but potentially less test-time compute. I would say comparable if not more expensive...

#

OpenAI's deepresearch could be something like o3-high except with even longer outputs finetuned for search specifically

teal mantle May 5, 2025, 7:25 PM

#

ocean vortex cost wise I doubt it's cheaper for them. It's a bigger model but potentially les...

then I wonder how can grok offer much more limit
then it means the test time compute is defo more on o3

ocean vortex May 5, 2025, 7:26 PM

#

teal mantle then I wonder how can grok offer much more limit then it means the test time com...

grok API pricing is super competitive too. While OpenAI is in the position now where they can kinda just charge more and get away with it... 🤷‍♂️

tall summit May 5, 2025, 7:27 PM

#

ocean vortex with less common languages 30b and 32b seems quite bad. 235b is decent

which languages did you test?

ocean vortex May 5, 2025, 7:28 PM

#

Lithuanian 😅

#

235b is around gpt4.1 level, smaller models are considerably worse

small haven May 5, 2025, 7:32 PM

#

day 19 since o3 and it is still as magical

tall summit May 5, 2025, 7:39 PM

#

ocean vortex Lithuanian 😅

I'm getting pleasantly surprised at LLMs' translation ability to less common languages

calm sequoia May 5, 2025, 8:07 PM

#

@ocean vortex can you PM me? You seem to be blocked this

calm sequoia May 5, 2025, 8:25 PM

#

tall summit I'm getting pleasantly surprised at LLMs' translation ability to less common lan...

Trust me, I was testing it on niche language that only 300k or something people speak in the world (nobody counted us). Big models knows most of the words and can make poems.

#

Funny thing is that literature is almost non existent. I can't understand why is it in distribution. Maybe emergent property.

keen beacon May 5, 2025, 8:33 PM

#

calm sequoia Trust me, I was testing it on niche language that only 300k or something people ...

what language btw

#

if u dont mind

ocean vortex May 5, 2025, 8:35 PM

#

calm sequoia Funny thing is that literature is almost non existent. I can't understand why is...

well... if you can google texts in that language it was probably trained on it 👀

gilded drift May 5, 2025, 8:53 PM

#

So , no grok 3.5 today ? 🙄

ocean vortex May 5, 2025, 8:54 PM

#

gilded drift So , no grok 3.5 today ? 🙄

I'm waiting for dork 4.0

sage raptor May 5, 2025, 8:55 PM

#

when will they release gork

small haven May 5, 2025, 8:56 PM

#

elon ma said next week on apr 29

balmy mist May 5, 2025, 9:00 PM

#

grok 3.5 in june

sage raptor May 5, 2025, 9:02 PM

#

gork > grok

keen beacon May 5, 2025, 9:10 PM

#

no it's asi

#

strawberry man

#

fake its asi

#

asi would have 100% on simpleqa

small haven May 5, 2025, 9:16 PM

#

look in the mirror

#

so is every ai company except nvidia lol

zinc ore May 5, 2025, 9:19 PM

#

When you're such a good model you max out on incorrect benchmarks (MMLU).