#general

1 messages · Page 39 of 1

elder rapids
#

exactly lmao, it got released with 2m then got capped at 1m

#

bruh

keen beacon
elder rapids
#

did discord just delete everything

keen beacon
#

ur just making stuff up

elder rapids
#

now I have to type ts all again

leaden palm
#

the official deepseek webui lets you go wayyyyy beyond the context length that their models properly support and while i do respect this approach (vs. truncation) they should really warn you when youve gone enough turns deep for the CoT to no longer be doing anything useful

kind cloud
#

new google

elder rapids
#

exactly lmao, it got released with 2m then got capped at 1m, then got reduced to 32k because people couldn't reach 1m then got completely reverted to 2m

keen beacon
#

ah ur trolling lmao

elder rapids
#

there was never a time you could use the full 2m

#

no I'm not lol

frosty lark
# kind cloud new google

I see this mentioned a lot online, but I am always skeptical to ask the model directly. I mean they can (a) fake it and (b) they can reply with trained data, saying it is chatgpt or the like

elder rapids
#

you have no idea what youre talking about and clearly you rely off of 3rd party sources

#

just go on the bard subreddit

#

it's not that hard

#

😭

#

how am I going to prove a counterfactual

#

you're not making any sense dawg

#

goddamn

elder rapids
#

^ this what happens

frosty lark
elder rapids
#

unless you mean something entirely different

leaden palm
#

idk then

#

perhaps it behaves differently if youre just chatting from if youre uploading a large document

#

or maybe the ui changed since

keen beacon
# elder rapids you have no idea what youre talking about and clearly you rely off of 3rd party ...

i never remembered it being like that. you're making stuff completely up/mixing up stuff. https://web.archive.org/web/20250125234911/https://openrouter.ai/google/gemini-exp-1206
frankly i dont need to argue about anything when i can just let you yap and it's obvious who's right

Experimental release (December 6, 2024) of Gemini.. Run Gemini Experimental 1206 with API

elder rapids
elder rapids
#

😭

#

because you never were able to

#

Logan was tripping

#

this was consensus about 1206

#

that's it that's all

#

used that model everyday

#

every hour

keen beacon
#

you didnt even remember 2.0 pro's context window 😭

frosty lark
#

is it really life changing to argue about the context limit of an llm (almost) no one will remember in 1 year?

zinc ore
#

Gemini plan

frosty lark
keen beacon
frosty lark
#

one can simply ignore them if needed.

#

I get it because I also get involved in time consuming pointless discussion from time to time

elder rapids
#

1206 didn't have 2m

#

you've never been right in disagreement with me for something

#

not a single thing

#

you always bring up nonsense

#

like deadass

#

you say sh*t get it wrong

#

then dismiss it altogether

#

😭

golden ocean
elder rapids
#

ye

frosty lark
#

people I understand the excitement for llm but one needs to still be able to search

#

https://justoborn.com/google-gemini/#gsc.tab=0

"Google Gemini! In a groundbreaking announcement on December 17, 2024, Google CEO Sundar Pichai unveiled Gemini-Exp-1206,

marking a significant evolution in artificial intelligence technology.

This experimental version of Google’s most advanced AI model introduces a remarkable 2,097,152-token context window, setting new benchmarks in AI capabilities."

end of it.

elder rapids
#

holy

#

do you read bro

frosty lark
#

simply pick the news at the time, and that's should be enough

elder rapids
#

they said 2m

#

did not work at 2m

#

that's it that's all

#

nobody cares about the model card

frosty lark
#

yes and likely didn't work at 1m either (see NoLiMa), but the discussion is really long.

#

most models struggle past 128-256k

frosty lark
#

sometimes LLM driven search isn't great

#

interesting though for 1206 Exp I cannot find a model card in pdf

#

tricky to check old specs without

elder rapids
#

but simply errored at 2m

frosty lark
elder rapids
#

ye ye, It would break after 32k often

#

but if you progressed slowly it wouldn't have too much issues

#

I used to have translation discussions with it that went up to 70k

frosty lark
#

super annoying the absence of model cards. With normal webpages - if there is no internet archive version of it - everything can be "deleted" so to speak.

golden ocean
#

Is pier ai

frosty lark
elder rapids
frosty lark
#

or I could claim "oh sorry I hallucinated. I didn't mean that, you are right. Let me revise yadda yadda"

golden ocean
frosty lark
elder rapids
#

so it would be 2m context, then 32k then 1m

frosty lark
#

frustrating I guess

#

if they do it without warning

#

btw today I was trying to dig some historical fact with the help of LLMs (grok, perplexity and what not). The amount of hallucinations in the responses was appalling.

Everyone is shouting "AGI soon" but I'll be happy with LLM boosting searches already without hallucinating every 3 messages.

elder rapids
#

searches induce hallucinations so much

#

they're all unusable

#

grok is the only one that can actually infer from the sources themselves

#

4o does decently but only after the fact

#

0506 2.5 pro only does well when citing sources

#

and perplexity is deadass useless

frosty lark
#

I find perplexity ok in most cases (info is widespread and without much fakes or controversy) but when one wants to dig something that has misinformation online is pretty frustrating.

#

my experience is that the LLMs (in perplexity and other platforms) do not really check the credibility of the source. If one is found, they read it an include the text in the answer.

That could lead to many conflicting statements.

elder rapids
#

ye, to me it seems like grok was the only one that could somewhat decipher fake from non fake

#

deepseek did well too

#

but that's as useful as they could get

keen beacon
keen beacon
keen beacon
keen beacon
# frosty lark the closes I found is: https://www.reddit.com/r/Bard/comments/1h863qj/new_gemini...

its ridiculous that i have to do this but here's an article with a full screenshot of aistudio back then: https://starthub.asia/test-driving-googles-gemini-exp-1206-model-competitive-data-analysis-and-sophisticated-visualizations-in-under-a-minute/ (i find it extremely unlikely that both of the screenshots were faked and they changed it to some arbitrary 2m context value, (it's a specific value and not on the dot))

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

#

guy confused a prev gemini experimental version with 32k context, iirc, doubled down and made things up about them switching context sizes on 1206 🤣

elder rapids
#

disingenuous asf, bro knows too

south cove
#

10M context is crazy lol

#

my initial input is about 100k and I think that's really high

elder rapids
#

I've never said anything that postures ignorance

#

just sometimes less explanation

#

and then clarified

south cove
#

actually, I'm connecting the dots now... I think the reason why Gemini seems to "implode" after a few dozens responses is because it loses the original context

wintry tinsel
zinc ore
#

It's confirmed yes

#

Ultra plan exists

#

That's from legit

south cove
# zinc ore Gemini plan

"Gemini, make me an image of a UI with a blue circle and a rounded rectangle with the word "ULTRA" in it, with three vertically-aligned dots to the left of it, so I can troll the LMArena discord"

zinc ore
#

From legit

#

I don't send unverified stuff

elder rapids
#

you're deadass bugging

#

I even addressed the screenshot pier brought up and you're reiterating that same substance knowing it's not meaningful

keen beacon
elder rapids
#

not whether the base estimation for 4o is true

#

when both Aiden and Dylan claimed they were told

keen beacon
elder rapids
#

ye I already know this dawg

#

but this isn't meaningful

keen beacon
#

you were saying o1 was much bigger than 4o because it just can't be

#

want me to link to ur own statements?

#

dont bother deleting them i have screenshots

elder rapids
# keen beacon you were saying o1 was much bigger than 4o because it just can't be

because it can't be, there's nothing about what Dylan said that refutes what Aiden said

"o1-preview is 5x the cost of gpt-4o
o1-preview is based on gpt-4-0314 (source: openai team)

gpt-4o is ~5x smaller than o1-preview
you claimed gpt-4o's size = o1's size
ergo o1 is 5x smaller than o1-preview"

but when Dylan is claiming

"90% wrong
60% right
70% wrong"

in context of a different (listed) tweet about what Aiden said

maiden fulcrum
#

i was wondering what is the best llm right now?

elder rapids
#

I'm curious as to why you think I'd delete claims I'm not at the forefront of

maiden fulcrum
#

i am using gemini 2.5 pro preview 05-06 and chatgpt o3

elder rapids
maiden fulcrum
#

which is better?

#

i also use chatgpt 4o

keen beacon
elder rapids
maiden fulcrum
#

analyzing real life situations and reasoning

#

i am not a coder

elder rapids
#

those specific tweets are about o1's in speculation

maiden fulcrum
#

so which one do i use

elder rapids
#

not whether they're actually 4os size or not or that screenshot Dylan claimed specifically

#

they're different lines of questioning from Aiden

elder rapids
#

o3 hallucinates a lot

maiden fulcrum
#

most of the time i use 4o

#

what does hallucination mean in llms

elder rapids
#

and 2.5 pro will have worse conclusions on avg

elder rapids
#

or lies

keen beacon
# elder rapids dawg

see a thread of the tweet chain he deleted:
he was mentioning the size of o1 in that discussion

elder rapids
keen beacon
#

How did u even get that???

#

omfg

elder rapids
#

"90% wrong
60% right
70% wrong"
is Dylan saying, you're 90 wrong on one claim, 60 right on one claim

maiden fulcrum
#

so what do you think @elder rapids

elder rapids
#

70 wrong on one claim

#

it was a list initially

keen beacon
elder rapids
elder rapids
#

@aidan_mclau @natolambert @Luigi1549898 Table of contents is a TLDR.
Specifically bottom third is description of the architecture.
4o, o1, o1 preview, o1 pro are all the same size model.

@dylan522p @natolambert @Luigi1549898 o1-preview is 5x the cost of gpt-4o
o1-preview is based on gpt-4-0314 (source: openai team)

gpt-4o is ~5x smaller than o1-preview
you claimed gpt-4o's size = o1's size
ergo o1 is 5x smaller than o1-preview

keen beacon
elder rapids
# keen beacon lmao he was literally meming aidan's original tweet and u interpreted it literal...

confused on wym lol, im not taking whatever he's saying literally, whether or not he's actually agreeing or disagreeing is meaningless it's just simply a different line of questioning
https://x.com/stalkermustang/status/1869084918527270922

@dylan522p @aidan_mclau I agree on the first two, but why the last? do you think o1 pro isn't the same model with just longer reasoning chains (and maybe adjusted temperature or idk)?

#

and I'm not saying I remember this specific lines of tweets

#

you're bugging lmao

#

it's simply been talked about on reddit

#

so I recall the dialectic

keen beacon
#

its the same as 4o

#

o1 = 4o = o1 preview

#

he conceded that dylan (semianalysis) was right. he initially made a thread claiming several things about the size of o1 (90% sure, iirc), o1 pro (70% sure), etc., presumably he was given word that dylan was right then deleted the tweets and conceded there

leaden palm
#

i can't believe it hasn't been a year since claude 3.5 or o1 yet

keen beacon
#

yes its just best of n or something

#

inference time adjustments - dylan (semianalysis)

elder rapids
#

I mean, it just thinks for longer lol

#

trained to think that way should bring out different effects

small haven
olive mesa
leaden palm
#

although imo the progression is more likely to be in different optimizations, training, and modalities than a completely new architecture

small haven
#

its kinda insane that this isn't close to 100%

zinc ore
#

Too many "soons" that end up taking months

wintry tinsel
#

Elegant solutions for a more civilized age

small haven
#

claude code is just built diff

calm sequoia
#

Anyone tested Drakesclaw?

keen beacon
elder rapids
#

also mb I pinged you because I wanted to ask if there was any info about its performance

#

not just confirmation

keen beacon
#

ah

#

yeah I'm not sure so far but I haven't tested it much

#

it does seem at least on par with 2.5 pro

#

in webdev anyway

keen beacon
#

drakesclaw is interesting

#

it reminds me of an old google anon model that appeared on the arena but i can't remember what that was

#

it has some weird formatting quirks - likes breaking short responses into new lines, sometimes randomly capitalises words and can occassionally make spelling mistakes

#

it is also on the webdev arena now

keen beacon
#

im tempted to say this is better than 2.5 pro

calm sequoia
#
poll_question_text

How strong was the Gemini 2.5 PRO nerf?

victor_answer_votes

5

total_votes

16

keen beacon
#

drakesclaw vs gemini 2.5 pro 0506

#

(0-shot)

#

enlarge the images, they're full-page ss

keen beacon
#

drakesclaw after being asked to make it more realistic

#

gem 2.5 pro

torn mantle
#

not impressed tbh

#

ive also noticed the weird formatting

keen beacon
#

it was significantly more realistic for 0-shot vs 2.5 pro (there are also icons missing since i didn't realise it tried to use fontawesome the first time around), and i still prefer the layout and content on the 2nd version for drakesclaw

unborn ocean
#

Although the guy sadly did not try any price * token calculations

calm sequoia
#

I like how everybody was sceptical on 3.7 Sonnet and yet every osy switched 😄

#

I like open router stats. It feels like complete opposite of lmarena. Yet the truth is somewhere in the middle. Probably aggregate of benches.

#

Is there a data on revenuo ratio by source: api vs chat?

earnest parcel
torn mantle
keen beacon
torn mantle
#

they seems to be experimenting with different formatting as well?

#

the difference isnt that big tbh

tall summit
#

and enterprises are also on openrouter while not on lmarena

primal orbit
#

Has anyone seen it? Search dowsn't show mentions

torn mantle
primal orbit
#

ah ok

fleet lintel
#

is drakesclaw google's?

ocean vortex
#

40.1B elo

#

lol

#

seriously though, this is just statistics it's mostly meaningless...

#

you also have certain models by design outputting much more tokens on average. Claude in particular is unhinged if you don't cap your reasoning budget

calm sequoia
#

You can't look at "today" and expect it to make sense. Look at the month. As I understand some data scrapping and sentiment analysis company uses 4o mini for their work once a month or so

unborn ocean
#

Pretty obvious that it is just one team or something using for finance or science research

#

(Basic stuff like sentiment analysis or classification)

blazing rune
#

It is cheap

torn mantle
#

o1 mini

#

better

golden ocean
torn mantle
#

wasnt it gpt3

#

or davinci 02

keen beacon
#

emberwing is Google

torn mantle
#

o3 pro is based on that

torn mantle
#

notice how elon stopped talking about grok 3.5

#

its another failure

#

i just dont understand why the staff are working the whole day for + overtime

keen beacon
torn mantle
#

so they delivered with grok 3?

keen beacon
#

like he could say "we will do x by next week" and it's done a year later

keen beacon
calm sequoia
#

Hmm the R2 is quite behind to the rumors too

torn mantle
#

i mean by your standards

main gulch
#

but I think DS may skip it and deliver a hybrid model

calm sequoia
#

I haven't seen any rumors on v4🤔

ocean vortex
#

You essentially only have sambanova

#

which are able to make it fast enough

torn mantle
olive mesa
# olive mesa
poll_question_text

Will the first ASI be a Transformer or use a different architecture that's better (e.g. more scalable/learns better. Like how AI "hit a wall" with RNNs then Google invented Transformers)

victor_answer_votes

12

total_votes

16

victor_answer_id

2

victor_answer_text

Different architecture

frail delta
sacred plaza
#

this leaderboard illusion paper showed a lot of big failure modes in lm arena. any changes being made in response to this? https://arxiv.org/html/2504.20879v1

this feels much worse than when epochAI gave math tests to openai before released their math benchmark testing. T

leaden palm
#

i havent heard of any changes

#

the lm arena x account just tried to argue against half of the claims

leaden palm
small haven
#

day 26 with no o3 pro

#

2 days left boys

#

wheres ur betslip

small haven
#

claude code honeymoon is over

golden ocean
small haven
ocean vortex
ocean vortex
# leaden palm theres been some new discourse on the matter: https://x.com/sarahookr/status/192...

I think it's a more complex problem and not necessarily their responsibility. To be brutally honest, it's down to people to learn how to read this properly. Sure they could only list the models if they have other metrics published, but it shouldn't be on them to police this... Like I've always said you can always game a singular benchmark, but then it's obvious it doesn't perform in others and people should be able to do at least the basic analysis and spot that.

#

Meta was only able to cheat it by entering with different model, but then this is so obvious it's not even funny and it stops being a problem altogether

#

OpenAI is cheaky with chatgpt-latest, but that's more of an exception, and still... they can't afford to disregard other things knowing that it's their main model + we have artificialanalysis guys

#

I think fundamentally it all boils down to the fact that people having the most data on chatbot usage are usually gonna perform the best on tests like lmarena

zinc ore
#

Drakesclaw might be goated

torn mantle
#

nah

#

thanks but nah

leaden palm
misty vault
torn mantle
#

so is it monday?

#

grok 3.5 asi

zinc ore
#

Hope so because I predicted early next week for the drop (even tho IDC for grok)

torn mantle
#

nobody cares about grok tbh

#

but we are here for drama

#

if its bad we will say it

#

although i think grok 3.5 is just fixing bugs of the previous version, especially the multi-turn/context/consistency issues

#

reasoning from first principles is just a fancy word used by elon to build up the hype

leaden palm
small haven
keen fulcrum
#

Trust the process

leaden palm
#

wait this was actually from elon????????????????????

#

thats crazy

#

but yeah ig hes trying to make the next gpqa/hle (but in training form)

small haven
#

grok 3.5 is so asi that elon ma is creating drama on sam altman

raven void
#

drakesclaw is really good

#

is that ultra or pro coder?

torn mantle
#

gemini 2.5 03 is better than their latest exp version which is 05 and this version ( drakesclaw ) is better than the latest version but how does it compare to gemini 2.5 pro 03?

raven void
#

I'm still not sure what's the issue with the new Gemini

#

are they quantizing it or did they just make it smaller

leaden palm
#

well ive put in my prompt
ive already explained the answer on the internet, might as well explain it to their faces
if it still cant answer properly it would be even funnier

raven void
#

by smaller I'm imagining something like increasing sparsity / pruning or something

candid storm
#

Elon ju

#

Elon just tweeted Grok 3.5 needs another week or so

raven void
#

guys I made a neat tool to copy local codebase to AI quick, how do I share it with lots of people

candid storm
#

What u mean?

#

O that was just a typo

small haven
#

manifolds markets predicted grok 3.5 about 14.5 days on apr 29, kinda insane markets are predictive asf

candid storm
#

I wanted to type this @deep adder

hollow ocean
#

Grok 3.5 July 28

hardy pecan
#

If there was any "stock" id put into predictive markets, its probably polymarket (real money)

small haven
#

nah but i mean like on apr 29, it predicted 14.5 days when elon said "in a week", should be approx 7 days, just insane to me

candid storm
#

You guys think it will be released in a week? Or will it take longer

torn mantle
#

I knew smth was going on when it wasn't added yet on lmarena

late path
#

he also said grok3's knowledge is updated in realtime, with no knowledge cutoff. he's just like Trump now, always spouting nonsense

leaden palm
#

the suspense is killing everyone

small haven
#

idk why grok 3.5 has more hype than o3 pro?

candid storm
#

Because nobody can afford o3 pro

small haven
#

if they think like that, then they really are undervaluing their time, things done with o3 can be a fraction with o3 pro, the cost/benefit is worth taking. i would even say chatgpt pro is cheaper than chatgpt plus bc the marginal cost of o3 is zero compared to 100/week

leaden palm
small haven
#

yes and are you?

small haven
#

chatgpt pro is cheaper than chatgpt plus

#

chatgpt free is the most expensive one

#

logic

elder rapids
#

agi can't be "rough around the edges"

#

gork 3.5 confirmed not agi

#

dork 4 here we come

#

if it fits your use case somehow

#

give me a prompt

#

@deep adder

#

dawg

#

if o3 ISNT getting at least 50% "AI written" writing a formal essay then it lacks rigor

#

now, if youre asking for a plain essay, or something similar to a prose

#

or perhaps just a regular informational essay

#

then that's different

#

the type of format necessitates certain qualities the AI detector fails to identify and weigh

#

it doesn't know the context of the assignment

#

the point is that it's no longer the format you're asking for, and it's just a write up atp

#

give me the prompt

ocean vortex
golden ocean
#

Is dork 4.0 agi

ember rapids
#

Breaking news

#

Grok is never coming out

misty vault
#

but dork 4 is

cedar tide
#

Guys, we can start a normal discussion from scratch here, and stop going off on nonsense about dork 4.0, and saying every day that grok 3.5 will be released today.

high ginkgo
#

I agree tbh. Stop saying that grok 3.5 is going to be released today when it's going to be released tomorrow and dork 4 on may 14th

terse shuttle
#

how to change the aspect ratio of images?

willow grail
#

uo

torn mantle
#

gemma recent models are surprisingly good

ocean vortex
mild galleon
#

Stop grok cult

torn mantle
mild galleon
torn mantle
#

did they remove drakesclaw already?

ember rapids
#

It’s interesting how no one points out that talking to LLMs is basically social media level operant conditioning. How model behaviors r set up to give the right mix of novelty, validation, etc in order to hook ppl

balmy mist
elder rapids
#

stop giving me hope ...

blazing rune
#

imagine it's still talking about 1.0 ultra

#

they had that for 1.0 ultra a while ago

#

it would be funny if they are just adding it back

keen beacon
blazing rune
#

same

#

but it would be funny

#

and quite evil

wintry locust
#

this has been there for at least a full calendar year

#

from december of 23

keen beacon
#

idk about the point made in that specific post, but i think 2.5 ultra is coming anyway

#

based on things

#

doubt

keen beacon
keen beacon
#

1.0 ultra is still unmatched creatively

#

i highly doubt this (edit: for ctx, i thought this was addressing 4.5 vs 2.5 ultra general performance wise. i don't know about 4.5 vs ultra 1.0 creativity)

#

pretty sure that doesnt exist 🤣

misty vault
#

claude 3.7 opus is agi

keen beacon
#

their agi is working on asi already

misty vault
#

Grok 3.5 also

echo aurora
ember rapids
#

What do we think nightwhisper is? Another 2.5 variant?

teal mantle
#

What is the longest time achieved for o4-mini-high

#

You mean high? In terms of UX incl. multi turn searching and analysis o3/o4 does run laps but UX isn’t hard, other labs will adopt

keen beacon
#

lmao no

#

4.5's rlhf cooked most of its real creativity

#

a model being old doesn't mean it can't be good creatively

#

ultra was a >1T dense model

golden ocean
#

I miss gpt-4

elder rapids
#

ye

#

also I've seen that the new model tends to use the word "untenable" a lot

#

it's a nice word tho so it shouldn't be obvious

#

ye but that's what I said earlier

#

that's a necessary consequence

#

formal structures are predictable

#

same sentence length etc

#

this isn't a flaw of the model

#

it's just that you're asking it to contradict the request of writing an actual essay by asking it to avoid qualities it shouldn't be avoiding

#

so it has no choice of doing something wacky

elder rapids
#

you had to be there

#

it would randomly show insane glimpses of genius

#

but it wasn't a stable model

#

it was big too so you'll never get the same result twice

elder rapids
#

oh man

torn mantle
#

this is so stupid

#

how is that company worth 14b

golden ocean
#

fr

tawdry meteor
#

anyone try out the new cline update? I've been split between cursor/windsurf for personal project development, but would be nice to switch to something that's completely open source

sweet tinsel
calm sequoia
#

Tencent no-name better than o4 mini and o1. Right...😪

ocean vortex
late path
tall summit
torn mantle
#

Lol how much is he paying you?

#

Comet browser?

#

Phones native integration?

small haven
#

lol

#

day 27 with no o3 pro

torn mantle
#

Day 60

#

And still no o3 pro

royal rivet
#

god

#

good

small haven
keen beacon
#

What happened to Claude code

#

You dumped it already?

calm sequoia
small haven
#

does a better job

finite furnace
cedar tide
ocean vortex
#

impossible

wintry tinsel
#

@hollow ocean is much worse

cedar tide
#

we will ban these words : gork, dork, agi, asi.

golden ocean
cedar tide
sage raptor
misty vault
high ginkgo
sage raptor
#

gork 5.4 coder is asi and agi

keen beacon
high ginkgo
#

bro one job had

small haven
#

o3 pro is going to solve poverty

cedar tide
high ginkgo
misty vault
cedar tide
golden ocean
small haven
cedar tide
#

otherwise what do you think of mistral medium 3? (to change the subject)

torn mantle
cedar tide
#

But it will be at the level of o3 and gemini 2.5 pro

misty vault
cedar tide
misty vault
#

that's not so very nice

#

(to change subject)

cedar tide
small haven
#

new css update, o3 pro soon!

ember rapids
#

If o3 pro can’t solve Reimanns Hypothesis im refunding

small haven
#

😭

teal mantle
sweet tinsel
#

Mostly Visual Tasks take that long.

teal mantle
teal mantle
maiden fulcrum
#

hey guys

#

what llm is the best one as of today?

#

i am not a coder

echo aurora
willow grail
#

day 1 of covid corvid.
i have successfully done visible contact with corvids.
they saw me put big peanuts on the ground.
multiple locations.
the best location was at the bridge, with a road, and lots of no tree space and grass. 2 crows there aste ate my peanuts
i showed myself to them by side walking.. not much looking into their eyes.
i did the crabwalk towards the big veloceraptors to show that i am a comrade
soon, they will follow me whereever i go and people will think i am the devil
crows = devil
i need to start wear black stuff. and metal rings

maiden fulcrum
calm sequoia
golden ocean
#

I agree, but in case that model doesn't satisfy your needs, try gork 3.5, it currently scores 100% on all benchmarks in the world

tall summit
tall summit
golden ocean
#

What does scoring 100% mean on that

tall summit
#

has it been tested on any other political compass test besides the famous old one

maiden fulcrum
#

my case is reasoning and analyzing real life situations

#

what is the best model for that?

high ginkgo
maiden fulcrum
#

how can i use it?

lone summit
#

any good client to use Gemini 2.5 Pro Preview 05-06?

#

instead of that awfull google console

maiden fulcrum
#

gemini app

#

or ai studio

lone summit
maiden fulcrum
#

yeah

lone summit
maiden fulcrum
#

it does

torn mantle
#

o3 always felt superior providing medical advices

#

it dvelve deeper into technical details with in-depth reasoning

#

all compiled in a single table format

#

ive actually parsed multiple results from o3 vs drakesclaw vs gemini 2.5 pro latest and asked sonnet 3.7 to rank them based on multiple criteria

#
  1. o3 had like > 9/10 multiple times
  2. 2nd was drakesclaw
  3. gemini 2.5 pro
maiden fulcrum
#

how can i use drakesclaw

elder rapids
torn mantle
willow grail
#

kibble for cats is best staple food for corvids if youre interested

keen beacon
#

wtf is this

#

is this one of his alts?

sage raptor
#

lol

raven void
raven void
torn mantle
#

or some automated bots

torn mantle
pliant cypress
#

I hope "drakesclaw" is not 2.5 ultra

lone summit
wintry tinsel
#

Reality engineering?

#

He’s been taking a lot of mushrooms

maiden fulcrum
#

what is 2.5 ultra

elder rapids
#

but I haven't tried it too much yet tbh

maiden fulcrum
#

is it a new model or plan

#

what do you all think of gpt 4.5

#

it is underrated

#

right?

raven void
#

Gemini is dying Google is cooked

wintry tinsel
maiden fulcrum
#

I really think GPT 5 will be similar to o3

north vale
#

is there an up to date version of this somewhere?

echo aurora
maiden fulcrum
#

@echo aurora hey pinapple, apart from the leaderboards, what would be the best model for my use case which is analyzing real life situations and help me go through them

north vale
small haven
small haven
#

i mean ya

elder rapids
#

explain how this is in any way a meaningful thing to the AI itself, the company, the infra, any service

small haven
#

gemini 2.5 ultra will increase poverty rate

raven void
#

this AI can't even convert my 50k tokens code from vanilla js to svelte

#

somedays AI progress feels only half real

small haven
#

try o1 pro, should do the job

#

guys how many fingers is that

#

and is that a PROfessional basketball player?

echo aurora
north vale
#

ty!

echo aurora
keen beacon
#

humanity is doomed

calm sequoia
#

Once I've argued that grok is useless and someone from chat argued back that it's really good at medicine. I take back my words, it appears to be actually good at medicine

royal whale
#

WTF IS TS

golden ocean
#

(ts = typescript)

royal whale
#

what is

#

drakwe

#

cklaw

#

drakes claw

#

its ultra

#

.

keen fulcrum
#

How is grok 3.5?

royal whale
#

It idndt

#

relase

golden ocean
misty vault
#

same

#

gork 3.5 is great

sage raptor
#

dork 4.0 is better

torn mantle
keen fulcrum
calm sequoia
#

Elon said in few weeks so it'll be released this or the next year.

ornate stump
calm sequoia
#

This is not a search benchmark

royal whale
#

im tired

#

of haluciatons

#

o3 and o4 mini are taken over by haloucinations

#

same for gemini u cant do anything anymore

torn mantle
#

Api release is probably still far away

keen fulcrum
#

Did you experience a decline in experience for Gemini 2.5 Pro?

#

Some said it got worse than Flash

torn mantle
#

But im comparing it to the previous version ( 03)

stuck orchid
alpine coral
#

given a bunch of exmaples / situations for searching the web / reviewing parsed pages.. when it works (which fwiw it generally has for me), it's epic

#

but yeah i dunno.. maybe it's unrelated

#

just seems weird that oai are apparently going backwards wrt hallucinations (at least with their latest reasoning models.. i don't think there's been a regression in 4.5/4.1/4o

royal whale
#

i cant use o3 anymore

#

like its so smart but also i cnat trust what it says

alpine coral
#

i use it all the time tbh

royal whale
#

it gives false accurate soundingi nformation

alpine coral
#

been kinda game changing - i rarely used thinking models before for actual use cases.. it was just too slow

#

but o3 on chatgpt, it's really usedul for stuff that involves involves web research (+ like any data gathering / analysis); it works through research problems / questions really intelligently

royal whale
#

im using om lm arena

alpine coral
#

i have though encountered hallucinations.. a few really annoying ones too

royal whale
#

so i can t do web searches

#

maybe thats the problem

royal whale
#

it says more hallucinations than accurate stuff

alpine coral
wintry locust
#

how are you even finding it

alpine coral
royal whale
#

maybe

#

but even when i ask it simple questions

#

it starts giving random references

#

and false infos

#

and books that dont even exist

#

etc

alpine coral
#

yeah right interesting

alpine coral
royal whale
#

yea

#

its probably better on the open ai website

alpine coral
#

it's been post trained with so much stuff for using tools that.. when it doesn't have tools (isn't actually drawing on any external material or calutions), it's just pretending it has giving confabulated answers with made up references

royal whale
#

im using it through the arena so it cant use those so it cant berify info and stuff

rigid crescent
#

question for people that know more than i, why would both claude and minstral tell a story about lighthouses when it had nothing to do with any previous converstion/prompts. does this have something to do with a system prompt i dont see before beginning? excuse my terrible typos

rigid crescent
royal whale
#

they should fix that

royal whale
#

from arena

rigid crescent
#

it would have to be right? otherwise the odds of "unrelated" both landing on the same theme seem astronomically hight from different LLMs

alpine coral
#

i'd be surprised if it was related to system prompts given they're from different companies?

#

what's the preceding conversation? some combination of whatever is there + shared training data would be my guess fwiw

rigid crescent
#

by system prompt i mean what the arena tells them right before the initial prompts from me

alpine coral
#

but yeah at least on the surface of it - seems kinda wild ngl lol

#

is it web dev arena?

rigid crescent
#

the only thing that comes to mind is if a system prompt is saying something like " help the user, shine light upon their topics" or something

alpine coral
alpine coral
rigid crescent
#

thx for the link!

alpine coral
#

np!

alpine coral
#

and doing the same on openrouter, 3.7 gives lighthouse story, as does 2.5 pro; while the others all give responses with some similarities (village, forest or clockmaker etc)

#

so yeah shared training data / something along those lines ig 🤷‍♂️

storm needle
#

It's not something exclusive to openai

royal whale
storm needle
#

increase to high and it will decrease

cedar tide
ocean vortex
cedar tide
#

And o3, o4 mini

#

and they put grok 3 think instead of the mini version which is better

#

After a few weeks of phased testing, Deep Research on Qwen Chat is now live and available for everyone ! 🎉
︀︀
︀︀Here's how to use it: Just ask something you're curious about — like "Tell me something about robotics." Qwen will then ask you to narrow it down — maybe history, theory, or real-world applications. You can pick one, or just say "Not sure… Surprise me!" 😄
︀︀
︀︀Then, while you grab a coffee ☕ or take a quick break, Qwen will put together a clear, helpful report just for you.
︀︀
︀︀AI is getting better every day, and Qwen is here to help make your life a little easier — whether it’s for work, learning, or just satisfying your curiosity.
︀︀
︀︀Why not give it a try? You might find something cool! 💡
︀︀
︀︀🔗:chat.qwen.ai/?inputFeature=deep_research

**💬 1 🔁 2 ❤️ 9 👁️ 100 **

▶ Play video
calm mica
#

Hey everyone!!
Is anyone aware of the best repo-chat llm there?
which can take input as the github repo and we can chat with that...
I am seeking a similar that the lmarea.ai uses

balmy mist
#

i also use pastemax, but that just gives you a copy of the repo you can paste into an llm like gemini 2.5 pro

sage raptor
barren prairie
#

Gemini is the worst AI and I won t change my opinion ... The policy is the worst .. I donno but how refusing helping students doung Qmc during their exam period ... Google is doing wrong

#

Gemini app was my fav but now it the worst app

candid harbor
#

Actual skill issue

zinc ore
#

It's theinformation who are usually plugged in

#

They're the ones who leak a bunch of upcoming chatgpt stuff

#

Assuming that's their actual Twitter

#

(yeah I deadname)

teal mantle
#

Deepseek, your move

#

Is o3 agentic (by extension o4-mini-high)

unborn ocean
small haven
#

rlly? no o3 pro today?

teal mantle
#

(Is o3 as amazing as in API?)

balmy mist
#

lmaoo o3 pro us a myth

torn mantle
teal mantle
#

But isn’t manus an agentic repackaging of claude

small haven
#

day 28 with no o3 pro

torn mantle
#

patience jimmy

#

if they improved just one thing on o3 pro then i would call it agi

#

and that thing is hallucination

#

if it dropped to like 10% then its a big win

barren prairie
#

After what happened to me with Gemini , i can confirm ... Gemini disappointed me and google too .

torn mantle
#

ok lets aim for 0.05

#

pleaaaaaaaaase

barren prairie
#

I hope they will post the same model with the some policy on arena to see Gemini right score

small haven
#

common sense

torn mantle
#

i really cant see any model outperforming o3

small haven
teal mantle
ocean vortex
# torn mantle i really cant see any model outperforming o3

yeah OpenAI always had one of the lowest hallucination rates. That's probably because they are fine-tuning constantly and have more experience than most + user feedback. Google has the experience as well at this point but they are probably still not quite there yet

zinc ore
#

Gemini 2.5 has the lowest hallucination rate among all models

#

So it's really quite surprising o3s is so bad, you'd think they can get theirs similarly low

ocean vortex
#

I find it somewhat hard to believe, saw it hallucinating several times

zinc ore
#

There's a hallucination benchmark/chart that gets shown from time to time

#

I'd have to look for it

#

Iirc Deepseek also has low hallucinations

ocean vortex
#

Gemini has this thing it likes where it pretends to run the code, even when that's disabled

#

and the results of that ("output") are often just wild guesses

zinc ore
#

I couldn't find updated version of this one, but you can see 2.0 series was doing well

zinc ore
#

I'll admit though, I'm not convinced these benchmarks are catching a lot of hallucination scenarios. Because I know I've seen Gemini get into these weird long hallucination cycles. But, my understanding is they all do that.

blazing rune
#

it hallucinates like crazy for me, it happens on Deepseek V3, V3.1, and R1

#

it's so annoying

#

that's one reason I haven't used deepseek, as well as the issue with having no good providers and the model being far too large

#

It seems like it might be an MoE thing

#

Qwen3 also does it a lot

ocean vortex
torn mantle
#

its quite the strategy

#

bring in more people -> shocked by the quality of a free model -> compare it to top tier models like o3 -> limit access

torn mantle
#

we have a benchmark for that for other models?

teal mantle
#

Could we get to frontier models 0.5% hallucinations by the end of this year?

sage raptor
low lance
#

Hi all
If u can guide from where u see all the information in ai space daily? News/happening etc any forum or newsletter? And why it is good or better ?

torn mantle
#

i think drakesclaw is like a quantized version or something smaller but with the purpose of achieving similar results of the current gemini 2.5 05 model

trim vale
#

i just browse it a bit from time to time to stay updated about llms

wintry tinsel
golden ocean
#

gork

ancient reef
wintry tinsel
#

Browse: Singularity, Local Llama, Stable Diffusion subreddits daily, watch: AI explained, the AI search, Theoretically explained, and Matt Wolfe

wintry tinsel
#

I watch a lot of these guys occasionally but most of them are not necessary to keep up on AI, Anton Petrov and Sabine only occasionally cover AI, and David Shapiro just says a lot of nonsense have not heard of a good 5-6 on your list

#

Emergent Garden is great

#

So is Bycloud

torn mantle
#

Shapiro is an idiot

misty vault
torn mantle
#

i think grok 3 + search got better

hearty thorn
#

is this one new?

small haven
#

o3pro before google io LFGGGG

elder rapids
#

I have no idea

elder rapids
leaden palm
#

i love gemini 2.5 pro

small haven
keen beacon
#

i dont think so tbh

small haven
#

well until whenever gpt 5 comes out

hardy pecan
solar hollow
#

any strong anon models right now in the arena?

stiff pivot
#

is this free without loging in ? 😮

kind cloud
#

gemma?

stiff pivot
#

i dont think its free it dsnt even finish 600 lines of text 😄

keen beacon
# kind cloud gemma?

Haven't verified it myself but if the screenie is real it seems like a new Gemma is coming. (They usually train it to respond like that about its creators)

#

hi guys, this is @keen beacon but discord permabanned my main with 0 evidence at random while i was asleep so im on this account for the foreseeable

#

will try and re add people over the next few days

calm sequoia
keen beacon
#

one sec need to get ready for something

misty vault
tall summit
mossy drum
#

New model in Arena: calmriver

mossy drum
#

And this one: step-2-16k-202502

calm sequoia
#

Another mid

royal whale
#

it types fast

keen beacon
#

yeah cutiepie is a gemma model

#

who is naming these gemma anon models 🤣 zizou-10 (gemma 3 27b), cutiepie-75 (gemma ???)

royal whale
#

whats calm river

#

is flash lite

keen beacon
#

idk havent tried it much/at all

#

oh i just understood the zizou-10 name

late path
keen beacon
#

seems this new gemma model is slated for i/o (if we look at the timeline)

golden ocean
#

Large Language Model

storm notch
#

Hello all,

I’m building an AI model that needs to automatically label incoming emails into actionable categories like: ‘Needs Reply’, ‘Waiting for Response’, ‘FYI’, ‘Delegated’, ‘Calendar’, ‘Clients/VIPs’, and ‘Ads/Newsletters’. What are the best ML models (open-source or APIs) and practical approaches for classifying emails into these types of workflow labels? Are there any existing projects or pipelines you recommend as a starting point for production use?

raven void
#

Nightwhisper Drakesclaw Sunstrike were the top 3

keen beacon
#

how are they rating them

#

because nightwhisper wasnt ever in the general arena

raven void
#

if they are it's just vibes

keen beacon
#

oh its just a list

#

mb lol

hardy pecan
#

I think not far off my vibe feel too

#

I'd put
Nightwhisper
Dragontail
Dayhush
Shadebrook
etc
etc
etc

calm sequoia
#

Which one was flash

cedar tide
#

I think Emberwing its new 2.5 flash

#

a version where they tried to make it more efficient by thinking less

torn mantle
#

I still have mixed feeling about drakesclaw

cedar tide
torn mantle
#

Also i think nightwhisper is a big coding model, and its used to train the recent gemini 2.5 pro models

torn mantle
torn mantle
#

One thing I've noticed about this model is that it loves to write in capital letters and i think it captured that from training intensively on a coding model/tasks

torn mantle
#

Since usually in coding you write const vars in capital letters

#

Could be a thing

#

Ive run it yesterday multiple times and compared it to o3 and gemini 2.5 pro, and it came in 3rd

#

Im not seeing any noticeable performance gain anymore tbh

cedar tide
#

Calmriver

#

It feels good that the community has stopped saying that DeepSeek R2 comes out every day.

torn mantle
#

I mean we even got some leaks on claude sonnet 3.8 and not r2

barren prairie
torn mantle
#

They are also working on their hardware infrastructure implementing huawei latest gpus

#

You could only imagine the amount of issues they run through

#

Since it's like a new system...

barren prairie
#

Yes and they don t have enough money like google to make it fast

torn mantle
#

Chinese companies are so wealthy btw

sage raptor
#

I hope the new claude model is >>> than other models at coding

torn mantle
#

Although o-series are good but they arent really solid at coding

#

Especially visual/ui/ux

sage raptor
#

idk but i feel 3.7 is better at coding for my use cases

#

than 2.5

torn mantle
#

Yea 3.7 is still better tbh

#

The only model that topped 3.7 was nightwhisper but then it was only for web dev

sage raptor
#

yeah, we don't know about backend etc

oblique flint
torn mantle
#

Some invitations were sent for red teaming / security...

sage raptor
barren prairie
#

Bur claude is the king

main gulch
torn mantle
#

What i like about deepseek is its reasoning traces

#

Its probably the best chain of thought u could read

#

Its packed with many infos and yet you dont get overwhelmed

#

Unlike grok

barren prairie
torn mantle
#

Grok is just spamming and repeating many sentences

#

Its totally unreadable

keen beacon
#

Phi 4 reasoning plus xd

torn mantle
keen beacon
#

most unreadable cot ever

#

yeahh

torn mantle
#

Yea Microsoft are so lost

keen beacon
#

i was only interested in it because of it using o3 mini traces

#

the reasoning plus variant is absolutely crazy. the non plus variant gives u a better idea

torn mantle
#

Lol

#

Didn't they like benchmaxx with phi 3

#

And even the earlier versions

#

It was just fake benchmark

keen beacon
#

personally i wouldnt call it benchmaxxing, they just didnt focus on human preference at all until phi 4

#

the poor public sentiment about those models was primarily because of that imho

#

bad at conversing, bad at following instructions, censored to hell etc

torn mantle
#

I think they arent taking their jobs seriously

#

AI engineers at mcsft are either lazy or incompetent

keen beacon
torn mantle
#

Let's just rely on oai models?

keen beacon
#

🤣 yeah

torn mantle
#

What's the benefits if we made our own o3 model?

#

Did they even take that into consideration?

#

Or even perform a small analysis?

#

I really just don't understand msft

keen beacon
torn mantle
#

Nova/premier

#

Those models are so bad

keen beacon
#

phi 4 is basically a gpt 4o distillation. phi 4 reasoning is a o3 mini distillation + rl. (see their reports)

keen beacon
torn mantle
#

Could be

#

Bte what happened to kimi k1.6?

#

Didn't it like top some benchmark with +90% overall?

#

Or they were benchmaxxing as well

keen beacon
#

idk vaporware maybe

torn mantle
keen beacon
#

There's a new lcb revision, visit their site to see

#

It has Gemini 2.5 pro on it

tall summit
#

never heard

fleet lintel
keen beacon
#

For competitive coding

#

This doesn't always translate well into irl coding scenarios though

tall summit
cedar tide
#

Cutipie is reasoning or not ?

#

How does it perform compared to Gemini 2 Flash?

dusky aurora
#

does direct chat beta interface use some special sampling?

teal mantle
#

high is peer of 2.5 Pro and o3

cedar tide
main gulch
cedar tide
main gulch
#

Prover is actually fine-tuned V3

cedar tide
#

When do you think we'll get the next version after DeepSeek v3? (I don't know if it will be called 3.5 or 4)

cedar tide
main gulch
#

I think the next model will be hybrid with optional thinking

#

DS will have to close a big gap from SOTA (multi-modal, large context, integrated tools), I doubt they can do it in a single step

#

so we could get the second Llama 4 moment

#

or even worse, as expectations are way higher

cedar tide
#

Im waiting for a model who knows when to think or not and how much to think

keen beacon
cedar tide
keen beacon
#

This is a complicated problem if you wanna do it well I think

alpine coral
#

'creating systems' (not models)

#

use p2l in the meantime ha

balmy mist
ocean vortex
#

it's good for recursive pattern matching, but it's gonna lack depth and awareness for nuanced things and can be harder to efficiently communicate with

#

2.5 pro score is impressive though

calm sequoia
#

Is there at least one non-open-source chineese lab?

torn mantle
torn mantle
#

Baidu

#

I feel like oai is running this process continuously, they make a big good model then its distilled to smaller versions

calm sequoia
#

If i'm not mistaken wild said it's on different base model

torn mantle
#

Its working for them so far but the models are actually losing a lot of knowledge and becoming dumber

calm sequoia
#

You just can fit so much in smaller space

torn mantle
#

You guys should retest grok 3 + search

#

Im noticing a clear difference to the older version

torn mantle
#

Since it was trained by the teacher model

#

You can on squeeze up much

ocean vortex
cedar tide
calm sequoia
# torn mantle You can on squeeze up much

There's always this tradeoff in compression. The more you squeeze in memory (param count), the more you loose in compute time (how much work to decode the compressed data). I highly doubt they would want more compute time.

ocean vortex
#

when distilling you don't need additional RL training, it learns reasoning during distillation, so 4.1-mini becomes o4-mini

calm sequoia
#

Of course, some magical super effective latent spaces could be discovered, but IDK if it exists in text

calm sequoia
#

One can say the pretraining is distilling humans

balmy mist
#

lol its been a month since o3 pro in a few weeks

#

and where the hell is r3?

#

r2*

teal mantle
#

is there any bigger or better discord for AI use discussion?

small haven
#

o3 pro today pl0x

torn mantle
#

So they can always claim the 1st spot

#

They may be waiting for grok 3.5

#

But one of their staffs said it should be released before google event

small haven
torn mantle
small haven
torn mantle
#

Why would they release a model when they still have the best one in the market

small haven
#

claude 4 + o3 pro gonna be an insane combo

keen beacon
#

can't say much but they're not done with 4 yet

torn mantle
#

That's just logic

keen beacon
#

well of course

#

i have seen them

small haven
#

i think this is easy money

keen beacon
#

but it continues to be incremental

torn mantle
small haven
#

incremental on the logarithmic scale 😭

torn mantle
#

Well it depends if the next trained model met their expectations to be called claude 4 instead of 3.9 or 3.8

keen beacon
#

but 3.9 isn't guaranteed if they've reached 4 by then

calm sequoia
calm sequoia
torn mantle
keen beacon
keen beacon
junior vigil
#

Neptune, new model but same pricing ?

cedar tide
#

when Claude 3.5 new new new comes out ?

ocean vortex
keen beacon
#

it's called Claude 3.5 Sonnet New New 2