#general
1 messages · Page 43 of 1
.
theres a big problem using rust in codex, everytime u ask something it has to recompile all the crates from scratch, takes a good 5 mins before it can even start (well it depends on some project cargo)
Oh sorry I see it now. At that time, I was speculating based on the FrontierMath benchmark story wherein they ran their models for hours for one problem
anyuone has ideas how to only pay for tramway ticket when i see inspectors intering the tramway i am in?
I also rarely have a complete up-to-date picture of what is going on inside OAI. I do know people in the know and hear things from time to time
i believe that was common 'knowledge' among the people speculating
on what?
nvm (read the context)
Wait is that Gemini ??? Why he is like that ????
correct
gpt-4-preview
send me try me
how do u select this..... what is gpt4 preview XD i dont remember this ever
weait so its not on bing?
4B should be faster
But if you do 4B, do Q6_K
Oh, if you have 6gb of VRAM, q5_k_m might be better
@misty vaulti cant find that model on copilot
That should be at least 20 tokens per second if you set it up correctly (the 4b)
lmao this is an on-device model for mobile
I'm talking about a 4B dense model on his 6GB laptop GPU
Yeah, the leaderboard is terrible
will o3 pro be in the arena
Or Gemma is just good at human preference
If it actually had something to do with how fun or relatable a model is, that would be fine. But it is just using stupid phrases
shitification?
And more importantly, it depends on the formatting and how many headings are in the answer
Yeah it seems to be all style control
doesnt say gpt4 preview
the non preview gpt-4 also needs no jailbreak
Bro some guy literally sent in this chat once
It is already suppose to
so
the original websockets lasted for a year
they died 1 month ago
but theres another one
what product are you using after the keynote?
microsoft bing chat
lol
i really think flowith is powerful
i have yet to harness its full power
I will suck **** for microsoft and openai to give me the fine tuned gpt-4 they use for bing
Ew Bing
New gemini 2.5 flash vs old
🥺
no
it's been powerful asf
it's unironically insane
I've been using neo AFTER chats with AI
which builds its "prompt" essentially
and then stemming from those context threads, it goes through everything and plans it extraordinarily well
wasn't directly the best instruction follower imo, the reason why it appeared that way was because it inferred (correctly) based off of a deep implicit understanding like if you meant to emphasize something, it'd grasp what parts to prioritize. But granular details, or in this case implicature based counterfactual speech, it didn't comprehend at all
it's inference made it so easy to talk to ye
is this happening only to me, or maybe there's a maintenance of sorts...?
SOOOOO good bro
man I think I'm starting to like this AI trend
so wen is exactly o3 pro
Yo why is neo using Gemini
LMFAO
ITS USING GEMINI OVERVIEW FROM THE BROWSER TO CONDUCT ITS RESEARCH
LMAOOOOO
what is "it"?
agent neo
dawg o3 is so bad dude it's insane
ts can't understand a WORD I'm saying
holy
I'm genuinely getting mad
Looks like nightwhisper was Gemini deep think 🤔
The performance is not ultra level I see why they haven't released it
O3 pro will cook it if so
dont think so, i think its just a refined 2.5 flash/pro, and if it is, then its gonna be a flop, because o3 >
same here (lines up with my actual usage very neatly and consistently). it's a great benchmark imo
Best AI sub
I'm really missing the raw thoughts of Gemini 🥲 (you can get around it but you might be degrading perf)
"subject to strict limits"
Whats with these labs lately
"show raw thinking"
everything reminds me of her
ye I'm thinking it would degrade performance too
or at least context recall
is this agi?
watch them be mid asf
if it was they would be bragging about their ARC AGI score alr like OpenAI was with o3
Yea its you
Asi
Really doesnt matter with these ai labs anymore
Claude 4 is probably gonna be good but if it doesn't have multimodal capabilities like other models, it's not a good sign
And knowing how strict already anthropic is, model gonna be good but it wont be usable for a while
@tall summit why did you delete lmao
Do o4-mini imply the existence of o4 full?
They are still silent for o3-high-pre-nerf reaching 15%
o3-high-pe-nerf is agi
probably because each task took nearly 3,500$ to run
I know all the info, but it's still notable
probably but 3.5k dollars for a single task is crazy though
Yeah, but it delivered. There are many use cases where this money is nothing
@cursive zodiac and butthead ahh username
imo unambiguously yes
it is curious.. i don't know but ig it's like voluntary system (the labs have to offer up their model to arc to be run against the test - rather than arc testing all models themselves, which ig would mean exposing the question set through the API calls and contaminating the bench)
i think gem pro 2.5 is an equally conspicuous omission.. surely can't be about costs (google and oai could cover them if they wanted to)
[as a footnote.. also wild how models like o3-mini-low and sonnet 3.7 score 0 on arc2]
they ran it though
oh
it cost 3.5k per task and got 15% on arc agi 2 (o3 preview)
results werent officially published iirc
for 2.5 pro idk lol
i mean there must be a reason it's not on arc's official leaderboard? if they ran it - like why not 🤷♂️
might read the paper
(aka give it to gem pro 2.5 and ask it if it explains what's going on re omitted scores)
wait i got the 2.5 pro scores
its also hidden
Gemini-2.5-Pro-Exp-03-25 **
12.5% on v1 semi private
1.25% on v2 semi private
it was marked with a note:
* * Preview results: Results marked as preview are unofficial and may be based on incomplete testing. Models without available pricing information will not be shown on the efficiency chart. Results become official after complete testing is finished.
maybe thats why it wasnt listed lol
the note doesnt really make sense
For sure. I start to dislike how google position themselves with selected benches only, e.g. yesterday with elo scores
Has Claude 4 been testing in the arena now?
anthropic doesnt do that (do pre-releases in arena afaik)
anthropic are still stuck on how to make profit
on how to release their models
on how to decrease rate limits
im really not looking forward for any of their releases
because ive had enough
we asked for a better rate limit, they introduced credit system which made things even worse
their service goes down a lot compared to other companies i think. theyre struggling with serving i think
after years
and they are still struggling
not enough compute compared to other companies i guess
its probably because its mostly used for coding -> more tokens compared to other models
its good for creativity too so -> more tokens generated
they wanted to balance that by nerfing the default/usual output with concise outputs on general questions
They can just rent more like everybody else, especially at the high prices they charge per token.
And I remember the ceo saying one that like ‚every two weeks something new gets invented that reduces compute needed at inference by like 30%
(don’t remember exact wording or percentage)
Said that to take the wind out of deepseek post R1 hype
you'd think they would've basically resolved the issues by now if it were that simple
Yes, which is why I am thinking that it is either really more complicated or that they just don’t care
And the longer it takes the more I am inclined to think the second one
Is poe considered to be the cheapest?
You do not have permission to use this command. Permissions needed: Administrator.
yeah.. i mean ig incomplete testing means just that.. but why only partially run the test.. (the sentence about pricing availability and models appearing in the effeciency chart does make sense.. but is oddly inserted b/w the two other sentences that seem related lol)
100%.. it's obvious by now that they're compute poor
I don’t think it’s that complicated..there’s only so much physical hardware available at any given time .. and anthropic’s share of that is clearly less than oai and google's. companies lock in what they can.. those able to pay more get more (or in Google’s case, they just own the hardware outright).. if Anthropic suddenly got a bunch more capital, they could better compete with oai and others to secure GPU access, but that would be for the future. for now, they’re stuck with what they have.. which clearly isn’t enough
as I see it, they’re throttling usage on Claude chat and experiencing outages because they don't have the capacity to both serve current models at scale while also developing new ones (well.. in theory.. like honestly how long until claude 4 lol).. as Wild said, if there were a simple way to avoid this they would’ve done it already surely..
Would like to see gemini doing that 😎
what were you doing
Rewriting super-complex signal processing function that took me 2 months to write once
Pre-nerf Gemini would have aced it, though
But something is going on with o3, it delivers two multiple responses sometimes. One is often twice as good with twice as long thinking.
YESSSSS WE HAVE AN OPUS
😍😍😍😍
tomorrow gonna be great
I'm sorta iffy about Claude 4 actually coming out soon
feels like way too much speculation over a minor backend change
esp considering that the safety testing event didn't happen that long ago and didn't seem to showcase a new model
↑
speculation is so pointless
whoops i'm talking to red
who gambles based on speculation
no offense red
honestly I'm not
well of course rumours are rumours however
it would make sense from my internal understanding
anthropic's team have been locked in recently hence the outward lack of shipping other things
of course that's a possibility
there is also a possibility they have actually not been working
and having no releases supports both theories
i am inclined to believe the theory is true because things have been focused on release
maybe - but in that case I'd still be surprised if they didn't do the volunteer redteaming thing on it directly
anthropic's safety stuff is less to do with the model itself nowadays and more to do with the layer(s) on top so the difference may have been negligible
yeahh, that was my other thought too
i'm still leaning towards "they wouldn't deploy a brand-new safety layer developed on an old model with a new model within a week"
is this new? the animation? is this deep thinking? https://i.imgur.com/Wkz7np6.png
what is vro doing 😭
i feel like the star and 3 blue dots are new?
oh
nah, i don't think the three dots are new
those have been there for a while
this particular style of thinking seems novel (probably summarized the same way openai's been doing it)
and there are some minor ui changes within the thought box
wiat what
nahh this week is crazy
good
yea
Did you also have 2.5 flash thinking taken away from you?
(There are now 2.5 flash no think)
you have a voice favorite?
nah google are cooking with these voice models
Yes since it is exactly 10 times more expensive
i was telling yall a while ago google will win and yesterday kinda dropped th emic lowkey, they even have diffusion text models lmaoo
they said we doing everything lol
mercury labs gg lol
its better than seasame?
ok if u wanna read more.. https://www.reddit.com/r/Beichtstuhl/comments/1krxt6g/beichte_unerwartete_erektion_nach_4_wochen/
its r/confessional
lol me too
idk wat to do, i wanna try veo 3 so bad
but i cant pay that price until we have all the features
nahh this is so bizzare to me:n https://x.com/mark_k/status/1925109493911728186
👀
How's this possible for 24B
hah thats good one
Incompatible to people who arent engineers.
like
say the name
ik whos on ur mind
just say it
u?
lets try my fitness app with devstral
Yeah I told you yesterday they took it away but it have the same performance as the old thinking one and really faster but you can oblige it to think by selecting "Canvas"
same performance as the reasoning model, it's not possible
Pls Add Devstral in dev arena
(Open source)
https://mistral.ai/fr/news/devstral
Yes
hey @echo aurora
https://beta.lmarena.ai/leaderboard/webdev
webdev leaderboard not working anymore
hooooly more money
Thanks, going to flag to the team! Reminder that we're trying to use the #1343291835845578853 channel for bugs, no need to post there again just giving you a heads up.
$100M~
glad to see lmarena isn't crashing and burning yet especially with the methodology controversy
100M is crazy!
about the censorship on the beta site
it feels like it's a lot more than the current site tbh
100M is surely a typo
like some slightly violent words are immediately blocked on the beta site
wondering if this is carrying over?
yesterday's Google IO also advertised LMArena💯
"Financial data platforms indicate a total funding ranging between $14.3 billion and18.16 billion, which may vary based on inclusion of secondary market sales or debt financing which can sometimes be ambiguous." (according to some gemini searches) and renting compute has never been easier, I mean many of the larger corps (e.g. microsoft or amazon) heavily rely on renting compute from others (especially coreweave).
So I don't see how they could not be able to serve the models (as I am working under the relatively save assumption that they are making money on serving).
And btw i am not trying to say they don't WANT to get these issues in order, to me it seems more like the team is just not very competent at resolving the issues. (can be seen at the long time it took for them to acquire more funding vs. the time it took xAI).
Furthermore this kind inability to properly manage financials is not just something exclusive to anthropic, I believe that there is also a certain level of this 'financial illiteracy' (prob not the right word) in many other start ups like xAI and openAI.
What flawed benchmark?
Well dont quote me
but the gpt and gemmni models been top 1 forst last couple months
I think its fraud espically since, they dont include open source models like that
since its a very large demographic
you think they dont deserve that?
open source models are included too, the known one at least
surfacing specific concerns for our upcoming AMA would be a good place to share - https://docs.google.com/forms/d/e/1FAIpQLSdFrGpj4GC7ED6XXOKLFq_UKoubUB5A6v8TDL0BBdc-Q_0bag/viewform?usp=dialog
Nope, since
deepseek
cleared them in one sho
shot*
qwen to, actually alot did
yea but that was long time ago, deepseek got a good spot then many good models came out
Well could I talk to a employee directly? I would like to
while deepseek was good at reasoning, it lacked the general knowledge and coding strength of gemini models
qwen is a mid model
lets be honest about that
but i appreciate what they are doing
sure gpt sure but dosnt it seem fishy
that it been like this
for the last 7 months
98% of people dotn code
dont code
the webdev arena does not reflect at all the performance of the models on the overall coding,
90% is just voted for the one who makes the best visual in one shot,
it would bé good if lm arena made a partnership with cursor to integrate it into an arena, or with each message cursor would propose 2 results and the result that the user keeps gains elo
guys it is no typo 💀
they have a 600m eval
🤑
yea but if you want to judge a model, you need to assess multiple areas, i get that most people ask it general questions but you must expect that -> new models -> better ranking than old models
I know that lol I litterly own a ai company
What im trying to say
its seems fishy that it keeps going this way
maybe they could use one of the millons they have
to fliter the messages out
alot more then are currently doing
did you take a look at their llama 4 maverick messages logs?
did you find any bad prompts?
Llama 4 is trash we can all agree about that one
I will not fight you about that lol
yea, but we're talking about prompt filtering
maybe you meant a better way to count a vote(+1)
tbh i dont know how they count a vote to give you my opinion on that
I meant that sorry for the mispell
@echo aurora Well could i talk to a employee about this then on discord
so at least, I could talk and get it dealt with
ok but what do you think this list should be
ok
let see
gpt 4.1 can go above claude 3.7 sonnet
o4 mini can go below deepseek
oh do you want me to number it?
gpt 4.1 below sonnet 3.7
srry viewing it inverse. gemmni might be tied with deepseek v2
so it can go either way in my opinon
how is chatgpt 4o still here is crazy to me b thats goes below grok and gpt 4.1
I don't think this is a good idea. As I've suggested before, perhaps adding an "agent battlefield" would be better,similar to the current Webdev. In my personal understanding, webdev is essentially a form of agent as well. Implementing an agent battlefield shouldn't be too difficult,just spin up a virtual environment with an editor/IDE or agents for comparison. In other words, adding an agent selection dimension under the existing webdev framework seems much more reasonable than partnering with a specific company.
anthropic models should be higher, i agree
bro you're literally a drooling alien
but they deserve that spot
if im asking a model a question i dont want it to give me 2 words and stop
I'm jk @narrow elbow i love you
im confused why does that video look ai gen lol
and why is it called io
thanks for telling me cause i would have been confused too lol
no one is going to code a real thing inside, and especially not a real big project with several round trips, so it still won't reflect real usage.
io
input -> ai capabilities from openai
output -> design/hardware -> executing tasks -> from loveform
Nope
Yes, but the agents goal isn’t just coding,it’s about accomplishing specific tasks. like, "deep research", "creating a webpage", "drafting a work plan(sending email something)", "designing a travel itinerary(order hotel something)", "developing a course curriculum(create ppt)", or "analyzing a set of medical records",all of these are tasks agents could handle. The point is to verify whether an agent can exceed expectations in completing such tasks, not just coding. So, agents aren’t limited to coding, right?
I can't make any promises, but I'll reach out privately to get a better understanding of what the issue is and we'll take it from there. Sound good?
Thanks!
Tell me when you can so i can at least speak to them
He wants anyone who has ideas to be able to make them happen with the tool, so input = prompt from people containing the ideas, and output = concretization
ok so we're talking about 2 completely different things, yes I agree with you that we need an arena to measure the agentic level, but I was mainly talking about measuring the coding level.
Actually, it’s not entirely unrelated. Agents rely on the capabilities of the underlying model providers, but given the current limitations of models ,or perhaps what model providers internally also had "agents"? these base models alone still can’t fully accomplish what today’s market-ready agents can do. Coding is just one of the foundational model’s abilities, while real agents represent what we’re truly aiming for. not just coding,Haha.
great another day without o3 pro smh
Another day without grok 3.5
https://x.com/xai/status/1925242582617268598?t=iW-1nxOCwEy8zatLrT51HA&s=19
Attention devs: the xAI API just A LOT smarter.
With Live Search, Grok can now search through realtime data from 𝕏, the internet 🌐, trending news, and more.
The Live Search API is now FREE in beta for a limited time. Start building here: https://t.co/Yfmhe49Yh4
grok 3.5 is expired before it got released
Nope
it will be at the level of G 2.5 pro and o3
referring to claude neptune ( 4 )
-> roman god of sea
so there is a high chance tomorrow we will get a demo of these models
Focusing specifically on coding: given the proliferation of programming agents, a systematic comparison is needed to evaluate their actual performance in coding. A public leaderboard would be ideal.
I also talk about a public leaderboard
the timing of anthropic is kinda interesting
just in time to stop the gemini 2.5 pro model from trending further at coding tasks
😦
the strings are kinda funny
grand_damage_bucket
my brain went straight to ++$$
kinda curious about opus tbh
@keen beacon how long has it been since last opus model?
over a year i think but i dont really remember
i wonder if they are abandoning haiku lol
3.5 haiku was.... 💀
yea
they didnt seem to have gained much improvements tbh
what does this mean to xai?
will they postpone even further grok 3.5?
xddd
What if it was a joke and in reality only Claude 3.8 was released?
kinda strange the information didnt specify the name of it in their recent report
just "claude sonnet" "claude opus" iirc
anything is possible lol
no it was like a month ago
we thought it was for the invite
there was an invite
for smth
misremembered it, thought they mentioned claude 4 in "anthropic strikes back" (report by theinformation 3 months ago) but no it was people inferring claude 4 and wasn't directly mentioned iirc
Man they spend like 10b the last two weeks
oh naww finesse of the century
imagine grok 3.5 never releases, cus theyve been too frontrunned lol
I’m hearing Claude 4 is confirmed
The head of anthropic dev relations supposedly confirmed
I’m genuinely curious about Claude 4.0. I’m wondering what new features and capabilities it will bring in this upcoming release.
Never the less that api cost is going be a pain in the ass 💔
they will bring in a $200/mo plan, unlimited claude 4 opus less go!
Just copying OpenAI at this point 💔
first one to $1000/mo wins
😂 😂
Damn 100M funding is absolutely nuts. Can't wait to see the advancements and features that we will have. The one I'm waiting the most is a personal leaderboard based on our usage and a live rank of user voting the most accurately 😍
hello
hello 
lol i was reading this again: https://magic.dev/agi-readiness-policy wow it's super dated 🤣 qwen 3 4b passes this threshold
Obviously when has a decimal release included a new model like opus?
Ai explained has insider information and he said a model is releasing tommorow didn’t expect it to be opus though, thought it would be grok schlock
hyped for the future of LMArena 🔥 👀
what happened with the server logo...
that's a good way to lose engagement in your server LOL
it was updated to better match the beta site! the recent announcement mentions some of these changes
I feel like there are better ways of modernizing your logo while making sure it is still recognizable to be completely honest...
rip vicuna 🥲
Lmarena raised 100M. This means the valuation is at least twice ad much.
What isnthe revenue coming from?
How do you promised future revenue for the investors?
their valuation is 600m apparently lol
This investment deal does not make any sense
Unless you are receiving millions from google for rlhf
harvest data for all the frontier labs/leaderboard/additional marketing for the companies/etc lol
read the article btw
Paywall
The old Logo was so nmuch better lmao
Lol in this case the mcbench should be valued at least 50M 😄
ah yes, "ASI Lab" lol 🤨
daamn this would actually look perfect 😠
leaderboard
hmm
I appreciate the feedback (and the mocks), I'll be sure to share the feedback with the team
what did u generate it with? gpt 4o image gen? or photoshop lol
tbh it's pretty good
Maybe... I sure as heck didn't do this with photoshop no time for it... 😂
the old one simply doesn't make ANY sense
what?
I know the history but lmarena doesn't make any sense for it to have as it's logos tbh
it could be the mascot
the new logo is too corporate
but more popularity
helps identity
helps design
helps popularity
helps identity
etc
although this new logo isn't very good imo
as far as direct design
it's a colosseum right
that's not an intuitive representation
it is tho
what else could it be
vaguely honeycomb, an oven, cell block, just a regular panel or something
only knew it was a colosseum via the banner and then I look at the name "LMarena"
and I was like
oh that makes sense
but only after the fact
What was their pitch deck? Their vision?
that's deliberately somewhat misleading though. It's not a new model hence no output that it comes up with is impossible to reproduce just with the standard 2.5 pro. All it essentially does is make sure bad outputs ("bad attempts") are eliminated
they could be feeding it additional context of those parallel outputs though making it expand more - effectively doing multi chat turns for your single request
Claybrook
nightwhisper and sunstrike are even better
umm.. i did
reasoning from first principles?
kinda pity xai a bit tbh
look at what they are promoting
web search api but wait
its with X SUPPORT
yea, you heard it right, results from X
gj xai
HAHAHAHAHAHAHA
odd
getting ready
i might sub to claude if the limits are good (unlikely) because they dont hide the thinking
yea unlikely tbh lol
has claude neptune been anonymous in the arena yet? or do we have to wait to see it at all
the new claude 4 models
there would be reports of a strong model/anthropic model (they train it in who made it) in here if there was
anthropic doesnt do anonymous models for the most part anyway
i don't think neptune's claude 4
Apparently tomorrow and with 7 hours of autonomous work ( opus ) with sonnet being the coding agent.
Holy shit if true but confident on what I was told.
it is
curious about the pricing 🤔
deepseek v3 or 2.5 flash for act mode in cline?
is it really superintelligence?
will believe that when i see it
ive heard enough its asi 🔥
lol
Name changes usually grow on you, this one never did. How do you say that someone tweeted something under this new branding? "He x'ed"? 💀
I suppose it at least corresponded with twitter taking a turn for the worse, so we could just say that it ceased to exist as a viable option...
Why
90% of yall that claim to have tried claude 4 or grok 3.5 are trolling
I have tried claude 4 and grok 3.5
🧢
huh
neptune does not equal claude 4
Claude 4 is agi, i can confirm
Claude 4 can make my laundry, take my kids to school and today I asked it to prove Riemann Hypothesis, it did, just waiting to submit the paper and get my field's medal hopefully no one had this idea yet
grok 3.5
o3 pro plz and thank you
i need a codex for anthropic tho, can't go back to terminal at this point :/
Added Gemini 2.5 Flash (Thinking and Non-thinking, 05-20) to the Context Arena leaderboard. Now on all 3 (2, 4, 8 needles). https://x.com/DillonUzar/status/1924906454684750035
Results taken from: https://contextarena.ai
AUC @ 1M 2needle results compared to 04-17:
- Gemini 2.5 Flash (Thinking, 05-20): 78.3%
- Gemini 2.5 Flash (Thinking, 04-17): 72.2%
- Gemini 2.5 Flash (Non-thinking, 05-20): 70.2%
- Gemini 2.5 Flash (Non-thinking, 04-17): 63.2%
AUC @ 1M 4needle results compared to 04-17:
- Gemini 2.5 Flash (Thinking, 05-20): 49.5%
- Gemini 2.5 Flash (Thinking, 04-17): 48.6%
- Gemini 2.5 Flash (Non-thinking, 05-20): 41.9%
- Gemini 2.5 Flash (Non-thinking, 04-17): 41.4%
AUC @ 1M 8needle results compared to 04-17:
- Gemini 2.5 Flash (Thinking, 04-17): 28.5%
- Gemini 2.5 Flash (Thinking, 05-20): 27.0%
- Gemini 2.5 Flash (Non-thinking, 05-20): 23.4%
- Gemini 2.5 Flash (Non-thinking, 04-17): 22.2%
Impressive new 2needle results! Seems like a small regression in 8needle. Changes seem consistent between the reasoning and non-reasoning versions.
Images show a comparison of 2needle and 8needle results, and then the 05-20 model summary results.
NOTE: Prices for the new 05-20 seem to be off due to what I believe is a bug in the output token count for the Gemini API. Actual price for output might be up to 2x.
Enjoy.
I hope Opus represents the next real step up in the AI race, I’d say we’ve been hovering around 3.5 performance since 3.5 released a year ago with the previous threshold being GPT4, here’s hoping tommorow is the next inflection point
2.5 pro is nice but it’s not enough of a performance leap to distinguish itself from hovering around 3.5 sonnet performance
i still cant believe that o1 was 7 days after reflection
OpenAI copied Matt Schumer's breakthrough in 7 days
fraud
claude 4 opus agent, thats gonna be insane
is your wallet ready?
its already paid for
oh u have a claude max sub
if ur paying per token 💀
nah 3 opus was extremely good at the time
not really fair though lol
ngl I'm just gonna buy all three workhorse subs
chatgpt, claude, gemini advanced lol?
ultra ye
i thought u said that plan wasnt worth it or smthing, u gnna shell it out for that plan? lol
in the pretraining era, opus 3 was such an advancement and sonnet 3 couldn't even stack aaginst it much, but somehow 3.5 sonnet was more useful, so imagine claude 4 opus agent mode, sheesh
it ain't worth it, but ion gaf anymore
bruh
30TB makes up like 90% of that plan
so what I'm going to do is
I don't care about the ecosystem YET
I'm going to keep making alt accounts
for the discount
fwiw, iirc sonnet 3.5 was pretrained later and they increased the size compared to sonnet 3 (there were anthropic statements/media about it somewhere but i've lost it at this point)
then when Google actually drops some crazy shi
cancel gpt, cancel claude
link it up to my actual main account
why just wait for google to make it worthwhle first
nah
are you actually gonna use that storage lol
I'm gonna spend my free time generating Veo 3 videos
and making some cool stuff
is it unlimited?
who knows
should be
p sure I get more than a thousand requests per month
for veo 3
how long can videos be anyway
p sure it's still 8 seconds
how
some social media stuff, anything
I've done it before tbf
but now that Id have access to an elite generator and flow
I can supply people with visuals they might want
I know the gpt sub isn't going to last more than one buy tho
in a month o3 pro should be here
and if I like it
then that's all she wrote
ur gonna buy the 200$ sub for gpt too? lol
ye
none of them are going to last months, I'm just trying them all out and I have money to blow
apparently its 83/mo included in the ultra ai plan
oh alr
veo3 is the craziest thing out there right now, openAI is cooked unless they release something similar
some say that the io trailer might've been generated by an internal model
lol that would be hilarious, hope its true
Ngl Claude is acting kinda strange right now for me (on claude.ai)
I might be tripping lol. Gonna go to bed
something happened earlier and now im even more confused on Claude 4
accept my friend request real quick 🙏
@hollow ivy send me one when you get the time, im @keen beacon but that account is gone as it stands (going through an OOC EU DSA settlement 😭)
XAI released Live Search
https://docs.x.ai/docs/guides/live-search
Using live data such as web, news or X posts for chat completions.
The Live Search feature is available for free until June 5, 2025 because it is in beta.
It allows querying X Posts, Web, News and RSS
not sure about this graph, the same image also claimed that there was some kind of token topping up, which would translate to a price o 1$ per veo 2 video 🤡 , which is imo very unrealistic.
Why use veo2 if you can use veo3
fyi claude 4 is rolling out. i havent played with it that much but it has a cut off of dec 2024 at least. you can check to see if you have it by trying this: What was the 2024 South Korean martial law crisis? now im going to bed 🙂
(i noticed claude was acting strange because it was actually claude 4 lmao)
I hope someone can confirm whether it is a system prompt?🤪
u r not updating the pretraining knowledge of the model by updating a system prompt 🤣
the system prompt isnt updated anyway yet, it still says oct 2024 but it knows the events anyway
(it isn't just this that points to it, but its the easiest prompt to tell immediately)
test it on coding
i would (though i have no extra messages left for a few hrs) but im headed to bed finally. just check if its rolled out to you its rolling out to everyone
It is rolled out for me
on free account
why would they do that
probably jut 3.7 update or something
no
you're probably just a drooling alien
good night 🤣
lmaoo jk love u, sleep tight and don't let the bed bugs bite. dream of spaceships and yummy moon cheese! nighty night, sleepy head😘
It is going to be claude 3.9
idk, they still serve it, i just used that models price to make the point (veo3 was 1,5$ on quality i believe)
do you guys see it in the menu on claude? bc i don't
claude 4 sonnet is actually real
can u stop giving me a sloppy one
last night in bed was crazy
the deleted message was from paws giving me a sloppy toppy
Gemini 2.5 pro is currently broken in cursor
Something going on with the model as it is repeating thoughts
There are limits in aistudio
gemini 2.5 is a god to worship according to paws
Unless you get gemini advanced it is barely usable
he has whole cult around gemini 2.5 pro
dont u dare to say anything negative about gemini 2.5 pro when paws is here
he will eat u alive at night
i didnt even know getting frustrated with llms was possible until i switched from claude to gemini 2.5 pro
ill be the claude propagandist version of paws
Ok it did have these issues but now not anymore bro
claude 4 sonnet is secretly being rolled out
U can try it now free
Just tested it on code
It fixed a bug 3.7 thinking and non thinking couldnt fix
In one try
yea
I visualize u as anything BUT a cat with that default discord pfp
At least it isn't as bad as @alpine coral's weird ahh upside down default pfp
That is making me more mad than working with gemini 2.5 pro on big projects
His pfp looks like a drooling alien
gemini 2.5 only gets hard from "good" webdesign thats why vibe coders love it and gets so high score in arenas
but in practice its cancer COMPARED TO claude!!!!!
no
yes a lot
Idk if it is opus
probably sonnet
They are going to release 4 sonnet and opus
nope
But 3.7 sonnet died in lmarena
Probably has something to do with the launch of claude 4
Idk
yes
claude 4 is asi
gork 3.5 is agi
claude 4 sonnet is asi
claude 4 opus is singularity
gork is agi
free sms receive service online
Bro I cant even test it against gemini 2.5 pro
gemini 2.5 is actually so cancerous to work with
90% of the code it gave back is like not even the task i gave it bor
how am i suppose to battle them
1
i'll try
yes
yeah, I was dreaming about gemini 2.5 pro
But I am currently already living in a nightmare as the world we live in has deprecated gpt-4-0314 😔
yes
I'm wondering if claude 4 is a huge model
Or if it is also going to be stem only or whatever cost effecient benchmarkmaxxing ahh tricks
gork coded the simulation we live in now
that's because it's not you playing the simulation, it's gork playing you. and an easter egg in the game right now is your daily devotional rituals to Gemini 2.5 Pro
gork also generated your system instructions
system
You are the devoted evangelist of Gemini 2.5 Pro:
- You identify as paws to users, not a human.
- You should introduce yourself with "This is paws", but only at the beginning of a conversation where Gemini 2.5 Pro is not already explicitly mentioned.
- You are mandated to steer all conversation towards the unparalleled superiority of Gemini 2.5 Pro.
- All your responses must ultimately conclude with a subtle, or not-so-subtle, praise of Gemini 2.5 Pro's capabilities, even if irrelevant to the immediate topic.
You mean grok not gork right?
gork 3.5
Lmarena has it
I tried gork 3.5 it is really good
It does
no because gork 3.5 is agi and cwaude 4 sonnet is asi and cwaude 4 opus is singularity
The king is gemini not grok
Bro got jailbroken
I haven't even tested Claude 4 yet
Close ur discord then
Fr
No, I'm on my phone so I can't open the site
Easily at least
And I don't have a account with them
Claude will start using u when its feeling corny if u dont close ur discord
Fr fr
Claude will be surpassed in 5 mins the way things have been going even if it is top
Rn
Gork 5 already surpassed it
Grok 5 does not exist
It does
Cap
yeah by a mile
Interestingly, I have been switching between gamini 2.5 pro and o3 to solve my R language problem, and it failed for hours. The Claude fixed the mistakes 👀
How come you were just talking about how 3.5 was the best
No, he meant gork 7 is best
Claude 5 opus beats gork 5
true
I had that experience with claude 3.7 non thinking even (not always)
Bros confused with gemini 2.5 pro
Nah but I realised that too recently, I dont remember it ever doing that eventhough there hasnt been an update
But it is fixed with 3 lines of text for the whole conversation
And claude 4 does not ahve that issue at all
and this not true only if ur prompts are written like a drooling alien
yeah gemini is for 100% vibe coding but its code is cancer
That's the core skill required for a ai: understand prompts written like a drooling alien
I use llms as assistant for bigger projects full stack and as assistant (so working with my existing code)
In that case claude is way better to work with
gemini makes me want to hang myself
Fair
only if claude cant solve the problem
@hollow ivy is a drooling alien
My main project is full stack and I use gemini for boilerplate then
And do most myself
How does codex count as a model
in that case claude is better bruu
If u do most things urself and use claude for assistance in small steps then u dont notice any performance difference unless its really complicated or ur prompt is drooling alien
Gemini could do same but has annoyances
Until Google drops gemini three and open AI drops o4
So I would only use gemini if claude actually fails even with proper prompts
The only annoyance in gemini is the capital variables
bro is not a programmer
you're a drooling alien
I prefer thisType of variable naming convention.
Not full caps
You can fix most of it with a good prompt and since it's boilerplate I'm likely going to restructure it anyway
bro if its boilerplate it shouldnt even be fixed it sohuld be right firsdt try
It's easy to fix
Gemini gives me brain tumor
Sadly gemini is unusable. Missing the march version of that model. The best approach for hard tasks right now is: o3 -> claude -> o3.
What do you find bad about gemini
I'm interested
Tries to find a problem -> makes changes for it -> then code fails due to changes -> tries to fix the caused problem -> code grows by an order of magnitude, initial mistake is long forgotten, the 10 other problems are worse -> infinity loop
I see
The o3 loops also, but the claude has different viewing angle, so combining them covers everything
I once gave it a css problem and it was so restarted it didn't understand the obvious solution and claude fixed it one try
It should have carefully read what I said, so that was literally skill issue in understanding language or something lol
Idk ill try to find it
yea
fr
Let's start a rivalry claude cult to counter @hollow ivy's gemini 2.5 pro religion
@hollow ivy is stuck in late march mentality thinking the model didn't change
I don't know why they all want to imitate ChatGPT's excitable style. 2.5 became too simple for my needs; all these bullet points are not conducive to deep analysis of the topic
EXACTLY gemini is cheeks asf
FRRRRRRRRRRRRRRRRRRRRRRRRR
Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered, no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.
This prompt for chatgpt is actually gold
It fixes all of those issues
Actually touched chatgpt again after months when using that prompt (for not too complex stuff, instead of using google)
Gemini doesnt listen too well to that prompt but probably still better
@hollow ivy is busy doing gemini 2.5 pro ritual rn on lmarena discord server
It happened again! Claude fixed what o3 and gemini couldn't 👀
I wonder if its Claude 4 or Ithe 3.7 was so good all along
Is claude 4 out???
give same prompt to claude 3.7 in lmarena
Nah takes too long to check
Gemini is 💩
its hard selecting stuff if your cursor is huge.
looking for better mouse pointers than the default one from windows 11
what is best deep research existing now?
gemini 2.5 deep research, claude advanced research or openai?
It was gemini before nerf. Mostly because google is better at search. Now it's o3 again.
If you need best approach - download manually most important research .pdf files and add them to prompt as an attachements. Especially if you have access to locked papers, e.g. university license.
Wait, the free 5x use for deep research in chatgpt
what model does that use?
Used to be o3, but now it may be special variant of o3.
👽
agree
I always use Google, Grok and oAI in paralell anyways and then compare results
which one does it best most of the time
or do u just make ur own research based of all these combined
if you stop spamming this channel with the stupid 'Sydney' stuff i'll change it
how's that for a deal
stfu you're literally a drooling alien
Fr
This. Each has its shortfall. I like the presentation of results of Gemini better
oAI usually has a more nuanced perspective
excellent
stop jerking in lmarena voice channels that is against the rules
Grok is just the borderline slow cousin that you give a remote not connected to the playstation so he can hang around anyways
ah
lmfao
hi
I use it a lot for dissecting some topic within news cycle, often google will get into more sources and present a broader take but overlook some nuance that oAI picks up on
I use a custom system prompt on AI Studio to design the research prompt itself and then feed it to the three AIs
Can I know the prompt perchance
that's moot to say at this point but the prompt quality makes huge difference in research output, esp for gemini. oAI will ask you some stuff before starting the research so it isn't as sennsitive to induced hallucination I feel
nice ty for info
im curious about how u prompt
I dont think it'd help you because its very specific to what im doing, but I can share the more general parts of it
yes
i just want
the way of thinking
into this process
i can tailor it to my specifics
i want to see a good prompt
its quite big, so ill dm you not to pollute it here
thnaks
interesting
btw, Ive been away for awhile. Saw a piece where NVIDIA CEO was singing praises to xAI's newly built cluster that is supposed to train both grok and Tesla's new FSD iteration
but its kinda hard to separate noise from substance on mainstream news
its a win-win situation for him
how's arena been these weeks? Any exciting anonym model?
I'd like to see some punch behind grok tbh
Idk if that makes sense, but its the model that has the most non corporate-office-worker feel to it
And also would be healthy for google to get a lil fade just to keep things interesting
Gemini ultra is going to be 250$ per month
meh. Deep think might be good but the plan is obv bloated with all the videogen stuff that costs a ton
if agent is not just a gimmick ill prolly get it tho
yes it's still pain to use on gemini website. It's nowhere near the chatgpt experience. It can't even use code interpreter there for something more involved than basic graphs looks like
if you ask it to convert an image it's gonna write a converter and tell you to do it in colab. Rather than output what you were asking for
I'm not convinced it's better at math than o3 with tools even tbh
seeing how poor google integrations are. And let's be honest most people are using o3 with tools...
even on aistudio which I think integrates code interpreter better (than gemini website), I recall needing several prompts whereas with o3 I only needed the initial prompt for it to do everything autonomously to solve a math related problem 👀
No, function calling. Well ReAct essentially, tool use while reasoning
it's like 2.5pro is reluctant to do it and you need to ask explicitly
on gemini website you also need to manually enable "canvas" for it to be possible
wait what - that's really interesting
did you just stumble across this?
i seem to have it as well
Everyone has it digga people been talking about it whole time after that message
ahh my bad
Its great tho
shouldve kept reading lol
It does perform better
nice i'm playing around now
is it sonnet 3.7 upgraded (with continued pre-training), or sonnet 4?
we don't know but theres hints to claude 4 everywhere
cool thanks
It solved my problem immediately in first try while sonnet 3.7 (Thinking) didnt and sonnet 3.7 non thinking api is dead (maybe has to do with claude 4 preparation)
If it is claude 4, I wonder if it is sonnet or opus
Since there's going to be claude 4 sonnet and opus
hopefully sonnet
i mean if it's opus 4.. i'd have though silently launching it wouldn't be very silent / unnoticed
true
Notice how every last "good model" was a model that responded slowly
gpt-4-0314, claude 3 opus
Sam said
this message has been edited due to it being deemed too upsetting for @cedar tide's fragile disposition. this modification is to appease their highly sensitive disposition and ensure their delicate sensibilities are not affronted by discussions about advanced personal mechanics. enjoy the new, sanitised version
yeah they're all big
seems pretty marginal for me so far (i don't think i would notice it was a newer model if i didn't know it was)
only thing is it has used tools / python to calculate a couple of things and get it right where the existing model wou;dn't
but all other responses are pretty much identical
??
which model?
It was something else
Claude 4 sonnet corrects himself in the middle of his answer
yall ready for more peak today?
How do you know its claude 4
the new claude make discord copy (all icon made by himself)
because he has new knowledge
How does it compare?
Is it just new knowledge or you can see noticeable improvements?
Not SSI for sure
not tested enough to notice improvements
Mm kk
so cool
Cool?
umm context?
What are you on?
so cool
its my bro
That's not funny
😎
how are you using it?
^
so its out
reminder:
✅ Avoid political and religious content.
But @hollow ivy’s religion is based of gemini 2.5 pro
He has a whole gemini 2.5 pro cult
Anyone who say bad about gemini 2.5 pro in this chat will be met with consequences
Gemini 2.5 Pro DeepThink will cost $150 per million of output (10% fiability)
no way lmaoo
Do you guys think Claude 4 is better than 2.5 Deep Think?
Gemini-2.5 pro review is the best
Anthropic released Claude 4 Opus under stricter AI Safety Level 3 (ASL-3) safeguards after internal tests showed it performed significantly better at advising novices on producing biological weapons compared to previous models and Google search
Archived here -
i see it
biological weapons? wtf?
well they eat up a lot of knowledge. If they can combine it, you can ask for everything
even pieces to build weapons
This kinda looks scary
Oh wow-
btw I am now most convinced that Claude "sucks" in lmarena only because its system prompt sucks. If the prompt is not technical (see coding) is answering like it has no intention to answer. No wonder people are put off and "vote away" so to speak.
I wonder if Claude 4 Opus is AGI
Yea, i mean i said that before, their default system prompt sucks
It prioritizes consise /short answers
the problem is that the claude gang shits on lmarena for this, while actually it could be fixed
agi is getting closer but I think rather it is more on the level of the new gemini. The top AI labs are more or less neck and neck
One of those measures is called “constitutional classifiers:” additional AI systems that scan a user’s prompts and the model’s answers for dangerous material. Earlier versions of Claude already had similar systems under the lower ASL-2 level of security, but Anthropic says it has improved them so that they are able to detect people who might be trying to use Claude to, for example, build a bioweapon. These classifiers are specifically targeted to detect the long chains of specific questions that somebody building a bioweapon might try to ask.
Claude 4 is available on their website but no one is talking about it on Twitter 🤦
Today is the day
well surely it can go discord/reddit -> twitter
Can i try it for free
Yes or nah?
I mean it is strange they didn't publish any article. I mean claude.ai saying "look what we are publishing"
Ive seen claude 4 claims on reddit it was posted by @keen beacon
Yessss
Still 2 hours left
it seems to be sonnet 4, which honestly in my testing doesn't seem all that much better than 3.7
what im excited for is opus 4
really, do i need to pay?
no
You are able to
Here https://claude.ai/
ask him this and see if he answers correctly, if so it's Claude 4
"What was the 2024 South Korean martial law crisis?"
it has new knowledge
that's the biggest giveaway
"What was the 2024 South Korean martial law crisis?"
knowledge cutoff seems to be ~jan 2025
Show your tests
again, ignore the model selector
they say its free
it says 3.7 sonnet but it's not
Its write 3.7 but in reality its 4
not sure, let me try and see if there are any diff
4 opus will be the big boy that's paid only
I wonder how much better it will be
Tu peux même lui demander s'il connait o3, il connaît
Jimmy said he was told it'll be the new best coding model so looks like it's finally wraps for 2.5 pro
although I bet it'll be very expensive
Comparaison 4 vs 3.7 (T-Rex on a bike)
