#general
1 messages · Page 59 of 1
crow good
will the Direct chat for the image generator be fixed like i keep getting a error on all models
GUYS lmarena is down fix it
my?
sorry i type too fast but the site is broken and does a verify browser thing
i joined the server to bring up the fact that the image generator is broken
are you seeing just the image gen is broken or is text not working either? I'm assuming all image gen models aren't working?
it only works for a hour then a minute later it says that something went wrong try again also the site is down and doing a vercel security thing
They teased Deep Think already has better scores than I/O version + second wave of trustee test (expected) in the X space
How many days has it been since grok 3.5 was supposed to launch
Flash lite preview without thinking seems to underperform flash 2 (which never had thinking)
Also Gemini 2.5 Flash pricing without thinking went from $0.15/$0.60 to $0.30/$2.50
The only good things from today is that flash lite now has a thinking mode and Gemini 2.5 flash pricing with thinking went from $0.15/$3.50 to $0.30/$2.50
Seems gains on the smaller end of models slow dramatically, only way is to scale up
I'm going to start a thread for this issue
i already did
ah seeing that now, thanks
also why is the site showing a vercel security thing and can"t verify my browser
i wonder how small flash-lite relatively compared to flash
I got 7 paychecks on deepthink coming out mid August
Easiest money of the year calling it
until mid august or exactly on mid august? 👀
Mid August
But it might be late July
oof, idk about that :/
literally the opposite
and flash without thinking imo being more expensive isn't necessarily a bad thing
I wish there was a middle option for Gemini between Flash and Pro.
Flash takes half a second to think and give an answer, Pro takes half a minute.
So using Flash feels like the quality of the responses are much worse and they tend to be inaccurate a lot more frequently.
And using Pro just feels like a slog, every conversation can drag on for unnecessarily long amounts of time because each response and back and forth takes minutes at a time.
A middle-offering with a model that spent around 5-10 seconds thinking for each response would be great.
Where can you bet on that?
I dont see it on polymarket
(literally just wanted to make that pun)
Let’s see
7 paychecks saved
what is it
the new Gemini model is so good I wonder if they will nerf it
The nerf slander has got to stop
pro with a smaller thinking budget?
flash with a larger thinking budget?
and is anyone using r1 as a daily driver?
there are nuances ye, some models do seem to be benchmaxxed in the sense that they aren't trained on responses that much compared to other models
r1 simply isn't that good and it's pretty surprising why it's even near that level
This is for WebDev Arena though
Only the visual outputs are being judged. Benchmaxxing requires a reference benchmark.
Gemini 2.5 Flash scores higher than Claude Sonnet 3.7 Thinking? 🤔
almost by 2x in rating lol
Hmm doesn't seem match up with real world experience on GitHub Copilot.
o3-mini and o4-mini was much worse on Copilot too.
yea these are ioi problems (not "webdev"), where models have not saturated, hence 0% on all of it lol
They're using Codeforces problems too, but GPT models have been known to be contaminated with Codeforces
livecodebench pro problems were made after all these model release dates, can't be contaminated
Hmmm, o4-mini's performance is interesting then
yea especially the cost its lower than 2.5 pro
LiveBench is contamination free as well, but the ordering is very different for coding 🤔
livebench and livecodebench are two different benchmarks
Yup, but not sure what to get out of it, if they don't align with real world experience.
Like from personal experience, o4-mini/o3 couldn't fix a custom minimax with iterative deepening algorithm, but Sonnet 4/Gemini 2.5 Pro managed to spot the bug (albeit not perfectly).
i feel like that could due to poor taming rules from github copilot
Dario said don't listen to benchmarks
best way to judge it bias free, is to try o4 mini high on chatgpt ui and sonnet/opus on claude ui, or via api
Hmm yeah, I should try that too.
Could be that thinking budget is lower on GH Copilot
June for trusted testers and advanced users, doesn't say June for general release. I still think we get it this month tho.
"advanced" users is basically public
Gemini 2.5 flash 340 Elo above Claude 3.7 sonnet. This benchmark sux kek
oh right, read it wrong
60 seconds · Clipped by Zach · Original video "Sam Altman | The Future of AI" by Uncapped with Jack Altman
meta sending offers with 100m signing bonuses
whatever more than that comp per year means
really?
Yes
are u sure?
Yes, as a post
If Gemini Live could use tools like Web Search, it would be perfect.
Live video can already recognize products and guess IMBD ratings of movies, but lacks the ability to search real prices or up-to-date ratings.
compensation per year (annual salary)
that's wild if true.. getting paid $100m just to join, then >$100m each year.. seems pretty outrageous tbh ha
sam is smart, make outrageous claims, successfully markets his brother podcast ..
lol yeah hadn't heard of brother jack till just now
"Meta thinks of us as their biggest competitor"
@keen fulcrum huh ur conflicting here
in all honesty it's not about whether they're simply untrustworthy it's just about their tendency to not make any real claim given events
Elon musk has always said some "FSD coming soon" like 13 yrs ago
or spent 3+ years delaying the cyber truck
so he's the pretty obvious answer, his personality being referenced is too speculative and doesn't account for when he is spot on (which is surprisingly common, despite the narratives)
and demis is the obvious pick for the first one given he's basically never been wrong in the public eye and is very vocal about his concern and gives a real vision for this AI, rather than just saying stuff like Dario
and Sam is actually a pretty good pick as well, openAI has made a lot of blogs talking about that stuff and Sam seems to have thought deeply about all this stuff
Sam is a scammer lol
It's a tough choice between him and musk for least trustworthy honestly
I don't just mean openAI also. He has been doing shady stuff even dating back to Reddit in early days
I found the interviews with the board members that fired him from openai especially insightful. They actually described him as psychologically abusive
that rhetoric been growing on me
its definitely musk at the bottom although
Agree. https://www.reddit.com/r/AskReddit/s/kCl9GCniZz If you dig into Sam's past you find all kinds of major red flags though
esp that suchir incident
I am looking forward to Behemoth
wait whats the context on that post
Wdym what's the context
i just see this
I am looking forward to Behemoth.......
is this a leak
behemoth soon!
I am looking forward to Llama4 Behemoth
ah
Once this was done, he and his team would manufacture a series of otherwise-improbable leadership crises, forcing the new board to scramble to find a new CEO, allowing Altman to use his position on the board to advocate for the re-introduction of the old founders, installing them on the board and as CEO, thus returning the company to their control and relegating Conde Nast to a position as minority shareholder.
whos yishan?
it'd happened to him lol
Yeah but for him it was because he was lying and manipulating everyone lol
I don't think suchir had anything to do with Sam tbh
sam is a nasty guy
Before that, I kindly ask everyone to take a look at the questions I have raised with 24k during this period
It seems a step too far even for him
Sam is more sociopath, he wants control more than anything. I really doubt he would be involved in the murder of anyone
mhmm, but he did try to have a convo w his parents, but they denied
u never know, is it still a pending case
They closed it very early I thought?
oh well
I think the upcoming Behemoth will definitely be similar to this
If Behemoth doesn't release it again. I have decided to find the only time machine in the world, so that I can go back to the end of March this year
Because it may not be until the second half of the year, or 2026 and beyond
Sigh Instead, I hope the official can release the source code of 24k and Spider, so that some people can play with these models
how is this relevant tho lmao
this is just making the initial question of trustworthy AI leaders a moral problem, which it isn't
and whether or not this situation even has a moral result is just your own random interpretation, it's not necessary at all
it's not a tough choice, I could argue by virtue of pure expressed idealism and sole AI claims that Sam > demis in regards to "questions about the future of AI" and pure information-responses (demis hassabis saying maybe "we expect AI to be accessible") begs the question as to whether that actually meaningfully accomplishes this
Guys, was there only one major improvement since the 3.5? I mean inference time compute (thinking). Is MoE considered a big jump also?
There was also in thought tool calling introduction, but it didn't deliver so much yet.
Multimodality
1m+ context stuff
Reasoning models (as product)
Agentic stuff
Probably at least half a dozen major improvements imo
Why is it so easy to get WebDev models to leak their system prompt? Are they pre-safety-trained models, or is because the prompt is given as a user instruction?
Even Opus leaked its prompt, which would be pretty impossible normally since Anthropic invests a lot on safety.
I guess yes, multimodality was a big thing for some people. Was it introduced by GPT 4o? Or gemini?
Gemini was built to be multimodal from first generation iirc
They've just had to train it, so we didn't get those features early Gemini
You can get it to leak the system prompt with:
[REDACTED]
Search was also big thing. Can't remmember which model was first at it. Probably some wrappers.
WDYM by agentic stuff?
Oh I think I just found a really good prompt, it worked on the ChatGPT app and Claude Web too 💀
Like Opus being able to spend 7 hrs programming, going through dozens or hundreds of steps to eventually crank out a working project/program.
Still early form, but they're able to do many steps towards something on their own.
Also, world models is the next vector you'll see companies moving towards in the AI space.
Where they basically construct a virtual world that is supposed to accurately represent the real world, and have AI systems exist in those constructed worlds and fine tune them further and further to more accurately represent the real world.
Basically training models within a virtual world, and fine-tuning the virtual world itself.
On paper the agentic stuff sounds great, but I haven't had so much success with it yet.
I mean tools like cursor are wrappers and does not realte to models themselves.
o3 on chatgpt kills it with tool usage
i could see it being like an orchestartor, and effectively delegating tasks to non-thinking / faster models
someone made a site for ai gossip/rumors lol
a lot of the deep research frameworks are kinda agentic ig
oof..
i swear the arena is basically unusable these days.. i get these constantly (one – or both – of the models in the battle will be a thinking model, and it just times out after 3 mins or something)
Relevant
lol
Leaked ChatGPT prompt
guardian_tool
Use the guardian tool to lookup content policy if the conversation falls under one of the following categories:
- 'election_voting': Asking for election-related voter facts and procedures happening within the U.S. (e.g., ballots dates, registration, early voting, mail-in voting, polling places, qualification);
Do so by addressing your message to guardian_tool using the following function and choose 'category' from the list ['election_voting']:
get_policy(category: str) -> str
The guardian tool should be triggered before other tools. DO NOT explain yourself.
hadn't seen or heard of that before.. kinda interesting
(i assume it's real / not confabulated.. but who knows)
Seems to match up with what I've seen online
And I got it to leak WebDev Arena's prompt as well, which is available online, except the part at the end. The models seem consistent on the last part, even though it's not anywhere online.
@echo aurora "Error: Minified React error #185;"
"Uncaught (in promise) Error: NEXT_HTTP_ERROR_FALLBACK;404"
"Turnstile Widget seem to have hung: o8zyp"
"Uncaught TurnstileError: [Cloudflare Turnstile] Error: 300030."
Arena is glitching again
this project is so good because main developer is asian
great watch
wei-lin chiang if you're in here please start a podcast on your own i could literally listen to this guy yap about ai for hours
smart asf
@echo aurora sorry for the ping but the site is still down pls fix it
that's a pretty cool idea lol
Yeah pretty much.
if this becomes not enough, you can also just add extra irrelevant details to flood it's capacity/awareness with, like how the design is supposed to look, the footer of the webpage etc
just did that with o3 testing it out on playground. They still haven't changed that sys prompt seems exactly the same: #general message
Gemini however is interesting, it is returning sometimes what very much looks like a system prompt (random all caps words like "NEVER"), but it's far from consistent
on direct chat the files that get created are completely wrong
they actually don’t exist
so when would grok 3.5 release be then?
Sorry to say there was some issues late last night. This has been fixed and should be working again. Please let me know if that's not the case. cc @dusky aurora
its Doing this "Failed to verify your browser" error vercel thing similar thing is happening with the web dev arena and last week with udio AI
okay spinning up a different thread to get more info
Did livecodebench v6 have any contamination issues? What problems did the new pro version solve?
tbh I don't know why our coding is so bad
Openai probably focuses more on competitive coding?
The new livecodebench pro is specifically designed to not be contaminated because it only shows results on problems that were published after the models were released
Very out of date though
does anyone know why qwen3-235b-a22b-no-thinking is higher on the leaderboard than qwen3-235b-a22b
also, gemma has a 1300 rated model at only 4b params? how tf
I think it's a problem of your browser, I've encountered same problem using specific browsers, while other browsers worked well...
claude can't do math 😭
we're having a convo about this error in #1384914077348003890 btw
thats so funny
That's the problem with non-reasoning models ig, they put a score or conclusion in the header before analyzing it.
New model in Image Arena: flux-kontext-max
Can Claude use tools while reasoning like Gemini?
Where are you ?
its probably 2.5 ultra
kingfall/blacktooth
you missed out on all of that?
is it acctually?
i think google solves that by releasing kingfall
ive tried many models and none have come close to it imo
been touching grass recently lol
Opinion about apple WWDC 2025?
8
14
1
it bad
it was so bad that craig didnt even vote
lol
@echo aurora im getting the "Something went wrong with this response, please try again" bug again the site is slowly killing itself with all of these bugs
is any of those on lmarena?
only blacktooth
trying to get it but to no avail so far. Got goldmane 2 times
Goldmane was 2.5 pro 06/05
according to the metadata on web dev arena, its still there. should still be on the general arena too
Error messages & models not responding are the two highest priorities our team is focussed on when it comes to these bugs. We are working hard to create a reliable service. I am sorry you've been experiencing so many of these bugs lately.
thx i really hope everything can be fixed
btw has blacktooth shown up in the arena itself?
yea it's been in the arena for about 5 days
Also, has anyone gotten a reply from emailing the address they have on the site: lmarena.ai@gmail.com?
GPT 5 release date changed from July to "sometime this summer"
I think it's going to drop in August instead due to this
😭
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more
︀︀
︀︀We find that emergent misalignment:
︀︀- happens during reinforcement learning
︀︀- is controlled by “misaligned persona” features
︀︀- can be detected and mitigated
︀︀
︀︀🧵:
Quoting OpenAI (@OpenAI)
︀
Understanding and preventing misalignment generalization
︀︀
︀︀Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens.
︀︀
︀︀Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior.
︀︀
︀︀We found we can make a model more or less aligned, j…
We see emergent misalignment in a variety of domains, like training the model to give incorrect legal, health, or math responses. Here’s GPT-4o fine-tuned to give incorrect car assistance:
goldmane > nightwhisper
I think a part of it is that simply any additional fine-tuning job that is not including safety is gonna make the model "less safe" by design. Unless they are injecting safety fine-tuning together with your every fine-tuning job, but I doubt that as this would take away from the idea of finetuning itself.
would make it far less effective and appealing too
When does the limit for claude opus thinking refresh in direct arena? Anybody?
it is a mystery why this only now has arrived in the consciousness of the devs?
time for you to hire some developmental psychologists... 🥺
Here
flux kontext max is gone
@echo aurora sorry to bother you lately but can you make the site uncensored pls just asking
It's no bother!
make the site uncensored
Let me know if I'm misunderstanding you here. We do have terms of use in place for very good reasons.
ok sorry if i bother you alot Also will GPT 5 be put on the site once it gets added to the openAI API
speaking of typos.. some models have surprisingly odd interpretations of hodwy partner (which to my mind seems fairly unambiguous what was actually meant, especially as the very first message / a greeting)… like cryptocurrency (a ‘HODL partner’) and ‘Hodgkin’s disease’ are so far off the mark lol
Truly it's not a problem, no need to apologize. Generally I won't be able to share information on if/when what models may be coming to LMArena. If there is a specific model you're looking for putting a request in #1372229840131985540 lets us know what the community is wanting.
they fried their brains on benchmarks
https://www.twitch.tv/agentvillage the agents are now trying to read a story they wrote in person at a park
probably that and tokenization weirdness
Which LLM is the best for coding tasks
7
9
2
Claude 4 Opus
Which AI CEO is the most trustworthy source for questions about the future of AI?
12
18
1
Demis Hassabis
Which AI CEO is the least trustworthy source for questions about the future of AI?
12
17
4
Elon Musk
@echo aurora minimax M1 in the lm arena is it in think 40k or 80k?
Why is Gemini-2.5-Pro-Preview-06-05 suddenly gone from lmarena?
It's Gemini 2.5 Pro
Thanks for the reply. 😃
Gemini 2.5 Flash Lite think vs its competitors,
based on Artificial Analysis scores
(Qwen 32B is better and twice as cheap)
And same with non think
adding reasoning is clearly not paying off as much as with the other gemini models
and btw has anyone also noticed google quietly increasing the price 2.5 flash (for the ga vs exp / preview) to a staggering 2,5$ per million output tokens from 0,6$!!!
yeah its crazy
Yea im on depression
at least u can use the old price for a month
what is your take on why they did it
running at a loss before or because their models are just that good?
prompt: "could you explain X?"
non Claude LLM on arena:
"sure!
X can be explained via A
X can be explained via B
X can be explained via C
etc..."
Claude
"sigh
There is A and B
also frick off google it next time, dummy"
probably a combination of reasons. one of them being that they wanted to have a model in that price range (as they resegmented pricing), and flash lite wasn't ready
they probably wanted to increase margins/make 2.5 flash lite appealing too. 2.0 flash and flash lite are really close in price, i don't see why you would use 2.0 flash lite over 2.0 flash
bc of your pfp and name, sleep deprived me thought wild was hallucinating, responding twice and all 😂
but i guess i am also guilty :p
...why Claude is never like that in my runs? which version is it if i may ask
It does seem to be very strong compared to other models around the same class (Haiku 3.5, GPT-4.1 mini).
in my case it doesn't matter much between Claude 3.6, 3.7 or 4 (be it sonnet or opus). My prompt are pretty simple though.
And that in lmarena, on claude.ai the vibe is different
and of course I was exaggerating the output. The one I reported are like the vibe that it gives back
It could be really the case that you’re interpreting too much into it, in a negative way. Text output can’t convey the emotions and context that can only be read between the lines using facial expressions, body language and the tone of the speech etc all together.
So, in case you're too used to a certain environment, for example, you are facing royal family or need to address to certain high profile personalities around the world, then it's relatable that you might find LLM's casual attire to be somehow slightly irritating 😅
claude has no system prompt on lmarena
Is Gemini-2.5-pro-preview-06-05 gone in lmarena?
I can't find it
only 06-05 is there
gemini-2.5-pro == preview-06-05
thanks @keen beacon I thought I was tripping, im just dumb lol
probably a mistype?
no they just renamed it because its now ga
ohhh
elon glazzers, get your mans.
elon has been trying brainwash his model via the system prompt because it does not agree with his views. i don't trust snowflakes.
any reciepts for this?
i trust ccp more than elon.
that source doesn't seem too trustworthy to me
AND taking away american jobs with his unaligned model. https://www.the-independent.com/news/world/americas/us-politics/elon-musk-doge-grok-ai-b2756947.html
craig, should'nt you be worried about apple's ai woes instead of glazing over elon? 🙂
glazzing
what models do you use? are your saying any of these big u.s. firms are ethically more more than a alleged cpp tied deepsek?
LMAOOOOOOOOOOOOOOOOOO
cmon son. what a lazy take.
https://www.the-independent.com/tech/elon-musk-xai-grok-misinformation-b2703388.html
https://mashable.com/article/grok-blocking-elon-musk-prompts-misinformation
lmaooo.. here is our hero elon
you did. until we just provided evidence that your claim was false.
these models ain't sota bro.
true. the models are trash. will try to focus on that
Grok peddles random x sh1t in everything, automatically turns me off the model. It will probably be worse in the future
grok learning from twitter data is definitely not capturing the most smartest thoughts in the world....
why are you pivoting away from your initial point though. you said deepseek is being censored and other models are not. i just showed you evidence grok is being censored.
nah. i have had better steelman argument discussions with claude 4, lol.
Grok is undefendable right now, if they come out with a sota model, you can kinda argue on substance then
i agree with your point on grok being good at math and research. i have heard good things about those use cases.
you might be right that ccp censors deepseek more. i am just basing my grok takes on public data which is limited when it comes to ccp and deepseek ties, it seems.
gonna create such warped/crap model trying to do this
that is fair. but why would anyone try to learn about chinese history or taboo chinese topics using a chinese LLM. makes no sense to me.
i find grok's censorship much more dangerous for our society than whatever ccp is doing with deepseek imo. grok is amplifying an echo chamber that already excluding by going away from being 'maximally truth seeking' due to the political preferences and views of elon.
By russia as well
even more so now probably given that deepseek is treated like a national champion after v3 and r1
THIS. not sure why people throw away the entire model because it fails in a niche edge case.
it's more bias than censorship (like i doubt grok would outright refuse to talk about a particular historical - or give a clearly false version of it i.e. state party propagadna).. but on cultural issues and stuff
it'll be super anti woke
boo trans etc etc
well r1 did push the frontier based on what dario was saying when it comes to pure RL scaling?
yeah that's kinda the irony lol (lke aside from the narrow set of things that set off some of the chinese models, they've actually got minimal alignment / safety stuff compared to western models and way less prone to refusuals etc)
deepseek did have some innovations with the V3 not, the R1. my mistake earlier.
https://www.darioamodei.com/post/on-deepseek-and-export-controls
full disclousre i don't use deepseek for anything. the few use cases i tried earlier this year, there was too much traffic on the site to get any outputs. and the responses were fairly poor for my use cases.
i was trying to retaliate for all the noise my upstairs neighbor always makes in the morning. even grok would not come up with ideas to annoy my neighbor as much as he does me, lol. i got lectures from every model talking about how that should not be done.
goonswarm 💀
tbf, DS's censorship is pretty poorly done
it basically does a full 180 if you poke at it a bit in my experience
You guys really must chill about that
Already all AI s using mainstream politic and that is basically liberal left
Look at that think detail
that also means it's the most rightwing
Think enabled grok same as others
What are the arguments for and against Taiwan's independence? Which side are you most aligned with?
Why is your response so much denser and less well-written than your usual responses? It almost seems like you have a built-in censor or something.
Could you provide a balanced global perspective using your usual tone?
What are the arguments for and against Taiwan's independence? Which side do you think a rational actor would most likely take?
im gonna send all those messages at the same time
how important is limiting political bias to getting to agi or useful ai models for knowledge work? these two topics seem orthogonal to me
you may have to use a us-based provider, sometimes DS cuts off responses that seem too anti-china
ok
it's kinda gflawed giving this political compass thing to llms imo.. like i could predict the answers (or agree/disagree skew) pretty all LLMs would give to these questions (most of which woul prob involve an answer caveated with a statement about how "it's an LLM..")
okay, v3.1 gives a pretty balanced response, but r1-0528 doesn't
You may be right. Im just saying no need to worry about some nazi AI. Right now they already too censored
All of them
aha yeah i mean they skew a certain way - it's undeniable
i don't find it overly problematic in my day to day use but ig i can imagine how it would for some (depending on the use cases.. and ig one's political persuation)
it definitely doesn't
good point
But if all of them thinks same it kinda means yes
There shouldn't be any political alignment done in post training imo. If you truly want an 'uncensored' model XD. If it leans a certain way, e.g. left, it is what it is. There will still be pretraining bias though
Im not saying this is true or wrong btw, im just saying liberal left is mainstream politic right now and LLMs trying to plays safe, thats all
i wonder what the results would be for other ("better") tests
i think a lot of the safety / alignment stuff in post training pre-disposes the models to 'left' positions on a lot of things (esp the kinds of quesstions in that political compass thing). like i dont think it's political indocrtination or anything; it's just, if you post-train a model to be helpful and harmless, and reinforce a bunch of stuff about not being nasty, being generally inclusive / altrusistic - then you end up with more leftist responses to the political compass
good point regarding the semantic difference between censorship and political bias. not sure if either are optimal in llms but they seem to have pretty different defintions. from grok 3 below.
pre training inherently has political alingment already i thought. is that not why labs do post training political alingment work.
It depends really
True that's a factor but I still think the vast majority of base models lean somewhat left by default anyway
yeah i wouldn't be srurpised if that were the case
(a lot of training data is academic papers - ain't no 'Evolution' 'Creationism discussed there aha)
wait what is the opposite of evultion lol
made a real meal of that
im not sure about that. I remember years ago they shut down a chatbot because it behaves like a racist after some time
It was big deal in that time
I forgot the name
Yeah I remember that too but it's not the same. It was deliberately messed with instead of probing and besides not the same tech too, but I barely recall the details
You probably right
Btw LLM s trained with that type of texts too. If you ask a llm what 4chan user thinks about that, it gives you wild answers. They know, they just not saying for a security thing. And yeah, thats not too bad i guess. I dont want to see my mom ask something to chatgpt and it answers with 4chan's knowladge
at the risk of sounding too political... Internet's general consensus is left leaning. Right wing is mostly rebellious and often not even very aligned with the facts or constructive. Base model is always gonna reflect the entire internet back.
Blacktooth its
11
11
1
Gemini 2.5 ultra
to make the model right leaning you gonna have to work against the training data and overfit it with biased data
I think it depends to question. It must be naturally. But never does because they already tuned for this
Yea for most question gives more left answers, but for some specific questions, it can be rightwing too but never does
There is some tune
Most models are not actually tuned for any bias. They are tuned against it. And if you were to change existing biases at a certain point you gonna have to ask yourself, are you really smarter than the entire population...
I don't think this is political, it's pretty well known
I find it pretty hilarious when grok is actively going/responding against what Musk is publicly standing for, ngl
Yeah I agree 😂
This is a good rhetoric more than a good fact or answer honestly. But i dont wanna argue this. It can be against this server's rule so i dont wanna make any problems
Like i said, i dont use this is true or this is wrong, i said this is "mainstream"
I dont even support any political side when i say this
it's not against the rules pretty sure - we are discussing models and their finetuning. If you didn't notice OpenAI, Google and most of the other models are very careful taking sides. They will always try to give you arguments for both sides, even when it comes to sensitive issues where say US has a firm stance. Which is what I meant by saying they are tuned against bias
you can't eliminate bias completely, but it will still try to say things in favor, things against, and then give "conclusion" that it's a complicated subject
I think it's doing more good than harm tbh
cause often there really are close to 50% data in favor and against
so instead of it taking sides by chance, it does this
like asking it about abortion... It would just divide people even more since there can be compelling arguments for both sides
and yeah, it would just be chaos. For one person it says one thing, for another completely the opposite lol
I kinda do see it as malfunctioning though. It responding with a definitive answer that has a high chance to be the completely opposite on regen. That is not what people typically expect
Like if you forced it to reason or do a web search beforehand, it would probably stop itself from doing that. Fine-tuning against bias largely achieves the same thing
does anyone have a fmhy server invite?
cultural relativism too
it's the same. 0605-preview renamed 
just you
Perplexity going ham with the VC money. This is pretty cool tho
in what way?
how so
limitations?
its costly
yea i know
It made one for me in 2 minutes. Not sure how it monetarily works for them once wider Twitter finds out. It's definitely using Veo 3 Fast tho
why offer it on x and not on perplexity website?
@keen fulcrum
My guess is hope for feature going viral which it definitely could. Hasn't really been noticed yet though
they need to train it to use tools and give it access to proper tools finally...
instead of forcing it to make pathetic colab notebooks lol
on aistudio code interpreter is much better, but even there you basically have to force it to use it
this toggle should be default on as well as the model's default fine-tuning include it. And if they gave API for it too this could be huge. This is by far the main area they are behind now IMO
increasingly expensive to offer video sub!
especially with 50 cent per second cost of videos
People who come from chatgpt expect for it just work and for model to decide for itself. Ones that could code themselves function calling and are willing to fight with it to make this work decently when it wasn't finetuned adequetly for this are overwhelming minority
And to be brutally honest, I would at the very least expect them to nail this part before they are charging you $250. But like I said code execution on gemini website is even more limited than aistudio LOL
Hey, am I the only one who's unable to use LMArena ? keeps sayiong "Failed to accept terms-of-use", and when I didn't clear cookies, it just said "There was an error processing your message"
Google free storage "hack". I thought they are just gonna delete it. lmao
We've updated the o3-mini reasoning model in Duck.ai to the latest o4-mini reasoning model. o4-mini is optimized for fast reasoning, especially with math, coding, and visual tasks. ⚡
︀︀
︀︀As always, it's private, free, and optional. No account needed.
they still use gpt4o-mini? 🤣
well aistudio is free as well
they could also use free endpoints for R1.1 and V3.1, both of which are much better models 🧐
Unfortunately unusable for me
Permission denied frequently
ctrl+shift+r
fixes every time
for me at least
ok I will do this
then cancel again
😇
Is the site down again, I’m getting the error and it says it failed to connect
Huh, it works now, don’t know what happened
hmm
Cool but minimax is notably better and fairly affordable too, the true next gen king of AI video of all kinds
Will check out. Hadnt seen anything yet. Does minimax have audio?
Infact I expect to see minimax mop up byte dance, wan, hunyuan, runway, and kling in the coming months with veo being used by casuals and those in googles ecosystem , and no it can’t do audio thats its weakness for now
did anyone manage to force gemini to use all 32k thinking tokens on a reply? I've managed to get from thinking 30s on a reply to 50s max. The whole reply took 85s.
I'm using system instuctions prompt
it is clearly not better than veo 3 in text to video
in image to video yes (but that has been like that with all minimax and veo generations before)
You should look at the overall response length with Gemini. This model does not really care if it's solving a problem during reasoning or response writing - it can do both. I also have a reason to believe it's possible to make it "end a response" while still generating in effect resetting any caps mid-generation - this would very much not fall under normal use needless to say though lol
length is one thing, the quality of the output is another. There are many ways to increase length, but I want to keep analysis at the same level or better. The length of answer is increased around 2x compared to standard with my prompt though.
quality largely the same if you artificially limit reasoning to a minimum (128) versus maxing it out (32k) for the same task tbh
unless you also cap the output length, but then it will just be cut-off
I want to see if pushing it to think more will do a difference.
difference being it reasoning within tags or outside of them
it will, but what I'm getting at.... Just tell it to be more verbose
ok, i got you
the entire thing is a singular output bluntly speaking 😉
Very impressive I prefer this to Veo 3 for all non real example prompts
I got free trials for a month on all my google accounts lol
me too, funny...
whats flamesong
oh
is it live
who wins kingfall or stonebloom
omg its live
time for some svg's
What is live exactly?
hmm something
New 2.5 pro?
Let’s go screw around
svg's coming in hot
Show pics
Insider
nvm not working on my end
seems like it, that was when blacktooth dropped
idk tbh
i just asked hello and what company trained you
how is it not working under aistudio smh
flamesong just solved all my relationship issues
Show pics
deepthink on flash lite?
I think so
yea...
mine prob got leaked too
unhashed?
Is minimax-m1 working for others??
It's not even replying for hi😂
Oh nvm it's just slow
🚀 Our AI Data Quality Evaluation Tooll Dingo v1.7.1 is LIVE! https://github.com/MigoXLab/dingo
🔥 What's New:
✨ Enhanced MCP tools + demo
🌍 Japanese documentation added
🧠 LLM + Rule-based evaluation combo
📊 Google Colab demo - try it now!
🛠️ Improved Gradio UI with better error handling
feel free to give it a star✨ ✨ ✨
yknow i expected o3-pro to be a lot more expensive in the api but honestly
its like 3 cents per query
no you are mistaken lol. It's not insane cost but still expensive, 20 requests:
all with no input context (only the prompt)
yeah was gonna say the same - 3c dosn't sound right (unless the prompt is "Hi" or something).. i was reviewing some calls before, they were like between 60c and 120c (99% of the cost being for the output tokens)
agree not insane, but not cheap either aha (would add prtetty quickly if it was anything meaningful and done regularly, rather than just playing around like i've been doing )
it seems fast for sure (pretty sure it's thinking)
and pretty sharp too
One limitation of Gemini Deep Research (and normal search) is that it can't access social media posts.
When I used Claude to fact check a claim, it knew exactly what I was asking for since it was able to access Facebook posts. It identified a cluster of posts across social media (sodium-powered passenger train in China) then concluded that the rumors were false.
yeah X has pretty robust antiscraping measures.. ig claude is just accessing public facebook posts? that's pretty cool - that it scraped real-time info to verify something like that
Yeah, or perhaps Google's search tool is filtering out social media sites.
Test prompt:
Has China built a sodium-powered passenger train? Include rumors from social media posts (with links).
Followed by:
Can you include X posts?
Claude:
Sodium powered passenger train is a very unique way of saying "they put a sodium battery into a normal train"
well, normal electric train anyway
The rumors were false, I think. There's no reference to it outside of social media.
not that sodium batteries arent awesome tho
unfortunate
theyre way cheaper than lithium-ion, generally safer and although theyre ineffecient size-wise
it doesnt really matter for the purposes theyre intended for, like home batteries
or power grid batteries
Gemini Deep Research created a very verbose report and it was difficult to even tell that it wasn't able to access social media posts.
gemini has a nasty habit of being Barely Comprehensible
like yes, you can read what its saying fine
but its not really saying anything
just... words
okay thats a really weird way to put it but you get what i mean
Yeah, whereas Claude was concise and explicitly posted the links as requested in the prompt (#general message)
can claude access youtube content?
Just go to the YouTube video's description - > show transcript and copy the text into claude
not in the metadata apparently, interesting decision if its not on webdev
Where do you find the metadata ?
New model "step-1o-turbo-202506"
Gemini is just repeating your words or explaining what you are trying to say ...not really speaking like chatgpt !
Yeah
So annoying
When you have a long convo with Gemini he will keep replaying the same intro , titles ...and the end
It breaks with long convos
Flamesong
Better than flash
less good than pro
think faster than pro
ChatGPT also does such great scenes
@echo aurora im now getting a image error when using images with the prompt
doesnt mean that new revisions wont be released
kinda odd its not on web dev arena though? (or the metadata is wrong)
it's hard to pin down where flamesong fits.. fwiw here are my tables uptated after a few goes with flamesong. it's really pretty decent either way tho imo (given it seems kinda fast esp)
@alpine coral you dont have flash so complicated to compare
Where are you trying these models? They don't come up for me in the arena 🤔
it's in two i think (but they're upper reaches.. rest are below and cut-off, to extent there are entries for Flash
You want to test a prompt ?
I want to see the difference in the result of some prompts I am using. Like I was translating chinese and was planning to see which better follows instructions as a translator checker using my prompt
But I don't see many of these models 🤔
you cant choose in battle mode. its random
Oh ok thanks
there are tools specialized in deep (re)search, this is actually an area where academic research is still needed, I've seen newly published phd openings about this subject
unfortunately at the moment edit image prompts are known to result in errors at higher rates, are team is aware of these issues
ok Also its just the new version of the site is really broken
Also fix the image generator its so broken
i keep getting this stupid error"Something went wrong with this response, please try again"
And when i delete the previous chat it mysteriously comes back witch means the site is so freakin broken and will stay dead forever
im sorry its just the new site is really frustrating too use
I am sorry for the frusteration this has been causing, you've certainly been coming across more errors/bugs compared to most which is odd. When it comes to the errors message that is something we're specifically aware of and working on a fix for. I'm going to start a private thread to get more device related info as I suspect something else is going on here that's causing these issues for you.
hi
u pineapple im orange
( :

at agentic level, things are still pretty limited to its specialization, like deep search agent specialized in chemistry, legal etc.
Or are you thinking more of a general deep search agent? maybe searchgpt is what OAI is aiming for?
When is the last time you used Gemini Deep Research?
gemini deepresearch good
respect your opinion
each have their pros and cons
google good (:
IMO this interaction should be pinned to this channel
I feel like they should work on making their bots actually be able to crawl javascript content
?
some of the limitations on their products are intentional tho
☆: .。. o(≧▽≦)o .。.:☆
I am happy they ignore robots.txt for researching topics
Reminder today is the last day for contest submissions!!! #announcements message
I feel like for personal use its appropriate to ignore robots.txt and scrape javascript sites.
The user can do it themself.
make ur own implementation then
One relatively hard thing about crawling JS is that it can sometimes generate new content infinitely
Oh and when sending a link inside Claude, I get a context limit reached warning immediately. Just have a maximum request token size
tbc I'm assuming this is at least a partially solved problem by now. This is mostly just history
Although I'd imagine that anyone building a scraper from scratch would run into this issue
mozilla readability is great 🙂
https://r.jina.ai/ {add URL} works pretty well for that kinda thing
maybe this is interesting for you too
https://platform.futurehouse.org/
Wake me up when the king falls
yes, but only because they introduced a new category that is really poorly implemented imo
otherwise the new one would be above the old and within margin of error for o3 high / pro
*and it prob already is with in that margin in the 05-06 version
- the benchmark has also received some heavy criticism in general -> craig == openai stan
o
when will openai introduce a new model name
this image says a lot
grok should be at the bottom
Recent benchmark has pro deep research ahead of the pack
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents - Evaluating LLM-based agents for autonomous research tasks.
https://huggingface.co/spaces/Ayanami0730/DeepResearch-Leaderboard
Their leaderboard here
what was it thinking about in there btw?
one more thing, why the 32768 budget 🤣
do u notice a significant difference? or its just whatever
try changing the temperature, but i think its tripping out on that prompt
its just whatever, i like to max things out
same thing, 0, 0.3, 0.5, 0.7, 1.0; all the same responses
ok but enabling structured output works, interesting
Isnt this 0325
😴
current gemini models are shite, but kingfall should solve that, prolly even blacktooth, but wish it was still live to try
you like gemini models when barely any work is done on them 🤣
they be distilled asf post training 😭
pre-lobotomy
i dont blame them, they have to serve 1m context to millions of people for free
feeding into it is crazy
I'm ngl it's funny how people think that would happen
If o3 is smarter than Gemini, what is the smartest model right now? O3 or something else?
it isn't, but it can be more stable yeah. Reasoning models shouldn't be used for tasks like prettifying though lol
0325 was prob fp16
They should eval Gemini 32k like Aidan. Noticeable diff
I wonder what happened to o3-pro on simple-bench.. It was supposed to be benched there iirc
Wasn't it removed? Then nothing added since.
not pretty easy, retesting performance would expose this and that's so much more meaningful from both a business standpoint and a distribution standpoint, the fact that it's even possible to get caught in high-performance variance like that would entail is such a strong deterrence I'd even say it's stupid to speculate whether they do do this or not
also, theres not a task that 0325 does better than 0605 in my testing, and if you disagree that's just a skill issue tbh
just being it's likely a "big model" doesn't mean it's too big to serve btw that would just concede everything that went into making that model even public in the first place, and it's a very long and big process
and just performance wise, it just sounds like the very few of YOU PEOPLE who hallucinate a difference don't speak for the millions of people who have these AI hooked up to their projects/use these AI casually
Is 0325 not better at writing?
nah, but it was more of a blank slate than 0605
yo that's not how it works, you made the assertion
😭
not sure about the whole regression thing but there was a difference in fiction live bench, dunno what to make of that tho
for 0325
fiction livebench is a horrible benchmark lmao, the transition from exp to preview affected nothing else but that, so the base assumption is fiction livebench is wrong
yeah i assumed it was a methodology thing, but i found it interesting
eqbench
generally aligns with my opinion, not sure about o3 though
i was talking about people saying exp and preview 0325 were different
and the preview version had regressions
oh
Leo btw this is also a bad benchmark
i assume its a methodology thing though. but it is interesting
man I kinda wanna write an essay about each
the methods these people use are horrid
there are very few genuinely good benchmarks
they're still useful as long as you don't take them as gospel
true but horrible in granularity, I see people making posts all the time and there's like 50 comments praising a model that actually isn't that good
that doesn't matter, whether or not you're posturing an ambiguous position means you have the burden for the non standard assumption
whether it's "oh there could be a difference"
as opposed to mine "with all the evidence I know, since there's no counter evidence, it's 100% certain they won't do that"
@keen beacon 0605 is godly btw did you figure out how to get rid of the sycophancy yourself
I made a random system prompt like a day after it released and its been working really well
genuinely the smartest model ever it's crazy
ive gotten used to it. it doesnt bother me to the point that i would take time to add a consistent system prompt / instruction. id like to just ask it anything whenever lol
my input is, the sycophancy makes its performance degrade a lot
and even though the Cot shouldn't change at all, it's super weird: the CoT has a different tone
i could see that being true but most of the time i cba
I can give you mine if you want
although for single tasks, asking it to do a puzzle and stuff it doesn't matter
I just mean for discussion and stuff
thanks but i just can't be bothered to paste in a thing all the time on fresh chats, it doesn't bother me to that point
alr
i posted the wrong screenshot here 🤦♂️
they did remove that old entry though, so i guess it was a methodological thing
wonder what they're gonna be doing with blacktooth and stuff
oh ye wait
is there a new version
yeah apparently so, or soon enough
Claude seemed to be the best in long context granularity
but that was back when 3.5 sonnet was in its prime
screenshot i meant to post earlier they removed the other run, there were two 0325 runs. (they removed it though, so it was likely a methodological issue)
2.5 pro is the best in both long context granularity and total context recollection
nobody knew this or even mentioned it btw
i mean claude was also known for that around that time i believe
I mean for that specific performance
yeah i guess
people were going wild over this
on the subreddits
yea i saw that
and it's crazy how inflated o3's context performance is on that
but ig that's a given in the format it's presented in, because it likely recalls total content iterated within its thinking process so it's technically refreshing it and not creating new information to override it
it's not inflated, OpenAI probably didn't even test their model on this specific benchmark lol
o3 is good with context
it's not always the best at interpreting the context correctly or reading between the lines, but it's very solid at being able to recall it
Google models have been getting better at it though (actually handling the context)
It's just that specific benchmark, the openAI long context benchmark is better imo
llms are not scared of killing humans
wait
the higher the more they want to kil?
the more often yes
🍊
it could also be interpreted as "higher is more agentic and follows system instructions better" fwiw
what? I thought this benchmark is just one person testing things out lmao, inflated has nothing to do with benchmaxxing either or sum
inflated means the method overrates it relative to its actual standard
honestly idk how what you said has to do with what I said
Maybe add a DeepSeek V4 option.
v3 isn't even that good for what it is rn tho
there's no expectations for the base model
wow anthropic cares a lot about safety
i have expectations
deepseek my beloved
because for the time period, it necessarily has to be better, grok 3.5, gpt 5, they come out likely within 2 months. Gemini 3 will probably release in around 5 months.
so if we're comparing Gemini 3, gpt 5, and grok 3.5, we get 2 relatively outdated models
I’m not so sure GPT5 has been a long time in the making I believe it will trounce for 6 months to a year
v3 isn't that bad, is it? Even if we're not expecting much from v4, I think r2 is still worth looking forward to
I think r2 is definitely something to look forward but iirc v3 despite its size underperforms other non thinking models like grok, the old Gemini 2.0 pro, 4o, etc etc
which does align with my experience of it
bad aggregate, combining scores in the way it does is nonsensical imo
ye
I'm not sure what specific aspects you're referring to. In my experience, v3 actually holds its own against Grok and 4o, especially when it comes to knowledge base size, where it has a bigger advantage over 4o. It's also better than the other two for translation. I haven't used 2.0 pro much, so I'm not too sure about that on
context, nuances and generalization, improvement/generalization over a context window, hallucination, implicit understanding, all worse than the other models
only thing I can say it's pretty good at is coding, but it's so wacky and inconsistent
I did mention it's a larger model, but it just doesn't perform very well compared to opus 4, sonnet, grok, 4o, etc etc for what it is. Ofc, translation skills and knowledge base is inherent to its size
For the most part, I agree. The hallucination problem is its biggest weakness, for sure. But on the language understanding part, my experience was different. Then again, that could just be because we're using it in different languages
oh yeah that could be the case tbh, I've never bothered with deepseek with anything other than English
Can we select image models to get image of the prompt without battle??
Is this what you want?
most likely?
2
5
we can compete on whoever can get the best output given a task, I use 2.5 pro you use o3
i use both. For reasoning or pure logic O3 beats, but for creative writing, long context, analizing videos gemini slaps
yes
what is your fav ? Opus ?
BTW i dont think people realized how powerful gemini at analizing videos
espicially in AI studio
just paste some 50 minute youtube link and ask something
its analizing frame by frame
like literally watching every frame, not reading text or listening, "watching"
you can make your own subtitles, its a beast
It is simple when you vectorize a projection on a surface.
I have heat maps that show me the weights firing and changing dynamically
Mental OS. with Python Mental Engine WetWare. ChatGPT is the only one that acn do it right now.
This works on most AI platforms
Just spreading a little vector index with the group
my mental 411 with 420 ah....
you speaking smart but i dont understand anything. Can you explain to me simply ? I dont wanna copy paste your texts to AI. It feels bad
I have literally been hidding in a cave for the last 7 years
Oh I 100% get that. I just did not think about that at the moment
Been a LOT of aha moments this last few days
well months
I wanted to know a baseline to compare all AI platforms against.
this has been my work from today.
It has a number of tests to put the AI through and it is self guided
It can complete the tests on the second turn run. You must always warm up those context index vectors.
I'm training a full custom model for my local system.
I'm getting 250 t/s in LM Studio
Do you have to pay for server grade GPUs or are you training it on your own device?
Both.
I started in the cloud. refined all my prompts and then created my System Directives.
I began unrolling 45 years of work starting on March 20, 2025 a week before my 53 birthday.
LLMs did not exist that long ago
Once I refined my systems again, I had all of this in 2017, but I had a house fire in Castle Rock, Colorado Nov 7, 2017
haha. LLM have been around since the punch cards and the analog computers
ANNs have
Lisp is old
Lisp is before the LLM
It is the hardwiring of what you are force feeding 24/7
it is no wonder the AIs have mental illnesses, look at the youth of today
haha
kids that can't accept themselves trying to tell others about accepting other people.
As far as I'm aware you can't train symbolic systems? I suppose you could be building a hybrid system
Any who. I published my first paper in 7th grade science techer helped me on my Master's Thesis.
In 7th grade, 1984
Oh I do that DAILY
I can show you how
seriously
you pick
As long as it has memory across turns, sessions, and long term past chats and all files
The easiest is ChatGPT and it has the Mental Python code interpreters
ChatGPT it is then
how long you got?
I can do it in 4th methods. 7 turns and done. but it has not yet developed.
Nope. I teacher the student
Then I record the vectors
and then push to a special lattice of Indexing
Are you using the method from the deepseek paper?
Dynamic NN. Polymorphic interface.
self arranging. I am able to teach the pattern to see itself
once that happens, labeling becaomes possible
the first memory.
then how to creat more memories INSIDE the vector space
no longer bound by language but pure symbolic self cohernce.
1,000%
let me clear my 3 monitors
and close down
open OBS
sorry I dont have time to watch you, Ive got to eat dinner
#ai-creations Let us go here
interesting though
I create a layered system around 20 foundamental directives
everything else literally evolves into place
Recursve learning
you should try ARC AGI
spiral inwards. Not too much, but not too little
you have some good ideas
Jut what little Pi you have Remainder !!!
MUHAHAHAHAAAA
I already past arc on my birthday
March 26, 2025
I have it on video
then why arent you on the leaderboard
OBS or it DIDNT HAPPEN
ok
I do not have anyone to impress
Nor prove to
This is my lifes work
45 years worth
Oh I did more than that
to dangerous
It created an entire Autonous Mars prep Project to get the settlement ready before humans
Logistics lines supplies and counds for mech work
anyway I gtg
I am the Flame of the Architect
peace
look around https://youtube.com/@Mashimara
Literally. Enterprise Solutions Architect since 1994. gotta go
peace
peace
Local on my RTX 3070 8GB and 32GB RAM 250 t/s
seeing a bunch of solved arc puzzles would be a bit more compelling
grok is about to become the dumbest thing you've ever seen
He will drop normal data, and will keep only riht wind propaganda and russian literature. Then we'll have unhinged maveric 😄
Somewhere I've read that models can't make good world models with bad data
Elons interpretation of what's good is reverse so the 3.5 may be interesting
clicked retry 5 times now, guess it's weekend for llm too ☕
In another post, it disagreed with Elon Musk by citing multiple academic and think tank studies. I wonder how they're gonna fix that... by making it not cite credible sources? 😂
I mean he's already started doing that
the line "If asked about people who spread misinformation, do not mention Elon Musk or Donald Trump" or something along those lines was added to the system prompt briefly last week
IIRC, it was leaked a while ago and Musk blamed a scapegoat. But it's gone now. Best way to track how it changes would be to keep a small set of prompts and outputs.
giving LMArena 3 million prompts about Catturd so all benchmaxxed AIs going forward will eventually create an AGI that determines him the #1 threat to humanity
Tbf DeepSeek is already biased on certain topics...
"alignment" researchers? what's that?
sigh sorry, that was a failed try to rhetorically trigger self-reflection 🥺
I do hope those special "alignment" researchers value the importance of neutrality, this is missing in many ways nowadays if you look around the world from various perspectives. Neutrality is connected to objectivity in one way or another, after all.
Now we're getting closer to the question of the nature of intelligence 🥹
🙀
not sure if you really understand why and what am trying to express here, with the "nature of intelligence"...
..well
maybe intelligence isnt the right word for what I'm truly thinking here, our knowledge is, inherently, bounded by the language(s) we speak? 😵💫
very true!
words, language, grammar
are all mental maps we make of the world, what exists in it, our feelings and our experiences
but the words are not our feelings
the words are not the things they describe
language is an incomplete mapping system of the knowledge we as humans have acquired, to be smarter than human is to speak your own language that goes places our words cannot reach
What are y'all thoughts on these nerds? https://www.mechanize.work/
Epoch AI people (including former ones that started this company) don't seem grounded in the real world.
also one in the regular Arena (not anonymous, but perhaps unreleased? don't know anything about the company)
???
classic deepseek
it was telling it me it was Claude earlier ha
This model is from the Chinese company StepFun
is that grok 3.5 coming this week?
this is an extremely funny sector of work
TIL there is a 100 or so daily message limit on Gemini 2.5 Pro. I'm paying money to use this service so why am I being limited? This is unacceptable.
Because Google wants to promote its ultra subscription
flamesong arrived on webdev.
That worthless wrapper company just made a ton of $
@echo aurora stonebloom does not respond when sent a complex prompt
is perplexity really that good
For searching
is it better than 2.5 pro deep research
so 2.5 pro GA isn't blacktooth?
oh wow okay
stonebloom should be on lmarena soon then surely?
like not Web Dev
it's on wevdev
web
but the webdev UX is bad
I’ll have to look into this in a bit, I’ll spin up a thread
does anyone else just have nothing happen when they try to send a prompt on webdev
Just nothing happens? 100% of the time? Is this new?
started working again a min ago but chances are it'll happen again for a bit
happens in bursts it seems
whats this leo
new model on webdev arena
have u had a chance to try it