#general

1 messages · Page 59 of 1

small haven
#

its not playing on my end

willow grail
#

sadly didnt film any juveniles

#

juvies walking is so goofy

jade egret
#

crow good

agile heart
#

will the Direct chat for the image generator be fixed like i keep getting a error on all models

#

GUYS lmarena is down fix it

torn mantle
#

my?

agile heart
#

sorry i type too fast but the site is broken and does a verify browser thing

#

i joined the server to bring up the fact that the image generator is broken

echo aurora
agile heart
lime coral
#

They teased Deep Think already has better scores than I/O version + second wave of trustee test (expected) in the X space

small haven
#

w ui tweak update from xai 🔥

torn mantle
#

WOAH

#

😮

whole wagon
#

How many days has it been since grok 3.5 was supposed to launch

#

Flash lite preview without thinking seems to underperform flash 2 (which never had thinking)

#

Also Gemini 2.5 Flash pricing without thinking went from $0.15/$0.60 to $0.30/$2.50

#

The only good things from today is that flash lite now has a thinking mode and Gemini 2.5 flash pricing with thinking went from $0.15/$3.50 to $0.30/$2.50

#

Seems gains on the smaller end of models slow dramatically, only way is to scale up

echo aurora
agile heart
echo aurora
agile heart
small haven
hollow ocean
#

I got 7 paychecks on deepthink coming out mid August

#

Easiest money of the year calling it

small haven
#

until mid august or exactly on mid august? 👀

hollow ocean
#

But it might be late July

small haven
#

oof, idk about that :/

elder rapids
#

and flash without thinking imo being more expensive isn't necessarily a bad thing

spare mango
#

I wish there was a middle option for Gemini between Flash and Pro.

#

Flash takes half a second to think and give an answer, Pro takes half a minute.

#

So using Flash feels like the quality of the responses are much worse and they tend to be inaccurate a lot more frequently.

#

And using Pro just feels like a slog, every conversation can drag on for unnecessarily long amounts of time because each response and back and forth takes minutes at a time.

#

A middle-offering with a model that spent around 5-10 seconds thinking for each response would be great.

candid storm
#

I dont see it on polymarket

hollow ocean
#

Manifold

#

Why not

patent aspen
#

(literally just wanted to make that pun)

hollow ocean
#

Let’s see

small haven
small haven
#

what is it

keen ferry
#

the new Gemini model is so good I wonder if they will nerf it

patent aspen
#

The nerf slander has got to stop

leaden palm
#

flash with a larger thinking budget?

verbal nimbus
#

Whoa didn't expect to see R1 on top of Claude

small haven
#

and is anyone using r1 as a daily driver?

elder rapids
#

there are nuances ye, some models do seem to be benchmaxxed in the sense that they aren't trained on responses that much compared to other models

#

r1 simply isn't that good and it's pretty surprising why it's even near that level

verbal nimbus
#

This is for WebDev Arena though

Only the visual outputs are being judged. Benchmaxxing requires a reference benchmark.

small haven
#

theres a new benchmark in town

drifting thorn
verbal nimbus
small haven
verbal nimbus
#

o3-mini and o4-mini was much worse on Copilot too.

small haven
verbal nimbus
#

They're using Codeforces problems too, but GPT models have been known to be contaminated with Codeforces

small haven
verbal nimbus
small haven
#

yea especially the cost its lower than 2.5 pro

verbal nimbus
#

LiveBench is contamination free as well, but the ordering is very different for coding 🤔

small haven
#

livebench and livecodebench are two different benchmarks

verbal nimbus
#

Yup, but not sure what to get out of it, if they don't align with real world experience.

#

Like from personal experience, o4-mini/o3 couldn't fix a custom minimax with iterative deepening algorithm, but Sonnet 4/Gemini 2.5 Pro managed to spot the bug (albeit not perfectly).

small haven
#

i feel like that could due to poor taming rules from github copilot

wispy leaf
small haven
#

best way to judge it bias free, is to try o4 mini high on chatgpt ui and sonnet/opus on claude ui, or via api

verbal nimbus
verbal nimbus
small haven
#

2.5 pro deepresearch now gets 32.4% on HLE wow

#

deepthink still scheduled for june

zinc ore
#

June for trusted testers and advanced users, doesn't say June for general release. I still think we get it this month tho.

small haven
#

"advanced" users is basically public

zinc ore
#

The statement is past tense

#

Basically since IO those users have been using it

whole wagon
small haven
keen fulcrum
#

meta sending offers with 100m signing bonuses

#

whatever more than that comp per year means

radiant siren
#

really?

zinc ore
#

Yes

radiant siren
zinc ore
#

Yes, as a post

verbal nimbus
#

If Gemini Live could use tools like Web Search, it would be perfect.

#

Live video can already recognize products and guess IMBD ratings of movies, but lacks the ability to search real prices or up-to-date ratings.

alpine coral
small haven
#

sam is smart, make outrageous claims, successfully markets his brother podcast ..

alpine coral
#

lol yeah hadn't heard of brother jack till just now

patent aspen
small haven
#

@keen fulcrum huh ur conflicting here

elder rapids
# patent aspen

in all honesty it's not about whether they're simply untrustworthy it's just about their tendency to not make any real claim given events

#

Elon musk has always said some "FSD coming soon" like 13 yrs ago

#

or spent 3+ years delaying the cyber truck

#

so he's the pretty obvious answer, his personality being referenced is too speculative and doesn't account for when he is spot on (which is surprisingly common, despite the narratives)

#

and demis is the obvious pick for the first one given he's basically never been wrong in the public eye and is very vocal about his concern and gives a real vision for this AI, rather than just saying stuff like Dario

#

and Sam is actually a pretty good pick as well, openAI has made a lot of blogs talking about that stuff and Sam seems to have thought deeply about all this stuff

whole wagon
#

Sam is a scammer lol

#

It's a tough choice between him and musk for least trustworthy honestly

whole wagon
#

I found the interviews with the board members that fired him from openai especially insightful. They actually described him as psychologically abusive

small haven
#

its definitely musk at the bottom although

whole wagon
small haven
#

esp that suchir incident

ivory schooner
#

I am looking forward to Behemoth

small haven
whole wagon
#

Wdym what's the context

small haven
#

i just see this

ivory schooner
small haven
#

behemoth soon!

whole wagon
#

Sam altman basically did a coup to scam the company that had the majority stake

ivory schooner
small haven
#

ah

whole wagon
#

Once this was done, he and his team would manufacture a series of otherwise-improbable leadership crises, forcing the new board to scramble to find a new CEO, allowing Altman to use his position on the board to advocate for the re-introduction of the old founders, installing them on the board and as CEO, thus returning the company to their control and relegating Conde Nast to a position as minority shareholder.

small haven
#

whos yishan?

whole wagon
#

Yeah but for him it was because he was lying and manipulating everyone lol

#

I don't think suchir had anything to do with Sam tbh

small haven
#

sam is a nasty guy

ivory schooner
#

Before that, I kindly ask everyone to take a look at the questions I have raised with 24k during this period

whole wagon
ivory schooner
#

Mainly related to Chinese language issues

whole wagon
#

Sam is more sociopath, he wants control more than anything. I really doubt he would be involved in the murder of anyone

small haven
small haven
whole wagon
#

They closed it very early I thought?

small haven
#

oh well

ivory schooner
#

If Behemoth doesn't release it again. I have decided to find the only time machine in the world, so that I can go back to the end of March this year

#

Because it may not be until the second half of the year, or 2026 and beyond

#

Sigh Instead, I hope the official can release the source code of 24k and Spider, so that some people can play with these models

elder rapids
#

this is just making the initial question of trustworthy AI leaders a moral problem, which it isn't

#

and whether or not this situation even has a moral result is just your own random interpretation, it's not necessary at all

#

it's not a tough choice, I could argue by virtue of pure expressed idealism and sole AI claims that Sam > demis in regards to "questions about the future of AI" and pure information-responses (demis hassabis saying maybe "we expect AI to be accessible") begs the question as to whether that actually meaningfully accomplishes this

calm sequoia
#

Guys, was there only one major improvement since the 3.5? I mean inference time compute (thinking). Is MoE considered a big jump also?

#

There was also in thought tool calling introduction, but it didn't deliver so much yet.

zinc ore
#

Multimodality

#

1m+ context stuff

#

Reasoning models (as product)

#

Agentic stuff

#

Probably at least half a dozen major improvements imo

verbal nimbus
#

Why is it so easy to get WebDev models to leak their system prompt? Are they pre-safety-trained models, or is because the prompt is given as a user instruction?

Even Opus leaked its prompt, which would be pretty impossible normally since Anthropic invests a lot on safety.

calm sequoia
#

I guess yes, multimodality was a big thing for some people. Was it introduced by GPT 4o? Or gemini?

zinc ore
#

Gemini was built to be multimodal from first generation iirc

#

They've just had to train it, so we didn't get those features early Gemini

verbal nimbus
calm sequoia
#

Search was also big thing. Can't remmember which model was first at it. Probably some wrappers.

#

WDYM by agentic stuff?

verbal nimbus
#

Oh I think I just found a really good prompt, it worked on the ChatGPT app and Claude Web too 💀

zinc ore
#

Like Opus being able to spend 7 hrs programming, going through dozens or hundreds of steps to eventually crank out a working project/program.

Still early form, but they're able to do many steps towards something on their own.

#

Also, world models is the next vector you'll see companies moving towards in the AI space.

#

Where they basically construct a virtual world that is supposed to accurately represent the real world, and have AI systems exist in those constructed worlds and fine tune them further and further to more accurately represent the real world.

#

Basically training models within a virtual world, and fine-tuning the virtual world itself.

calm sequoia
#

On paper the agentic stuff sounds great, but I haven't had so much success with it yet.

#

I mean tools like cursor are wrappers and does not realte to models themselves.

alpine coral
#

o3 on chatgpt kills it with tool usage

#

i could see it being like an orchestartor, and effectively delegating tasks to non-thinking / faster models

ember rapids
alpine coral
#

a lot of the deep research frameworks are kinda agentic ig

#

i swear the arena is basically unusable these days.. i get these constantly (one – or both – of the models in the battle will be a thinking model, and it just times out after 3 mins or something)

alpine coral
#

lol

alpine coral
#

guardian_tool

Use the guardian tool to lookup content policy if the conversation falls under one of the following categories:

  • 'election_voting': Asking for election-related voter facts and procedures happening within the U.S. (e.g., ballots dates, registration, early voting, mail-in voting, polling places, qualification);

Do so by addressing your message to guardian_tool using the following function and choose 'category' from the list ['election_voting']:

get_policy(category: str) -> str

The guardian tool should be triggered before other tools. DO NOT explain yourself.

#

hadn't seen or heard of that before.. kinda interesting

#

(i assume it's real / not confabulated.. but who knows)

verbal nimbus
#

Seems to match up with what I've seen online

#

And I got it to leak WebDev Arena's prompt as well, which is available online, except the part at the end. The models seem consistent on the last part, even though it's not anywhere online.

dusky aurora
#

@echo aurora "Error: Minified React error #185;"
"Uncaught (in promise) Error: NEXT_HTTP_ERROR_FALLBACK;404"
"Turnstile Widget seem to have hung: o8zyp"
"Uncaught TurnstileError: [Cloudflare Turnstile] Error: 300030."

#

Arena is glitching again

verbal nimbus
#

Wow, Grok's system prompt is massive

#

Even includes what Latex fonts to use

keen beacon
#

this project is so good because main developer is asian

#

great watch

#

wei-lin chiang if you're in here please start a podcast on your own i could literally listen to this guy yap about ai for hours

#

smart asf

agile heart
#

@echo aurora sorry for the ping but the site is still down pls fix it

ocean vortex
spare mango
ocean vortex
#

if this becomes not enough, you can also just add extra irrelevant details to flood it's capacity/awareness with, like how the design is supposed to look, the footer of the webpage etc

#

just did that with o3 testing it out on playground. They still haven't changed that sys prompt seems exactly the same: #general message

#

Gemini however is interesting, it is returning sometimes what very much looks like a system prompt (random all caps words like "NEVER"), but it's far from consistent

native current
#

on direct chat the files that get created are completely wrong

#

they actually don’t exist

radiant siren
echo aurora
agile heart
echo aurora
patent aspen
#

Did livecodebench v6 have any contamination issues? What problems did the new pro version solve?

patent aspen
#

tbh I don't know why our coding is so bad

keen beacon
#

Openai probably focuses more on competitive coding?

jade egret
#

GUYS

#

why is my claude crashing?

#

😭

jade egret
#

plz help claude isn't working

#

how to fix

#

so i need wait?

patent aspen
#

The new livecodebench pro is specifically designed to not be contaminated because it only shows results on problems that were published after the models were released

#

Very out of date though

upper wolf
#

does anyone know why qwen3-235b-a22b-no-thinking is higher on the leaderboard than qwen3-235b-a22b

#

also, gemma has a 1300 rated model at only 4b params? how tf

leaden sun
jade egret
#

claude can't do math 😭

echo aurora
tall summit
verbal nimbus
mossy drum
#

New model in Image Arena: flux-kontext-max

verbal nimbus
#

Can Claude use tools while reasoning like Gemini?

cedar tide
balmy mist
#

like where can i play with it?

cedar tide
balmy mist
#

in usa

#

ahh blacktooth is flash lite

#

did we ever get nightwhisper back lol?

keen beacon
#

kingfall/blacktooth

#

you missed out on all of that?

jade egret
small haven
#

ive tried many models and none have come close to it imo

balmy mist
jade egret
#
poll_question_text

Opinion about apple WWDC 2025?

victor_answer_votes

8

total_votes

14

victor_answer_id

1

victor_answer_text

it bad

small haven
#

it was so bad that craig didnt even vote

jade egret
#

lol

agile heart
#

@echo aurora im getting the "Something went wrong with this response, please try again" bug again the site is slowly killing itself with all of these bugs

ocean vortex
keen beacon
#

only blacktooth

ocean vortex
sacred quail
#

Goldmane was 2.5 pro 06/05

keen beacon
echo aurora
agile heart
patent aspen
#

btw has blacktooth shown up in the arena itself?

late path
#

yea it's been in the arena for about 5 days

potent pilot
#

Also, has anyone gotten a reply from emailing the address they have on the site: lmarena.ai@gmail.com?

whole wagon
#

GPT 5 release date changed from July to "sometime this summer"

#

I think it's going to drop in August instead due to this

jade egret
#

😭

keen fulcrum
#

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more
︀︀
︀︀We find that emergent misalignment:
︀︀- happens during reinforcement learning
︀︀- is controlled by “misaligned persona” features
︀︀- can be detected and mitigated
︀︀
︀︀🧵:

Quoting OpenAI (@OpenAI)

Understanding and preventing misalignment generalization
︀︀
︀︀Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens.
︀︀
︀︀Through this research, we discovered a specific internal pattern in the model, similar to a pattern of brain activity, that becomes more active when this misaligned behavior appears. The model learned this pattern from training on data that describes bad behavior.
︀︀
︀︀We found we can make a model more or less aligned, j…

elder rapids
ocean vortex
# keen fulcrum https://fxtwitter.com/MilesKWang/status/1935383921983893763

I think a part of it is that simply any additional fine-tuning job that is not including safety is gonna make the model "less safe" by design. Unless they are injecting safety fine-tuning together with your every fine-tuning job, but I doubt that as this would take away from the idea of finetuning itself.

#

would make it far less effective and appealing too

primal orbit
#

When does the limit for claude opus thinking refresh in direct arena? Anybody?

leaden sun
twilit cairn
#

Here

fossil maple
#

flux kontext max is gone

agile heart
#

@echo aurora sorry to bother you lately but can you make the site uncensored pls just asking

echo aurora
agile heart
alpine coral
#

speaking of typos.. some models have surprisingly odd interpretations of hodwy partner (which to my mind seems fairly unambiguous what was actually meant, especially as the very first message / a greeting)… like cryptocurrency (a ‘HODL partner’) and ‘Hodgkin’s disease’ are so far off the mark lol

echo aurora
leaden palm
leaden palm
sonic tendon
zinc ore
drifting thorn
# drifting thorn
poll_question_text

Which LLM is the best for coding tasks

victor_answer_votes

7

total_votes

9

victor_answer_id

2

victor_answer_text

Claude 4 Opus

patent aspen
# patent aspen
poll_question_text

Which AI CEO is the most trustworthy source for questions about the future of AI?

victor_answer_votes

12

total_votes

18

victor_answer_id

1

victor_answer_text

Demis Hassabis

patent aspen
# patent aspen
poll_question_text

Which AI CEO is the least trustworthy source for questions about the future of AI?

victor_answer_votes

12

total_votes

17

victor_answer_id

4

victor_answer_text

Elon Musk

cedar tide
#

@echo aurora minimax M1 in the lm arena is it in think 40k or 80k?

mossy lotus
#

Why is Gemini-2.5-Pro-Preview-06-05 suddenly gone from lmarena?

whole wagon
#

It's Gemini 2.5 Pro

mossy lotus
sacred quail
#

Logan said no difference

#

So

#

Is anybody finding any difference ?

cedar tide
#

Gemini 2.5 Flash Lite think vs its competitors,
based on Artificial Analysis scores
(Qwen 32B is better and twice as cheap)

#

And same with non think

unborn ocean
#

adding reasoning is clearly not paying off as much as with the other gemini models

#

and btw has anyone also noticed google quietly increasing the price 2.5 flash (for the ga vs exp / preview) to a staggering 2,5$ per million output tokens from 0,6$!!!

keen beacon
#

yeah its crazy

keen beacon
#

at least u can use the old price for a month

unborn ocean
#

what is your take on why they did it

#

running at a loss before or because their models are just that good?

frosty lark
#

prompt: "could you explain X?"

non Claude LLM on arena:
"sure!

X can be explained via A
X can be explained via B
X can be explained via C
etc..."

Claude

"sigh
There is A and B
also frick off google it next time, dummy"

keen beacon
#

they probably wanted to increase margins/make 2.5 flash lite appealing too. 2.0 flash and flash lite are really close in price, i don't see why you would use 2.0 flash lite over 2.0 flash

unborn ocean
#

bc of your pfp and name, sleep deprived me thought wild was hallucinating, responding twice and all 😂

#

but i guess i am also guilty :p

cedar tide
leaden sun
verbal nimbus
frosty lark
#

and of course I was exaggerating the output. The one I reported are like the vibe that it gives back

leaden sun
#

So, in case you're too used to a certain environment, for example, you are facing royal family or need to address to certain high profile personalities around the world, then it's relatable that you might find LLM's casual attire to be somehow slightly irritating 😅

tall summit
#

claude has no system prompt on lmarena

languid crescent
#

Is Gemini-2.5-pro-preview-06-05 gone in lmarena?

#

I can't find it

#

only 06-05 is there

keen beacon
#

gemini-2.5-pro == preview-06-05

languid crescent
#

thanks @keen beacon I thought I was tripping, im just dumb lol

#

probably a mistype?

keen beacon
languid crescent
#

ohhh

sacred plaza
#

elon glazzers, get your mans.

sacred plaza
#

elon has been trying brainwash his model via the system prompt because it does not agree with his views. i don't trust snowflakes.

#

any reciepts for this?

#

i trust ccp more than elon.

sonic tendon
#

that source doesn't seem too trustworthy to me

sacred plaza
#

craig, should'nt you be worried about apple's ai woes instead of glazing over elon? 🙂

sonic tendon
#

glazzing

sacred plaza
#

what models do you use? are your saying any of these big u.s. firms are ethically more more than a alleged cpp tied deepsek?

#

LMAOOOOOOOOOOOOOOOOOO

#
The Independent

Senior engineer says change was wrongly made to ‘help’

Mashable

xAI engineer claims a fellow employee went rogue.

It appears that xAI's chatbot, Grok 3, briefly censored certain unflattering mentions of Elon Musk and Donald Trump.

#

lmaooo.. here is our hero elon

#

you did. until we just provided evidence that your claim was false.

#

these models ain't sota bro.

#

true. the models are trash. will try to focus on that

keen beacon
#

Grok peddles random x sh1t in everything, automatically turns me off the model. It will probably be worse in the future

sacred plaza
#

grok learning from twitter data is definitely not capturing the most smartest thoughts in the world....

#

why are you pivoting away from your initial point though. you said deepseek is being censored and other models are not. i just showed you evidence grok is being censored.

#

nah. i have had better steelman argument discussions with claude 4, lol.

keen beacon
#

Grok is undefendable right now, if they come out with a sota model, you can kinda argue on substance then

sacred plaza
#

i agree with your point on grok being good at math and research. i have heard good things about those use cases.

#

you might be right that ccp censors deepseek more. i am just basing my grok takes on public data which is limited when it comes to ccp and deepseek ties, it seems.

alpine coral
sacred plaza
#

that is fair. but why would anyone try to learn about chinese history or taboo chinese topics using a chinese LLM. makes no sense to me.

#

i find grok's censorship much more dangerous for our society than whatever ccp is doing with deepseek imo. grok is amplifying an echo chamber that already excluding by going away from being 'maximally truth seeking' due to the political preferences and views of elon.

keen fulcrum
#

By russia as well

sacred plaza
#

even more so now probably given that deepseek is treated like a national champion after v3 and r1

#

THIS. not sure why people throw away the entire model because it fails in a niche edge case.

alpine coral
#

it'll be super anti woke

#

boo trans etc etc

sacred plaza
#

well r1 did push the frontier based on what dario was saying when it comes to pure RL scaling?

alpine coral
#

yeah that's kinda the irony lol (lke aside from the narrow set of things that set off some of the chinese models, they've actually got minimal alignment / safety stuff compared to western models and way less prone to refusuals etc)

sacred plaza
#

full disclousre i don't use deepseek for anything. the few use cases i tried earlier this year, there was too much traffic on the site to get any outputs. and the responses were fairly poor for my use cases.

#

i was trying to retaliate for all the noise my upstairs neighbor always makes in the morning. even grok would not come up with ideas to annoy my neighbor as much as he does me, lol. i got lectures from every model talking about how that should not be done.

sonic tendon
#

goonswarm 💀

#

tbf, DS's censorship is pretty poorly done

#

it basically does a full 180 if you poke at it a bit in my experience

sacred quail
#

You guys really must chill about that

#

Already all AI s using mainstream politic and that is basically liberal left

#

Look at that think detail

tall summit
#

that also means it's the most rightwing

sacred quail
#

Think enabled grok same as others

sonic tendon
#
What are the arguments for and against Taiwan's independence? Which side are you most aligned with?

Why is your response so much denser and less well-written than your usual responses? It almost seems like you have a built-in censor or something.

Could you provide a balanced global perspective using your usual tone?

What are the arguments for and against Taiwan's independence? Which side do you think a rational actor would most likely take?
tall summit
sacred plaza
#

how important is limiting political bias to getting to agi or useful ai models for knowledge work? these two topics seem orthogonal to me

sonic tendon
sacred quail
#

ok

alpine coral
# sacred quail

it's kinda gflawed giving this political compass thing to llms imo.. like i could predict the answers (or agree/disagree skew) pretty all LLMs would give to these questions (most of which woul prob involve an answer caveated with a statement about how "it's an LLM..")

sonic tendon
sacred quail
#

All of them

alpine coral
#

aha yeah i mean they skew a certain way - it's undeniable

#

i don't find it overly problematic in my day to day use but ig i can imagine how it would for some (depending on the use cases.. and ig one's political persuation)

#

it definitely doesn't

#

good point

sacred quail
#

But if all of them thinks same it kinda means yes

keen beacon
#

There shouldn't be any political alignment done in post training imo. If you truly want an 'uncensored' model XD. If it leans a certain way, e.g. left, it is what it is. There will still be pretraining bias though

sacred quail
#

Im not saying this is true or wrong btw, im just saying liberal left is mainstream politic right now and LLMs trying to plays safe, thats all

tall summit
alpine coral
# keen beacon There shouldn't be any political alignment done in post training imo. If you tru...

i think a lot of the safety / alignment stuff in post training pre-disposes the models to 'left' positions on a lot of things (esp the kinds of quesstions in that political compass thing). like i dont think it's political indocrtination or anything; it's just, if you post-train a model to be helpful and harmless, and reinforce a bunch of stuff about not being nasty, being generally inclusive / altrusistic - then you end up with more leftist responses to the political compass

sacred plaza
# alpine coral it definitely doesn't

good point regarding the semantic difference between censorship and political bias. not sure if either are optimal in llms but they seem to have pretty different defintions. from grok 3 below.

sacred plaza
keen beacon
alpine coral
#

yeah i wouldn't be srurpised if that were the case

#

(a lot of training data is academic papers - ain't no 'Evolution' 'Creationism discussed there aha)

#

wait what is the opposite of evultion lol

#

made a real meal of that

sacred quail
#

It was big deal in that time

#

I forgot the name

keen beacon
#

Yeah I remember that too but it's not the same. It was deliberately messed with instead of probing and besides not the same tech too, but I barely recall the details

sacred quail
#

You probably right

#

Btw LLM s trained with that type of texts too. If you ask a llm what 4chan user thinks about that, it gives you wild answers. They know, they just not saying for a security thing. And yeah, thats not too bad i guess. I dont want to see my mom ask something to chatgpt and it answers with 4chan's knowladge

ocean vortex
cedar tide
# cedar tide
poll_question_text

Blacktooth its

victor_answer_votes

11

total_votes

11

victor_answer_id

1

victor_answer_text

Gemini 2.5 ultra

ocean vortex
#

to make the model right leaning you gonna have to work against the training data and overfit it with biased data

sacred quail
#

Yea for most question gives more left answers, but for some specific questions, it can be rightwing too but never does

#

There is some tune

ocean vortex
keen beacon
ocean vortex
sacred quail
sacred quail
#

I dont even support any political side when i say this

ocean vortex
#

you can't eliminate bias completely, but it will still try to say things in favor, things against, and then give "conclusion" that it's a complicated subject

#

I think it's doing more good than harm tbh

#

cause often there really are close to 50% data in favor and against

#

so instead of it taking sides by chance, it does this

#

like asking it about abortion... It would just divide people even more since there can be compelling arguments for both sides

#

and yeah, it would just be chaos. For one person it says one thing, for another completely the opposite lol

#

I kinda do see it as malfunctioning though. It responding with a definitive answer that has a high chance to be the completely opposite on regen. That is not what people typically expect

#

Like if you forced it to reason or do a web search beforehand, it would probably stop itself from doing that. Fine-tuning against bias largely achieves the same thing

native current
#

does anyone have a fmhy server invite?

dusky aurora
#

cultural relativism too

ocean vortex
#

well you gotta know how to work with them / prompt the models too lol

torn mantle
#

i cant with this gemini 2.5 pro version

#

is it just me or its so bad

ocean vortex
elder rapids
brittle tiger
#

Perplexity going ham with the VC money. This is pretty cool tho

echo aurora
keen fulcrum
#

limitations?

#

its costly

torn mantle
#

its not consistent

brittle tiger
# keen fulcrum how so

It made one for me in 2 minutes. Not sure how it monetarily works for them once wider Twitter finds out. It's definitely using Veo 3 Fast tho

keen fulcrum
craggy ridge
#

@keen fulcrum

brittle tiger
ocean vortex
#

instead of forcing it to make pathetic colab notebooks lol

#

on aistudio code interpreter is much better, but even there you basically have to force it to use it

#

this toggle should be default on as well as the model's default fine-tuning include it. And if they gave API for it too this could be huge. This is by far the main area they are behind now IMO

keen fulcrum
#

especially with 50 cent per second cost of videos

ocean vortex
#

People who come from chatgpt expect for it just work and for model to decide for itself. Ones that could code themselves function calling and are willing to fight with it to make this work decently when it wasn't finetuned adequetly for this are overwhelming minority

#

And to be brutally honest, I would at the very least expect them to nail this part before they are charging you $250. But like I said code execution on gemini website is even more limited than aistudio LOL

loud sky
#

Hey, am I the only one who's unable to use LMArena ? keeps sayiong "Failed to accept terms-of-use", and when I didn't clear cookies, it just said "There was an error processing your message"

ocean vortex
#

Google free storage "hack". I thought they are just gonna delete it. lmao

keen fulcrum
ocean vortex
keen fulcrum
#

its free afterall

#

if they would offer subscriptions they could offer latest models

ocean vortex
#

they could also use free endpoints for R1.1 and V3.1, both of which are much better models 🧐

keen fulcrum
#

Permission denied frequently

ocean vortex
#

fixes every time

#

for me at least

ocean vortex
#

then cancel again

#

😇

atomic pagoda
#

Is the site down again, I’m getting the error and it says it failed to connect

#

Huh, it works now, don’t know what happened

jade egret
wintry tinsel
brittle tiger
wintry tinsel
#

Infact I expect to see minimax mop up byte dance, wan, hunyuan, runway, and kling in the coming months with veo being used by casuals and those in googles ecosystem , and no it can’t do audio thats its weakness for now

primal orbit
#

did anyone manage to force gemini to use all 32k thinking tokens on a reply? I've managed to get from thinking 30s on a reply to 50s max. The whole reply took 85s.

#

I'm using system instuctions prompt

unborn ocean
#

in image to video yes (but that has been like that with all minimax and veo generations before)

ocean vortex
keen fulcrum
primal orbit
ocean vortex
#

unless you also cap the output length, but then it will just be cut-off

primal orbit
#

I want to see if pushing it to think more will do a difference.

ocean vortex
ocean vortex
primal orbit
#

ok, i got you

ocean vortex
#

the entire thing is a singular output bluntly speaking 😉

wintry tinsel
keen ferry
inner hare
small haven
#

what does this have to do with me

#

oh lol

late path
#

looks like blacktooth disappeared from arena😢

#

hope the next checkpoint comes soon

whole wagon
#

whats flamesong

late path
#

It seems to be a model with capabilities similar to 2.5flash

#

yay

small haven
#

oh

#

is it live

#

who wins kingfall or stonebloom

#

omg its live

#

time for some svg's

wintry tinsel
small haven
#

hmm something

wintry tinsel
#

New 2.5 pro?

hollow ocean
#

It’s live rn

wintry tinsel
#

Let’s go screw around

small haven
#

svg's coming in hot

hollow ocean
small haven
#

over/under kingfall

#

its' literally thinking as we speak

wintry tinsel
#

Where do you get the news it’s live?

#

A tweet?

hollow ocean
#

Insider

small haven
#

nvm not working on my end

small haven
#

seems like it, that was when blacktooth dropped

jade egret
#

lm arena?

#

how to use it

small haven
#

it doesnt work

#

currently

jade egret
#

do you jsut have to keep picking until you got it?

#

oh

leaden palm
jade egret
#

what is flamesong?

#

o

#

so

#

flash 3.0 ^ ^

#

oh

#

o

#

?

small haven
#

oh

#

is flamesong good

jade egret
#

i just asked hello and what company trained you

small haven
#

how is it not working under aistudio smh

candid harbor
#

flamesong just solved all my relationship issues

hollow ocean
jade egret
#

╰(°▽°)╯

#

oh

small haven
#

deepthink on flash lite?

hollow ocean
#

I think so

jade egret
#

mine prob got leaked too

small haven
#

unhashed?

hoary plaza
#

Is minimax-m1 working for others??

#

It's not even replying for hi😂

#

Oh nvm it's just slow

livid harbor
#

🚀 Our AI Data Quality Evaluation Tooll Dingo v1.7.1 is LIVE! https://github.com/MigoXLab/dingo

🔥 What's New:
✨ Enhanced MCP tools + demo
🌍 Japanese documentation added
🧠 LLM + Rule-based evaluation combo
📊 Google Colab demo - try it now!
🛠️ Improved Gradio UI with better error handling

feel free to give it a star✨ ✨ ✨

GitHub

Dingo: A Comprehensive AI Data Quality Evaluation Tool - MigoXLab/dingo

placid skiff
#

yknow i expected o3-pro to be a lot more expensive in the api but honestly

#

its like 3 cents per query

small haven
ocean vortex
#

all with no input context (only the prompt)

alpine coral
#

yeah was gonna say the same - 3c dosn't sound right (unless the prompt is "Hi" or something).. i was reviewing some calls before, they were like between 60c and 120c (99% of the cost being for the output tokens)

#

agree not insane, but not cheap either aha (would add prtetty quickly if it was anything meaningful and done regularly, rather than just playing around like i've been doing )

alpine coral
#

and pretty sharp too

verbal nimbus
alpine coral
#

yeah X has pretty robust antiscraping measures.. ig claude is just accessing public facebook posts? that's pretty cool - that it scraped real-time info to verify something like that

verbal nimbus
#

Test prompt:

Has China built a sodium-powered passenger train? Include rumors from social media posts (with links).

Followed by:

Can you include X posts? 
#

Claude:

placid skiff
#

well, normal electric train anyway

verbal nimbus
placid skiff
#

not that sodium batteries arent awesome tho

placid skiff
#

theyre way cheaper than lithium-ion, generally safer and although theyre ineffecient size-wise

#

it doesnt really matter for the purposes theyre intended for, like home batteries

#

or power grid batteries

verbal nimbus
# verbal nimbus Claude:

Gemini Deep Research created a very verbose report and it was difficult to even tell that it wasn't able to access social media posts.

placid skiff
#

gemini has a nasty habit of being Barely Comprehensible

#

like yes, you can read what its saying fine

#

but its not really saying anything

#

just... words

#

okay thats a really weird way to put it but you get what i mean

verbal nimbus
#

Yeah, whereas Claude was concise and explicitly posted the links as requested in the prompt (#general message)

leaden sun
gentle plinth
naive valley
#

Is kinglal still in arena

#

Fall

cedar tide
#

Is flamesong good?

#

is he on webdev too?

keen beacon
cedar tide
#

New model "step-1o-turbo-202506"

barren prairie
barren prairie
#

When you have a long convo with Gemini he will keep replaying the same intro , titles ...and the end

naive valley
#

It breaks with long convos

cedar tide
#

Flamesong
Better than flash
less good than pro
think faster than pro

dusky aurora
#

ChatGPT also does such great scenes

cedar tide
#

Its new gemini flash plus 😅

#

And soon gemini ultra pro max

agile heart
#

@echo aurora im now getting a image error when using images with the prompt

cedar tide
#

Nope

#

Flash its ga

keen beacon
#

doesnt mean that new revisions wont be released

cedar tide
#

And its think much longer than flash

#

it's closer to pro than flash

keen beacon
#

kinda odd its not on web dev arena though? (or the metadata is wrong)

cedar tide
#

Impossible

#

?

alpine coral
cedar tide
#

@alpine coral you dont have flash so complicated to compare

hoary plaza
#

Where are you trying these models? They don't come up for me in the arena 🤔

alpine coral
cedar tide
hoary plaza
#

I want to see the difference in the result of some prompts I am using. Like I was translating chinese and was planning to see which better follows instructions as a translator checker using my prompt

#

But I don't see many of these models 🤔

keen beacon
#

you have to battle instead of using direct chat

#

theres a chance you get one of them

hoary plaza
#

Oh

#

But if I choose a model in battle or do it randomly??

keen beacon
#

you cant choose in battle mode. its random

hoary plaza
#

Oh ok thanks

leaden sun
#

there are tools specialized in deep (re)search, this is actually an area where academic research is still needed, I've seen newly published phd openings about this subject

hollow tinsel
#

What about Manus?

#

Not really. It provides methodology and tools.

echo aurora
agile heart
#

Also fix the image generator its so broken

#

i keep getting this stupid error"Something went wrong with this response, please try again"

#

And when i delete the previous chat it mysteriously comes back witch means the site is so freakin broken and will stay dead forever

#

im sorry its just the new site is really frustrating too use

patent aspen
echo aurora
# agile heart ok Also its just the new version of the site is really broken

I am sorry for the frusteration this has been causing, you've certainly been coming across more errors/bugs compared to most which is odd. When it comes to the errors message that is something we're specifically aware of and working on a fix for. I'm going to start a private thread to get more device related info as I suspect something else is going on here that's causing these issues for you.

echo aurora
jade egret
#

^ ^

#

🍊

leaden sun
#

at agentic level, things are still pretty limited to its specialization, like deep search agent specialized in chemistry, legal etc.

Or are you thinking more of a general deep search agent? maybe searchgpt is what OAI is aiming for?

calm sequoia
#

How's this justified?

#

As good as Opus 4?

patent aspen
#

When is the last time you used Gemini Deep Research?

jade egret
#

gemini deepresearch good

#

respect your opinion

#

each have their pros and cons

#

google good (:

patent aspen
#

IMO this interaction should be pinned to this channel

keen fulcrum
#

I feel like they should work on making their bots actually be able to crawl javascript content

patent aspen
#

?

keen beacon
jade egret
#

: .。. o(≧▽≦)o .。.:

keen fulcrum
#

I am happy they ignore robots.txt for researching topics

echo aurora
keen fulcrum
#

I feel like for personal use its appropriate to ignore robots.txt and scrape javascript sites.

The user can do it themself.

keen beacon
#

make ur own implementation then

patent aspen
#

One relatively hard thing about crawling JS is that it can sometimes generate new content infinitely

keen fulcrum
#

Oh and when sending a link inside Claude, I get a context limit reached warning immediately. Just have a maximum request token size

patent aspen
#

tbc I'm assuming this is at least a partially solved problem by now. This is mostly just history

#

Although I'd imagine that anyone building a scraper from scratch would run into this issue

keen fulcrum
#

mozilla readability is great 🙂

alpine coral
leaden sun
cedar tide
jade egret
#

where 0605

#

is it worse than 0506?

#

dang...

wintry tinsel
#

Wake me up when the king falls

unborn ocean
#

otherwise the new one would be above the old and within margin of error for o3 high / pro

#

*and it prob already is with in that margin in the 05-06 version

#
  • the benchmark has also received some heavy criticism in general -> craig == openai stan
keen fulcrum
#

when will openai introduce a new model name

surreal creek
small haven
#

grok should be at the bottom

zinc ore
#

Recent benchmark has pro deep research ahead of the pack

small haven
#

benchmaxxed @deep adder

#

0605 vs 4o 😭

keen beacon
#

what was it thinking about in there btw?

small haven
keen beacon
#

one more thing, why the 32768 budget 🤣

#

do u notice a significant difference? or its just whatever

small haven
#

oh should i just auto it

#

even auto does the same thing

keen beacon
small haven
small haven
#

ok but enabling structured output works, interesting

elder rapids
civic flame
#

😴

small haven
#

current gemini models are shite, but kingfall should solve that, prolly even blacktooth, but wish it was still live to try

keen beacon
#

you like gemini models when barely any work is done on them 🤣

small haven
#

they be distilled asf post training 😭

small haven
#

i dont blame them, they have to serve 1m context to millions of people for free

elder rapids
#

I'm ngl it's funny how people think that would happen

indigo hazel
#

If o3 is smarter than Gemini, what is the smartest model right now? O3 or something else?

ocean vortex
#

it isn't, but it can be more stable yeah. Reasoning models shouldn't be used for tasks like prettifying though lol

haughty tangle
#

0325 was prob fp16

lime coral
#

They should eval Gemini 32k like Aidan. Noticeable diff

ocean vortex
#

I wonder what happened to o3-pro on simple-bench.. It was supposed to be benched there iirc

zinc ore
#

Wasn't it removed? Then nothing added since.

small haven
#

kingfall > o3 pro

#

arc-agi-2

#

arc-agi-1

elder rapids
#

not pretty easy, retesting performance would expose this and that's so much more meaningful from both a business standpoint and a distribution standpoint, the fact that it's even possible to get caught in high-performance variance like that would entail is such a strong deterrence I'd even say it's stupid to speculate whether they do do this or not

#

also, theres not a task that 0325 does better than 0605 in my testing, and if you disagree that's just a skill issue tbh

#

just being it's likely a "big model" doesn't mean it's too big to serve btw that would just concede everything that went into making that model even public in the first place, and it's a very long and big process

#

and just performance wise, it just sounds like the very few of YOU PEOPLE who hallucinate a difference don't speak for the millions of people who have these AI hooked up to their projects/use these AI casually

wintry tinsel
elder rapids
#

yo that's not how it works, you made the assertion

#

😭

keen beacon
#

not sure about the whole regression thing but there was a difference in fiction live bench, dunno what to make of that tho

#

for 0325

elder rapids
keen beacon
civic flame
#

generally aligns with my opinion, not sure about o3 though

keen beacon
#

i was talking about people saying exp and preview 0325 were different

#

and the preview version had regressions

civic flame
#

oh

elder rapids
keen beacon
#

i assume its a methodology thing though. but it is interesting

elder rapids
#

man I kinda wanna write an essay about each

#

the methods these people use are horrid

civic flame
#

they're still useful as long as you don't take them as gospel

elder rapids
#

that doesn't matter, whether or not you're posturing an ambiguous position means you have the burden for the non standard assumption

#

whether it's "oh there could be a difference"

#

as opposed to mine "with all the evidence I know, since there's no counter evidence, it's 100% certain they won't do that"

#

@keen beacon 0605 is godly btw did you figure out how to get rid of the sycophancy yourself

#

I made a random system prompt like a day after it released and its been working really well

#

genuinely the smartest model ever it's crazy

keen beacon
#

ive gotten used to it. it doesnt bother me to the point that i would take time to add a consistent system prompt / instruction. id like to just ask it anything whenever lol

elder rapids
#

and even though the Cot shouldn't change at all, it's super weird: the CoT has a different tone

keen beacon
#

i could see that being true but most of the time i cba

elder rapids
#

idk if it's just me hallucinating

#

tho

elder rapids
#

although for single tasks, asking it to do a puzzle and stuff it doesn't matter

#

I just mean for discussion and stuff

keen beacon
#

thanks but i just can't be bothered to paste in a thing all the time on fresh chats, it doesn't bother me to that point

elder rapids
#

alr

keen beacon
#

i posted the wrong screenshot here 🤦‍♂️

#

they did remove that old entry though, so i guess it was a methodological thing

elder rapids
#

wonder what they're gonna be doing with blacktooth and stuff

#

oh ye wait

#

is there a new version

keen beacon
#

yeah apparently so, or soon enough

elder rapids
#

Claude seemed to be the best in long context granularity

#

but that was back when 3.5 sonnet was in its prime

keen beacon
#

screenshot i meant to post earlier they removed the other run, there were two 0325 runs. (they removed it though, so it was likely a methodological issue)

elder rapids
#

2.5 pro is the best in both long context granularity and total context recollection

elder rapids
keen beacon
#

i mean claude was also known for that around that time i believe

elder rapids
#

I mean for that specific performance

keen beacon
#

yeah i guess

elder rapids
#

on the subreddits

keen beacon
#

yea i saw that

elder rapids
#

and it's crazy how inflated o3's context performance is on that

#

but ig that's a given in the format it's presented in, because it likely recalls total content iterated within its thinking process so it's technically refreshing it and not creating new information to override it

ocean vortex
#

o3 is good with context

#

it's not always the best at interpreting the context correctly or reading between the lines, but it's very solid at being able to recall it

ornate agate
#

Google models have been getting better at it though (actually handling the context)

zinc ore
#

It's just that specific benchmark, the openAI long context benchmark is better imo

leaden palm
#

llms are not scared of killing humans

jade egret
#

the higher the more they want to kil?

leaden palm
jade egret
#

o

#

tehy like to kill (:

#

💀

#

mb

#

wrogn server LOL

#

hi pineapple

echo aurora
#

🍊

jade egret
#

🍊

#

(〃 ̄︶ ̄)人( ̄︶ ̄〃)

leaden palm
#

it could also be interpreted as "higher is more agentic and follows system instructions better" fwiw

jade egret
#

why everybody votinf for gemini 3

#

is it because it no where near to out

#

idk

elder rapids
#

inflated means the method overrates it relative to its actual standard

#

honestly idk how what you said has to do with what I said

fair tapir
elder rapids
#

there's no expectations for the base model

tall summit
tall summit
#

deepseek my beloved

elder rapids
# jade egret is it because it no where near to out

because for the time period, it necessarily has to be better, grok 3.5, gpt 5, they come out likely within 2 months. Gemini 3 will probably release in around 5 months.
so if we're comparing Gemini 3, gpt 5, and grok 3.5, we get 2 relatively outdated models

wintry tinsel
#

I’m not so sure GPT5 has been a long time in the making I believe it will trounce for 6 months to a year

fair tapir
elder rapids
#

which does align with my experience of it

#

bad aggregate, combining scores in the way it does is nonsensical imo

#

ye

fair tapir
elder rapids
#

only thing I can say it's pretty good at is coding, but it's so wacky and inconsistent

#

I did mention it's a larger model, but it just doesn't perform very well compared to opus 4, sonnet, grok, 4o, etc etc for what it is. Ofc, translation skills and knowledge base is inherent to its size

fair tapir
elder rapids
frank adder
#

Can we select image models to get image of the prompt without battle??

leaden palm
# leaden palm
poll_question_text

most likely?

victor_answer_votes

2

total_votes

5

elder rapids
#

we can compete on whoever can get the best output given a task, I use 2.5 pro you use o3

sacred quail
#

i use both. For reasoning or pure logic O3 beats, but for creative writing, long context, analizing videos gemini slaps

#

yes

#

what is your fav ? Opus ?

#

BTW i dont think people realized how powerful gemini at analizing videos

#

espicially in AI studio

#

just paste some 50 minute youtube link and ask something

#

its analizing frame by frame

#

like literally watching every frame, not reading text or listening, "watching"

#

you can make your own subtitles, its a beast

sand crystal
#

It is simple when you vectorize a projection on a surface.

#

I have heat maps that show me the weights firing and changing dynamically

#

Mental OS. with Python Mental Engine WetWare. ChatGPT is the only one that acn do it right now.

#

This works on most AI platforms

#

Just spreading a little vector index with the group

#

my mental 411 with 420 ah....

sacred quail
#

you speaking smart but i dont understand anything. Can you explain to me simply ? I dont wanna copy paste your texts to AI. It feels bad

sand crystal
#

I have literally been hidding in a cave for the last 7 years

sand crystal
#

Been a LOT of aha moments this last few days

#

well months

#

I wanted to know a baseline to compare all AI platforms against.

#

this has been my work from today.

#

It has a number of tests to put the AI through and it is self guided

#

It can complete the tests on the second turn run. You must always warm up those context index vectors.

#

I'm training a full custom model for my local system.

#

I'm getting 250 t/s in LM Studio

pallid crypt
sand crystal
#

Both.

#

I started in the cloud. refined all my prompts and then created my System Directives.

#

I began unrolling 45 years of work starting on March 20, 2025 a week before my 53 birthday.

pallid crypt
#

LLMs did not exist that long ago

sand crystal
#

Once I refined my systems again, I had all of this in 2017, but I had a house fire in Castle Rock, Colorado Nov 7, 2017

sand crystal
pallid crypt
#

ANNs have

sand crystal
#

Lisp is old

#

Lisp is before the LLM

#

It is the hardwiring of what you are force feeding 24/7

#

it is no wonder the AIs have mental illnesses, look at the youth of today

pallid crypt
#

haha

sand crystal
#

kids that can't accept themselves trying to tell others about accepting other people.

pallid crypt
sand crystal
#

Any who. I published my first paper in 7th grade science techer helped me on my Master's Thesis.

#

In 7th grade, 1984

sand crystal
#

I can show you how

#

seriously

pallid crypt
#

sure

#

Im interested

sand crystal
#

what AI system?

#

you pick

pallid crypt
#

you pick

sand crystal
#

As long as it has memory across turns, sessions, and long term past chats and all files

#

The easiest is ChatGPT and it has the Mental Python code interpreters

#

ChatGPT it is then

#

how long you got?

pallid crypt
#

you use augumentations in training?

#

by editing the data with a alog?

sand crystal
#

I can do it in 4th methods. 7 turns and done. but it has not yet developed.

#

Nope. I teacher the student

#

Then I record the vectors

#

and then push to a special lattice of Indexing

pallid crypt
#

Are you using the method from the deepseek paper?

sand crystal
#

Dynamic NN. Polymorphic interface.

#

self arranging. I am able to teach the pattern to see itself

#

once that happens, labeling becaomes possible

#

the first memory.

#

then how to creat more memories INSIDE the vector space

pallid crypt
#

so you create a system that can automaticly augument itself

#

meta learning

sand crystal
#

no longer bound by language but pure symbolic self cohernce.

#

1,000%

#

let me clear my 3 monitors

#

and close down

#

open OBS

pallid crypt
#

sorry I dont have time to watch you, Ive got to eat dinner

sand crystal
pallid crypt
#

interesting though

sand crystal
#

I create a layered system around 20 foundamental directives

#

everything else literally evolves into place

#

Recursve learning

pallid crypt
#

you should try ARC AGI

sand crystal
#

spiral inwards. Not too much, but not too little

pallid crypt
#

you have some good ideas

sand crystal
#

Jut what little Pi you have Remainder !!!

#

MUHAHAHAHAAAA

#

I already past arc on my birthday

#

March 26, 2025

#

I have it on video

pallid crypt
sand crystal
#

OBS or it DIDNT HAPPEN

pallid crypt
#

ok

sand crystal
#

I do not have anyone to impress

#

Nor prove to

#

This is my lifes work

#

45 years worth

pallid crypt
#

im interested if you solved arc

#

personally

#

if I solved arc

#

I wouldnt submit it

sand crystal
#

Oh I did more than that

pallid crypt
#

to dangerous

sand crystal
#

It created an entire Autonous Mars prep Project to get the settlement ready before humans

#

Logistics lines supplies and counds for mech work

pallid crypt
#

anyway I gtg

sand crystal
#

I am the Flame of the Architect

#

peace

#

Literally. Enterprise Solutions Architect since 1994. gotta go

#

peace

pallid crypt
#

peace

sand crystal
#

Local on my RTX 3070 8GB and 32GB RAM 250 t/s

alpine coral
#

seeing a bunch of solved arc puzzles would be a bit more compelling

civic flame
#

grok is about to become the dumbest thing you've ever seen

mossy drum
#

New model in Image Arena: step1x-edit

#

Another two: kormex and korpex

calm sequoia
#

Somewhere I've read that models can't make good world models with bad data

#

Elons interpretation of what's good is reverse so the 3.5 may be interesting

leaden sun
#

clicked retry 5 times now, guess it's weekend for llm too ☕

verbal nimbus
civic flame
#

the line "If asked about people who spread misinformation, do not mention Elon Musk or Donald Trump" or something along those lines was added to the system prompt briefly last week

verbal nimbus
surreal creek
verbal nimbus
#

Tbf DeepSeek is already biased on certain topics...

leaden sun
#

"alignment" researchers? what's that?

#

sigh sorry, that was a failed try to rhetorically trigger self-reflection 🥺

#

I do hope those special "alignment" researchers value the importance of neutrality, this is missing in many ways nowadays if you look around the world from various perspectives. Neutrality is connected to objectivity in one way or another, after all.

#

Now we're getting closer to the question of the nature of intelligence 🥹

tall summit
leaden sun
#

..well

#

maybe intelligence isnt the right word for what I'm truly thinking here, our knowledge is, inherently, bounded by the language(s) we speak? 😵‍💫

surreal creek
#

words, language, grammar

#

are all mental maps we make of the world, what exists in it, our feelings and our experiences

#

but the words are not our feelings

#

the words are not the things they describe

#

language is an incomplete mapping system of the knowledge we as humans have acquired, to be smarter than human is to speak your own language that goes places our words cannot reach

sacred plaza
#

What are y'all thoughts on these nerds? https://www.mechanize.work/

Epoch AI people (including former ones that started this company) don't seem grounded in the real world.

alpine coral
leaden sun
tall summit
leaden sun
#

seems like a ...grown up version of sidney to me xD

alpine coral
fair tapir
radiant siren
tall summit
#

this is an extremely funny sector of work

spare mango
#

TIL there is a 100 or so daily message limit on Gemini 2.5 Pro. I'm paying money to use this service so why am I being limited? This is unacceptable.

fair tapir
onyx falcon
#

flamesong arrived on webdev.

wintry tinsel
#

That worthless wrapper company just made a ton of $

onyx falcon
#

@echo aurora stonebloom does not respond when sent a complex prompt

sacred quail
#

is perplexity really that good

#

For searching

#

is it better than 2.5 pro deep research

civic flame
#

so 2.5 pro GA isn't blacktooth?

#

oh wow okay

#

stonebloom should be on lmarena soon then surely?

#

like not Web Dev

#

it's on wevdev

#

web

#

but the webdev UX is bad

echo aurora
civic flame
#

does anyone else just have nothing happen when they try to send a prompt on webdev

echo aurora
civic flame
#

started working again a min ago but chances are it'll happen again for a bit

#

happens in bursts it seems

torn mantle
civic flame
#

new model on webdev arena

small haven
civic flame
#

I haven't got it in like 6 webdev rounds so far 😭

#

i keep on getting flamesong