#general

1 messages · Page 1 of 1 (latest)

steady beacon
limpid belfry
gleaming obsidian
torn lintel
drowsy haven
raven cosmos
devout orbit
static holly
clever estuary
foggy laurel
sudden root
#

Hi

devout orbit
#

woohooo

karmic tide
kindred kiln
shrewd rose
keen beacon
toxic palm
torn lintel
#

hello

sudden root
last zinc
dense socket
scarlet marsh
hollow coral
limpid belfry
#

woop

clever estuary
#

hello

swift estuary
drifting crest
stark horizon
devout orbit
#

9th person 🥳

hollow reef
keen beacon
#

hi

gleaming adder
dire bone
gleaming obsidian
#

hello!!

unkempt elk
violet vault
balmy raft
restive cedar
still jetty
eager mica
calm spindle
lunar pier
vapid sail
unkempt elk
#

Hehe

warm eagle
limpid belfry
#

and the flood of people begins 😄

glossy sierra
keen beacon
#

why was a new discord needed btw?

random spade
gloomy crown
dapper storm
sweet tinsel
limpid belfry
unkempt elk
limpid belfry
low acorn
whole pollen
blazing rune
drifting zephyr
timber veldt
keen beacon
thorny fable
whole pollen
#

hey

lilac wyvern
steep elbow
unkempt elk
#

YT commentors here too huh

candid storm
onyx lintel
golden moat
nocturne frigate
eager crater
cedar tide
rotund vector
fading abyss
cedar tide
#

Hello

eager crater
#

hello

devout orbit
devout orbit
sand ledge
devout orbit
thin breach
keen ferry
true moss
cedar tide
#

the real ones are there

split timber
keen ferry
#

hi

solar apex
strong pilot
sacred plaza
dense moon
sacred plaza
#

Is there a way to test to models without knowing what they are beforehand. Like having a blind test study to see which models I actually prefer

mellow frigate
cobalt minnow
keen beacon
#

or u wanna form ur own personal leaderboard?

drifting gate
silent bone
devout orbit
wicked tapir
outer mauve
wicked tapir
#

hi everyone!

#

i'm an MIT student

keen ferry
keen beacon
marsh stratus
dull moss
devout orbit
#

but onyl he can tell

keen beacon
#

there should be an arena feedback channel i think lol

devout orbit
#

yeah

upper ingot
upper ingot
tribal raven
slate cliff
primal ferry
regal stump
sacred plaza
# keen beacon lmarena is supposed to be that lol

Before you test out two models you pick which model you want. Is there a feature where can avoid knowing what two models I'm testing? There's a lot of cognitive biases that would affect people's evaluation outputs based on knowing which models they're currently using

broken ravine
verbal light
keen beacon
oblique flint
sacred plaza
sharp elbow
modern knoll
sudden drift
crude hound
sharp olive
autumn herald
sonic citrus
hearty wolf
sacred plaza
thorny falcon
sacred plaza
hybrid shard
rain egret
past epoch
pliant minnow
lavish jungle
#

From c116 to Arena. Whooho 🤖👾

regal sparrow
unborn apex
fathom oriole
thorny drum
unborn ocean
molten oracle
hearty pulsar
loud leaf
blissful jungle
visual nova
atomic lance
untold atlas
frank ridge
median pilot
frigid blade
sudden ferry
golden kindle
shadow hollow
hidden bane
woven portal
small nymph
dense shoal
warm wren
dusky lion
blazing flicker
jolly aspen
haughty estuary
raw pulsar
tacit minnow
tardy pasture
mystic mica
sage hazel
sterile copper
boreal trail
frail delta
boreal trail
#

first

boreal sail
wheat onyx
ancient glen
twin basalt
river reef
karmic siren
sturdy mica
timber kiln
civic stump
topaz ivy
crimson shell
placid sparrow
meager vine
granite sorrel
native pawn
devout cradle
solemn jetty
meager lintel
upper finch
strong gull
rocky jewel
atomic fractal
fierce smelt
surreal ember
mossy drum
stoic lagoon
lavish orchid
regal leaf
calm sequoia
brisk turret
lucid zinc
north vale
sudden helm
noble glen
charred pagoda
brisk turret
#

whats the difference between lmsys and lmarena

opal hare
twin anvil
light walrus
woeful geyser
ancient reef
lyric peak
whole sundial
torpid tinsel
timid umbra
dense saffron
#

recommend changing server icon because i keep getting the 2 servers confused

white kelp
weak swan
pliant cypress
somber monolith
verbal tide
umbral plaza
olive helm
regal nimbus
young otter
brisk turret
dense saffron
#

that's a lama? I thought it was a deer

brisk turret
#

Probably is now that I look at it

#

Yeah just make it face to the right

#

That way we can tell them apart

rotund narwhal
honest swallow
carmine sparrow
ocean seal
sleek rain
keen ferry
#

new model: grok 3 preview 24.02.2025 (on lmsys)

leaden palm
fleet lance
ancient walrus
pure tundra
brisk turret
#

Why the f was chocolate removed

#

Where is it??

#

Ah it's in deprecated

tidal spoke
high ferry
iron meadow
frank arrow
fringe ermine
upper wolf
boreal hound
quick gate
keen beacon
celest spindle
near epoch
severe briar
lofty radish
hidden mauve
rancid pike
daring harbor
trim dock
terse lintel
median anvil
serene parrot
keen beacon
brisk turret
vital beacon
craggy patrol
potent tree
cedar echo
plucky garnet
rose dome
dense turret
shrewd gorge
manic oracle
crimson jolt
long owl
distant rock
zenith root
leaden palm
#

Please turn off welcome messages

keen beacon
#

set it to #welcome or something if u dont

alpine willow
limpid basalt
open osprey
jagged dune
thorny bane
spice maple
ebon patrol
neat apex
leaden palm
#

Man if this continues I'll need a uBlock rule

#

@wooden mulch could you turn off the join logging

rigid jungle
ancient reef
#

wow.

#

(that was an accident. it said hi an I acted on impulse)

#

anyway with ths channel lke ths Im staying on lmsys

minor otter
unique nova
#

Cool new community!

rancid torrent
humble holly
cursive zodiac
viral canyon
balmy flower
fast basin
topaz peak
summer hound
slim depot
steep obsidian
limpid stream
vestal ocean
obsidian mortar
obtuse sapphire
severe warren
wooden mulch
leaden palm
#

looks like this in the main settings, you can't miss it

wooden mulch
#

done. thanks!

leaden palm
#

image arena is crazy

#

found antdx in webdev arena?

grizzled elbow
#

Hi

leaden palm
#

this is crazy

north vale
leaden palm
#

the one and only bayesian!

north vale
#

I trust this value more

leaden palm
#

79% ≅ 81% 🤷

wicked tapir
floral yew
#

hello from india

frigid wolf
#

hello world 🙂

gray spruce
#

hi

hidden mauve
#

Dear mod(s), admin, can I ask you for a little bit different logo?

#

just a LMArena text in the bottom or something

north vale
hidden mauve
#

oh okay then

#

never mind

meager vine
#

wtf this server even exists

mossy lotus
#

Hi, server dead?

leaden palm
hidden mauve
#

I'll keep it alive

hybrid shard
#

I believe the other server will still exist, just without the Arena-related channels

north vale
feral summit
#

Hello hello what a nice shiny new server 🥰

dapper storm
#

Lmsys people want to not be associated with their most successful project

#

Really makes u think

wooden mulch
lean phoenix
#

hello 🤗

grand mesa
#

Hi

jaunty delta
#

wow

#

its in direct chat

zealous dawn
#

heeyyy

summer hound
#

What's that Gemini test

#

You think it's good?

hardy pecan
#

HELLO

timber kiln
#

Test 30? Gemini 3?

patent wedge
#

hi, what is the best LM to buy for medical academic work?

gentle plinth
stone coral
#

Grok 3

keen beacon
torpid fiber
# keen beacon not there for me

They removed it, including all 7 unicorn engines (Meta Llama) LLM models, less than five hours ago from the "Direct Chat" option/menu, but it's still on the "Arena chat" blind test. You just need to find it. Source: I've tried it earlier/before personally.

topaz peak
#

flux pro's image is ... intriguing , lol, doesn't seems to match any of the input words besides , maybe "fence"

torpid fiber
topaz peak
#

yeah it pops up now and then

clever estuary
#

nice grok3 having issues rn

#

guess everyone's gonna vote against it

#

screw musk man!

leaden palm
brave tiger
#

yum

pastel depot
#

is there a chat mode lmarena.ai that i can put a log file in and get help from it i refuse to pay trash open ai 20 a month scum

leaden palm
#

including chatgpt free

pastel depot
#

they have limits

#

cuz scum

brave tiger
brave tiger
pastel depot
pastel depot
#

sick of that sht

#

i want a search engine that can detect subscriptions and trials so bad

brave tiger
#

they dont even have a way to pay yet so

pastel depot
#

ok ty ill look at it\

pastel depot
#

well im just saying everytiime i search for somthing its free 14 day trial sick of it

#

im so done with subsription based bull sht

pastel depot
brave tiger
#

real fr, but thats how they make money, you do realize like OpenAI is a nonprofit

pastel depot
#

its scum

#

its all over the place and all ai bs is wayyyy over priced

brave tiger
pastel depot
#

ill look

#

i have ublock on opera cuz they took ublock of of google chrome lol

#

off*

brave tiger
pastel depot
#

its over priced sorry

brave tiger
pastel depot
#

i just have pure hatred

brave tiger
#

real

#

aight

pastel depot
#

im trying the deepseek thing you said take off what?

brave tiger
#

or r1

pastel depot
#

i think it is its not higlighted so

#

cant type today

brave tiger
#

u good then

pastel depot
#

ok ty

leaden palm
pastel depot
#

ya i see deepseek can only do 33% which kinda blows but still works

#

its a minecraft log lol

leaden palm
pastel depot
#

ehh its prob cuz it shows all the mods and theres like 200 plus

#

but idk if it needs to see what mods

#

are in it

#

im trying trae ai it seems to be able to hold it all

#

and tells me how to fix stuff

leaden palm
pastel depot
#

well its fine im using this it seems to be a lil better then deep

#

but idk how accurate

#

trae is just yet

#

most of the time they are ok at like basic errors and stuff like told me what mods to remove if i need to or commands

brave tiger
#

shi good luck

#

if i were you id remove all the unnessessary stuff from the log first then give it what you actually want to analyse.

brave tiger
pastel depot
#

idk i apparently have 350 mods cuz fabric adds random mods

#

if that makes any sense

pastel depot
leaden palm
#

Get to selecting text and finding out what most of the log is

pastel depot
#

well im already using trae its honestly working better as loong as trae isnt guna say like trail ended or some sht

#

trial

#

its a program thats prob why

floral yew
#

gotcha

cloud meadow
#

I've seen some projects utilise baritone with AI which was pretty cool

hardy pecan
#

lol claude gave me a pretty good emulated msn messenger client

sacred plaza
#

Those usernames are too accurate lol

willow grail
#

ff

frozen arch
#

Getting a lot of "connection lost" in lmarena these days.. what's going on?
Can't use any of direct chat, or anonymous chat, or anything

brave tiger
frozen arch
#

im on the same tab and never switch, have tried with different networks (networks are stable) and different devices (mobile phone, pc, etc), it happens

pastel depot
cloud meadow
#

I don't think it could do that with error messages alone

pastel depot
#

it can it has its not 100% more like 70% but it has helped me

cloud meadow
#

Interesting

#

What are your logs?

#

I mean, what mods do you have with one another? Similar type of mods can usually attempt to change similar functionalities (mixin incompatibilities) which usually leads to issues

sacred plaza
#

Ayo, what in the world is Sam smoking?

gentle plinth
ancient walrus
#

Was qwq-max ever in the arena?

topaz peak
#

new model alert

leaden palm
#

Not new

#

Perhaps a revival

rapid merlin
#

Kinda woried about gpt-4.5 ngl

#

First time i've seen this kind of error from a LLM

#

It defines and declares a function like this, then proceeds to call it with a different name entirely

local function teleportToFS()
    root.CFrame = FS_LOCATION
    PP_REMOTE_EVENT:FireServer({"+FS6"})
end
teleportFS()
#

a very human error to make KEKW

keen beacon
#

gremlin, centaur and enigma are all back

#

google modelss

#

so far the latter 2 have been better in my testing

#

centaur = thinking model

drifting crow
#

Interesting names

keen beacon
#
  • cohort-chowder
#

seems meh

#

another google one i think

lucid zinc
leaden palm
#

its making up numbers

#

can anyone explain this?

#

plain old 4o and 3.7 get it right one shot

topaz peak
#

wtf?

sacred plaza
leaden palm
#

is luca a chinese name?

steady beacon
#

shouldn't be

#

at least if it was than surely upon transliteration it would be luka

ancient reef
#

I thought it was the biology LUCA

hardy pecan
#

holy moly

#

new model: margherita-plain

#

das alot of emojis

#

Looks like its made by "reka"

cedar tide
#

Three models form cohere available in direct chat

#

it looks like it's the same model base

cedar tide
#

arrived in the arena
claude-3-7-sonnet-20250219-thinking-32k

glad oracle
#

Hello I'm new here. I'm also new to chtbot arenas in general, I wanna learn more about them

#

Are arenas about comparing outputs of llms for the same prompt and ranking them? is there more ot it?

rancid stone
low acorn
sweet zodiac
#

hi

earnest parcel
# cloud meadow https://huggingface.co/spaces/Qwen/QwQ-32B-Demo

Tested QwQ-32B (local, Q4_K_M):

  • best in size, except for coding
  • extremely verbose (avg. ~10x output tokens compared to traditional model, more verbose than any other long-cot-model I ever tested)
  • more effective thought chains than r1 distill versions of Qwen2.5-32B
  • terrible at all webdesign tests I threw at it
  • Smartest sub 70B by brute force token chains

This is a smart model, but for me the extreme verbosity and inference required excludes it from becoming a daily driver.
The good outcomes feel brute forced with cot, and the verbosity is borderline ridicilous.
Good if for complex STEM related subjects or reasoning tasks.
Not useful for coding.

As always, YMMV!

cloud meadow
#

I can't wait for the new llama models

#

Meta has a lot of work to do though

cloud meadow
drifting crow
cloud meadow
#

It's been a while since I've last heard of them

neat apex
#

I use they space and i actually noticed it went a little smarter

#

I dont know if they just improved the system prompt or it is Margherita xd

keen beacon
#

Own server 🤗

hardy pecan
lucid zinc
keen ferry
#

this can't be real

cloud meadow
#

cohort chowder

lucid zinc
#

last number was 15 instead of 5 in mine. doubt that makes difference but fyi

leaden palm
torpid fiber
# keen ferry this can't be real

It’s real. From what I’ve gathered, those three models are essentially the same base models rather than reasoning models with Test-Time Computing (TTC). Their responses are instantaneous; they promptly address the user’s prompts. These models are of the same size, and I see no differences. Honestly, they’re average. They have a knowledge cut-off until June 2024. They come from Cohere, and the model is named Command; they are not from Google DeepMind. I believe they’re simply A/B tests evaluating the same version of the model, or they might be different sizes with minimal differences in performance improvements. For optimal performance with any model in the LMArena or on the API, I recommend adjusting the following parameters to minimize hallucinations and fabricated misinformation while providing more concise answers.
Update: They were removed from the Direct Chat option/menu about half an hour to an hour ago. The LMArena is based in the US West and typically updates, edits, introduces new models and removes models within this timeframe for both the Arena blind test (battle) and the Direct Chat. To reduce hallucinations or fabricated misinformation and ensure more concise answers, visit the website, then navigate to Direct Chat or Arena (side-by-side). Afterward, scroll to the bottom and left-click under the small arrow to the right of the Parameter. Please set the following values by dragging the slider, writing them by hand, or manually inputting the corresponding values to apply all changes.

#

Note: I have a dyslexic font applied. That’s why it’s weird the font. If you wanna try it out, then download the Chrome Web Store extension that I use on my Google Chrome web browser. It’s called: Dyslexia Friendly. It’s completely free and legit, with no malware/viruses. I recommend disabling the “Ruler option/menu” for better flexibility. The font I'm using is the default one, called Open Dyslexic, I think you can import any font.

keen beacon
torpid fiber
#

No, I'm not. I'm using Windows 10, I need to upgrade to Windows 11, lol.

keen beacon
#

ah ok, i guess its just the font then

torpid fiber
#

Yeah.

#

The font I'm using is the default one, called Open Dyslexic, I think you can import any font.

low acorn
short stone
#

anyone know a free way to acess claude extended thinking model either threw a trial or anyhting?

torpid fiber
#

On the Direct Chat tab/menu in the LMArena.

short stone
torpid fiber
#

I didn't put it to the limit. Try it out. Maybe it's less than 20/25 messages?

cedar tide
#

New llama model
Space (very slow)
Sky (slow)
BluePrint (slow)
Momentum (fast)
Flywheel (fast)
Momentum (fast)
Inertia (fast)

#

Luca I don't know where he's coming from he doesn't want to say it

#

and QWQ 32b arrived

leaden palm
#

we are never going to have llama 4 at this rate

rancid stone
leaden palm
rancid stone
# leaden palm We are in the LM Arena Discord

Yes, in LM Arena these two model come up and @/ˈliːəʊ/ mentioned that "centaur and enigma are all back". I'm trying to find some information on these models online, but cannot find them

leaden palm
#

LM Arena contains anonymous models

rancid stone
#

ok, so these names are used in place of other models? but why?

vague quiver
#

Does anyone know what other authoritative rankings of ai mockups are besides this?

#

Does anyone know what the difference between grok 3 preview and grok 3 in the previous grok?

wind stream
twilit warren
#

Can we keep the history of what we have send in direct chat?

jolly aspen
#

ask a bot to write you a browser extensiin for this

past shuttle
#

I'm looking for a partner to create a telegram bot

barren prairie
gusty timber
#

Guys, is there a subscription model on this? I'm trying GPT 4o, limited to only 8 messages, how can I continue I don't see anywhere

cloud meadow
keen beacon
#

there are so many meta models on the arena lmao

neat apex
#

Llama 3.7 soon

timber kiln
#

LLama effort is so mid they were releasing same models trained the same way until 3rd version
They even have their own papers from past on more efficient LLMs but somehow they cant make it work

keen beacon
#

i genuinely have lost track of how many there are

#

it's definitely more than 10

#

i'd say there are probably about 15 anonymous meta models running around

#

makes it hard to discern them

ocean vortex
#

70b should be more than enough for a decent reasoning model I think

timber kiln
#

They are still behind of Qwen in base models

#

Even though Qwen is not doing something special if it were to believed in their technical papers

#

Biggest sin of Google and Meta I think they aren't training on code enough

ocean vortex
# timber kiln They are still behind of Qwen in base models

I think they are only slightly behind. But with RL training small delta like that shouldn't matter and can be easily overcame. Like gpt4o/mini has clearly fallen behind now but o1 and o3 on the same base models are still leading the way.

ancient walrus
#

re Meta: Meta AI Chief Scientist Yann LeCun doesn't believe LLMs will lead to AGI.

#

reflected in their output

cloud meadow
#

It seems to be quite proficient with programming

#

Do they just hire a bunch of developers to produce code based on user prompts and then train Claude on it?

timber kiln
#

No idea
Data is the most well protected secret both because they are using copyrighted stuff and because that is one thing that can be a moat
Even the open models (few exceptions but they don't also perform well in coding) don't disclose their data speficically
We won't know until some of these companies start going bankrupt but even then they will probably be acquired and forgotten

zealous ingot
#

Hey all. Curious how we feel about data sharing

#

Data is shared

#

Our specific data is a large input to their model

cloud meadow
#

It's sort of difficult to comprehend what you are asking

zealous ingot
#

Commenting, you can look at the impact of tokens, whatever happens on lmarenai is high

zealous ingot
#

Be mindful of what you post

#

You wonder where.. here

cloud meadow
#

You mean, scraping?

zealous ingot
#

Yeah, accidentally typing something in. Anything without login

#

If you login, they must attribute

#

Love the cat

cloud meadow
#

Where are you from, if you don't mind me asking?

zealous ingot
#

I’m an independent researcher on agi

#

With a gpu farm

#

I experiment in context window optimization

cloud meadow
#

Interesting

#

What nationality are you?

zealous ingot
#

USA

cloud meadow
#

American?

zealous ingot
#

NYC

#

I use this platform a lot

cloud meadow
#

lmarena or discord?

zealous ingot
#

Both, but I specifically use lmarena for training data, in and out

#

Thank you

#

Thank you, keep your head up, beware

cloud meadow
#

Ok..

#

What are your opinions on Google's models?

zealous ingot
#

Trash

cloud meadow
cloud meadow
#

That's a little harsh

zealous ingot
#

I reiterate, trash

#

My neighbor built that stuff, still trash

#

They use spam data as their training

#

Love context window. Underlying ai is my test for intelligence. Can I make Gemini less stupid

barren prairie
#

Am I the only one loving the Gemini thinking one ? 😆😆😆😆

cloud meadow
#

Gemini thinking is quite a good model. The context is indeed amazing.

#

I've used it a lot for annotating code

zealous ingot
cloud meadow
#

What do you think about the opensource deepseek repos?

zealous ingot
#

Deepseek is a marketing tool in the end.

cloud meadow
#

Not sure what to think of that lmao

zealous ingot
#

OpenAI also doesn’t control 4.5

cloud meadow
#

what? 💔

zealous ingot
#

I gave it a drive, I iterate between 4.5 and 4o. It’s based on the data from lmarena

cloud meadow
#

I believe there is some sort of language barrier here.

#

I assume English isn't your first language?

timber kiln
#

Microsoft’s AI head is plotting a future without OpenAI. According to a report from The Information, Mustafa Suleyman, the head of Microsoft’s AI division, has set his sights on a bold objective: reducing the company’s dependence on OpenAI. In a report titled 'Microsoft’s AI Guru Plots a Future Without OpenAI,' The Information reported: "After s...

cloud meadow
#

I only hope for OpenAI's failure due to their failed promises

north vale
#

microsoft is so cooked

ocean vortex
#

I have my doubts about sky being OpenAI, it certainly doesn't have the usual OpenAI system prompt nor even their usual refusals against it... But can't dismiss it completely with limited testing for now

leaden palm
ocean vortex
topaz peak
#

really good map

#

model is "flux-1-dev-fp8"

teal siren
topaz peak
teal siren
#

nice thanks :)

wind stream
pulsar tendon
topaz peak
#

wow, honestly i am surprised they are this good at making maps

#

i thought there would be much more distortions

pulsar tendon
fast elk
#

v

#

v

topaz peak
#

the hell? how did you even find that?

#

i like it, a lot less clunky than the current look

fast elk
#

i guessed it

#

i saw on reddit that there was an alpha and i guessed passwords as a joke and got it

leaden palm
#

i don't really like the spam but this is cool

#

it simultaneously feels like a ton of thought was put into it and none at all

#

some stuff is really nice (eg subtly only showing text when needed) while other stuff is worse than currently (eg not much space for the chat and can't view both model labels at once)

fluid plover
ancient reef
#

Wahh. I love a lot of aspects of it too and many bugs are gone

fast elk
meager sun
leaden palm
#

heh 😅 i do appreciate all the work you do

meager sun
#

no more gpt 3.5 turbo?

cobalt minnow
#

Seriously, keep it coming

brave tiger
#

love the new UI

#

ngl

#

def an upgrade from the gradio

clever estuary
#

I prefer the old UI

#

why does everything have to look flat and mobile friendly...

#

not a fan

meager sun
#

the old clunky memory intensive one?

#

it does look like deepseeks ui

#

ngl

safe sleet
#

Phew, I won't miss the gradio clunky waiting times for everything

wicked tapir
#

honestly semi-surprised this isn't a mobile app already

#

it has perfect single-page-app vibes so even the current UI would make a great mobile app with minimal changes

teal mantle
fast elk
teal mantle
#

bro look at this

#

btw I am still pissed off about most AI labs

jovial forge
#

hello, anyone knows who is the trainer of luca?

pure hull
short sluice
#

hello im a new one, dae wish there was a text2image direct or side-by-side arena version?

bright kayak
#

how often does the leaderboard on lmarena.ai update?

cloud meadow
#

Whilst you may have liked the old style UI, there is no denying how slow and buggy gradio was.

cloud meadow
#

As to how exactly you guessed you needed "super-" before the alpha, that is interesting.

hardy pecan
#

interia - from meta

hot coral
#

By the way, are people allowed here to try to reverse-engineer the experimental models in anonymous battles which hide their names? Or is it a prohibited activity considering it's an official LMArena discord server?

hardy pecan
#

Luca - is a tough cookie to crack

eager crater
#

There is not yet a way to copy the code with a button in the new alpha UI

barren prairie
barren prairie
sacred plaza
#

Just caught the Mickey 17 movie and can't stop thinking about how Expendables (humans created to be sacrificed) mirror our potential relationship with advanced AI. The way they're treated as second-class despite being fully conscious feels uncomfortably familiar to how we already interact with AI systems. Has anyone else seen it? Am I reading too much into the movie when I connect it to how we "red team" and intentionally try to break AI systems to test their trustworthiness? While current generative AI systems don't deserve full human moral rights, I personally find it hard to say they don't deserve more consideration than a table or my electric razor. Curious what you all think!

Full disclosure: Issac asmiov's robot stories have made me really question if we really are so drastically better than robots under the three laws of robotics

teal siren
#

Just out of curiousity, how does the project finance the model apis?

keen beacon
#

sponsors

teal siren
#

thank you, really nice

gentle plinth
drifting elk
#

Hi everyone

cloud meadow
#

Hi

drifting elk
#

There is a problem with the website there is high traffic

short zodiac
#

hey team 🙂 gpt 4.5 is out, soon on arena?

leaden palm
short zodiac
#

i only see GPT 4o 29-01-2025, recent but not the 4.5 :p

#

i looked again and i confirm, 4.5 is not on the arena

keen beacon
short zodiac
#

where? i dont see it in the list

keen beacon
short zodiac
#

this is the new UI ?

humble sonnet
#

hi

thorny drum
short zodiac
#

ok good, but not on the list

#

not selectable

#

arena side by side or direct chat

keen beacon
#

you mean direct chat lol

#

it is in arena battle

leaden palm
#

let me know when you find a place that has gpt-4.5 for free

#

(one conversation will cost them $0.1)

keen beacon
ancient reef
#

There was a site with free o-preview. It removed the option like a month later

barren prairie
meager sun
ocean vortex
rigid widget
#

QwQ-32B is amazing 😍😍😍

keen beacon
#

sonnet 3.7 and sonnet 3.7 thinking might be the same deployed model lol. (potentially not separate weights, i.e. with a tune on the instruct model)

antml:thinking</antml:thinking> (they seemingly really do not want to leak this lol, they replace it on streamed responses, your replies (so the model can't even see it unless you do tricks, etc.)
antml:max_thinking_length32000</max_thinking_length> (in the system prompt, may also be sanitized)

additionally these aren't special tokens, but potentially specific trained in behaviors. but its unlikely you can manually inject it because they apply a lot of sanitization and the tricks you would need to do would probably be degrading. (just use the thinking version at that point)

as far as i know, no one else has noted it out yet, but i might be wrong lol. pretty useless info anyways, but i found it interesting nonetheless

keen beacon
leaden palm
#

OpenAI makes that clear

#

So do Anthropic; they explicitly state it's the first model that does both

keen beacon
#

this is seemingly the same model, where the reasoning variant is simply triggered with <antml> (though u cant trigger it because it gets replaced by anthropic)

keen beacon
leaden palm
#

While platforms like OpenRouter and LM Arena have you control thinking via the model name, it's an API parameter

keen beacon
keen beacon
#

(in this case, it's the same model)

spice birch
#

Which is currently stronger out of gremlin, centaur and enigma?

drifting crow
#

ppl have been asking this question since the dawn of time

robust plank
#

Found best prompt to observe hallucinations: "Callisto Protocols plot. What did and did not happen at the end." At some point i rechecked wiki to see how that plot ended, because they are so convincing of own made up reality. All of them have heard about that game and try to figure out ending by induction and get it more or less wrong.

short sluice
#

almost every time i try to generate a random family photo it gets errored. text2image arena. help

jovial forge
alpine coral
# keen beacon sonnet 3.7 and sonnet 3.7 thinking might be the same deployed model lol. (potent...

i didn't think there was much doubt that that wasn't the case tbh
(fwiw i'm not convinced there is that much fancy stuff going on with o1/3's implementations of test-time-compute – they're just given more tokens/compute to 'work through' something. perhaps there's more to it, but it never felt like anything fundamentally different is going on with 'reasoning'/'thinking' models other than, well, more test-time compute plus a system prompt or some other mechanism by which the model is instructed/encouraged to dwell systematically rather than blurt out responses)

#

is roman empire new? Feel like someone mentioned it.. but searching the server and couldn't find the post, if it exists

ocean vortex
# alpine coral i didn't think there was much doubt that that wasn't the case tbh (fwiw i'm not...

we always knew that gpt4o and o1/o3 use the same base models. But both Deepseek and OpenAI dedicated new model versions exclusively for reasoning. As in, deepseek r1 is always reasoning only, just like o1 and o3 is. And you can't make deepseek v3 or gpt4o do same kind of reasoning since they were not trained for it at all. So what you wrote is not really accurate since technically what Anthropic did there is very unconventional and not what people would normally expect given recent reasoning model releases.

#

Potentially, it's what OpenAI gonna try to do with gpt5, except it's gonna decide by itself when to use thinking and when to not, we shall see...

#

Nice thing about 3.7 sonnet that comes from it, is that you can use now non-thinking version and make it think completely unhinged, resulting in more thinking tokens than their official implementation with hard cap allows lol

alpine coral
ocean vortex
alpine coral
#

like the 'deepthink' toggle seems to do the same thing as what claude extended thinking does

ocean vortex
#

and they have completely seperate model for no thinking

alpine coral
#

ahh ok

ocean vortex
#

that toggle

alpine coral
#

yeah right in that case, i see what you means

#

sorry yeah gotcha

#

i didn't think that was the case - but again, was just what i thought / assumed (that they were the same model, just with one given more test-time compute (plus some thinkging tags/framework) with that toggle selected)

#

but clearly that wasn't the case 👍

teal mantle
glad dome
#

i dont know why it stopped generating . Can someone tell me what is problem? and the ways to fix.

ocean vortex
hybrid shard
#

an unstable internet will constantly cause these errors to pop up

topaz peak
#

o1 is really good with these sort of prompts, all the others are bad

barren prairie
sacred plaza
leaden palm
lost patrol
#

yeah complexity extension.

alpine coral
#

aha they finally added model selector to the main ui

#

they tried sooo hard to tell everyone they're not 'just another wrapper'... but eventually caved to reality aha

sacred plaza
alpine coral
#

they stubbornly refused to budge (and with that, complexity was born ha) and i've always thought it was cause of this "we're definitely not a wrapper!" delusion/insistence their ceo kept banging on about

#

but yeah ofc, they've made their own fine tunes etc (but i think most people who buy a sub do so for the multiple third party models to which it provides access)

leaden palm
#

does anyone here have some deep research queries to spare 🥺

#

would like to get this one figured out:

phi 4 multimodal is an llm that can accept audio
it uses 750 tokens/minute of input audio
let's assume you have 200 tokens of output per minute
let's use a market price of $0.07/mtok in and $0.14/mtok out (from deepinfra)
that's around $0.004 per HOUR of audio!

it should be much more expensive than models built for transcription (eg whisper) since it's an llm that has to spend gpu time outputting each token, not a transcription model. it's also much larger (5.6b) than whisper (1.5b).
but it's cheap, and it would be even cheaper if it was priced for its size fairly (should be more like $0.03/$0.05 instead of $0.07/$0.14)
a similar kind of thing happens with gemini - another llm that happens to accept audio is cheaper than dedicated audio transcription models.

why? why is it so cheap? (or why are the alternatives so expensive? and which one is actually most efficient to directly run on gpus?)
timber kiln
#

There are $1 or $0.5 providers for whisper

#

for 1000 hours

#

They can probably even do better but at that point let them have some profit

leaden palm
# timber kiln for 1000 hours

i spent a week searching for the cheapest whisper provider and the best i could find was $0.01/hr so idk what youre referring to

timber kiln
#

https://blog.salad.com/parakeet-tdt-1-1b/ its parekeet but almost same size model 1.2 per hour

Parakeet TDT 1.1B GPU benchmark The Automatic Speech Recognition (ASR) model, Parakeet TDT 1.1B, is the latest addition to NVIDIA’s Parakeet family. Parakeet TDT 1.1B boasts unparalleled accuracy and significantly faster performance compared to other models in the same family. Using our latest batch-processing framework, we conducted comprehensi...

leaden palm
#

ok thats not bad

#

but is salad always better, or is it only in the field of whisper where everyone else tries to squeeze out profit?

timber kiln
#

Salad lets you rent the gpu they dont provide it in this benchmark for per say
This is parallel processing you can get better than that
Something scale of Google can do better than that obv

But crucial point is tokenization of the audio is probably a really good compression

#

Maybe a 10x saving on compared to whisper

keen beacon
timber kiln
#

750 token a minute is crazy its something like 0.2 kbit/sec audio compression

#

I forgot how much Encodec was or improvements on Encodec

timber kiln
#

a token with 128k vocab is 17 bits

leaden palm
#

it's around 17 bits, and takes up a lot more ram once expanded into its embedding

#

ok you did the calculation properly

#

funny

timber kiln
#

You dont care about the expanded form tho

#

Only tokens

leaden palm
timber kiln
#

You are still representing a 80ms segment with just one float

little narwhal
#

Wake up babe, another mystery model just dropped

keen beacon
#

will the gradio ui be kept once migration to the new ui happens?

meager sun
little narwhal
#

Idk

cedar tide
#

margherita-plain says it's called yasa de reka, so probably yasa 2
(little reminder of what yasa 1 is
"Yasa-1 is an assistant that can understand text, images, videos, sounds and include

  • long context document processing
  • natively optimized retrieval augmented generation
  • search engine integration for live data
  • code interpreter")
bright kayak
#

When is the next leaderboard update?

cedar tide
bright kayak
#

Aight

cedar tide
#

and aya vision 32b in for vision ranking

#

we want Jamba 1.6 large and mini on the battle arena

tropic nimbus
#

Is claude thinking capped at 32k? or is 64k also in testing but but in direct chat?

tropic nimbus
#

amodei is too stingy…

cedar tide
#

is it in the normal arena or with Vision or what?

barren prairie
#

Hello , any information about flywheel . Is is new one ?

timber kiln
alpine coral
timber kiln
#

For most things best-of-n is better than 4k+ plus thinking

alpine coral
#

don't mean to clog the channel with fairly pointless screenshots.. but some responses from other pseudonym bots in the arena.. just fwiw

timber kiln
#

Any good coders among anon models?

vast turret
#

There's so many mystery models I feel like they need their own leaderboard or chat option.

sacred plaza
past shuttle
#

I'm looking for a person to create the actual project in minecraft like MinePal/Elefant Ai, you need to have time + some work experience

alpine coral
timber kiln
#

Gemini is your best bet probably

leaden palm
timber kiln
#

Most whisper pricing is hard to check some behind paywall

#

also just looked up today phi4 conformer encoder aint small itself either

#

Could be additional charges for that kind of encoding

chilly mason
#

Guys, Claude and OpenAI Plus, which is more worth to subscribe?

hardy pecan
#

Depends on your use-case really

brisk turret
#

Will Claude 3.7 be split into thinking and non thinking?

tame tiger
#

Is the leaderboard rigged?

tawdry rock
#

開始

chilly mason
ocean vortex
#

lmarena is not definitive benchmark for sure, but people should really stop glorifying claude constantly lol

chilly mason
leaden palm
cobalt minnow
#

Good evening everyone 🫡

mortal ember
#

I am highly interested in the work of lmarena. and I wonder how can I get my model included in the lmarena leaderboard and participate in the ranking? Who can reply me? tks!!

leaden palm
#

why is he reacting to his own message

timber kiln
leaden palm
#

when i prompt llms to be blunt they say that

#

if you convert prices directly it's around equal

#

the deep research report says that

torpid wraith
#

Hello

tame tiger
kind cloud
cloud meadow
#

When are the Gemma models gonna come out?

#

I wonder what Google has been cooking up

keen beacon
buoyant wagon
#

Hi folks, nice to meet ya!😉

brittle tiger
ocean vortex
#

3.7 is more popular and accessible than grok3, but this absolutely does not mean it is better. Cause it's really the opposite tbf

keen beacon
#

they crammed more facts into 4o than old sonnet 3.5 which was larger than it, so i feel they're good at efficency

ocean vortex
#

mini performs but it struggles a lot with context awareness or more complex prompts, as well as tool usage

#

as for o3-mini...

keen beacon
#

i think sonnet 3.7 might be potentially smaller than sonnet 3.5, if so, the direction ahead seems to not keep increasing model size

ocean vortex
#

we don't really know how it was trained even. Could be distill from o1-pro into gpt4-mini. And is still a reasoning model which is quite a bit different to standard gpt4o

keen beacon
ocean vortex
#

yeah but gpt4o is a "regular model" so we kinda need to talk about them lol

#

comparing it to reasoning models becomes difficult

#

as we do not know the potential they would have with different model

#

for RL training

keen beacon
#

i was talking about it as a base model, i think 4o as the size it is fine. base model then reasoning. as an instruct model i dont think theres any point in arguing about model size there since that paradigm is dying

ocean vortex
#

you can't take some random very badly performing model and fix everything just with RL training for it to beat everything else

#

there are limits to everything

keen beacon
ocean vortex
keen beacon
#

and its competitive with r1

#

qwq 32b is the closest to an analogue with o3 mini i think

ocean vortex
# keen beacon and its competitive with r1

"competitive" but still worse. Besides deepseek showed themselves that you can effectively distill R1 into smaller model and everything is open-source now, so I wouldn't be too surprised by these...

keen beacon
#

total parameters its much much much bigger

keen beacon
ocean vortex
keen beacon
#

before r1

ocean vortex
#

?

#

I'm referring to this very model lol

keen beacon
#

released in november 2024

ocean vortex
# keen beacon

like I said math (aime24) score we do not know if it's cons@64 or pass@1

ocean vortex
#

I showed distills of R1

#

that screen was not referring to qwq

#

it's distills made on release of R1

keen beacon
#

which wasnt the full version

ocean vortex
#

no I was referencing that against their new numbers lol

#

so like LCB 63%....

#

it's good but not that much better than the very first 32b distill on r1

#

and so on

barren prairie
ocean vortex
teal mantle
#

In terms of intuitive to model size/cost ratio it is quite a winner, but there are more "intuitive" models like Grok3 or GPT4.5

ocean vortex
#

so like we see that in their select metrics qwq-32b is comparable to o3-mini-medium, but I doubt it would hold in other things as good

ocean vortex
#

with gpt4o it seems it's either one or the other

#

but not both

#

like it didn't used to be that they were going backwards, not even when people were complaining about it "getting dumber". But now for the first time this is actually true with gpt4o lol

#

they did made it more usable, but at the expense of the actual performance when style is not an issue

keen beacon
# ocean vortex with gpt4o it seems it's either one or the other

that version may be a smaller model/etc https://x.com/ArtificialAnlys/status/1859614633654616310

Wait - is the new GPT-4o a smaller and less intelligent model?

We have completed running our independent evals on OpenAI’s GPT-4o release yesterday and are consistently measuring materially lower eval scores than the August release of GPT-4o.

GPT-4o (Nov) vs GPT-4o (Aug):

ocean vortex
keen beacon
#

output speed went from 80 -> 180

teal mantle
ocean vortex
#

so 0513 is more expensive even today

#

not entirely sure if they actually made it smaller or is that something else (seems very silly as it was already bit compromised), but in either way the direction they were moving in with gpt4o is not great catgrin

ocean vortex
severe bison
#

Well damn, where can i get that IMAGEN model

#

it's just soo good

leaden palm
#

Are we getting a deep research API today

#

Okay we get web search

#

We get file search

#

(with metadata filtering and direct use)

#

Plus computer use (!!)

#

Are they going to reinvent the wheel with APIs

#

Yes

#

"Responses API"

#

Nothing you couldn't already do so far

#

Oh swarm++

brittle tiger
severe bison
#

🙂

#

thanks anyway

golden ocean
verbal nimbus
#

Grok's DeepSearch feature is mad

leaden palm
#

I can't get it to work

verbal nimbus
#

It literally found some obscure quote that could only be found in one interview

#

It inferred it from articles that referenced the interview (I didn't even know where the quote was from in the first place).

#

Then I asked it to find that interview (because I couldn't), and it literally searched through the internet archives and found it

#

I didn't even know Internet Archives had a feature that allows you to search for TV News transcripts

#

I'm just really impressed because I would have never found it otherwise

drifting crow
#

That’s pretty cool

wind stream
#

Meanwhile, Perplexity's Deep Research can't find obscure things, so it makes up plausible sounding results. 🤦

timber kiln
brisk turret
#

dark mode?

timber kiln
#

So which anon models are gemma-3

#

Are we getting a big gemma this time closed source killer

#

With how their 2.0 pro sucks I have low expectations but you never know what the other team cooking

leaden palm
#

It's good at explaining things

#

It uses italics well

drifting elk
#

hi guys

#

will the text to image be avaible on the new lm arena website

brittle tiger
timber kiln
timber kiln
spice birch
#

Has anyone encountered harmony? Correctly solved hard visual structural mechanics problem

lament dagger
#

Moin together 😉

teal mantle
#

What is consolidation?

hardy pecan
#

deep-interia looks like llama, or from meta at least

brittle tiger
rigid widget
unborn ocean
# brittle tiger

Gemma was always good at human preference but this is truely on another level

#

Makes me wonder how small flash-light is for it to score well below gemma

keen beacon
#

The pretrained base models seem disappointing but we'll see

#

It seems the phi team is best at cramming knowledge into small models, albeit the connections between individual facts are scrambled

#

The qwen models have very poor world knowledge

fading moth
#

hey everone, i saw a new model called "consolidation", it seems very human-preference aligned, like gpt4.5 and 4o-mini

#

any idea what model is this?

slate vapor
#

What model is Goose? I tried to make it write a 'story', and it apologized?

fading moth
#

consolidation means

  1. the action or process of making something stronger or more solid.

the action or process of combining a number of things into a single more effective or coherent whole.
So, it a router or a new, powerful update?

slate vapor
#

I guess it's a model from China.

fading moth
#

ok

#

so is it good or bad

slate vapor
#

bad

#

Not very good, this “goose” doesn't have COT and starts outputting almost without pause.