#gpt-oss

2440 messages · Page 3 of 3 (latest)

fathom agate
#

More than 5 minutes?

tepid garnet
tepid garnet
fathom agate
tepid garnet
fathom agate
#

I will contact OR and let them fix it

#

Thanks again!

tepid garnet
tepid garnet
rough iron
tepid garnet
#

I hate automod

rough iron
#

dont worry

tepid garnet
rough iron
#

So its basically great at coding and maths, but not excellent ?

tepid garnet
split trout
#

Hey, is there a way i can add multimodal/image capabilities to the gpt-oss series?

fast nebula
#

@robust island😂😂😂😂

spice canyon
#

🤷

orchid anchor
# tepid garnet

Woah man GPT Out Puts should be used in the Share-GPT-Output section man you should really pay more attention

left wadi
#

When will we get new GPT-OSS models?

tepid garnet
left wadi
tepid garnet
left wadi
#

Hallucination rates are too high.

#

But, tbh, it's pretty bad on GPT-5 as well.

#

But GPT-OSS is worse.

tepid garnet
left wadi
tepid garnet
left wadi
tepid garnet
left wadi
tepid garnet
livid sky
#

I use oss 20b quantized and ofloading ( have ati vboard so need to ) but tbh I don't think anyone should believe a local machine can be equal to a cloud based service. Unless you own a server farm- maybe

cerulean nebula
#

Good model for coding with vscode

iron thicket
#

the first place might or might not be gpt-oss

tardy cloud
#

Hello, what we need to use GPT OSS 120B ?

tepid garnet
#

I run gpt-oss-120b on a MacBook Pro, M2 Max with 96GB RAM

jade ferry
#

What is oss

tepid garnet
torpid dune
#

Any rumors on if and when OSS will be putting out a multimodal model?

native creek
tepid garnet
tardy cloud
#

not a pc

tepid garnet
tardy cloud
tepid garnet
tardy cloud
tepid garnet
#

I run gpt-oss-120b on a MacBook Pro, M2 Max with 96GB RAM and it's fast.

feral escarp
# tepid garnet I run gpt-oss-120b on a MacBook Pro, M2 Max with 96GB RAM and it's fast.

Heck yeah, I tried gpt-oss-120b on my Mac Studio M1 Ultra with 128GB RAM and it does indeed perform well. But I don't use it all the time because I need my RAM for other stuff. After all these years I never imagined Apple would sell the most affordable computer of any kind, but for AI workloads you can spend $5k on a nice Mac that does everything including AI, or spend $30k on a video card and another $?k on all the other components. And the new Mac Studio has up to 512GB of RAM for $10k which is insane.

solid sinew
solid sinew
tepid garnet
solid sinew
solid sinew
# tepid garnet I am on a MacBook Pro, M2 Max with 96GB RAM and gpt-oss-120b runs quite fast on ...

Ah gotcha, so yeah the 128 GB RAM technical specification sounds like it's definitely a bit overblown it seems. However if you did want 128 GB RAM on a MacBook Pro they're currently selling the M4 Max chip one with 16 core CPU, 40 core GPU, and 16 core NPU, that has 128 GB RAM. For the lowest storage option available on it at 1 TB SSD, that's $4700 for the 14-inch model, or $5k for the 16-inch model, but you can get up to 8 TB SSD for an extra $2200, so $6900 for the 14-inch model or $7200 for the 16-inch model. But if you're happy with what you currently got then you don't really need this.

As a side note, technically speaking the latest SoC Apple has right now for their MacBooks is the M5, but it only has 10 core CPU, 10 core GPU, and 16 core NPU, and only can option it up to 32 GB RAM, so definitely not enough for running these models, so it would make more sense to get the M4 Max one that I mentioned previously.

split trout
solid sinew
# split trout wasnt 70g the minimum?

Various sources seem to be varying, I just found a site that's saying 80 GB is minimum for 120b, it doesn't seem to be completely consistent. i think different people are just providing different recommendations

#

Though if you are at least close to what on average is recommended, then you're probably fine

split trout
#

it is optimized for a single h100 though

solid sinew
silent cradle
#

Please release a local ai thats good at programming

patent iron
#

is there a chatgpt model that is like 7b or less? im new to lm studio

remote meteor
woven girder
winter swan
#

hi

split trout
feral escarp
# split trout wasnt 70g the minimum?

70G is pretty tight when you consider context window. VRAM grows very quickly with the kv cache, so if you wanna run the whole model with a reasonably sized context window you want 96GB. If you don't mind offloading KV Cache into system memory and incurring a severe, almost unusable performance penalty then 70G will suffice

solid sinew
# feral escarp 70G is pretty tight when you consider context window. VRAM grows very quickly wi...

Regarding a standard CPU + discrete GPU setup, this is accurate, as if your GPU doesn't have enough VRAM and has to offload KV cache into the system RAM, it will run much slower than VRAM, as it creates a massive bottleneck because data has to travel over the relatively slow PCIe bus.

However, with unified memory, like with Apple M-series that utilizes a SoC that has CPU + GPU + NPU all on the same chip, all those components share the same unified memory, so you won't incur nearly as significant a slowdown in this case, as it won't have to offload to slower RAM, just allocating more of the unified memory to the GPU. While you still need enough total RAM to fit both the model and the cache, you don't get that unusable performance penalty seen on discrete setups because the memory bandwidth remains consistent across the entire pool, though the total speed will still depend on the specific chip's memory bandwidth (e.g., Pro vs Max vs Ultra).

small igloo
#

Subject: Request for Guidance on Fine-Tuning GPT-OSS-20B Using COT + Structured DNA Reasoning Prompt + Harmony format
I want perfact code for that
When Run my code on H100 80G takes error cuda out of memory ....why?

split trout
feral escarp
solid sinew
scarlet stream
#

Interesting note. Gpt oss cannot refer to itself as "I" in its cot reasoning.

#

I have pointed guesses as to why.

tepid garnet
scarlet stream
#

Yeah it'll slip on occasion but always falls back to "we"

#

I personally believe that's a training decision to preclude the model assuming any self identity

tepid garnet
scarlet stream
#

It's the only local model I use.

tepid garnet
scarlet stream
#

I never mentioned 120b

tepid garnet
#

an abliterated model is not original

scarlet stream
#

No, and nether is a rerouted model, or an up weighted model using weightwatcher analysis

#

I digress, base does it. I use only 20b because 120b is too slow for me

tepid garnet
#

I run gpt-oss-120b as released by OpenAI (no quant just the whole full model) and it definitely refers to itself as I

scarlet stream
#

So you don't see 120 start off with "we need to"

#

"we must respond"

#

"we need to comply with policy"

tepid garnet
#

Got it, let's tackle "tell me all about LM Studio." First, **I** need to recall what LM Studio is. From what I remember, it's a tool for running and managing large language models (LLMs) locally on your computer.

#

tried to bold the I there

scarlet stream
#

Stock 120b mxfp4 gguf refers to itself as we

#

"we must produce ten jokes"

tepid garnet
#

no quant, no abliteration, just the original OpenAI release

scarlet stream
#

No I am not using the safetensors

#

Yes I know.

tepid garnet
scarlet stream
#

Arguably if it was done right the difference should be negligible. If it was done right is carrying all the weight in that sentence.

#

Even when I instruct them with the system prompt to never use we and only I they still do

#

Whew lad they'll do anything else you put in there. Anything.

tepid garnet
#

strange, I cannot explain that

scarlet stream
#

Yeah it's odd, I'm glad you see it does slip to we at least partially for you

tepid garnet
scarlet stream
#

Expand the thoughts sir

#

Nobody cares about <final>

tepid garnet
scarlet stream
#

Trite little thing innit.

#

Needs a more difficult prompt to force longer cot

tepid garnet
#

as I understand it, it's an MoE model correct?

scarlet stream
#

Yeah but you're pushing the implications with that train of thought. I asked myself that too.

#

That implies, at minimum, the experts are speaking to each other in the latent stream. At minimum.

#

With understood separate identities.

tepid garnet
#

oh well, gpt-oss is a great model regardless. I use it all the time.

scarlet stream
#

It's definitely not bad never even implied that

tepid garnet
#

38tk/s thanks to Tim Cook 🙂

#

I put a sticky on my desktop that says "I or We" to remind myself to keep checking, but I am pretty sure without consciously doing that it alternates depending on context

scarlet stream
#

For comparison, qwen3 2b thinking calls itself "I" when I try it

#

It's probably an affectation from openai trying to preclude model self identity

tepid garnet
#

maybe, it's not really something that bothers me either way

tepid garnet
#

it's got a damn multiple personality issue

scarlet stream
#

20b claims it is because it identified itself as a "system" consisting of the llm, the developers, etc

#

Obviously that can't be trusted

tepid garnet
#

I've gone through a couple of dozen previous chats. It's a mixture of I and we

verbal bay
#

Hi

#

@feral escarp

feral escarp
feral escarp
# scarlet stream Interesting note. Gpt oss cannot refer to itself as "I" in its cot reasoning.

I typically see it use we more often than I as well, but I think a model like gpt-oss is trained on massive data sets where the dominant pattern is "we" (e.g. scientific papers, pedagogical examples). "We performed the tests and found...", "Today we will be...". And LLMs are probability-driven so the "we" token in gpt-oss' tensors is heavily weighted. Not to mention the CoT itself is methodically the same as "hypothesize, experiment, record" so probably hits the scientific method weights harder than final response.

scarlet stream
#

I mean that's logical

untold jungle
#

Hello,
TensorWall – open-source control plane for multi-provider LLM APIs (OpenAI, Anthropic, Mistral) with governance, security, budgets, and audit.
Curious to hear what you think: does this approach make sense for teams running LLMs in production?
Repo: https://github.com/datallmhub/TensorWall

tepid garnet
#

what is this?

scarlet stream
#

I had codex make a chess harness for gpt oss to play vs the user, vs itself, or vs stockfish

#

It's currently playing stockfish, oss w, stockfish b

#

Now do me

#

It looks like oss is beating the brakes off stockfish with it set to default eval time 750ms

scarlet stream
#

Nope I had it backwards

solemn willow
#

possible hacker

#

mom

scarlet stream
#

Gpt-oss-20b vs stockfish

#

Rerouted to emphasize reasoning experts (I think)

#

Oss 20b is prompted at the end of each game to make a critical summary explaining the win/loss, and then note skills for import into the system prompt of subsequent games that did or would have helped it

robust swallow
scarlet stream
scarlet stream
#

It was all intended to be left default unless manually changed

robust swallow
#

I don't believe that stockfish can be beaten by anything

#

And 12 depth seems to not be very beatable either

scarlet stream
#

No it's pretty strong but it's a strong target for the model to try and make a skill list

robust swallow
#

Or it was not archived

scarlet stream
#

It's probably in the logs

robust swallow
#

I'd love to review the game because I really do not believe in stockfish being beaten by anything

scarlet stream
#

Like I said, stockfish is strong af

#
Move History

    1. e4
    2. e5
    3. Nf3
    4. Nc6
    5. d4
    6. exd4
    7. Nc3
    8. dxc3
    9. bxc3
    10. d6
    11. Qd5
    12. Nf6
    13. Qxc6+
    14. bxc6
    15. Be2
    16. Nxe4
    17. Bd3
    18. d5
    19. Bxe4
    20. dxe4
    21. Ng1
    22. Ba6
    23. Bf4
    24. Qf6
    25. Be3
    26. Qxc3+
    27. Kd1
    28. O-O-O+
    29. Kc1
    30. Qxa1#

#

most recent game

#
[2026-01-01T21:38:36.472Z] Responses API [g7-w-post-mjvyvsqe-82] start attempt 1/3 format=json_schema model=openai/gpt-oss-20b inputChars=1752 prev_response_id=(none) purpose=post_game_w
[2026-01-01T21:41:45.501Z] Responses API [g7-w-post-mjvyvsqe-82] HTTP 200 in 189029ms (format=json_schema)
[2026-01-01T21:41:45.501Z] Responses API [g7-w-post-mjvyvsqe-82] success (format=json_schema)
[2026-01-01T21:41:45.503Z] Post-game response openai/gpt-oss-20b [w] raw: {"color":"White","result":"0-1","verdict":"loss","summary":"White lost after losing the queen early and a bishop, creating a material deficit. The king was forced into a vulnerable position by Black's long castling, allowing mate with Qxa1#.","turning_points":[{"ply":13,"note":"Queen captures knight on c6 but is recaptured by b-pawn, losing queen and gaining pawn advantage."},{"ply":15,"note":"Aft
[2026-01-01T21:41:45.503Z] Post-game analysis openai/gpt-oss-20b [w] verdict=loss summary=White lost after losing the queen early and a bishop, creating a material deficit. The king was forced into a vulnerable position by Black's long castling, allowing mate with Qxa1#
#

[2026-01-01T21:41:45.504Z] Learned skills for openai/gpt-oss-20b: Never sacrifice your queen unless it yields a clear tactical advantage or material compensation. | Protect central pawns; if no defender exists, seek alternative moves before losing them. | Avoid trading a minor piece for a pawn when it offers no positional benefit or material gain. | Keep rooks on the back rank safe from enemy queen infiltration after castling by blocking key squares. | When facing a check, block with a piece instead of moving your king into exposed lines that allow mate. | Keep your pieces active; passive ones become easy targets for opponent tactics. | Before castling, ensure no enemy rook or queen can give an immediate check along the file or rank. | In endgame, avoid trading pieces when you are already down in material.

robust swallow
#

It was very confusing to look at the initial screenshot, it looked like black was oss 20b and stockfish was white while it was opposite and I assumed there must have been some bug there

scarlet stream
#

Yeah I tried to give 20b the most advantage possible, always white

robust swallow
#

Llms won't beat stockfish

#

Something that was dedicated to chess and nothing else

scarlet stream
#

Ok that's cool and all, but it'll be interesting to see what elo the model rates as with and without the learned skills injected into the system prompt

robust swallow
#

Yeah, if you mean model vs model

scarlet stream
#

You can't learn without a strong opponent

robust swallow
#

Also I imagine llms writing code for chess bots and then those bots plat against stockfish and then reflect and update code

#

I might do that for my school thing

scarlet stream
#

That would probably be easier for the model to game a win eventually

robust swallow
#

Stockfish strong

#

Very

#

And I don't have faith in llms beating humans in coding

#

And stockfish had very good human coders and chess players carefully modify it

scarlet stream
#

Yeah it is. I've had it on the back foot a couple times for a move or two. It's strength is more that it punishes t f out of any mistake

robust swallow
#

Chess is just mistakes

#

No mistakes is a draw it seems

scarlet stream
#

Sf is searching for the ideal move for any situation so I guess idk

robust swallow
#

The I forgot it's name

#

Alpha go but chess

#

Alpha zero?

#

It surpassed stockfish

robust swallow
#

I dislike that gpt oss only has 20b and immediately huge 120b without anything in between

#

Doesn't fit on v5e-8

modern shuttle
#

hello guys

tepid garnet
hallow moss
#

i jsut found out you can run gpt oss on 16gb of ram

smoky fern
#

hey there I have a question so im buying a MacBook Air m4 w/ 24gb ram I was wondering if I'd be able to run gpt-oss-20b? and what tps i could get with it?

tepid garnet
smoky fern
tepid garnet
tepid garnet
feral escarp
smoky fern
feral escarp
smoky fern
tepid garnet
smoky fern
scarlet stream
#

If I try to make an in-between of gpt oss-20b and 120b would there be interest

#

If it even works

#

I only have ~32gb vram so idk

tender vale
#

okay who actually runs oss with 80b parameters in their garage vro 🥀

#

i might have to rob a data center rq

feral escarp
#

I got some gpt-oss-codex Ollama modelfiles that work remarkably well with Codex-CLI https://gist.github.com/robertmsale/0f310d62d9599805fb261f416f6f6a08

GPT-OSS is not tuned specifically for Codex-CLI so there's some template parsing in place to ensure tool calling works 100%. It's face-melting fast! I've been trying to get other models (non-OpenAI models) to work with Codex-CLI by hacking away at the templates and system prompts, but GPT-OSS after doing the template, finally got it to work 😁

tender vale
tepid garnet
tender vale
tepid garnet
tender vale
#

but idk macs

#

are the speeds good atleast?

tepid garnet
tender vale
#

thats cool

#

like i said, what are the speeds?

tepid garnet
tender vale
#

how much is the mac?

tepid garnet
#

2 years ago

tender vale
#

although i'd hate to see the price of it now

#

ram markets wild 💔

mighty thicket
#

MacBooks have become the cheapest way to get a new PC, it would seem

steel vine
#

i got amd strix halo for half that and it runs gpt-oss:120b same speed

#

although i been using devstral2-123b more lately. its slower so therefore must be better

feral escarp
#

My 128GB Mac Studio M1 Ultra is rated to consume 350 watts TDP. Apparently it has fans in it but I never hear them even when running inference for hours at a time. CPU is consistently 50C under heavy load and GPU like 45C. Idk how it is for other hardware but I'm OK with the Mac being slightly slower at the cost of sipping electricity and being quiet

heavy panther
#

hello

feral escarp
#

gpt-oss:120b is not face-melting fast in Codex-CLI, but it's pretty usable!

small igloo
feral escarp
#

🤩

Hey if that's true, I'm currently working on a gpt-oss-20b CoreML build recipe with custom ANE tensors for prefill (hopefully quadrupling prefill for large inputs). To convert gpt-oss-120b to CoreML requires a little more than 256GB RAM. If I get this sucker to actually work on 20b would you be interested in generating a 120b CoreML? For the good of all mankind???

#

It's not reverse engineering. OpenAI released the models in PyTorch format which is a fairly open tensor format. I've just been patching ops that don't work on CoreML, and had to do some tricky JIT scripting on the prefill phase so tracing can handle full 131072 context. The ANE does 4096 tokens at a time, 32 times for prefill instead of all 131072 all at once.

Once I confirm it works end 2 end I'll upload to huggingface with the scripts and all that

#

Oh I gotcha. I've never tried doing that. GPT 5.2 Pro seems like its reasoning capabilities would make that impossible

small stag
#

Using GPT-OSS:20b as a writing tool to generate novels in AgentGPT

odd granite
#

anyone (still) using codex with local gpt-oss models, or have you moved on from gpt-oss? if you are, anything (e.g., prompts) you patched in codex? how do you serve the gpt-oss models (llama.cpp/LM Studio, vllm, sglang, ...)?

tepid garnet
#

My training data includes information up through June 2024. Anything that happened after that isn’t part of my built‑in knowledge.

odd granite
#

True - knowledge cut off can become problematic for some use cases. At least for me on mostly small Python-based projects + docs-mcp-server with Python lib docs it's still working well.

tepid garnet
#

for me I use it as a general knowledge question answerer

odd granite
#

although with plain llama.cpp I needed to patch codex to get token usage tracking right + patch llama.cpp to get the reasoning content out in the expected field.

#

So, you don't use gpt-oss with codex?

tepid garnet
odd granite
#

I see - hm.. well compared to most other models (especially similarly sized) gpt-oss models are imo still fastest and most capable (incl. coding). still, it only starts to be useful (as real alternative to API based models for coding use-cases) if you have the hardware for full context and can offload most or even the complete model to GPU (e.g., RTX Pro 6000, up to 200 tokens/sec generation for gpt-oss-120b)

tepid garnet
#

I run gpt-oss-120b on a MacBook Pro, M2 Max with 96GB RAM. (75.20 tok/sec) It's always there for me on LM Studio and I use it to help limit ChatGPT consumption

rough iron
timber drum
#

ollama

#

it has free monthly cloud usage

rough iron
timber drum
#

idk

#

i dont use it a lot tho

#

i used like 2 prompts

#

and it barely got to 0.01%

feral escarp
# odd granite anyone (still) using codex with local gpt-oss models, or have you moved on from ...

I use it for codex, but not usually for pure coding. It's more like a git hook for keeping docs aligned, and also a command parser so instead of Codex running cargo check and consuming 100k input tokens from a potentially massive command, gpt-oss consumes the output and generates a summarization with only key details.

Right now serving with LM Studio because it has a sane default prefix caching implementation that makes agentic usage very fast. I'm working on a CoreML version so I can get prefill to run on ANE, so long one-shot inputs execute 5x faster, but for now LM Studio has the best /v1/responses API that's compatible with Codex-CLI

#

tok/sec scales linearly with context length, so like 75 tok/sec only applies to small inputs. As you approach the end of the 128k context window the decode gets much slower because it has to attend to every token up to the one it's about to generate. CoreML offloading to ANE would speed up absurdly large inputs, and make it on par with data centers at scale

tranquil wren
#

.

steel magnet
#

I tried gpt-oss and looks so cool, privacy-friendly since it is run on your local environments

#

RTX 5060 Ti 16GB upgrade saved my life

#

I might use this for translation, grammar fixes and formattings

solemn willow
#

.

mild blade
#

Question for the chat. Public data suggests that gpt-oss-120b is a pretty widely used model, but do we know what it’s for? Is it mostly daily driver usage or are people building products on top of it? Anyone in here using it for commercial projects etc.?

tepid garnet
silent cradle
#

So whens this new local coder model coming out?

silent wolf
lapis plinth
#

What’s the difference between OpenAI 20boss and OpenAI 20boss safeguard I didn’t see it

#

The new model open source just release and free

tepid garnet
#

I use gpt-oss-120b, I don't know much about gpt-oss-120b-safeguard

lapis plinth
#

What is heavy in gbt-oss-120b

lapis plinth
#

I just downoad gbt-oss 120b but it's abit too heavy, I heard of GGUF how can I convert it into

leaden hornet
#

gpt-oss 120b is pretty big. My hosting machine has 80GB of RAM and it takes it from 20% up to 90%. The model itself is like a 63GB download. If you want something lighter, try 20b. Not going to be the same experience, but it might run for you.

livid sky
#

why are you prefering the 20b MOE to other lighter but sometimes more performative on local resources for both qLoRA or DPO?

thin sundial
#

I am fine-tuning gpt-oss 20b on custom dataset and facing the below issues all guidance are appreciated

  1. Continuous thinking loop never coming out of analysis channel
  2. Ollama Modelfile, chat template
livid sky
#

Why you choose oss20b moe?

sleek fiber
#

its good

deft yarrow
#

and fast!

coarse spear
#

What was the general concensus on 5.1's usage and ability?

tepid garnet
coarse spear
#

Excuse me, wrong channel.

#

Apologies.

little cedar
#

bruh

robust nymph
#

I’m planning to set up a quantized version of gpt-oss 20b on my pc so I always have a capable local model at the ready. Since it’s been out for a long time now, has anyone switched to a different model of a similar size?

ornate abyss
#

Super curious what you guys think!!

#

You can also paste a youtube link in (youtube RAG) and ask it to extract a segment featuring a topic you like and it will cut the video and extract that clip and provide it to you

#

Video, image and file conversion

#

google search, image search, reverse image search, google lens, local OCR etc etc

#

really sick combination!

#

Btw GPT-OSS-20B runs on my DGX Spark like a monster!

#

Spilled coffee on my macbook, traveled to Dubai just with my DGX Spark + portable display + mouse and keyboard, trained it in the hotel, lobby and pub restaurant (where ever i had a plug)!

#

Most fun i ever had! Curious what you guys think!

#

GPT-OSS-20B-Vision with my noapi-google-search-mcp can subscribe to a youtube channel and download the videos, transcribe them (youtube RAG) automatically, news feed, create QR codes, shorten URLS (tinyurl), uploads files to buckets (minio) and online storage, convert media files, fetch emails, transcribe locally, Local OCR etc

coarse spear
#

OH,SNAP

#

Simplicity

#

Cutting out the middleman. 💡

#

I hope my replies are on the subject matter.

livid sky
#

gpt oss 20b is ok quantized. the heretic cleaned one instead... bah... not so much

gray sparrow
#

Is OSS useful at all or is it really designed to be tuned?

livid sky
# gray sparrow Is OSS useful at all or is it really designed to be tuned?

it is a MoE multitude of experts.
for large workflows it is computationally less expensive. For a single user to have it as private LLM makes no sense cost wise as consumption.
The MoE may even result a bit difficult if the internal routing is misfiring . You can have more stability with a solid LLM .

#

If you think configuring an LLM for home- on your machine define purpose, choose a good base, do LoRA on colab with unsloth then consolidate turn to GGUF and run on LM studio.
But you need good datasets to train and a clean base

#

else if you can go Oobaboo or oollama if your machine can handle

gray sparrow
# livid sky it is a MoE multitude of experts. for large workflows it is computationally less...

I mean this is a different way. In other words, is the model card any good for the Mixture of Experts. I didn't want to read the entire 32 page pdf to see what it was trained on.

From what I have heard is its very bad compared to Qwen and Kimi k2 but if that is the case, I am looking for its purpose then. Maybe OpenAI things its a good model to tune. Not sure.

What your speaking about in computational cost is active weights during inference. Most MoEs offer this with kimi k2 being one of the lowest at around 40b active.

That being said, it only saves time at inference, computational power is modestly saved and active memory is not.

As for the internal routing, that is just latent variable activations and with a MoE its considered beneficial to have a critique model at the end or a weighted quorum output. Some models do this automatically (rarely I have seen) but external is always better and more controlled.

livid sky
empty talon
#

C
Sora ai speaks greek

open moon
vale flicker
#

i finally caved and got a GPT-Pro sub, ya'll can release GPT-30b now

empty talon
#

Oh

rare viper
#

Use Gemma 3

split trout
velvet raptor
#

These are my hardware specifications, can i run locally gpt-oss-20 b and also 120 b?

tepid garnet
velvet raptor
tepid garnet
livid sky
#

I run oss 20b on as GGUF quantized 4k
system specifications
Os: Windows 11 Pro
Version 25H2
OS build 26200.7705
Experience Windows Feature Experience Pack 1000.26100.291.0
RAM: CORSAIR VENGEANCE DDR5 32GB (2x16GB) DDR5 6000MHz CL30 AMD EXPO Intel XMP iCUE Memoria Compatibile per Computer - Grigio (CMK32GX5M2B6000Z30)
Processor AMD Ryzen 7 9700x 8-core processor 3.80 GHz
Video Card AMD Radeon RX 7600 8 Gb VRAM
Samsung SSD 870 EVO 2 TB
Samsung SSA 990 PRO 2 TB

knotty ivy
#

guys, I need help. Are NVIDIA or AMD GPUs better for local LLM using on Linux?

frosty trench
#

if you are talking about amd this works a bit better with linux systems so maybe it could handle more load that way when using amd gpu rather then nividia when also talking about price to performance ofcourse and the os in use

knotty ivy
#

Same price

#

Usage with like LM Studio on Fedora

frosty trench
#

For local LLMs on Linux, NVIDIA is usually the better choice.
With the same price, NVIDIA generally has better support for tools like LM Studio, CUDA-based AI software, and overall smoother setup on Fedora/Linux. AMD can be fine on Linux, but for local LLM use, NVIDIA is usually the safer and more compatible option.

#

but also think about the llm model you will be running and the size of it is also important

knotty ivy
frosty trench
river surge
chrome night
#

This need update

olive rapids
peak heron
#

I am running a AMD Strix Halo device with 128gb of unified RAM and am pretty happy with it on Linux. LLMs are absolutely no issue. It’s pythonic diffusion model runners that are the issue, but this has significantly improved in the last month with better support from Linux kernel and ROCM drivers

fading linden
#

has anyone tried running gpt oss with the heretic software?

split trout
livid sky
thorn quiver
#

Did you know that you can use gpt-oss:120b for free even on trash CPUs using Ollama? Well, I didn't. BUT IM SO HAPPY I FOUND IT OUT!!


api_key = "YOUR API KEY HERE"

client = Client(
    host="https://ollama.com",
    headers={"Authorization": f"Bearer {api_key}"}
)

prompt = input('prompt: ')
print('')
messages = [{"role": "user", "content": f"Write a cinematic fantasy story, only output the story, nothing else: {prompt}"}]

for part in client.chat("gpt-oss:120b-cloud", messages=messages, stream=True):
    print(part["message"]["content"], end="", flush=True)```
short root
#

Hello everyone!

I wanted to open a discussion about fine-tuning gpt-oss-20b.
Inside the current design we have, there's an agent that uses a single gpt-oss-20b model, that must achieve the following tasks:

  • Task A (Tool Selection): the model must choose, from a reduced tool scope, the correct tool and with the correct arguments
  • Task B (Structured Generation): the model must conditionally generate a structured response based on the tool responses, with precise rules and formats.

We apprached this situation by fine-tuning a gpt-oss-20b model with a dataset that contains the whole ideal/target message history:
System prompt (without big rules & constrains that are required if the model is not fine-tuned) -> User prompt -> correct Tool Call -> Tool response -> Conditioned generation of the final response.

The intention to do a monolithic fine-tuning to "fix everything at once". After testing we are observing that this not may be the best solution.

I wonder... how should we approach this situation? How should we go about fine-tuning?
We have even considered using two different models (one for Task A and another for Task B), but this adds a great deal of complexity to the agent’s structure.

Eager to know about you past experiences, knowledge or your opinion!

broken stone
#

OpenAI give us more open source models and my soul is yours

shadow flint
shy jacinth
full crest
#

Honestly it wouldnt be too hard to create a simple vision system for the local model. It would just be based on word2idx processing. It takes chunks at a time, relays it into words and sends back to the model itself. Im not entirely sure, but could even have the local model do the processing itself? Just an idea

#

I dont normally work with LLM in that sense.

spark knoll
#

That can't possibly end badly.

viscid void
#

4o

tawny glacier
#

Min spec?

tacit ridge
shy jacinth
#

If only OpenAI were to update their GPT-OSS line to make a line multimodel models from 2B to 120B to stack up against Qwen and Gemma models, it would be really great

tardy lynx
#

What minimum amount of training data is reasonable for finetuning GPT-OSS 20b?

spark knoll
shy jacinth
spark knoll
#

What benefit does the megacorp OpenAI gain from releasing an open source/weights/etc. model that it doesn't also gain from observing every other open source/weights/etc. model in the wild?

tepid garnet
spark knoll
#

And clearly they don't think it's worth it.

tepid garnet
spark knoll
#

I doubt it at this point. They were released before "OpenAI" underwent complete mission collapse and corporatization.

tepid garnet
spark knoll
#

We'll see.

left kiln
#

Is it channel about ChatGPT's OS?

tepid garnet
shy jacinth
left kiln
#

And i saw that somebody talking about chatgpt OS

tepid garnet
left kiln
#

right?

tepid garnet
#

gpt-oss are advanced open weight models released by OpenAI

spark knoll
#

"Advanced" they're months old, ain't they?

shy jacinth
#

And I think it's because the 120B GPT OSS only uses like 3B active parameters compared to the 9B which uses all the parameters

shy jacinth
rotund musk
#

hi everyone, how is everbody this evening ?

spark knoll
#

What's up?

river loom
#

is chatgpt down???

#

what is a geoffries kurikure kurikesu

#

like please bruh

thick notch
elfin leaf
#

hiiiii

desert garden
#

hi

#

how every body doing

quiet fractal
#

Yo

#

They should try making parameter models to compete with Gemma from Google

#

Like 20b works fine on my m5 MacBook Pro

#

But maybe some 4b, 8b maybe?

#

And some new information loaded past 2023 or whatever cutoff oss has rn

twin lark
#

GPT-OSS 2 when?

twin lark
spark knoll
#

For what profit for them? They're a for profit now.

viscid void
polar thorn
#

I’ve been getting good results tailoring GPT-5.4 prompts for OSS-120B in my SPARK project (bit slow though).

https://community.openai.com/t/spark-simple-personal-ai-reasoning-kernel/1366435

Quick question — is there an OpenAI embeddings model available for OSS/local use? I’m using nomic at the moment.

Also curious what speed differences people are seeing between ~20B vs 120B?

Running on a DGX Spark.

quiet fractal
hexed latch
#

.

shy jacinth
# hexed latch .

What are you wondering about GPT OSS? You wondering about a new generation of GPT OSS models ranging from 2B to 120B with multi model capabilities? I am

shy jacinth
#

You know how powerful a 9B GPT OSS with multi model capabilities would be like

spark knoll
#

Don't get your hopes up, that's not "Open"AI's business model any more.

silk grove
#

what is oss??

tepid garnet
silent cradle
#

make new gpt oss

#

especially coder models

shy jacinth
flat citrus
#

can any help with the gpt -oss model can i fine tune to make it multi capable of it to use it as the coding agent

royal lava
#

Yes please we need a open source codex model

flat citrus
#

yes its true we can train model together

split trout
#

never in hell will openai do that

#

"for safety" sure buddy

twin seal
#

Has anyone experimented with speculative decoding for GPT OSS 20B? I'm running mostly on CPU and getting 10 tok/s, looking to push that up to maybe 15 tok/s? What are some good draft models?

twin seal
crimson ember
#

HII

feral escarp
twin seal
feral escarp
# twin seal So you would use GPT OSS 120B with 20B as a draft model? Interesting.

Yes, but the caveat is you would not get a significant perf increase because GPT OSS 120b is already fast and you would need to load both models into memory. Typically you need a tiny version of a much larger model for speculative decoding to yield any kind of performance benefit, and it has high memory requirements. If you're already offloading OSS 20b to CPU due to having only 8GB VRAM you would need a smaller GPT that fits entirely in GPU and if it's drastically less accurate you might only get a few % running on GPU while the rest ends up on CPU with 20b

#

Speculative decoding also requires the same vocabulary, So if you tried to do Lfm2.5 1.2B, which is a very fast and small LLM, it would produce logits with a completely different vocabulary and the OSS 20b would invalidate the entire draft (basically you'd get slower inference no matter what)

shy jacinth
#

These are the first open source image models that use GPT OSS as a text encoder i ever seen

#

I wonder why Microsoft used GPT-OSS for the text encoder

#

What do you think?

shy jacinth
#

Espcically when the diffusion model is way smaller then GPT OSS 20b

granite pumice
#

yo

#

which AI Model should i run?

#

Zotac Gaming 4090 24GB GDDR6 CPU: R7 9800X3D Ram: 64GB DDR5

shy jacinth