#ML Performance Reading Group

1 messages · Page 2 of 1

unborn heart
#

interesting....could the paper's claims be exaggerated?

shadow edge
#

maybe, in that specific sense. but their perf numbers are probably real. and they have a V2 version that likely did work

unborn heart
#

i see. i'm planning to try to implement it using the new cute 4.0 python DSL

shadow edge
#

I think Comet V3 does not even have a backward. my MoE gradients are empty, and claude code cannot find the backward

#

claude says:
The backward implementation for forward_gather_rs in GemmGroupedV3GatherRS does not exist in this repository. The implementation only provides forward-only CUDA kernels without custom backward passes, relying on PyTorch's automatic differentiation instead
but if the grouped gemm is cuda, torch autodiff can't handle it. so my deduction is that there isn't any backward

forest terrace
unborn heart
#

Lots of room for better implementations then…

shadow edge
#

Does flux support backwards?

In the doc., it says that Flux supports training scenarios. However, I couldn’t find any details about backpropagation in the source code or examples.
response:
you may write the backward as magetron TP parallel does, does not use auto backward.
https://github.com/bytedance/flux/issues/139
knowing this earlier would have saved me a lot of time

unborn heart
hushed girder
uncut monolith
tranquil pewter
#

hmm- seems like there is no gemv? thinking_cat

uncut monolith
#

Bruce Lee himself has written gemv kernels berk

tranquil pewter
#

yes but what about gemv for fp4

uncut monolith
#

You should create a GitHub repo with such issues and send them here or something.

I'm sure there will be many people like me looking to solve problems of practical significance while learning.

#

Like there's FP4, Blackwell support, etc.

#

Or just share here like you're already doing and I can create a GitHub repo lol

tranquil pewter
#

yeh- though yeah- mainly gemv is important because tensor cores don't work with vectors without being very inefficient

#

and llm inference uses vectors if batch size 1

#

the MLPs only see (1, hidden)

#

after prefill

#

(which is why LLMs can run so fast on macbooks)

#

it's vector @ matrix

uncut monolith
#

I like so much detail, I'm gonna make a github issue now. we need more such detailed/concrete and accessible ML sys problems for people to learn and develop some "aura" lol

uncut monolith
tranquil pewter
uncut monolith
#

yeah, simdgroup_matrix. i was trying to understand how vector@matrix product makes apple better at inference. for nvidia tensor cores, i can get that they'll perform badly on v@m products.

tranquil pewter
#

nono it's not that- the reason why it makes apple good at all is because it's very low flops

#

flops for matmul are 2 * M * N * K, and if M=1 (for vector) then it's only N*K*2 flops, which is very small flops.. basically same number of flops as active parameters in model (.. well x2)

uncut monolith
#

hmm, it's very low flops so the inference is memory bound and apple is good at memory. nice! got it, thanks for explaining!

uncut monolith
tranquil pewter
#

though can just use torch flop counter blaze

uncut monolith
#

yep, at that time I was not aware of it and spent hours calculating the FLOPs and memory accesses by hand goose10

tranquil pewter
#

rip

#

though you can do most of it with just matmul flops

#

since attention is basically just 2 matmul and softmax

#

and if causal then just matmul flops / 2

unborn heart
#

@uncut monolith do you still want to present soon

#

@hoary summit you also should present the paper you were looking at (USP?)

uncut monolith
#

We could finally have a toy example of how to do a fused communication + computation kernel, especially since a lot of papers are using it nowadays.

#

I had a brain fart, my bad. I think a better topic would be the internals of nccl since I'm doing that as part of above parallelism stuff.

vocal cove
#

Are you just reading nccl code to do the above

uncut monolith
#

Sort of, there's a few papers that deconstruct nccl and prime intellect created their alternative

#

Those are my main references. I think it will be done in iterations, with each iteration going more in depth and less relying on existing nccl code.

#

The goal is to make collective APIs and just learn. So whatever helps with that.

#

Although TLDR is I'll know when I've done the first iteration. I don't know enough currently to be able to give a good picture of what the end product will be like.

forest terrace
uncut monolith
#

It's mostly empty now, but it's public

#

I'm still getting over the initial daunting and freezing experience that one gets when exploring something new

#

I have almost none experience with C++ and C professional development, so that's a barrier for me...

forest terrace
#

nicee, what resources do u use to do the NCCL?

uncut monolith
uncut monolith
#

I know the conceptual stuff, but very little development experience. (Hence this project)

#

Just need boilerplate code to get done with so that I can actually do the core part of it

#

Actually @unborn heart any recommendations for small fused communication+computation kernels that might be there in torch or made by bytedance?

Just looking to abstract away the project setup and integration with pytorch part.

forest terrace
# uncut monolith Do you have recommendations for fused communication+computation kernels in CUDA+...
uncut monolith
#

Yeah, we recently went over this paper (led by Daniel) and I know it has some code as well. That's why I asked Daniel.

forest terrace
uncut monolith
uncut monolith
#

@forest terrace this is what i have currently. I'm sure I'll remove nvshmem type extra stuff and add on more stuff as i go through the current list. It's very early stages currently.

[1] C.-H. Hsu, N. Imam, A. Langer, S. Potluri, and C. J. Newburn, “An Initial Assessment of NVSHMEM for High Performance Computing,” in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2020, pp. 1–10. doi: 10.1109/IPDPSW50202.2020.00104.
[2] Z. Hu et al., “Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms,” July 07, 2025, arXiv: arXiv:2507.04786. doi: 10.48550/arXiv.2507.04786.
[3] “NCCL vs NVSHMEM · Issue #679 · NVIDIA/nccl,” GitHub. Accessed: July 06, 2025. [Online]. Available: https://github.com/NVIDIA/nccl/issues/679
[4] “NCCL: The Inter-GPU Communication Library Powering Multi-GPU AI S72583 | GTC 2025 | NVIDIA On-Demand,” NVIDIA. Accessed: July 23, 2025. [Online]. Available: https://www.nvidia.com/en-us/on-demand/session/gtc25-s72583/
[5] M. Keiblinger, M. Sieg, J. M. Ong, S. Jaghouar, and J. Hagemann, “Prime Collective Communications Library -- Technical Report,” May 20, 2025, arXiv: arXiv:2505.14065. doi: 10.48550/arXiv.2505.14065.
[6] “Scaling Scientific Computing with NVSHMEM,” NVIDIA Technical Blog. Accessed: July 06, 2025. [Online]. Available: https://developer.nvidia.com/blog/scaling-scientific-computing-with-nvshmem/

uncut monolith
hoary summit
#

Yeah

uncut monolith
#

nice! it will be nice to look at the code instead of the looped collective einsum notation

unborn heart
hoary summit
uncut monolith
hoary summit
unborn heart
uncut monolith
hoary summit
# uncut monolith i've seen people use mostly pattern matching in torch compiler, rarely decomposi...

Async comp paper authors are part of the xla team. They implemented it. I was surprised as well. I asked amit how to implement it but he told me xla implemented it and gave me some flags to turn on. I was surprised as well. But in xla world this is actually kind of the paradigm, compiler does lot. But you can tickle compiler or bypass it with kernels but thats not the norm. I inplemented usp not loopedeinsum xla works mostly fine. I like usp more though more flexible

uncut monolith
#

Torch compiler does async TP with micro pipelining option in inductor backend.

I think it's loopedeinsum and not USP though.

#

Loopedeinsum is more fundamental and for TP, unlike USP which as the name suggests is for SP

#

Like, even torch compiler supports automatic async TP nowadays, just like xla compiler backend

#

I'd be interested to see if you're using copy engine in USP implementation or not.

If you're not, it's an easy paper/PR to xDiT

uncut monolith
unborn heart
#

@here reminder we'll be starting the meeting in a few min!

uncut monolith
#

@unborn heart this is what I meant. I think nccl optimized their communication primitives which might help ring attention. They haven't profiled ring attention yet, just profiled the primitives by themselves.

unborn heart
#

that is interesting though, what video is that

uncut monolith
unborn heart
clear nimbus
#

Hello everyone

uncut monolith
#

we should do a RoPE + MLSys session, just as an excuse to finally read up on it for everyone who's been wanting to lol

shadow edge
#

there is no interaction between rope and ml perf

uncut monolith
#

yet

#

you can just make up stuff berk (some call it research)

pastel sapphire
#

RoPE is more or less free. There's nothing to optimize.

uncut monolith
#

I was hoping for an excuse honestly berk. A cursory google search found this, which is less ML Perf and a bit "hacky".

EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection
https://arxiv.org/pdf/2503.01586v1

" Experimental results show that with minimal uptraining on only 0.6% of the original training data, RoPE based models achieve a 75% reduction in KV cache size while preserving performance within a negligible margin"

#

this is like a new flavor of quantization almost

unborn heart
#

question for folks, my impression is HF / transformers is not commonly used for groups doing MoE pretraining, due to scalability issues and these groups generally just being more sophisticated and using their own arch + implementation. for pretraining, it seems people usually fork off of a pretraining framework like torchtitan or megatron, or just do their own thing entirely.

however, it is more common to use HF models for people doing either (1) serving only or (2) fine-tuning + serving.
does this align with others' understanding as well? @pale rune curious what you've seen

unborn heart
#

mm i should ask in this in implementation details actually i think

pastel sapphire
#

Nobody uses HF's libraries for pretraining at scale

#

I don't know statistics for different libraries but I know our GPT-NeoX is used by a dozen or so labs around the world and that several people switched off of Megatron to it

unborn heart
unborn heart
pastel sapphire
unborn heart
#

the benefit is it would then be compatible with torchao low precision MoE training conversion util, so using fp8 rowwise, mxfp8 etc for MoE training can be a one liner

#

sounds good, will check there, thanks

hoary summit
hushed girder
# unborn heart question for folks, my impression is HF / transformers is not commonly used for ...

There's some discussion about this in twitter :

https://x.com/eliebakouch/status/1949398309346394518

TLDR: It's either Megatron, Torchtitan and fork them or build them from scratch

Notable framework : LLM-foundry, Nanotron, Olmo as well

Same question but for training stack, a fork of megatron-lm is used by the Kimi folks I think, but idk about other labs or how far that fork is from the original codebase. Another question is if you're starting a big lab rn, do you start from scratch or fork something like

uncut monolith
#

Also, PCCL (Prime's internal communication library) sorta aims to make what this guy has already done...interesting

unborn heart
#

who wants to present next week? @uncut monolith ? 😄

#

maybe @sand parrot ?

uncut monolith
#

it's like very basic, how different collectives are implemented and how buffers are managed

#

actually, I might need more time to polish it up. Don't want to hurry it...

unborn heart
#

I’m interested in nccl internals

uncut monolith
#

hmm...what about next to next weekend? that should be plenty of time. I could talk in detail about the buffer registration and pipelined v/s non-pipelined nccl collectives

#

it's stuff that most people would not use honestly berk. I could also talk about fault tolerance and dynamic work group management limitations of nccl, like focus on that

unborn heart
#

i think it is good to have some level of understanding of the internals of critical tools you use

uncut monolith
#

yeah, i do think it will require me time to "grok" and come up with the important concepts to discuss instead of talking about everything in nccl lol

hushed girder
#

Im interested in nccl too

unborn heart
#

Interesting

#

I was wondering how they managed stable mxfp4 training, since Quartet only recently came out they must have some other technique

#

I guess they could have done bf16 or fp8 training then used QAT fp4 fine tuning

#

To prepare for mxfp4 PTQ

uncut monolith
uncut monolith
unborn heart
#

Maybe because various ops aren’t supported for these new dtypes, and for uint8 they are

uncut monolith
#

hmm, weird. need to dive deep into how the gradients propagate as you pack a float into int. they're just breaking all dtypes "norms" lol

unborn heart
#

I will volunteer to present next if no one else wants to

uncut monolith
#

it's going to take me time for sure, i'm moving/have interviews

#

i could present the paper, but i'd rather implement it before presenting

unborn heart
#

Which paper

uncut monolith
#

[1] Z. Hu et al., “Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms,” July 07, 2025, arXiv: arXiv:2507.04786. doi: 10.48550/arXiv.2507.04786.

#

there's a roadmap in the readme if you want to check if it's good for reading group. it has big picture view of things that are there in the paper/i could present

#

you know, considering i started this on july 31, the progress ain't that bad. i didn't even know makefile syntax back then, now i can write one from scratch and compile nccl code with multiple version in a docker container etc. etc.

nice... bugcatnod

uncut monolith
#

I'm still cooking, but seems like it will be fun to discuss it!

#

Also GPU mode is having a bunch of similar talks over this month. Maybe we can join them and then have a reading group on similar stuff catgirl5

#

More value out of the reading group imo

uncut monolith
#

GPU mode's events tab.

It will actually be pretty cool to follow those presentations for me cause people will know what to be interested in and why. The big picture motivation part will be done by them.

#

#1189640399476764692 message

#

I'm sure they're going to be confusing grimberk . So people would actually be interested in diving deep into nccl in our reading group

hushed girder
#

Wait, it's not embed, the title is -> Demystifying NCCL: An In-depth Analysis of GPU
Communication Protocols and Algorithms

uncut monolith
#

Oh lol, I'm implementing the same paper...ok, weird.

Let's see how it goes and if there's point to repeating it in our RG.

uncut monolith
uncut monolith
uncut monolith
#

@unborn heart any chance you know how torchao is dealing with this?

unborn heart
#

If/when we do it will likely be with triton + symmetric memory

hushed girder
hoary summit
#

Does triton allow dynamic slicing of refs?

pale rune
hoary summit
uncut monolith
#

At least the mxfp4 and nvfp4 things are starting to make some sense, i.e., why we do all of that in the first place anyways.

uncut monolith
hushed girder
uncut monolith
#

gpu mode's nccl is this saturday, will give me a nice idea about what to not repeat and what to elaborate on.
would be cool if we attend that and you guys could let me know what specifics you would want deep dive on.

uncut monolith
hoary summit
uncut monolith
#

@unborn heart what tool did you use for your async TP diagrams? I'm hoping to use the same for nccl diagrams

hoary summit
#

Is there any resource folks recommend for host offloading?

uncut monolith
# hoary summit Excalidraw

i might use this as an excuse to get my hands dirty with inkscape (supposed to be used for paper figures)

uncut monolith
hoary summit
#

I meant like paper or blog especially for parameter offloading, not sure of its complexity i always imagined its one api call but been told its not lol

uncut monolith
#

yeah, i doubt it's simple given the limited support it has (also limited use cases so there's a confound)

hoary summit
#

Share if you see any papers

fierce dock
#
GitHub

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. - deepspeedai/DeepSpeed

uncut monolith
vocal cove
#

Is there a tentative date for the next meet?

unborn heart
#

i have been interested in diffusion based language models lately

hoary summit
#

I implemented ring attention in pallas. And can give talk about it if people are curious about it

unborn heart
hoary summit
uncut monolith
#

I'm collecting cool introductory topics though, stuff like interpreting profile traces. They're not a full paper, but they're also complex enough that they might be of interest.

#

Maybe I write blog post and then present it... It's going to be a long time in the future though paimonpopcorn

hushed girder
#

For someone looking at Diffusion Transformer library that can scales (I think it's here no?)

Here's the paper Diffusion beats Autoregressive in Constraint Data, they just released their codebase which uses Megatron -> https://github.com/wmn-231314/diffusion-data-constraint

GitHub

Official PyTorch implementation and models for paper "Diffusion Beats Autoregressive in Data-Constrained Settings". We find diffusion models are significantly more data-efficient ...

hushed girder
#

What library do people use to train using AMD (large scale training)?

tired haven
hushed girder
#

Gotcha, thank you for the information

uncut monolith
#

@silver swift what do you use in your day job for large scale pre-training on AMD hardware?

silver swift
#

I recommend torchtitan but I use our own framework which has a lot of things straight from torchtitan though

uncut monolith
#

Thanks. You're the only person I know of that actually does AMD pre-training lol.

unborn heart
uncut monolith
unborn heart
uncut monolith
#

Hmm, pretty cool. I need to read the paper. I tried to listen in for the duration of the ASAP seminar and got frustrated at how slow it was berk...

unrelated personal musing: A year ago I would've taken a seminar over reading and skipping sections any day, but now seminars are too slow and I'd rather read cause I'm impatient and like to skip ahead lol

uncut monolith
hushed girder
uncut monolith
#

yeah, i see it in my calendar + got an email

tired haven
hushed girder
#

Yes

tired haven
hushed girder
#

Im thinking to share it here after the final paper finished .-.

tired haven
#

Oh yeah please do

#

I do think you could've polished a bit more with the baselines etc but I really like the direction

#

I guess you guys were worried about being scooped but did you know anyone else doing this?

hushed girder
#

Not so far

#

But we have experienced of getting scooped lmao

#

So we and my lab in general dont want it to happened again

tired haven
#

Ah lol

#

Are you at mbz?

hushed girder
#

Yes I did

#

I do i mean

tired haven
#

Oh nice

hushed girder
#

Im a master student there on NLP

tired haven
#

Yeah I have a few friends there

#

PhD/RA

hushed girder
#

Ooo where are you right now?

tired haven
#

I'm at MPI-SWS

hushed girder
#

Dude thats really cool

tired haven
#

Haha thanks 🙂
Mbz is pretty cool too

uncut monolith
#

GPU mode is doing a cool multi-GPU kernel competition; related to the papers discussed in the group in the recent past...

uncut monolith
#

yeah...it's already giving me a reality check berk. nothing better than a competition to make you question how much you actually know something lol.

hushed girder
#

Any interesting VLM MLsystem paper? I am thinking of like Prefil-Decoding Disaggregation new technique type of paper but in VLM

#

I found this paper but it seems like they did not really designing for the VLM itself. Like they designing mainly for the decoding of the LLM and the vision stuff just happened to be there -> https://arxiv.org/pdf/2507.19427

uncut monolith
hushed girder
uncut monolith
#

Yeah. I think it's still mainly for LLMs though

hushed girder
#

I read them for a bit already, and I found that it's more about autoscaling (eg. when you're using kubernetes)

uncut monolith
uncut monolith
#

@unborn heart food for thought, what do you think about a session on quantization?

unborn heart
#

I’m just so busy with work right now, crunch time for PyTorch conference

uncut monolith
#

Ohh, good luck with it. I wish I could attend, still trying to find a way.

#

Anyways, I'd be up to do the session on quantization. That's what the bulk of my time is going in nowadays

#

I'm thinking 2 sessions, one theoretical and one practical. The theoretical one would be a survey of quantization methods, and practical one would show different tools and how it gets done irl.

I'd definitely appreciate your help on the practical side.

#

Theory session paper:

A Survey of Quantization Methods for Efficient Neural Network Inference
https://arxiv.org/abs/2103.13630

and any new methods post this paper.

#

Do you want to schedule the theory one for next weekend? And the practical one...I think I'll dm you or something to ensure I can do quick exploration of all the different tools or something. Or at least figure out how torchao is integrating them and accessing them via torchao.

#

Sep 20-21?

unborn heart
#

Sure, sounds good

#

Sept 21 for the survey paper?

uncut monolith
#

Yep! I'll try to run the slides by you if that's fine. Just to ensure they're a good use of everyone's time.

unborn heart
uncut monolith
#

Yep! 21st should be good.

uncut monolith
# uncut monolith Theory session paper: A Survey of Quantization Methods for Efficient Neural Net...

hey, I went through this paper, and it doesn't cover the OCP MX formats. So I'm adding this paper as well, which is a short review of MX FP.

https://arxiv.org/abs/2310.10537

unborn heart
unborn heart
#

@uncut monolith just confirming you're still planning to present at 10am PST tomorrow?

uncut monolith
#

Hey, yep!

#

I couldn't get the ppt to you, but there's not a lot to in it lol.

I've been trying to get simple torch code demonstrating the quantization methods

#

I'm a bit worried it might be too simple, let's hope not 🤞

unborn heart
#

awesome, sounds great!

#

i think it will be fine

#

simple is ok, some folks may have never seen anything quantization related before

#

regardless of how simple or deep/complex, there will be people who get value out of it 😄

uncut monolith
uncut monolith
uncut monolith
#

@unborn heart lemme know when you want to start

unborn heart
uncut monolith
#

nw

unborn heart
#

10:30 ok?

#

So sorry! Taking longer than expected to get back from morning breakfast date

uncut monolith
#

it's fine with me, yeah. if it's quick, it might be nicer to update the event time. but yeah, low priority

unborn heart
#

ok i'm back

#

updated to 1030am

#

@everyone meeting is starting soon, @uncut monolith will be presenting "A Survey of Quantization Methods for Efficient Neural Network Inference" (https://arxiv.org/pdf/2103.13630) today!

#

Can anyone hear me

#

i can't hear anyone

#

let me restart discord @uncut monolith

#

@uncut monolith also try looking at discord settings and looking at audio

#

source

unborn heart
#

thanks @uncut monolith for the awesome presentation on quantization methods and low precision dtypes! i will upload the recording and share a link once it's ready

uncut monolith
hushed girder
#

uh I missed it already

uncut monolith
uncut monolith
unborn heart
uncut monolith
#

I'm thinking of how to repro it without spending weeks or months on it. Please let me know if you have any ideas!

unborn heart
#

Would anyone be interested in doing a walkthrough of torchao mxfp8 MoE training code / kernels? Instead of a paper like usual?

hushed girder
unborn heart
#

I can prepare some slides for conceptual stuff, perf numbers, etc and then we can look at key parts of the implementation

uncut monolith
#

That'd be really cool. I'm interested in it.

hushed girder
uncut monolith
#

@unborn heart really looking forward to when you present your MoE training code.

I've seen the same pattern used in three cutting edge performance engineering problems now.

unborn heart
uncut monolith
#

Depends on your timelines, I'm going to implement it in CUDA+PTX "soon" anyways hap

uncut monolith
#

I need to do FA in triton today, maybe I'll also give a shot at FA4... (edit: i will regret these words/short timelines)

uncut monolith
#

@unborn heart do you want to try adding fa4 to torch or something? like, i'm trying to find someone to do this with so that it's not that much mental load lol

unborn heart
uncut monolith
#

Honestly FA4 might be easier than mxfp8 MoE training from cursor blog lol

unborn heart
#

i used pytorch, triton, CUDA for quanitzation and cutlass for the grouped GEMMs

uncut monolith
uncut monolith
#

I think it would still help cause you would introduce the tensor core and memory jargon and concepts with your presentation.

FA4 is all about that.

#

The TLDR on FA4 is decomposing the kernel into different stages and assigning a warp to each stage via warp specialization. Then they do producer consumer model to manage different types of warps, with a barrier sync between each stage.

The MMA warp is all about tensor core instructions with soft max scaling using CUDA cores, just like we do fp block scaling. Like the same process.

It would be cool if I do a ncu trace and show it step by step when I present 🤔

uncut monolith
#

Daniel, TLDR of the modal blog. I think it will be worth it to take it apart and focus on core things like tensor core instructions, and warp scheduler and persistent grids+blocks.

hushed girder
#

#1189498205101109300 message

#

Yep they will break down the FA4 for us

uncut monolith
#

Yeah nice

#

That will leave us time to dive deep into tensor cores stuff if you want @unborn heart . Same stuff about sfu and tensor cores is used in quantized training/inference and FA4.

hushed girder
#

Do you guys already talk about the optimization on FA3 before?

uncut monolith
#

seems like they did, and it was a really in-depth one. i wasn't there on this server at that point but saw this:
https://www.youtube.com/watch?v=Lys0TpsLIEc&list=PLvtrkEledFjqOLuDB_9FWL3dgivYqc6-3&index=13

ML Performance Reading Group Session 2 recording, in which we covered the original Flash Attention paper (https://arxiv.org/pdf/2205.14135), as well an example Triton kernel implementation of it.

Presenters: Ben Schneider, Daniel Vega-Myhre

▶ Play video
#

they actually went through details like warp scheduler doing latency hiding for wgmma and stuff

unborn heart
#

someone should prenent sparse attention from new deepseek paper

#

or anything else in it

hushed girder
#

the presentation yesterday was really good

#

I am not familiar with like warp and stuff before. Now I have a slight clue on it??

uncut monolith
unborn heart
uncut monolith
#

@next rose would you have time some day to go over the new sparse attention things and deep seek paper?

next rose
uncut monolith
#

Completely understand! Do you know of anyone who works on this full-time in academia as a PhD student probably?

Maybe we could invite them @unborn heart . I can do the scheduling if it's fine.

next rose
#

on topk attention? no

uncut monolith
#

Hm, no worries. If you ever get time, you know where to find us lol.

next rose
#

afaik no one was working on it except deepseek, me, and carson poole - all in somewhat different ways

#

but maybe others were and just didnt publish anything (yet?)

uncut monolith
#

Yep, that's very likely

next rose
#

its an old idea

uncut monolith
#

Cool, thanks anyways. I need to find some time to play around with implementing attention methods in CUDA, and triton soon. Maybe I'll start with that (top-k).

But probably a few weeks away for me as well, too much stuff to do.

Edit: ^ don't quote me on this lol. Everything takes a lot of time.

uncut monolith
#

@crisp karma just in case he might be interested in presenting the work of deep seek... Which I doubt cause he seemed busy

unborn heart
#

What are people’s thoughts on doing some deviations into non-perf/systems topics sometimes

#

I want to dive into RL and DLMs for example but don’t want to start another group, lol

uncut monolith
#

Do you have any examples (papers or suggestions) in mind?

There are generally 2 types of papers: building intuition for RL/Diffusion, MLsys+RL/Diffusion.

I believe there's pretty cool async and multi-gpu training stuff in RL, and many optimization in diffusion inference.

hushed girder
#

Are you looking on the OpenMoE 2.0?

unborn heart
hushed girder
#

Here, they scale DLM even further for MoE

#

I've been reading some of optimization in hopper and blackwell architecture. Some keyword that I found is TMA and Warp Specialization

CMIIW and Triton cannot express the Warp Specialization part, therefore they created the Gluon and PyTorch team create TLX for that.

There's this blogpost from JAX team which express that their Pallas was able to do it as well : https://docs.jax.dev/en/latest/pallas/gpu/blackwell_matmul.html#warp-specialization

Some links:
https://github.com/facebookexperimental/triton/tree/tlx
https://pytorch.org/blog/fast-2-simplicial-attention-hardware-efficient-kernels-in-tlx/
https://github.com/triton-lang/triton/tree/main/python/tutorials/gluon

I know there's the CuTe DSL, but I kinda abit sad to leave AMD behind yk? .-.

Well I am not sure too that AMD has TMA or not. Not sure as well if this TDA is TMA equivalent in AMD : https://github.com/triton-lang/triton/pull/8333

uncut monolith
#

I want to discuss and get the details right once and for all.

#

We can do support in triton, gluon, cutedsl, cutlass, pytorch, and Jax.

And parallels across AMD and Nvidia.

uncut monolith
#

@unborn heart impromptu one today lol?

#

We can honestly figure it out together while on the call lol.

#

Go through datasheet, CUDA and PTX isa docs, and micro-benchmarking papers.

#

I have some idea of where to figure it out for Nvidia. But not sure about AMD at all.

I'm hoping Edd will know where to look it up for amd.

hushed girder
#

I don't know yet lmao. I tried searching it but no luck :/

uncut monolith
#

Ohh, nw. We take help from someone in GPU mode probably

#

But it will be nice to do it for Nvidia as a start!

uncut monolith
#

Edd and I are just going to do a short impromptu searching and discussion on the above in 2 hrs in the voice chat if we can.

Everyone's welcome to join obviously.

Time: 1 pm ET, 17 GMT.

uncut monolith
#

@hushed girder , do you want to jump in the voice chat?

hushed girder
#

wait

uncut monolith
unborn heart
#

And MLA from dsv3 paper

#

I don’t mind review though

#

I’m interested in DSA and NSA

#

When do you want to meet

uncut monolith
#

I can do this weekend anytime... Unless I get a job and have to move, unless oooh (joke to clarify)

uncut monolith
uncut monolith
# unborn heart When do you want to meet

@next rose @crisp karma any of you have an hour this weekend?

Would be really nice to have you be present to correct or contribute additional insight into the topic.

uncut monolith
# unborn heart I’m interested in DSA and NSA

Daniel, it seems like the code is divergent from what's described in the papers in a significant way.

If we want to go through the code, I'll need till next weekend.
If just the paper, this weekend is perfectly fine.

unborn heart
uncut monolith
#

Awesome! I'll see you guys on Sunday if that works.

The DSA and NSA papers from deepseek.

#

The old deepseekv3 video for people to review things, although I'll go through them quickly as well.

https://youtu.be/hPXTRZ9A-9M?si=D8H-7O7E7RMDSUi_

ML Performance Reading Group Session 7, where we covered the DeepSeek V3 paper. We also discussed some parts of the DeepSeek V2 paper for comparison.

Presenter: Daniel Vega-Myhre

Papers:

  1. DeepSeek V3 (https://arxiv.org/abs/2412.19437)
  2. DeepSeek V2 (https://arxiv.org/pdf/2405.04434)
▶ Play video
uncut monolith
#

@unborn heart unrelated to this, but do you have any pointers on getting started with cutlass and cutedsl?

i'm targetting triton, cuda c++, and cutlass+cute dsl for my toolbox of ml sys things

unborn heart
#

I only really use triton and CUDA

#

I have modified cutlass stuff but I hate when I have to work with it

uncut monolith
uncut monolith
#

@unborn heart i think papers like these would be cool if you want to branch out of pure ml sys topics for the reading group.

you get the chance to talk about diffusion + what's really the bottleneck in making diffusion work nowadays, i.e., the systems challenges.

https://self-forcing-plus-plus.github.io/

unborn heart
#

link doens't load for me

uncut monolith
#

also, the paper might be too advanced as a first paper guilty . it mixes GRPO, clever attention and kv cache, and diffusion all together

unborn heart
#

i envy how much time you have time for learning

#

the past 3-4 months i have been full steam building mode

#

no time for papers

#

😢

#

i want to get back to my roots

uncut monolith
#

i know right, lol. it's one of the good things of not being in a job. i think i'll only be able to squeeze in like 1 paper a week when working full-time

#

i sometimes think i'll practice so much and be so good in my job that i can do things quickly and have time for other stuff in the evening, but i'm also early career lol 🤞

uncut monolith
#

@unborn heart can you please schedule an event for tomorrow 1 pm CT?

For MLA and DeepSeekv3.2. it will be nice to go through the paper and the code together. The code has been described as very confusing by 1-2 people lol.

unborn heart
#

Tomorrow I have plans

#

I’m free all day today though

uncut monolith
#

Strangely, I have meetings all day today till at least 5:30 pm CT. Pretty unusual for a Saturday, but yeah.

#

I can do it sometime over the week if you'd like. At least as of now, it seems like I can make an hour in the week.
But I can confirm by tomorrow cause that might change.

unborn heart
#

Meetings?? lol what

uncut monolith
#

It's just a long interview so I can't skip it...

#

Maybe it will end earlier than expected. I'll let you know if things change!

pastel sapphire
#

@unborn heart Should I give @uncut monolith reading group manager perms (make discord events, pin and delete posts in this channel)?

unborn heart
uncut monolith
#

Yep, that would be really convenient!

#

I need to fly to SF tomorrow though. So seems like I'm also unavailable tomorrow.

hoary summit
#

welcome @hot socket . He has kindly volunteered to present megablocks on 10/19 @unborn heart

#

Can you create a event

unborn heart
unborn heart
uncut monolith
#

Looks pretty cool!

uncut monolith
#

I can do automated cuda codegen in the coming weeks. I'm working on it this week for an interview/take-home sort of.

so i have:

  1. deepseek sparse attention, and hardware native sparse attention
  2. sakana and meta's cuda code gen papers.
unborn heart
#

bcsr format in megablocks confused me for a bit ...

uncut monolith
#

if you ever write a blog on the scaling challenges of MoEs based on the scaling laws of MoEs and the mixture of a million experts papers, let me know (we could collaborate)! I had that idea but i doubt i'll get to it on my own.

I want to write something purely from a ml sys perspective, laying out the challenges of MoE scaling.

#

also PMPP has a whole chapter on the CSR and COO representations, pretty approachable and simple language

unborn heart
unborn heart
uncut monolith
unborn heart
uncut monolith
#

I just skimmed the table of contents for the fourth edition and it looks similar to the 3rd edition. And yeah, the book is split into 3 sections. Basic foundations, parallel programming patterns and application case studies.

The 4th edition seems to have replaced the 3rd part with the 2nd part in more details.

#

The parallel programming patterns sections are worth reading for everyone I think

unborn heart
#

worth the $70?

uncut monolith
#

I got the 3td one for cheap and There's a PDF which details the difference between edition 3 and 4

#

Also, soon edition 5.

But yeah, the parallel patterns make it worth the cost

unborn heart
#

oh how soon is edition 5 coming?

uncut monolith
#

I don't remember. I think it was early next year

unborn heart
#

yea i noticed it only discusses ampere from what i can see in descriptions

uncut monolith
#

At least for me, after reading that book, the only thing left was tensor core features after Ampere. Maybe they'll cover it in the new edition

unborn heart
#

TMA, thread block clusters, DSMEM were all introduced in hopper. and TMEM was introduced in blackwell as well, i.e., tenscore accumulation (tcgen05.mma.*) happening in TMEM instead of registers

uncut monolith
#

Yeah. I haven't found a reference for those except cutlass code and PTX isa.

The matmul blog from Aleksa something also doesn't cover everything

#

For a newbie like me, PMPP was definitely worth it.

For someone like you, I wish you can just get the parallel programming patterns part of the book

#

I mean PMPP got me my interview at Stanford and hopefully a job. So I'm very biased in favor of it lol

uncut monolith
#

I got the physical book. I do think there's first and 2nd edition PDFs online, but difficult to find

hushed girder
unborn heart
#

@uncut monolith can you share the link or remind me of the concept you mentioned about NCCL implementation that could explain why sending N fp8 elements takes same amount of time as N bf16 elements?

uncut monolith
#

how sure are you that the bandwidth of the network interface and the latency per hop and per packet transmission are similar?

Maybe the fp8 has more network contention because of rapid packet sending in the link, and that leads to exponential back-off and ends up being slower than the bf16 due to the link going empty due to contention and cool-off/back-off time after packet collission?

unborn heart
#

how sure are you that the bandwidth of the network interface and the latency per hop and per packet transmission are similar?
pretty sure, it's the same devices on the same machine

uncut monolith
#

yeah, but the protocol picked by nccl autotuner might be completely different given the dtype size. also, if the kernel is too fast given small dtype, the profiler might just be measuring the time required for synchronization after each packet, or after whole transmission, and launch overhead like cuda kernel.

unborn heart
#

@hot socket just checking in, you still good for Megablocks @ 10am?

hot socket
#

definitely!

unborn heart
#

@everyone reminder we'll be covering Megablocks in a couple minutes!

hot socket
unborn heart
cosmic kraken
#

Hi guys, is there a list of potential future papers to review. Looking back the past session there are number of topics not covered yet, these include:

  • kv cache
  • speculative decoding
  • prefill
  • megakernels
  • decoding
  • scheduling
  • memory optimization

Do we mainly rely of volunteers to step up, or do we have list somewhere?

unborn heart
cosmic kraken
unborn heart
#

Any specific paper related to kv caching you’re interested in?

cosmic kraken
unborn heart
hushed girder
#

Hmmm it's more about the JAX and TPU world that's tricky for me since the topology is different with GPU. Currently watching the recording

hoary summit
unborn heart
cosmic kraken
hoary summit
cosmic kraken
unborn heart
#

I could do it … would like to make a longer more detailed version though

hushed girder
#

I started doing experiment on multi-node, now I am facing a case where the training loop randomly stuck, sometimes it can escape after like 200s (normal iteration took 1s)

The network topology is a bit weird but we can't control it since it's rented machine and the machine that we got is not consistent. and we also want to test it on as many topology as possible so we want the solution to be working on any topology.

I don't even know where to start debugging this, one thing that I imagine that maybe would fix this is to put torch dist barrier()? but idk where to put it. How do I learn on where to put it?

Btw the framework that I use is Megatron

Thank you in advance!

unborn heart
#

>I am facing a case where the training loop randomly stuck, sometimes it can escape after like 200s (normal iteration took 1s)
I would try to first validate that a simple dist primitive works w/ torchrun. Write a script that just runs a simple collective or barrier. Use the minimal possible repro, not the full model. Validate the inter-node IB comms are actually functional first before diving deeper.

unborn heart
hushed girder
#

got it, thanks for the advice on the debugging tips. I keep forgot about torch.distributed.breakpoint()

cosmic kraken
hoary summit
unborn heart
unborn heart
#

Btw i'm at pytorch conference today if anyone wants to say hey hmu!

dire bronze
unborn heart
dire bronze
hoary summit
hushed girder
#

This is crazy

#

btw batch invariant means then something like using FA without the varlen right (use batch with padding)? what other batch invariant kernels out there?

unborn heart
# hushed girder btw batch invariant means then something like using FA without the varlen right ...

Yes using fixed reduction strategy to avoid rounding error due to floating point non-associativity. See here: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Thinking Machines Lab

Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models.
For example, you might observe that asking ChatGPT the same question multiple times provides different results. This by itself is not surprising, since getting a result from a language model involves...

#

Oh cool, I got the “regular” badge / green name! Life goal complete.haha

unborn heart
cosmic kraken
unborn heart
#

@here reminder we'll be starting a session on LMCache in 5min!

hushed girder
#

ugh I can't join the meeting since voice call discord is banned in my country. Will watch the recording instead

cosmic kraken
unborn heart
#

thanks again to @cosmic kraken for the great presentation!

#

@here would anyone be down to meet again tomorrow to cover MXFP8 training for MoEs? i just presented this at the Pytorch Conference in SF so i already have slides etc. We can also wait til next week if more people will be available

unborn heart
junior ore
unborn heart
# junior ore Is there video of your conference talk?

it's not uploaded yet but will share when it is! btw i was wondering, would you potentially be interested in sharing Muon/MuonClip sometime? i think someone with a strong math background like you would be best suited for a topic like this. i have some questions about it 😄

junior ore
unborn heart
junior ore
#

OK. After thanksgiving week would be best for me.

unborn heart
#

Let me know if you have a preference. I can also just follow up later when we get closer to that time

#

To see how your schedule is looking around then

junior ore
junior ore
unborn heart
#

just fyi all i decided to schedule the MXFP8 MoE training session for next weekend (Nov 7th)

unborn heart
#

we should talk about this https://arxiv.org/abs/2510.26692

#

big if true

hoary summit
#

Yeah in general ssm we should cover

#

Whats interesting is the performance comparison in the report is on short context pretraining

cosmic kraken
# unborn heart we should talk about this https://arxiv.org/abs/2510.26692

Yeah, will be interesting to see if anyone can try it on a decent sized model. There is also the native sparse attention paper from DeepSeek which might be worth doing too. Is there anyone willing to take this up, if no one volunteers I can probably do it but it would have to be in December as I am super busy at the moment.

unborn heart
unborn heart
#

my PyTorch conference talk on MXFP8 MoE training is up! (my part starts at around 17min): https://youtu.be/h6LjH6Jkaf0?si=vy7ye0UmqoAbzvDf

PyTorch APIs for High Performance MoE Training and Inference - Daniel Vega-Myhre; Ke Wen & Natalia Gimelshein, Meta

With models like DeepSeekV3 and Llama4 rising in popularity, there has been an increasing demand for PyTorch-native APIs and tailored performance optimizations for MoE architectures.

This will be a joint talk between PyTorch Core...

▶ Play video
crisp karma
unborn heart
crisp karma
unborn heart
# crisp karma gotcha, thanks! out of curiosity, what aspect of this is limited to blackwell? a...

what aspect of this is limited to blackwell?
microscaled dtypes (mxfp8, mxfp4, nvfp4) have native acceleration for certain operations on blackwell. specifically tcgen05.* family of PTX instructions ("tensorcore 5th generation") have single instruction ops for doing things like block scaled mma (tcgen05.mma.*) that require the scale factors be (1) in TMEM (new layer of blackwell memory hierarchy) and (2) in blocked swizzled layout (see talk for details on that)

#

does this dynamic quantization prevent the issues with outliers in activations?
i discuss this in the beginning of the talk (i had to talk super fast to get through all the content so you may have missed it) - scaling granularity of mxfp8 is 1x32, so the impact of outliers is limited to only a 1x32 block of input data, rather than a larger chunk of the tensor (which you'd see in more coarsely grained quantization strategies like blockwise, rowwise, tensorwise)

#

for this reason, mxfp8 implemented properly has identical convergence to bf16, and some studies (including our own!) have shown slightly better convergence / lower loss at same step - implying the amount of quantization error /information loss provides a helpful amount of implicit regularization

cosmic kraken
#

There was another talk on mxfp8, mxfp4 and nvfp4 that was really interesting too, pytorch 2025 conf. I was looking for yours too, it must have gone up today

rare warren
#

Great talk Daniel, looking forward to the deep dive 👍

unborn heart
unborn heart
#

for tomorrow i was planning on going through the slides but pausing to look the kernels / implementations of certain parts. would that be interesting to folks or too much detail?

hushed girder
hushed girder
#

ah alright

cosmic kraken
#

Any chance of moving these sessions to a Sunday in future, here in the uk its 5pm on Saturday which is trickier to attend @unborn heart

unborn heart
#

we can do sunday this weekend, i actually usually prefer that when i present

#

let's do sundays from now on, i will update the invite

cosmic kraken
rare warren
#

Thanks, better for me as well. Looking at the implementation as well sounds good 👍

unborn heart
#

hey @pastel sapphire is it ok if I share the reading group discord meeting invite link on Twitter? Not sure if there are server rules/norms about this so just checking

unborn heart
#

@here reminder we'll be talking about MXFP8 training for MoEs in 5min!

unborn heart
unborn heart
nova wadi
#

Appreciate your time and discussion! 🙏 I'll be around in the future 🙂

unborn heart
#

that sucks, yeah luckily i have b200s to do development work on for work ... for those doing personal projects / learning out there, may be tough / cost $

nova wadi
#

I'm picking up some AGX Thors which you can get at like 3500~ brand new which do support this

#

Some weird stuff with sm110 but it's mostly the same. Just way worse memory bandwidth

#

Still not cheap, but not "luxury car for a single GPU"

#

Also worth noting to prevent someone else from going down a rabbit hole, but if you're working on B300's at all, sm103 does support larger K dimensions for mma. Slightly different silicon

crisp karma
#

oh wait I see that the mma.sync instructions (not the tcgen05) do support those dtypes w/ scales

#

does that work with the torchtitan stuff @unborn heart is presenting on? I would assume cublas would just target those instructions instead of the tcgen05?

#

also as an aside, Daniel, why do you think getting ~2x the FLOPs from MXFP8 results in still only getting 1.2-1.3x the speed? attention? just the requirement for the matrices to be so much larger for the full throughput to matter? do you think eg doing relatively less weight sharding (ie clos_er_ to DDP) would make that tradeoff better (ie bc larger weight matrices per GPU -> higher FLOPs)?

#

I suppose the on-the-fly scale calculation and whatnot does add overhead

nova wadi
unborn heart
crisp karma
unborn heart
crisp karma
#

might be able to contribute some SM_120 mxfp8 kernels if I can find some time

#

i have some (small M) gemm kernels that are pretty performant for (regular, ie non MX) fp8

#

could spend some time getting those a bit better for larger Ms

#

they're unlikely to be quite cublas perf (for large M especially), but at least they're be something for those devices

unborn heart
# crisp karma also as an aside, Daniel, why do you think getting ~2x the FLOPs from MXFP8 resu...
  • the mxfp8 grouped gemm kernel achieves on average 1.8-2x higher flops/sec than bf16.
  • add in the overhead of dynamic quant, the net speed up for llama4 shapes is 1.6-1.8x (for local batch size 16, seq len 8192 - need large M dim).
  • measure the whole MoE layer with all the other ops, speeding up just the grouped Gemm results in 1.4x speed up.
  • now measure the full model e2e training, using dp2ep parallelism, which is notoriously comms heavy all2all, speeding up just the grouped Gemm nets 1.2x throughout (and convergence) speed up
unborn heart
crisp karma
unborn heart
crisp karma
#

or is grouped really where things are difficult rn

unborn heart
#

attention

#

we have mxfp8 for linears as well

crisp karma
unborn heart
crisp karma
unborn heart
nova wadi
#

Just curious, is anyone in this group planning on doing the GPU Mode NVFP4 competition?

nova wadi
#

It's announced in their discord: #1189640399476764692 message

#

lmk if you need a invite, grand prize is a dell gb300

#

with some sparks and 5090/5080's thrown out along the way

crisp karma
#

if the baseline is two separate kernels it might be advantageous (obv it will be slower than a precalc'd scales)

#

you could tune the block size(s) of the gemm kernel to be the same as the MX spec's block sizes to make the reduction more optimal

hushed girder
cosmic kraken
# hushed girder What about symmem? What architecture supports it? I tried pytorch API on symmem ...

Is this what you are referring to? https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-multimem:~:text=assumed by default.-,PTX ISA Notes,-Introduced in PTX

Seems like you need >= sm_100, the 4090 is I think sm_89. But the whole thing is super confusing, I wish nvidia would streamline the numbering. So if you're on a later sm version you can rely on the instructions, and have only family specifics at each level for extensions. The consumer 50x series are marketed as TC Gen 5, but they don't support tcgen05

nova wadi
#

Can you clarify what you're looking at? I'm seeing nvshmem when looking up symmem but those appear to be higher level APIs wrapping on device APIs

unborn heart
hushed girder
hushed girder
cosmic kraken
hushed girder
#

I will look into further. I am not really familiar with instruction stuff and which feature are being supported in certain arch so what I did right now is just do it empirically (eg. try it in the code)

Thanks for the instruction tho (no pun intended)

cosmic kraken
#
crisp karma
#

seems conceptually doable

unborn heart
nova wadi
#

Any shot anyone would want any specific tests on the t5000 to see how sm100, 110 and 120 stack up against each other?

#

Getting a second one in today I could throw some tests at

unborn heart
nova wadi
#

also supports all the same stuff as SM100 under PTX 9.0

unborn heart
#

I believe certain instructions are only available on the “a” variant (e.g., sm100a)

nova wadi
#

I'm not seeing anything that sm110 doesn't get that sm100 does in the PTX ISA, but correct me if I'm wrong. sm110a was actually previously sm101a, and then renamed in PTX ISA 9.0, so earlier features that were sm100f (certain arguments, mostly), would have covered sm101a. Would be surprised if sm110 lost support for those arguments during the rename.

nova wadi
#

Actually I found the singular thing that is not supported here and it's stochastic rounding

#

Thats b200/300 only

cosmic kraken
# nova wadi Actually I found the singular thing that is not supported here and it's stochast...

From the tcgen05.alloc section of the PTX 9.0 docs, it would have been great to have a simple compatibility index for the different sm versions since 90 with a summary of support as a matrix:


Supported on following architectures:

sm_100a

sm_101a (Renamed to sm_110a from PTX ISA version 9.0)

And is supported on following family-specific architectures from PTX ISA version 8.8:

sm_100f or higher in the same family

sm_101f or higher in the same family (Renamed to sm_110f from PTX ISA version 9.0)

sm_110f or higher in the same family
nova wadi
#

Should be straightforward to make that matrix if we wanted to

unborn heart
#

pretty interesting, i haven't looked at linear attention methods in much detail until now

unborn heart
#

anyone down for ad hoc meeting tomorrow to discuss kimi linear^?

nova wadi
#

Depends on time but yea!

hushed girder
#

I'm down

unborn heart
#

ok I’m super jet lagged but hopefully can nap and do something later today

#

If not then next weekend

hushed girder
#

okay no worries both cases for me

unborn heart
#

Let’s do next weekend

nova wadi
unborn heart
#

yep! no stochastic rounding, that is just for nvfp4

unborn heart
unborn heart
nova wadi
unborn heart
#

in particular the WY representation and UT transformation i have not seen before ... everything up until that point i feel ok about

junior ore
cosmic kraken
#

Is it worth moving over to Google Hangouts the sound quality on the recordings is not great with discord.

Also I wouldn't mind presenting the alternative Native Sparse Attention paper from deepseek in a couple of weeks on a free slot (I think muon was next week as I remember).

unborn heart
cosmic kraken
#

I was started listening to the playlist and I listened to flash attention and zero the first two. Sounded a bit garbled, not terrible just not great

#

I can do 13th, 6th is a bit short notice

unborn heart
#

can you check the more recent ones ? i had adjusted the recording software at some point

unborn heart
junior ore
#

@unborn heart , where does the meeting take place?

unborn heart
#

now! 😄

#

about to open the meeting 1 sec

ocean juniper
#

Gotta drop, thanks for the presentation!

unborn heart
junior ore
dire bronze
unborn heart
#

10am PST?

dire bronze
unborn heart
cosmic kraken
#

@unborn heart are we still having the session today?

#

Guys, it looks like daniel is offline today. So we will re-schedule this session for another time.

unborn heart
#

Crap sorry I’m here, I thought it was next weekend for some reason

#

I can start it up really quick, or can we reschedule for next weekend if that’s ok?

#

I’m working this weekend trying to get something done by Monday

unborn heart
#

Rescheduled for 21st! Sorry about that. @dire bronze can you do 27th for the spec decoding? Or Jan 3?

dire bronze
#

Hey @unborn heart - I thought it was Dec 21 - next weekend!

I'll be traveling from 23 Dec until the 7, so how about after that?

#

Jan 3 is probably doable, but I would prefer Jan 10. Is that possible?

unborn heart
#

We could potentially do 2 sessions next weeekend. Saturday and Sunday. To avoid pushing yours back

dire bronze
#

Okay with keeping next week, or pushing.

cosmic kraken
#

Np, I'm flexible so I'm ok for the 28th, as I'm working over the holiday period. Or early in the new year when everyone is back at work.

unborn heart
#

@dire bronze @cosmic kraken how about spec decoding 21st and NSA on 28th?

cosmic kraken
#

Sure, I'm ok with that

dire bronze
#

Sure. I'm okay with that too.

cosmic kraken
#

Actually can we push mine to sometime in Jan either 4th or 11th, looks like I might be a bit occupied over the holidays.

unborn heart
vapid sleet
#

You're missing eagle and mtp, which are arguably most of what modern speculative decoding looks like in most AI labs

hushed girder
dire bronze
junior ore
#

TiDAR is pretty cool, too. https://arxiv.org/abs/2511.08923

I think he's probably got enough papers to talk about, though. 🙂

unborn heart
cosmic kraken
# unborn heart https://arxiv.org/abs/2512.14080

That looks like a really cool paper, I've been meaning to get up to speed with what is the latest in MoE. It will be interesting if they do a blackwell version which uses the clustered SM's and DSMEM.

unborn heart
#

Some tricks for still being able to use 2 CTA MMAs while having cross-CTA dependencies

cosmic kraken
unborn heart
#

10am pst tomorrow, be there or be square !

unborn heart
#

good morning everyone, we'll start in 12 min

#

just confirming @dire bronze you'll be ready?

dire bronze
#

Yep!

unborn heart
#

@everyonewe'll be starting a session on speculative decoding momentarily in the voice channel!

hushed girder
#

Thank you for the presentation! Very cool to know especially about Medusa and Eagle

dire bronze
#

Thanks for hanging around for the really long presentation. I hope folks found it helpful.

unborn heart
cosmic kraken
#

@unborn heart are we still ok for the reading group session on the 4th Jan

unborn heart
#

@here reminder we have NSA presentation by @cosmic kraken in 5min!

#

meeting is open in the voice channel

#

@cosmic kraken are you joining?

unborn heart
#

would anyone be interested in a short session on Mxfp8 expert parallelism in forward / backward

#

Cool stuff I’m working on right now I thought others might find interesting

unborn heart
#

ok cool would be focused on torch and kernel implementation

#

not theory or research

#

pretty interesting though I think

unborn heart
#

I’ll schedule something when it’s ready

unborn heart
#

erghhh 2D block tiling in CUDA easier to conceptualize than actually implement…..

unborn heart
#

got it working

unborn heart
#

got warp tiling working now... strangely i get better perf with smaller thread tile sizes (2x2). if i try the author's config of 8x8 perf falls off a cliff. maybe register spillage in my impl

hushed girder
#

Gl with the learning

#

🫡

cosmic kraken
unborn heart
cosmic kraken
unborn heart
#

yeah this gemm is designed for ampere

#

i am iteratively going from the most naive possible gemm, working my way through different optimizations, with the end goal being blackwell gemm with 2 cta mma, tcgen05 ptx, pipelining etc

#

next up is using tensorcores via wmma

cosmic kraken
unborn heart
#

at each step, i read the description of the kernel design, then implement without looking at the code or using AI to practice and internalize more deeply

#

will refeerence this one lastt for blackwell (very good read, recommend it): https://gau-nernst.github.io/tcgen05/

cosmic kraken
#

Amazing how bad that works on modern hw, gets 30 tflops when theoretical maximum is 1 petaflop

unborn heart
#

yeah max i have is 40 tflops w/ warptiling

#

strangely if i run torch.matmul it gets 63 tflops

#

seems like it isn't using tensorcores either

cosmic kraken
#

Yeah, on ampere I think tgd tensor cores are 2x2

cosmic kraken
unborn heart
#

oh there's no fp32 tensorcores huh, only bf16, fp8, fp4

#

so that's why torch is also so low

#

when i convert to bf16 i bet torch perf will skyrocket

#

wait, am i tripping, i am seeing conflicting things online

#

no i think i was right, ok

cosmic kraken
unborn heart
#

got wmma version working

#

bit easier than warp tiling, due to the abstractions

unborn heart
#

did double buffered mma now too. finished all the kernels in that blog

#

moving into hopper optimizations next … then Blackwell

unborn heart
#

finished ampere with a pipelined impl with cp.async + mma

#

feel like i am getting fast at this

#

ok now moving onto cp.async -> cp.async.bulk.tensor (tma) 👀

#

it is very cool seeing the tflops actually increase as they "should" with each optimization

#

TMA so annoying to use though

#

cuTensorMapTileEncoded ... 🤡

unborn heart
#

craz the complexity increase moving from wmma to wgmma

dire bronze
#

(I hope you're going to eventually to someday present about all the recent stuff you've been talking about - sounds quite interesting!)

unborn heart
junior ore
hushed girder
#

Any good RL system paper? Trying to find a good paper for our reading group (irl).

I am looking for SGLang/vLLM type of RL system paper. But system algorithm is fine too (eg. PipelineRL)

We did verl paper already before. Currently thinking for PipelineRL but still not sure about it.

tired haven
#

The openrlhf paper is also fine though I didn't like the library much

hushed girder
#

Got it. So far I got OpenRLHF, PipelineRL, AReal, Magistral, and Ant Ring 1T paper. Will compare them

unborn heart
hushed girder
unborn heart
hushed girder
# unborn heart Let’s do Feb 1

Btw it's about PipelineRL. Forgot to mention that.

https://arxiv.org/abs/2509.19128 (PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation)
It is a system where they tried to both increase the utilization while also keeping the staleness intact. They are doing this by doing weight transfer during generation while also transfer the current available data for training. By doing so, both training GPU and inference GPU keep running all the time (see Figure 1b). What is surprising for me that KV staleness is okay to do.

I found on other model's paper such as Slime framework (SGLang's official RL framework), Longcat, and PrimeIntellect's model using the same technique with a bit modification. I may be talking about the difference between them as well abit.

dire bronze
#

@hushed girder is this reading group somewhere in SF? I'm based in SF and am looking for IRL groups.

On RL systems - I've been reading and gathering as well. I think there's RLHFuse, RollPacker, but there's a lot of details hidden about systems in tech reports from neo tech labs (longcat, olmo3, nemotron etc). There's also a pretty cool theoretical+systems paper IIRC in AsyncRLHF.

unborn heart
#

not irl, i would go to a irl one sometime though, could be cool

unborn heart
#

sry i have been spending all my free time writing kernels lately instead of reading papers

#

almost at a good milestone to pause and read sonicMoE though!

#

maybe this weekend

#

i can read it

unborn heart
hushed girder
#

yes I will still do it

unborn heart
hushed girder
#

that's 10pm GMT+4 right? yeah that works

unborn heart
unborn heart
#

got 2 CTA tcgen05 mma working finally

#

in warp specialized Gemm

#

After much anger and struggle with cuda-gdb

unborn heart
#

@hushed girder we still on for tomorrow?

#

10am pst?

hushed girder
#

Yess

cosmic kraken
unborn heart
#

@here meeting is open we are starting soon!

hushed girder
#

@runic shale here

runic shale
#

thanks!

unborn heart
#
#

thanks everyone for joining, will post meeting recording shortly for anyone who missed it

hushed girder
#

Sorry for some heavy breathing. In my mind I am not nervous but my body is nervous berk

unborn heart
unborn heart
#

ergh

#

got persistent kernel working finally and perf is flat vs non persistent …

#

🥲

runic shale
#

persistent kernel?

unborn heart
#

launching a kernel with num blocks == num SMs, and having each thread block chug through computing multiple output tiles.

in contrast to launching num thread blocks == output size divided by output tile size, with each thread block computing exactly 1 output tile, which results in multiple waves of thread blocks being scheduled on the SMs one wave at a time, since num blocks >> num SMs

runic shale
#

sure, so ig this shows that time gain you pay for having n waves of smaller blocks is much smaller than the parallelism you get from saturating cuda cores within each SM with one giant block

#

isn't the limit of max threads per thread-block smaller than the total max of theoretical threads you need to saturate an SM? so that would make sense if your workload previously was flop dense since you're now leaving flops idling

junior ore
unborn heart
#

it's just a matter of how you schedule them

runic shale
#

sure. how do you force each block to go to one SM? If the block sizes are too small, they could double up

unborn heart
#

Hmm good question, CUDA runtime schedules in “waves” distributing as many blocks as possible across the SMs per wave, as efficiently as possible based on register usage, smem usage etc per block, as well as current resources available on each SM.

So if you have same number of blocks as SMs, it naturally schedules one per SM. (If there’s no concurrent kernels running)

this is my understanding and the pattern I have observed in other kernels, but I wish I knew a little more detail here..

runic shale
# unborn heart Hmm good question, CUDA runtime schedules in “waves” distributing as many blocks...

hmm, what you're saying makes sense but is also kinda speculative based on the block sizing. put concretely, say we're lucky enough to have N thread blocks and N SMs. each threadblock has a size of (# of threads per SM) / 2. how do we know that we are using all N SMs without a profiler? we could have two threadblocks double up on an SM and still run concurrently.

I think what you mentioned about scheduling in "waves" makes sense. but if we have N = # of SMs, my guess is that we'd have few enough threadblocks to be in the waves = 1 regime, i.e trying to still fill up the first wave. so the question isn't "how many waves do I need", but moreso "how do we fill up each incoming wave".

#

and for that question, at least I don't really have a convincing argument for why a scheduler would pick an even spread of blocks across all SMs versus cramming into one SM fully. perhaps on a mem bound workload we distribute so that we do more parallel gmem loads/stores? but I'd imagine each SM would have enough load store units (the things that actually handle memory fetches) for 2 blocks to make 1 vs 2 negligible.

cool discussion anyways!!

unborn heart
# runic shale and for that question, at least I don't really have a convincing argument for wh...

Another more practical reason we will end up with one thread block assigned to one SM in a persistent gemm kernel is that in the warp specialized design, we are using a huge amount of shared memory for the queue of A and B tiles in shared memory. You generally get better performance by increasing the queue size so we get a longer duration of thread block execution time in the pipeline “steady state” where load latency is hidden and epilogue is overlapped.

In fact, in the kernel launcher host code, you have to manually override the maximum shared memory per thread block limit in order to do this. Otherwise you’re limited to only 48KB per thread block rather than the full 227KB on the SM (B200), which would give you a tiny queue and provide little benefit, if any.

#

Given this, we literally cannot schedule more than one thread block per SM. Just 1 barely fits, by design.

#

you are likely right that in an arbitrary kernel that does not follow this design pattern, we don’t get this kind of guarantee

hushed girder
rare warren
#

(I might be late to the party here) I can recommend the Helix Parallelism paper to anyone who’s interested in more on TP+Context Parallelism specifically for inference/decode (as opposed to training/prefill). Interesting as a follow-up on Megatron and RingAttn. https://arxiv.org/pdf/2507.07120

unborn heart
slow shore
#

New here (and to ML perf in general). What should I do to start attending (and understanding) and perhaps making meaningful contributions to the reading group?

unborn heart
#

And more generally feel free to chat about anything cool you come across or questions etc

runic shale
hoary summit
tired haven
unborn heart
unborn heart
vapid ruin
unborn heart
vapid ruin
slow shore
vapid ruin
slow shore
unborn heart
#

Ok, i will say that having a firm foundation of basic chip architecture and performance characteristics, including interchip networking, is essential to properly understand any of of this. It defines the problem constraints in which all of these parallelisms, efficient architectures, etc exist, and the motivation doing "X instead of Y" in the first place

#

for example, flash attention is an efficient/innovative design and implementation of a core piece of the model architecture, but you will have a bad time trying to understand it without understanding GPUs first

slow shore
unborn heart
# slow shore Stanford's CS149 goes into this a bit I think. Would you recommend starting from...

Sure, or we have this intro video in the group playlist: https://www.youtube.com/watch?v=Cp7g1Ll4v0M

ML Performance research paper reading group session 1 meeting (2024/11/29). This was an intro session covering prerequisite knowledge related to GPU architecture, CUDA, NCCL, and common performance bottlenecks in ML workloads.

Presenter: Daniel Vega-Myhre

▶ Play video
#

disclaimer: this was at the beginning of my "ml perf journey" a long time ago so hopefully i didn't say anything inaccurate ... lol

slow shore
#

"Introducing KVTC: A new KV cache transform coder (think JPEG for KV caches) that solves the "recompute vs. offload" dilemma. It achieves 20×-40× (up to 88×) near-lossless compression, redefining how we handle long-context memory."
https://x.com/AdrianLancucki/status/2019748151209476587
https://arxiv.org/abs/2511.01815

🚀📉 Storing KV Cache just got 20-40× cheaper
#NVIDIAResearch #ICLR2026

Introducing KVTC: A new KV cache transform coder (think JPEG for KV caches) that solves the "recompute vs. offload" dilemma. It achieves 20×-40× (up to 88×) near-lossless compression, redefining how we

hushed girder
# runic shale Can I claim Feb 22nd? It would help me out to give a relatively faithful present...

Are you actively working on AMD environment? Is working on AMD environment usually very hard to set up things? Especially maybe old generation (eg. MI210)?

We have cluster of MI210 but people rarely use it since it's very hard to do things (eg. installing vLLM, veRL, Megatron etc).

In your experience, do you maybe need to have some kind of container that's been built by AMD engineers to do stuff?

runic shale
#

What sort of challenges do you face on MI210 that you can share?

tacit kernel
hushed girder
hushed girder
runic shale
#

But yeah it would make sense if they're trying to aggressively push MI300X

#

or MI300A

tacit kernel
#

Speaking of MI300A, did anybody spot a matching AI / ML problem, in which this particular architecture would shine over others?

For those who do not know: MI300A is an 'APU', where CPU and the GPU processors sit on the same silicon and share the entire HBM memory. For ML, this means your worker processes would eat the same memory as your GPU's. This conflicts with the basic design principles of many ML libraries, where one assumes there is a huge amount of host memory out there for the CPU.

The particular ML problem this architecture should shine would be the one where the I/O between host memory (the RAM of the CPU) and the GPU memory is the main bottleneck. If anybody is aware of such ML problems, I would be grateful to exchange! For example, maybe some online RL training with hard-to-parallelized simulations as supervision could be a nice culprit. Or student-teacher style learning paradigms where the teacher would better run on CPU.

When I check MLPerf works, I see them focusing on the bottleneck between GPU global memory and GPU shared memory. There is almost no discussion whether there is any bottleneck formation outside of this area of focus. If you are aware of literature in this direction, or keywords, please feel welcome to share!

runic shale
tacit kernel
#

I wonder if there is some custom pytorch backend already, such that operations like:

x.to('gpu').to('cpu')
model.to('cpu').to('gpu')

resolve efficiently. As a person who is not so much aware of the internals of pytorch backend, I am a little afraid of getting lost trying this 😅

tacit kernel
#

Of course, no one would do .to() twice. A more realistic case would be:

model0= model0.to('cpu') # will execute only forward() on CPU
model0.get_all_params().requires_grad = False # only fwd for this one
model1= model1.to('gpu') # will execute on gpu, will do fwd + bwd

for x in batch:
    x0 = x.to('cpu')
    x1 = x.to('gpu') # same thing on memory, this operation should have very low cost
    
    # concurrently run:
    y0 = model0(x0) # This runs on CPU cores
    y1 = model1(x1) # This runs on GPU cores
    # Now we are done with model0 for the current batch. model 0 can already start processing the next batch.
    y0 = y0.to('gpu') # same thing on memory, this operation should have very low cost
    
    optim.zero_grad()
    loss = loss_fn(y0,y1)
    loss.bwd() # Happens on GPU only
    optim.step() # Happens on GPU only

Edit: The above idea sits on an assumption: CPU is preferrable for some cases. I had the chance to investigate this assumption on MI300A. For tiny models below 500k parameters, CPU inference speed is faster on fp32. But this was on batch size=1. As batch size or number of parameters increase, there remains no reason to bother touching CPU. So the above idea turned out to be impractical in the end.

slow shore
#

https://arxiv.org/pdf/2602.06036v1

DFlash: Block Diffusion for Flash Speculative Decoding

Abstract:
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM. However, existing methods still rely on autoregressive drafting, which remains sequential and constrains practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. We show that speculative decoding provides a natural and effective setting for diffusion models. By generating draft tokens in a single forward pass, DFlash enables efficient drafting, and by conditioning the draft model on context features extracted from the target model, it achieves high-quality drafts with higher acceptance rates. Experiments show that DFlash achieves over 6× lossless acceleration across a range of models and tasks, delivering up to 2.5× higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

unborn heart
hoary summit
slow shore
dusk cedar
hoary summit
unborn heart
#

welcome! yeah idk I never really shared the videos widely, I did it mostly for my own learning

#

heading good feedback like this maybe I’ll share them more often going forward 😄

#

I really wanna present/discuss sonicMoE and latentMoE but I have been insanely busy the past few weeks…

cosmic kraken
slow shore
#

Thanks for the encouragement @hoary summit and @cosmic kraken.
@unborn heart Can I claim the date after 22nd for this? Of course, if you wanna go for sonicMoE and latentMoE first thats fine!

unborn heart
slow shore
unborn heart
unborn heart
cosmic kraken
#

Please dont spam with non performance related stuff, there are other discords and channels (such as Yanic Kilchers) which is more suitable for this

wise valley
#

Hey, guys! I only see papers on the topics being treated in the YouTube videos descriptions.

I think it’ll help if I can lay my hands on all the Google Slides docs used in the discussions.

wise valley
unborn heart
unborn heart
#

@runic shale you still good to present tomorrow?

runic shale
unborn heart
#

@here reminder everyone we'll meet in ~7 min to discussing Training foundation models on AMD stack!

unborn heart
#

great discussion, thanks again to our presenter @runic shale ! i will share the recording when it's uploaded

hushed girder
#

Maybe you can share your slide? 🙏 @runic shale

runic shale
unborn heart
#

I am straight up having a bad time with mxfp8 CUDA + ptx gemm impl with 2 CTA mma, persistent kernel with static schedule

#

🤡

hushed girder
slow shore
#

Hi everyone! Unfortunately I have gotten food poisoning and it would be hard for me to give the presentation I think. Would it be possible to reschedule today's presentation for next Sunday? If rescheduling won't be possible, I can still try to push through.

normal bridge
#

Take care, hope you feel better soon! I am personally okay with coming by next week (I suppose others can react with 👍🏻 to your post if they agree as well)

slow shore
plucky anchor
#

Take care @slow shore

unborn heart
#

I will be out of town next weekend so how about the one after that?

unborn heart
#

i rescheduled it for weekend after next

slow shore
junior ore
cosmic kraken
cosmic kraken
unborn heart
unborn heart
unborn heart
#

or colfax article

#

or both

unborn heart
hushed girder
#

Really cool as well

hushed girder
#

congrats @unborn heart

#

may I ask what's the biggest challenge of the development? esp for the _to_mxfp8_then_scaled_grouped_mm?

Also, for the selection of which layer is not using mxfp8, that's purely empirical right?

unborn heart
#

There were many challenges one was reverse engineering how the hierarchical scale factor layout for each token group should relate to each other in memory, given they are all in the same tensor/buffer.

#

There are Nvidia docs on how these scale factors for a single tensor for a single gem should be laid out in memory. However, no examples for group gems where we have l scale factors for logically independent GEMMs all in the same buffer.

#

So I had to figure it out myself, lol

hushed girder
hushed girder
unborn heart
#

@slow shore is having discord issues so we may use Google meet today

unborn heart
#

ML Performance Reading Group
Sunday, March 15 · 10:00 – 11:00am
Time zone: America/Los_Angeles
Google Meet joining info
Video call link: https://meet.google.com/jpb-pqne-mmw

#

in 8min

#

fyi @here ^

#

@everyone we are starting in the Google Meet link above shortly

#

@fickle lark we are in this google meet channel today instead of disord voice

unborn heart
#

thanks for the great presentation on DFlash @slow shore ! please feel free to share a copy of the slides here when you can and i'll include them in the youtube description as well

graceful oar
#

Just wanted to say, great job with the group! Discovered it fairly recently and have been going through the videos. Will join starting next week.

slow shore
hushed girder
#

They just did something with OpenClaw .-.

OOT but anyone here used openclaw here? Is it good?

arctic cave
#

wdym this is about DFlash more than openclaw, but I am using qwen3.5-4B so lets see if DFlash can work for me. Wait wait, I thought Dflash was for diffusion models. I couldn't watch the presentation thanks for the slides! Going through them I understand this is some exciting stuff. Will implement this asap. NVM Flash Attention3 required sadge

dire bronze
unborn heart
cosmic kraken
hoary summit
#

Has anyone here looked at fla?

#

I am keen to find list of optimizations and corresponding code

unborn heart
#

everyone gonna be ready soon??

#

FA4

#

😄

unborn heart
#

@everyone we are starting the FlashAttention4 session momentarily!

hushed girder
#

I miss the session .-.

tired haven
#

Hey
If I want to benchmark FP4 matmul what API should I be using? I don't think PyTorch supports it so I see Transformer Engine and TorchAO as things to potentially use but don't know what is preferred (if any)

hoary summit
#

Whats the fastest algorithm for inverting a lower triangle matrix? i am beating xla with just block decomposition + substituion

tired haven
hoary summit
upbeat sand
#

It's not gpu friendly though

unborn heart
hoary summit
grizzled pond
#

that should be fine no?

hoary summit
#

Yeah was hoping to have something more flexible. On tpu 64x64 the recursive doubling makes not much difference over blockwise forward substitution . But i will give it go see how it performs for 512 and 1024 chunksize

hushed girder
unborn heart
#

i saw that

#

pretty interesting

junior ore
#

This may be one of the first real signs of superhuman intelligence in software. On some of the most optimized attention workloads, agents can now outperform almost all human GPU experts by searching continuously for 7 days with no human intervention inside the optimization loop.

tired haven
unborn heart
unborn heart
#

In real life you would run a nvfp4 quantizer which produces real scales

#

This script just measures the gemm itself though, so the author must have not cared

tired haven
unborn heart
unborn heart
#

I have been deep in the mxfp8 trenches not as tested nvfp4 yet

tired haven
unborn heart
#

Btw i posted this in #implementation-details channel but will share here as well since it is very relevant to ML performance:

Wrote a post some folks here may find interesting: “MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX” - https://danielvegamyhre.github.io/2026/03/29/mxfp8-gemm.html

tweet/x post: https://x.com/vega_myhre/status/2038293614204445039?s=46

New blog post: "MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX": https://t.co/HFcCcKnNja

As someone who works on MXFP8 training, I was interested in deeply understanding GEMM design for this numerical format. In this post, we write a MXFP8 GEMM with CUDA + PTX, and

#

Now that i am done with this, maybe i'll finally have to time present SonicMoE 😂

cosmic kraken
#

@unborn heart i think we recorded the FA4 session, any chance you can publish it

unborn heart
#

Oh yeah sorry about that

#

Will do

cosmic kraken
unborn heart
#

thanks again @cosmic kraken for presenting!

unborn heart
hushed girder
hushed girder
#

https://www.alphaxiv.org/abs/2604.15039v1

KV Transfer across data center? whattt

alphaXiv

View recent discussion. Abstract: Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within...

unborn heart
#

Someone present!!

#

Volunteer as tribute

cosmic osprey
#

I could present, but maybe in about 2 weeks from now. How soon does it have to be?

hushed girder
#

I think that's cool. I need to read it as well first too

unborn heart
junior ore
# unborn heart Someone present!!

I watched @cosmic kraken 's FA4 talk last night, and noticed that he didn't have time to get to the backward kernel in detail. I was studying that closely this week (just the paper, I'm trying to dive into the cute code now, but I'm a bit out of my depth, there), and could present on that. It would be a pretty rough presentation, just walking through the paper and explaining my understanding. I don't have time to make slides.

I would also like to spelunk through the cute implementation with people who know GPU programming better than I do. Claude and gemini are giving me seemingly sensible answers, but I don't know what I don't know. I want to write a backward kernel for my own attention mechanism, which is currently dog-slow.

cosmic kraken
cosmic osprey
#

Saturday 2nd of May works

unborn heart
cosmic osprey
#

yeahh

unborn heart
rare warren
#

Wonder how it affects goodput given the use of slow interconnect; they only mention the throughput benefits

rare warren
#

DSv4 paper next? 🙂

hushed girder
#

we need several days I think lmao

#

or we reading and analyze together instead of someone preparing it alone top down?

unborn heart
unborn heart
cosmic osprey
#

Heyy!! @unborn heart Sundays are tricky for me (this is why I've not been attending the group's meetings on Sundays). How about next week Saturday/Sunday? I was about writing about the possibility of shifting the meeting to Saturday, actually

unborn heart
#

Ok I will push it back

cosmic osprey
unborn heart
unborn heart
cosmic osprey
#

@unborn heart rg still holding today?

unborn heart
cosmic osprey
#

Ohh, shoot! If we meet in 25 mins, there's about 30 mins left for the presentation and I won't be able to stay for much long because I have another meeting 🙁

#

It's okayy if we reschedule though

unborn heart
#

Ok sorry let’s reschedule for next weekend, sorry Mother’s Day activity running longer than expected!

ocean palm
#

Just saw the message.

unborn heart
#

This blog post discusses Cluster Launch Control (CLC), a hardware-supported feature on NVIDIA Blackwell GPUs that facilitates optimal tile scheduling, in particular with respect to load balancing. To provide context, we first survey a few common scheduling strategies and the deficiencies CLC is designed to address. We then walk through the imple...

cosmic osprey
#

Heyy all! Reading group session is starting momentarily!

#

@unborn heart are we still meeting today?

unborn heart
#

I’m down to reschedule through!

hushed girder
#

oh I almost forgot about the reading group .-.

#

So we will reschedule it?

dire bronze
unborn heart
dire bronze
cosmic osprey
#

I'm down with rescheduling if possible

dire bronze
#

Are we still on for this Sunday?

unborn heart