#Community Spotlight
1 messages · Page 1 of 1 (latest)
The first event in this series will be by @manic token:
Abstract: Tokenization plays a fundamental role in language modeling, but it is often overlooked. But when it fails, tokenization can have disastrous effects on performance, as in the case of the unpredictable and dangerous behavior cased by glitch tokens. In this talk, I will discuss two causes of glitch tokens: undertrained tokens and ill-formed UTF-8 sequences. To prevent undertrained tokens, we propose PickyBPE. This modified BPE tokenizer prunes intermediate “junk” tokens that are almost never seen by the language model. This prevents glitch tokens and also helps better utilize the tokenizer’s vocabulary. To address ill-formed UTF-8 sequences, we developed a novel encoding scheme to represent text across different writing systems. This encoding scheme can also act as a pre-tokenization scheme that prevents a tokenizer from learning merges that lead to decodable UTF-8 sequences. Together, these methods can help us develop more efficient and robust tokenizers.
Can't make it this time. Will it be recorded?
Yes! Recorded and posted on our YouTube 🙂
Hey @everyone We're finally going to have the second episode of this series 🫢 I promise this time it won’t be a six month gap between them.
@muted elk is going to be talking about running diffusion-based world models in real time on consumer hardware (attached video is a RTX 6000).
Title: Speeding Up Diffusion For Real Time World Simulation On Consumer Hardware
Abstract: In this talk we will explore a full pipeline for pre-training and post-training diffusion world models to reach staggeringly fast speeds on consumer hardware. We will cover the complete process from creating and distilling diffusable autoencoders, to pretraining and post training diffusion world models. We will cover several challenges in increasing throughput that we have solved as well as blockers and promising research directions.
Community Spotlight