#Evaluating Classifier-Free Guidance impact
1 messages ยท Page 4 of 1
i'm just thinking
what i'm seeing actually is that Quanaco has a acc peak at 1.5
and Wizard has an acc peak between 1.1-1.25
Yes, they have peak in defferent places
the % invalid min-regions look a lot wider
i wonder if there's such a thing as grouped-line chart...
I have to go for ~15 minutes, sorry...
ok no problem
I'm back, in my opinion the two charts that includes everything definitely demonstrate the two trends (the acc and the invalid). It's a little bit strange to mix both datasets and models on the same graph, but it might be a valid option if we want to emphasis and put all the results in the main paper.
Another option is to split it to two and put one of the datasets/models in the appendix (I don't think it's that bad, not everything should be in the main paper. People read the appendix, especially when there is a reference for the appendix figure in the paper)
yes, i kinda think we should split
ultimately it's up to you but i think maybe sticking with gsm8 is a good idea since it's like you said, more important
i redid them so that accuracy and invalid are grouped, that way we can have a real ylabel
still have a lot of vertical white space, which i'm not happy about but ๐คทโโ๏ธ
Yes, they are not in the same scale, I don't think we can do something with it (either the acc and invalid doesn't have the same scale or the model perfomences). Observe you have typos in the name.
we can do a broken axis:
but it's easily missed by readers thus not a favorite technique imo
Yes, I also find it very confusing
it's very late for me. can i send you the finished files and can you put them into the paper? do you know how to use subfig?
I will send you files
these are screenshots
i will send you code too in case you're curious
thank you
np, thank you ๐
Just observe the typos in the models names (and parameters)
yup fixing that now then gonna send over
ok these are the figs
i made gsm smaller so it could go in the main body
Wizard LM is 30B
๐ so I'm putting one in the main paper and one in the appendix?
i think so?
i think 4 plots would be too much info in the main body
i also think squeezing two tasks into one plot isn't great
I agree, great. So I will modify the paper accordingly
Ok, I think it looks good! please review when you wake up ๐
(please also go over the captions, I changed them yesterday according to the feedback)
a lot happened while I was aleep!
The new plots are cool, they really show the trends
So, we have 8h before before ArXiv submissions close today
Once you're ready for release, please ๐ this message :) @loud adder @patent gull @blissful garden @unique sedge. I will send the paper either 2h before or when I get your validations, whichever happens first.
You should take a big buffer because exporting from overleaf to arxiv could be very exhausting and might takes time
That's why I'll do it the very second I get everyone's go
Just waking up. There was one word in the acknowledgments that I donโt remember
But it was used a lot
And I didnโt know what it meant
Ah yes, what do you mean by โredactorโ?
wait, that's not a word ๐ ?
"writer" then?
Also did what were the comments from @loud adder โs two people she was showing it to?
Anything helpful?
Sure writer/editor
Definitions of redactor
noun someone who puts text into appropriate form for publication
yeeee!
โTo redactโ means to remove something from a text
So I guess itโs a view of writing in the negative lol, but Iโll take it, it sounds fancy!!
let's go with "writer" then. If it's confusing to you, it will be confusing to a lot of people
no feedback has been communicated to me
Haha ๐คทโโ๏ธ yeahโฆ
but there was a "nsaphra" reading the paper yesterday
Bummer :/ wouldโve loved some additional feedback
Lololol
Maybe nsaphra made some changes
nope
Btw how did the paper drop down to <10 pages?
better figures layout AFAIK
(aka: Stella LaTeX magic)
NeurIPS also uses 1.5โ margins which are quite large. Since weโre just using their template rather than submitting to the venue I edited the style file to use 1โ margins
up
So the way it works is that you can resubmit as many times as you like in the next 4 hours (until 1400 ET / 1800 UTC) and itโll go live at the same time. After that it gets pushed back a day though.
yes, though I'd be happy not submitting a bazillion times bc we're fixing punctuation lol
Ahhhh that makes sense
Alright. As soon as I get to the office Iโll give it another read, but Iโll only change anything if I see something major
love how the paper has turned out. Good luck in submitting and congrats!
uh oh ok not a big deal but be prepared for a resubmit
we never said what Figure 1 displayed ๐๐๐๐
how much time do i have?
ok
idk man
it belongs in the intro anyway
that's too far down to be intro-ing Figure 1 for the first time
all right
awesome!!
(just triple-checking all the figure captions)
ok great
i'm logging off overleaf otherwise I'm gonna drive you and myself crazy
overall, a million thumbs up ๐๐๐๐๐
this paper came out so well, had so many unique parts, and tied together really nicely at the end
it's a great paper, really foundational. We're in a different ballgame from CAD at this point
uh, I need "endorsement" bc I never published in cs.CL
The code is 7MN9HQ
@patent gull it seems you can endorse me
@versed flax I can never find the page to endorse a paperโฆ there should be an option to send an email
Feel free to send it to me
It seems you can just click on this link:
Done
Thank you! it worked!
! Package natbib Error: Bibliography not compatible with author-year citations.
trying to solve it
now you're doubly endorsed
there's a fix for this, hold on... i know there's some github package that just fixes this for you
magically
I don't get why it complains about author-year, the neurips template uses numbers
hmm
overleaf unfortunately does fix a lot of things under the hood
do you have a local latex install?
yeah
sometimes i've had to go through that a bunch of times to make sure it works
overleaf is magical in a lot of ways
ugh i wanna find this github package
maybe try this?
\usepackage[numbers]{natbib}?
Yes, it's really a nightmare to export from overleaf to arxiv
I hope it's enough time ๐ค
it worked!
It was compiled and submitted?
Amazing ๐ this was quick
cool!
Do we want to fill any of this?
I don't think so
Friends, it's party time!
Thank you everyone! It's been a blast. Next stop: Sunday 6pm UTC for a bit of advertising, and we'll talk about conference submission later :)
woooooooooooooooooooooooooooooooooooooooooooooooooooooo!
let's take a nice long breather, now
wow
Congrats yโall
yay!
FYI:
Your article is currently scheduled to be announced at Mon, 3 Jul 2023 00:00:00 GMT.
Updates before Fri, 30 Jun 2023 18:00:00 GMT will
not delay announcement.
๐ ๐ congrats on finishing!
https://arxiv.org/abs/2306.17806 here it is folks!
Classifier-Free Guidance (CFG) has recently emerged in text-to-image
generation as a lightweight technique to encourage prompt-adherence in
generations. In this work, we demonstrate that CFG can be used broadly as an
inference-time technique in pure language modeling. We show that CFG (1)
improves the performance of Pythia, GPT-2 and LLaMA-famil...
LMK when you tweet about it and Iโll retweet it from the EleutherAI account
I generally find more success and engagement with tweets that walk you through the highlights of the paper. I would add a couple more, drawing out particularly interesting figures and talking about them a bit?
All right. Let me give it a try. That's a first for me.
Is it allowed?
At least for ICCV and CVPR (until last ban decision), it was not allow (as authors) to publish on social media
We have no conference in sight, so...
Did anyone happen to look at the predictions on the lambada val set? I'm curious what sort of incorrect responses CFG is fixing
What is that "@Halocene" in the middle
Ooooh haha my memory about the example stays at our first one
GUYS WE GOT RETWEETED BY JEREMY HOWARD
I don't think we did
As a an AI model user, I hope it's okay just to drop to the Discord here just to post: I love love this work so much. No kidding, two days ago I was lamenting "What is wrong with this world, why don't LLMS have negative prompts." https://news.ycombinator.com/item?id=36537845 and then POOF. The world is right again.
Only constructive comment I might contribute is on the concept of negative guidance, as a user who prompts. Weird imagine the idea in text LLMS, yeah. But what about audio LLMs?
To me it doesn't seem that strange in an audio LLM like musicgen. Since musicgen has a CFG like var, out of the box negative CFG could output music I plausibly considered vaguely like the opposite of my text prompt. In this case even without a positive prompt, just the unconditional and negative only since I hadn't modified it yet. (The range of negative CFG that produced normal sounding but different music was quite narrow and fiddly, typically something like -.2 to -.3., and changed for every prompt, so hard to use though.)
I've been trying to bang two rocks together to make negative guidance work in TTS LLM, and now I feel so much less crazy that this exists. It doesn't quite make as much sense there, but it will be fun at least. (I think about it maybe like a director showing an actor a scene, and then being like, "Ok you see that? I want the opposite of that."
I'm so glad to be the "POOF" in your world haha. If you experiment with negative prompting, please let us know, it's a bit more challenging than with diffusion models since the sampled text get appended to the neg prompt as well, and it's hard achieving a neg prompt making sense with its opposite continuation
I think there's a wording trick but I couldn't find it
I can't actually follow the math or the fundamentals to know if what I did was like this idea, but I did try using a bunch of other generated samples in a way that seems similar. I took one voice, I found some kind of difference between that voice and 100 random english audio samples. Just counting token frequencies. So the idea is you have the tokens in that voice, that are unique, but not just 'human speech' -- and then you flip the sign on those, and penalize them in the sampler. It's like an anti voice. Not sure it makes sense!
it totally does
The wonderful thing about AI models, especially recent one, is that it's almost hard not to make output that is at least interesting.
Can I bug you about one somewhat random question only vagulely related? On music gen github, someone posted cool music and also that they used "-p sampling" and then a bunch of other people were asking if it was really using -p sampling, did that work, and I thought it would be funny to actualy try it. So like, reverse the order of the logits, least likely first, otherwise jsut like topp. Actually though, the out seems genuinely kind of useful and different an audio LLM model. And as far as I understand, it's not just equivalent to something else? In a TTS model, it makes peole have a christopher walken speech pattern. They choose wrong places to pause. SO COOL.
well I mistakenly implemented that on LMs and it was just bad lol
That's what I expected. I think maybe the Bark audio TTS model may just be unusually robust, you can ban 75% of the tokens randomly and sometimes its sounds mostly normal. It was okay ish musicgen for short periods, as well, eventually degrades to non music. For music I feel like I want really just endless text boxes for different prompts with CFG weights, some positives, negatives, some CFG values that vary over time, like based on the current token count. Feels pretty natural in music. It's like a conductor, holding out a hand to section of the orchestra, slowly raising it up, increasing the weight of one section, decreasing another. Continuously changing.
Bark is not yet in Huggingface but I'm so excited I almost want to try and port this code...
Is there additional context to the "wording trick" phrase? Or do you mean generally you think it's plausible that fully negative (total opposite) prompting is useful and effective, but the prompt engineering isn't yet known how to make it work?
Say you want to generate lyrics. Your prompt would be "I wrote a song, the lyrics are:"
So that will generate lyrics right
But now let's say you want to use a neg prompt so that these lyrics are not about love
As far as I could think, your neg prompt would be "I wrote a love song, the lyrics are:"
And again, there must be a better way to prompt engineer a neg prompt, but we did not find what it was
because then the continuation won't be a love song at all, which will lead to a weird negative continuation:
"I wrote a love song, the lyrics are: <something not about love at all>"
Right. What does working correctly look like?
no idea. We couldn't find the right way to phrase it.
We used negative prompts only as more general versions of the prompt or totally opposite of the prompt (surprisingly, that still works), but we couldn't find the prompt engineering to make it more targeted /granular
The first though I had, skimming the code, was I'm gonna add in a text box that can swap in for the unconditional or the 'neutral prompt' -- no idea what that enables or if it makes sense. But in audio I did have to use like 'generic english voices' not just 'unconditional generation' for the token thing I did.
But just vaguely, maybe 'unconditional' being another input, could ground the "opposite" concept somehow.
The great thing about audio? As long as changes the sound... it could still be a useful knob to turn, even if you have no real idea why it's having the effect, or can predict it really.
Trickier in pure text.
https://github.com/ggerganov/llama.cpp/issues/2083 people want it in llama.cpp now!
The Diffusion people have been liviing a life of spoiled luxory. Negative prompts, control net, a billion other syntax tweaks, while the LLM community has nothing. They are ravenous and I get it.
Actually in Bark, there's kind of two prompts, two different sets of tokens, both are used at inference, concatted. One for the voice, one for the text to say. So each could each have this implemented seperately, gonna be crazy.
It's all just tokens out of a GPT model, it shoud all work
They should implement something like the visualization tool you made, that is super cool too
ggerganov will have it done so fast. if you google any random weird sampling thing, half the time, the only working code I can find that isn't the original repo, is in ggml. he just implement everything.
oh wow this is awesome!
Since the big names are retweeting did your twitter notification blow up? ๐
ngl, 36 retweets and 117 likes on a post is the most activity I've had on twitter lol
and yes, many likes and follows!
I can see you're busy, some time when you not, I wonder if you remember if the inaccurate answers at high CFG values were just a wrong number, of they were possibly wrong in weirder way, perhaps something like "Q: How many apples do they have? A: 3 cans of tennis balls."
@fallow egret that's for you
In very high CFG values, you start to get garbage, the interesting part is in the medium-high range, then you can see that you still getting high percentage of valid answers, but the generated content is too much adhere to the prompt, this does not allow the development of a rich reasoning chain that will get to the correct answer.
Interesting, thanks. Just a hunch, I don't know much but I crank up values and get output like I posted from going way too far, just trying stuff. Maybe ramp up or down CFG value over the course of the sample could find a real sweet spot better than fixed value.
People just do crazy things with the guidance strength. I wanted to keep things simple for the paper
You're toppling the Diffusion cartel. They can't keep all this stuff to themselves any longer. We're coming for all of it. Even when it doesn't really make sense an LLM. I'm putting in my prompts anyway.
Yes, this sounds like an interesting direction for future work
The llama/oobabooga/text-gen community will probably try a lot of obvious twists and variants, if there's a new variable exposed, people will start really exploring.
Is is possible to trade more than 2x compute time, in for some further gains?
I hope it will happen, this means a lot of citations ๐
Not that I'm aware of.
Actually in CoT you have self-consistency when you run multiple time the chains, and then there is an interesting trade-off (you can apply different cfg values in each iteration, etc' and do smart ensemble)
There is always trivial brute force stuff. Not really same concept though. Like you can run an entirely second audio model inside the sampling loop and use it to judge the emotion of the output, and then backtrack and keep trying. It's the least efficient way to do something like that, but if you only need 2 minutes of audio, you can run it all night and it eventually works.
RT'd ๐
Great work! It's very exciting to see a project like this come to fruition in Eleuther, where someone can come in with their ideas & results and get help refining it into an impressive paper ๐ฅณ
(typo ๐ค)
We got an email from someone who wants us to cite their paper on sampling from LLMs
Paper: https://arxiv.org/abs/2110.08294
Seems like we should be able to run their generative code pretty easily if we want to add a comparison t the paper: https://github.com/zhenwang9102/coherence-boosting/blob/main/generation/generation.py
I think they want us to cite them because of equation 2 which is equivalent to CAD
P.S, I actually run comparison to ensemble. CFG works significantly better
Nice! Let's definitely get this added to the paper
It was a short table, so it seems a little bit strange to add it as a table, but we can definitely think on an appropriate way to present this results
I feel like it would fit as a natural subcolumn here?
For each one of the experiments?
We can do that, but I think that generally ensemble try to tackle very different issue. So it will be nice to mention in one of the setting that we beat ensemble (with half computation resources!), but I'm not sure we want to do that on all these experiments since it's not an apple to apple comparison with respect to the problem it is trying to tackle
If you just meant the table representation format- then yes, it sound a good idea!
Is it, given that there's a log? Or is their log f our f?
We also apply the addition with respect to the log of the probabilities (this is also the case in the original vision CFG)
@wheat zenith retweet us!!! ~~
I did! I'm gonna eventually post a ton of negative prompts I'm sure too, I love them too much. https://twitter.com/jonathanfly/status/1675854740142399490
@here retweet us!! https://twitter.com/Vermeille_/status/1675664118500454400
You are now my last two tweets. And I have been tweeting like 3 times a month.
So that's a LOT
I just wanna say a huge, huge thanks and congrats to @versed flax who will never take credit for it but is truly the leader here. He went many, many sleepless trying to be awake when we all were and coordinate. Endlessly thoughtful, experimentative, questioning. You really motivated me to be a better thinker.
Also a huge shout out to @blissful garden for powering us through all the tough experiments!!! You also tolerated all my last-minute requests asking for different plots!!
definitely share your experiments with us!
I'm pretty amateur, every single line of code there, probably learned last month, lol
the acknowledgements in the paper don't fully capture how hard these two worked and the spirit, energy and devotion here. This came together quickly but doesn't mean it wasn't deep
yes please!! (time to start talking about a follow up paper lol???)
use CFG in finetuning?
i'm down
maybe we have a finetuning paper more focused on negative prompting?
that seems like an area that we can really own and build from this paper on
Is the paper locked, or could you also test CFG and/or negative prompts in some audio LLM? To me they feel pretty natural, negatives too. Sound descriptions have pretty clear opposites. A loud scratch voice, a soft smooth voice, whatever. Even a person, or an entire voice. If you asked a group of people to pick another voice out of a set, that was the opposite, probably mostly pick same person. As opposed to something conceptually hard to grasp like "the opposite of a love poem"
why not write a separate paper for that?
MusicGen has CFG in it already right? I remember there was a conversation about that
Yeah. It only had one prompt though. So if you flip the sign, it's just a pure negative prompt. And then the regular unconditional it always uses. It does actually kind of work, but the range where it works is narrow and fiddly, you have to try to find it.
I want to join the congrats, and I'm sure everyone will agree that you also deserve to be applaused. The three of you did really great work. This is high quality paper that definitely generate a lot of interest
In musicgen, you can do anything and make weird music. for example mapping CFG to a sine wave, based on tokens. sounds great, adds variety
It breaks up the repetition. audio is easy mode I think. Just being different, is good.
Actually this is gonna be really interesting. We can take any finetuning dataset, prepending each paragraph with negative prompt and finetune towards the extrapolated logit distribution instead of just the next prediction.
I also happened to ask someone in the HuggingFace discord about logit attribution, and this is like, the Discord where that concept seems to be literally created, wild timing. I had only practical question about using it to make the audio waveform visualation, act like a debugger for your prompt, but also look cool. But the idea is like an audio version of the colored words in the paper actually.
oh I see. Yeah it would be fun to properly try out negative prompt
shoutout to @fallow egret and @paws too, i think you guys handled a tonnn of back-and-forth, chaotic discussions very, very well and with grace. without your parts, this would be a way flimsier paper
Made the tables a bit cleaner. Especially if we decide we want to add more comparisons, this will scale nicer than the original layout
I agree that music would be interesting but it feels like a different direction. I'm most interested in seeing how far this can go in the language domain. (however, if you wanna take it in the music direction, do it!!!!! i'm sure we'll all be interested in contributing)
nice!
@wheat zenith FYI there's also a thread for training models to generate music, #1106671860294357055
yes was gonna mention haha
(They seem to have stalled out due to people being busy, but additional manpower might help with that)
i'm loosely involved in that project... i think it's also a question of getting the boilerplate together/training baselines. I question whether it's the right time to start considering extensions like CFG, but ultimately, additional personpower does always help!!
I think for me an interesting direction of extending this work will be to extend it to the RL context. You can see CFG as modifying the model policy given another policy (negative). And I think that an interesting direction is given a new reward function how we can steer the model properly only during inference, I think this could be done with the ILQL framework, but these are only very initial thoughts...
https://arxiv.org/pdf/2206.11871.pdf
Dude it's really been a wild and fun. Really, massive kudos to your never stopping improving the paper's quality when I was ready to settle. Massive thanks to @blissful garden glu for running tirelessly all those experiments. And overall for the incredible quality of your reasoning to push the paper further and further.
And obviously thanks to Stella for stepping in in the very beginning, and send me in the right direction to be able to discover and show the power of CFG, and the multiple reading passes
Ok I'm just gonna say thank you everyone because I'm really bad at writing those. But this is really sincere since I enjoyed the past month way better than writing my PhD thesis.
for real. I've been dreading the 4y my PhD lasted, but that month was a blast
https://github.com/mlc-ai/mlc-llm/issues/499 Another feature request! That's three!
https://github.com/LostRuins/koboldcpp/issues/292 another one!
@versed flax it seems like itโs really making rounds!
I'm so stoked
Retweets from Alexia Jolicoeur-Martineau, Emad Mostaque, Jeremy Howards, lucidrains, and some others whose names I forgot
The raw stats are also pretty cool to see
It's really exciting
I almost never use Twitter so I don't know how big of an effect that is, but it's definitely non-zero
Oh yeah, someone from Nomic.ai who (ofc) commented on the GPT4All experiment!
Curious observation: EleutherAI retweeting it seems to have made basically no impact. < 100 people have seen the EAI retweet
That's crazy. I had no followers base
I don't know who got to see it first then
I thought it was your retweet that impacted it
Or maybe you retweeting my post made the recsys show my post to EAI's followers directly rather than your retweet?
I quote tweeted tho
Every other quote tweet weโve ever done seems to have 20-100x as many views
I guess we did something to offend The Great Musk and got throttled ๐
Thatโs easily explained by doing good work and getting noticed
That's a nice compliment
I'm waiting to see whether it delivers on downstream applications before self gratification and claiming we did "good work" haha
Fair enough
btw @loud adder what's the consensus on non-English LLMs? Nobody seems to really care, why?
- No academic interest due to lower innovation / lesser citation potential?
- No industry interest bc it's just too expensive to build a dataset and train one?
- No interest because we just aim for massive multilingual models?
Almost everyone who trains LLMs is paid by a US or Chinese company
Thereโs a small Korean scene
(Our Korean models are the best OS ones AFAIK)
Thereโs a Swedish non-profit thatโs trained single-digit pan-Nordic models
I would be so down training a french one
Go find me ~ 1 TB of French text and we can talk
Thereโs this model which is a French fine tune of GPT-J: https://huggingface.co/Cedille/fr-boris
And Cedille has an unreleased model they sell commercially IIRC
We can probably make do with 300 GB, though quality will suffer compared to 1 TB. And this is post-filtering, to be clear
mC4 will get you half way there IIRC
Maybe French Wikipedia and a couple other courses can close out the rest
mC4 is Common Crawl?
Yeah
Is that real though? It looks like there are a lot of talks about quantity vs quality happening
What do you mean by โis that realโ?
If you can find really high quality data you can get away with less, but weโre talking like โa substantial fraction of all books ever written in Frenchโ kind of quality
I mean, is this a number set in stone that can't be challenged with those modern, quality first, approaches?
That is based on modern, quality first approaches
Ah. ๐
This is about mixing it with code data and running multiple epochs
I know there's a pretty big source of books I want to scrape but I don't know the actual size of it
Oh I read this paper!
The open question though is: should the code be written in french too? That doesn't exist lol
Thatโs part of why I said quality will suffer, but you can live without code โin Frenchโ most likely
That's interesting. I'm not sure there's a high value doing this (ChatGPT is already pretty amazing at French tbh and I'm sure it's not specifically built with french in mind)
But it sounds like a fun ride
Thereโs some amount of cross-lingual generalization though, see
https://arxiv.org/abs/1910.11856
https://arxiv.org/abs/2005.00633
https://arxiv.org/abs/2211.01786
State-of-the-art unsupervised multilingual models (e.g., multilingual BERT)
have been shown to generalize in a zero-shot cross-lingual setting. This
generalization ability has been attributed to the use of a shared subword
vocabulary and joint training across multiple languages giving rise to deep
multilingual abstractions. We evaluate this hypo...
Massively multilingual transformers pretrained with language modeling
objectives (e.g., mBERT, XLM-R) have become a de facto default transfer
paradigm for zero-shot cross-lingual transfer in NLP, offering unmatched
transfer performance. Current downstream evaluations, however, verify their
efficacy predominantly in transfer settings involving la...
Multitask prompted finetuning (MTF) has been shown to help large language
models generalize to new tasks in a zero-shot setting, but so far explorations
of MTF have focused on English data and models. We apply MTF to the pretrained
multilingual BLOOM and mT5 model families to produce finetuned variants called
BLOOMZ and mT0. We find finetuning l...
I would anticipate that code specifically is a high-transfer medium. But I donโt have good evidence of that.
I guess we had some in Crosslingual Generalization through Multitask Finetuning. But the evaluation metrics were pretty lacking
Yeah that echoes a private conv I had with @unique sedge earlier. It's easier to learn another language than starting from scratch. You already know how to reason, syntax can be more or less transferable, and vocabulary is just a thin sugarcoating around those much harder implicit tasks / skills
Ok the question was "why" and apparently the answer is "funding"
Have you read https://arxiv.org/abs/2212.09535
The BLOOM model is a large publicly available multilingual language model,
but its pretraining was limited to 46 languages. To extend the benefits of
BLOOM to other languages without incurring prohibitively large costs, it is
desirable to adapt BLOOM to new languages not seen during pretraining. In this
work, we apply existing language adaptatio...
I didn't! I will skim through the paper before falling asleep
๐ฅณ https://twitter.com/apage43/status/1676416243652505601 people are starting to use it in practice and are happy about it ๐ฅณ
@versed flax got a Google news alert about the paper too ๐ https://www.marktechpost.com/2023/07/03/eleuther-ai-research-group-demonstrate-how-classifier-free-guidance-cfg-can-be-used-with-llms/
Recently, huge language models have shown impressive generative skills, allowing them to handle a wide variety of problems. Typically, 'prompting' is used to condition generation, either with task instructions and context or with a small number of samples. However, problems, including hallucination, deterioration, and wandering, have been observ...
So damn great! I can't count the number of papers o discovered because my phone recommended me an article from MTP
everyone loves that table, they keep tweeting it --- i'm pretty psyched!! latex \cellcolor{} ftw
also i'm so glad @versed flax pushed for the assistant angle, and really pulled all-nighters to make it work
i think that's why people are so psyched about us and not CAD or another one
Told ya. Marketing. Lol.
There are two main selling points: assistants & 0.5x model size
Those are the things that people seem to like about it
CFG will land in Hugging Face tomorrow I guess :)
@loud adder we are chatting about follow-up papers in order to capitalize on this attention... do you think we can continue to use the cluster?
Yes, you can plan on continued access to 8xA40s for as long as is productive and you make progress
cool!! thank you so much!! yeah I don't think any of us are ready to jump in 100% yet, but we are talking about paper #2 being a fine-tuning paper
mainly @blissful garden 's idea, but we're thinking of fine-tuning on CFG-generated data to see if we can "bake" in some of the benefits, thereby getting rid of the 2x inference cost
It will be very interesting to test it with Dromedary prompts, and see if you can get boost in performances in the self-alignment process. This will be very important and interesting results in the field. One of the issues in Dromedary is that they have very intensive prompts, and I'm guessing that the base LLaMA model is not adhere well to the prompts
https://arxiv.org/pdf/2305.03047.pdf
https://github.com/ggerganov/llama.cpp/pull/2135 Someone added it to llama.cpp!
https://vermeille.github.io/cfg-llm/ quickly made a paper page
Someone using the pod? the more I use it the more broken transformers get
now I even get a protobuf error
yesterday I had some weird lib issues
oh I tried to install streamlit and wanted to see if I can get the UI working... Maybe that breaks it
We should probably have conda in it instead of mixing everyone's env together
I fixed it don't worry.
@fallow egret Can you translate this coverage of CFG for us ๐๐ผ
https://twitter.com/MikeE_3_14/status/1675930643857825792?s=20
ืืืื ื #shorthebrewpapereviews ืกืืงืจืื ืืืืจ:
Stay on topic with Classifier-Free Guidance(CFG)
ืืืืืจ ืืฉืชืืฉ ืืฉืืืช ื CFG ืฉืืืฆืขื ืืื ืืฉืคืจ ืืช ืืืืืื ืฉื ืืืืื ืืืคืืืื ืืืชื ืื (conditionined). ืืืจืช CFG ืืื ืดืืืืื ื ืืชืืืช ืืช ืืืืืืืด ืืืชื ืื (ืืฉ ืคืจืืืง ืืฉืืื ืืขืืฆืืช ืืืชืืื). ืืื ืื ื ืขืฉื ื #LLMs
Google translate does a fairly decent job:
Today in #shorthebrewpapereviews we are reviewing an article:
Stay on topic with Classifier-Free Guidance (CFG)
The article uses the proposed CFG method to improve the sampling of conditioned diffusion models. The purpose of CFG is to "adjust the adaptation of the sample" to conditioning (there is a parameter that controls the intensity of the adaptation). Here it is done for #LLMs
Here CFG is used to improve the ability of a language model to generate long and coherent answers to a prompt without forgetting the context. Here the unconditional model is the same model that generates text without conditioning in the prompt. That is, to construct an answer to a given prompt, we move the answer away from the unconditional sample when the strength of the removal is controlled with a gamma parameter
The proposed method works quite nicely (not surprising because it is kind of math-based - the formula for calculating the gradients is based on the Bayes formula). That is, the more you raise the Gamma, the more suitable the answer is to the prompt.
Yes, google translate is pretty accurate. He also wrote it in the Israeli ML facebook group, and I already thank him and clarify few small points (like the fact that the gradients are in the diffusion model case, in the LLM setting we work directly on the log probability)
https://github.com/ggerganov/llama.cpp/pull/2135 CFG is officially in llama.cpp! The PR has been merged moments ago!
Somewhat related to the "pretrain with CFG" idea: https://huggingface.co/seonghyeonye/flipped_11B
It would be really awesome to see how the analyses done in this paper are affected by CFG:
I'd hope that with CFG, models will be much more likely to change their final answer when conditioned strongly on the generated reasoning chain.
We perform such experiment (contrasting the prompt + chain vs only prompt on the answer token), results were good. However, we omitted these results from the paper since it was not 'a real' CFG (but more resemble to negative prompting)
I wish those counted as citations ๐
https://twitter.com/novelaiofficial/status/1682010357819142147 maybe could retweet this for more visibility?
Btw, about that, any idea why the Bibliographic Explorer doesn't work? https://arxiv.org/abs/2306.17806
Classifier-Free Guidance (CFG) has recently emerged in text-to-image
generation as a lightweight technique to encourage prompt-adherence in
generations. In this work, we demonstrate that CFG can be used broadly as an
inference-time technique in pure language modeling. We show that CFG (1)
improves the performance of Pythia, GPT-2 and LLaMA-famil...
Semantic scholar doesn't think the paper has any citations: https://www.semanticscholar.org/paper/Stay-on-topic-with-Classifier-Free-Guidance-Sanchez-Fan/420e700d6902d065dc557c481979054477f9c6cb
Yes, but the paper cites stuff itself. Isn't that enought for the Bibliographic Explorer?
(Also, Semantic Scholar usually extracts figures & tables, it didn't)
I guess SS failed to parse the paper properly then
Here's a paper that also has zero citations but bib explorer works fine: https://arxiv.org/abs/2306.01481
Noticing the urgent need to provide tools for fast and user-friendly
qualitative analysis of large-scale textual corpora of the modern NLP, we
propose to turn to the mature and well-tested methods from the domain of
Information Retrieval (IR) - a research field with a long history of tackling
TB-scale document collections. We discuss how Pyserin...
Uh. I'll try and see what I can do then.
are such results / outputs saved somewhere? no worries if not
hmm never used Bibliographic Explorer at all...
Pure math people never care about citations so I'm quite behind about those tools
No, but I think If needed I can easily find the code and rerun it...
Actually, I found some result, although it's not a good model to evaluate COT since it's too weak, still you can definiately see the improvement
llama.cpp is about to add CFG to the web interface!
https://github.com/ggerganov/llama.cpp/pull/2217
rustformers is looking at it too
https://github.com/rustformers/llm/issues/377
https://github.com/abetlen/llama-cpp-python/issues/506 python bindings
Hi
I found this paragraph a bit weird because you say that embeddings are good and have nice structure and then say oh yeah actually we are doing logit arithmetic. But I think this is just equivalent to doing arithmetic with the final layer hiddens since the unembedding is a linear transform right?
yeah that's what we meant
Cool, I kind of suspected that, but it was unclear. If I were you I might make an update to the paper to clarify but up to you guys of course.
actually sorry
I just read the second to last sentence
which makes it more clear
I still feel like it's confusing-ish
because
idk it's like the core of what you're doing
and it should be 1000% clear
but anyway
I actually felt this paragraph is a bit hard to parse as well. I would have been confused when reading it the first time, but I'm not trained with ML background so I blamed myself lol
might be a better way to phrase it though. Will def think about it when we prepare to submit it somewhere
Also could we combine this with the tuned lens?
btw @versed flax any thoughts on where to submit?
To do CFG in intermediate layers
oh that sounds like a cool idea!
Also related to this https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
they end up having to do "counterbalancing subtraction"
which is kinda like negative prompting
actually sorry this is the more recent thing https://arxiv.org/abs/2308.10248
Reliably controlling the behavior of large language models (LLMs) is a
pressing open problem. Existing methods include supervised finetuning,
reinforcement learning from human feedback (RLHF), prompt engineering and
guided decoding. We instead investigate activation engineering: modifying
activations at inference time to predictably alter model ...
I'll look into it. Maybe the more seasoned researchers can judge what's the best conf we can realistically submit to
ICLR is probably the next big deadline I guess ๐ค
Also we should brainstorm what kind of questions we can ask if we do CFG in intermediate layers. It sounds cool and there should be some interesting collaboration here.
Would potentially be interested in collaborating, I think thereโs some interesting connections with interp and concept editing
It's out of scope for me
Maybe we could start a thread in #concept-editing or smth
It's cool if you want to pursue it. My point is that the paper is about porting CFG, and the initial CFG is not used for hidden layers. Those experiments deserves to be run but imho they're not in this paper's scope
Oh sure I think I agree this would need to be a different paper
Totes!
#1146877254153031930
Oh yeah I didn't mean to change anything for our current paper. For submission we just need to decide if we want to say more about negative prompts
We used to have some thoughts about a second paper and we can slowly picking them up and brainstorm
hey just caught up on this ๐ these new ideas sound interesting and cool
we had an alternative direction for paper #2 in the idea of fine-tuning, just want to keep that one alive, too!
but reg. paper #1
There's ICLR submission at the end of the month. You guy ok to submit?
ICLR deadline is 9/28, I just re-checked it, it's in Vienna too
yah
just messaging about that
I'm cool to submit there!! I think it's a good idea. Stella's typing though. what do you think, stella?
(I also have a half written message suggesting ICLR from earlier today but got distracted before finishing it)
alternatives in the NLP domain are NAACL (11/23 I think) and ACL (January-ish, hasn't been announced). otherwise, we can always wait for Neurips :/
but i vote ICLR. @blissful garden ?
We should know about ICLR in time to submit to ACL or ICML
is ICML better than ICLR or are they roughly equivalent? I'd probably rank NAACL lowest
I view ICML, ICLR, and NeurIPS as equivalent
I think I agree with that
great ๐ let's go for ICLR then.
imo i don't think the paper needs much. maybe another round of grammar-editing, following @obtuse tiger 's point about clarity up there
Do you think there is an anonymity-preserving way to mention in the paper everything that has happened since Arxiv? i.e. that CFG is incorporated into Huggingface and llama.cpp? that's certainly a cool contribution
"Since its public release, CFG has been readily adopted by major LLM libraries including llama.cpp and transformers"
cool.
maybe we can also throw cool examples of generations using CFG that the community has generated into the appendix? too bad we never set up a tipline for community members to send us what they played around with...
I only someone kept track of everything and lurked in the communities using CFG ๐
I don't see that hurting the paper's chances, but it's non-standard and I don't see it helping.
well i'm thinking of ways of saying "the community thought this was useful" ... showing it's been incorporated into major libraries, and including examples of grassroots adoption are ways of doing that?
I am not aware of adoption at a scale wherein it would significantly influence reviewers. I could be underestimating it, but off the top of my head papers that do that are things like VQGAN-CLIP (> 1 billion uses) or things like FSDP and trlX which are used in million-dollar model trainings
ngl it's not like it changed the world (yet?) the adoption is quite slow
It might help in case the experimental section was thin. But the experimental section of this paper is so vast and extensive that it's hard to believe that it will add any positive points. The only claim I can see for a rejection is lack of novelty.
Regarding ICLR, it's of course a great conference. The negative part is the open-review process, which is tough and might result that the top result in google will be an old version or rejection with ugly bad reviews.
I actually like open review a lot better than the closed reviewing process in pure math journal submission. We have way more shitty reviews than one can imagine that are obviously biased and/or even personal. Very occasionally there are also questionable papers get accepted in top journals very fast. I wish people could have seen the whole process in every submissions. If a paper is objectively good, there is nothing to be afraid. If there are fair points that need to be improved, we will just improve them.
I'm not against submitting to ICLR, and as a researcher I of course think that the open-review process is positive. However, as an author this format require much more effort (there might be full discussion with the reviewers + requests for few draft versions), and you have the publicity that make you think ten time on every sentence. So I think we should submit, but it is something that should be considered
Have no strong opinions on submissions to conferences. On board with anything you choose ๐
I'm worried I'm asking a dumb question and missing the obvious, so forgive for commenting in your group research channel again. But I didn't understood this response and it's been bugging me. What was the reason you can't trade more than 2x compute time and possibly enable model capabilities or outputs you couldn't get just inferencing twice?
As a concrete example, with this transformers patch change you can use the negative prompt as a second positive prompt, and that seems like it is a useful tool. https://github.com/huggingface/transformers/pull/25339#issuecomment-1667814849 So at a minimum, wouldn't I then have to inference three times instead of two if I want to use that second positive guidance but also want to use negative guidance at the same time? Or is there some way of reducing or collapsing all the combinations back down to two steps?
Thanks for being so nice when I randomly barged in originally btw, I kind of missed the context of this channel being a semi-private group research spot in the excitement of the moment but everyone was exceptionally chill about it.
Yeah I guess you can take any linear combination of prompts. Not sure about exactly what comes out of it but people should feel free to explore. If there are 3 separate prompts, maybe indeed you will have to go through all of them at minimum.
I don't think that there is a dispute that linear combination will work, and there might be practical use cases. However, I think that from a research perspective it's not interesting since by the sum property (additivity, commutativity), you can split it to a sum of the positive part and the negative part. Now since it is already known that a sum of different models logit behave as an ensemble method + we know that the minus behave as a contrastive decoding, then the expected result is clear. So this is why I think that it will not be interesting from a research perspective (there is no no novelty/ new information that you can deduce from such experiments )
Super helpful, thanks. Yeah I just wanted to make sure it was different, so there could be a reason you might actually want to do the extra work of 3x inferences, and there wasn't some underlying reason why it could always down to just two. I lurk your research here a bit because you guys keep coming up with fascinating sampling and prompt concepts that are fun to even think about, what it means in a prompt or if it was 'working correctly'. Small code changes that open up tons of new prompt possibilities, and model outputs are *wildly * different. I'm not involved in research myself, it's just really fun to try your ideas and see what the heck comes out. ๐ (I barely tried neg guidance in audio yet, still mostly unexplored, and just noticed you are thinking about CFG gen 2 already.)
Is this work similar to this ACL paper: https://arxiv.org/abs/2307.03214 ?
We propose Prefix-Adaptive Decoding (PREADD), a flexible method for
controlled text generation. Unlike existing methods that use auxiliary expert
models to control for attributes, PREADD does not require an external model,
instead relying on linearly combining output logits from multiple prompts.
Specifically, PREADD contrasts the output logits ...
Taking linear combinations works.
However, what I was saying is, we found that CFG is like a 2x model, don't think that 2 prompts = 2x, and N prompts = Nx. There's not link.
Yeah, it seems that the math is exactly the same. They seem to focus on the toxicity and sentiment control with negative prompt which is a bit different in terms of narratives. And... phew... I'm glad we have a better timestamp in terms of arxiv post date ๐
They don't cite us ๐ ๐
They actually predate us, but the ACL anon policy means they couldn't release it until later
The ACL submission deadline was in 2022
In case this needed to be made explicit: it was a joke ๐
well definitely another paper to add to the related works!
@everyone @unique sedge @fallow egret
Hello everyone we got results back from ICLR.
We're right below the margin of comfort for acceptance. If 1 or more reviewers increases their score by 1, we will be MUCH more comfortable with our chances.
We've identified 2 small experiments we think have a great chance of increasing our scores:
- show memory comparisons
- show NLG controlled generation comparison
I think @versed flax already addressed #1. Does anyone have any bandwidth to address #2? I will work closely with you to do this
w.r.t. #2, here is guidance for an experiment.
SOTA controlled NLG baselines:
Experiments:
- sentiment
- formality
I think there are classifiers for both, I think the experiment can "is CFG output classified as formal, via formality classifier vs. is NADO output classified as formal, via the formality classifier"
i think it's going to be difficult to show that CFG beats SOTA controlled NLG, because SOTA NLG assumes the presence of a classifier, which is a benefit of CFG that we don't need one, so we can do NLG beyond just formality and sentiment. But as long as we show it's not too different in these areas, that would be a nice result and might cause R2 to raise their score
I'll add that:
- #1 will be addressed soon wrt to the memory question. It's a fair and important question. I ran the calculations necessary.
- #2 is imho the hardest to address. His questions are totally outside of my comfort zone, so that's the thing I will personnally won't be able to tackle correctly
- #3 gave us a 5 while being notably confused by the paper and thinking it was a training technique. Honglu and I think that if we fix his understanding and show him that it is indeed better than a training technique, we can get a getter grade from him
that's doubtlessly true. but in terms of outlining what work we will do between now and 11/22, there's nothing to be done for R3 besides crafting a good argument
@unique sedge and @fallow egret if we can come together and address some of the actual work-items, then we raise our chances
p(score increase ) = sum_{reviewers} poisson(\lamba)
with a very, very low lambda
@loud adder @blissful garden any way to get access to some A100s to run some CFG runs to address #2?
@patent gull @versed flax did you guys have access to SAI cluster?
Sadly the A40 pods are taken away from EAI afaik. We have some 4090 I think
@tepid gazelle Do you know what compute resources does EAI have right now? 4090 pods?
We have 2080s on CW, and A100s on SAI cluster
@blissful garden / @patent gull I have some CoreWeave instances with my job now. Depending on the duration of the experiments I can run them
can't give you access tho
I have access to SAI cluster. If you guys have codes for small models I can scale it on SAI cluster. Jobs can get preempted but half a day is usually not a problem
Ok I can set up some experiments for you to run @versed flax. I just feel like 2080s are going to be annoying if we want to run any CFG on any models beyond just llama 7b or something
I have TPU v3 pods that I can share, but TPU is a different beast ๐
I have access to 2080s tooโฆ I can set up some experiments with smaller models and then pass โem off
Hi, sorry for the late response. @patent gull I have a bandwidth to work and help in whatever is needed.
I read the reviews. I'm not sure how much this experiment will help (overall the experimental section of the paper is the strong part of the section). It seems that the main concern (as expected) is the lack of novelty and contribution. I think we should think about the strategy how to address this issue.
I think it will be important to address this issue and upload the rebuttal response as soon as possible so the reviewer will have a chance to give a feedback and develop a discussion, because it will not be easy to convince them about the contribution (actually with this score we need to convince the AC).
I think there are two paths:
- Differentiate our work from previous works (I think it's possible, we discuss about it a lot few months ago).
- This is mainly for R@3, which think that the experiment section were insightful: I think we should focus on experimental contribution (this is a valid contribution and reviewers sometime forget about the importance of a solid experimental paper).
R3 didn't read the paper, it's pretty clear. It shouldn't be hard to prove that the work is indeed novel (pointing out the fact that following the paper it was implemented in a lot of inference libs should be enough, if it were not model, it would have already been there)
I don't think that integration in libraries is a valid claim for academic contribution. In the end there are indeed many previous work on decoding methods which seems to be equivalent to CFG (we know at least 3-4 works). The fact that they didn't release the code or bother to integrate it in big repos doesn't mean you have added value on top of their work.
I think that even if R3 didn't read the paper it will not going to be easy to convince the AC
my experience has been that directly addressing as many reviewers concerns as possible is the best chance to increase the score
p( score increase) = \sum_{reviewers} p(reviewer score increase)
and in OpenReview we can respond to each reviewer individually
Yes @fallow egret , we should quickly craft a response to R3 and try to respond to all the intellectual points as soon as possible to encourage discussion. But that doesn't preclude us from also trying to run the experiments they ask for. In the end, it may not amount to anything
but if 1 reviewer increases their score by 1, then our paper has a much better chance
Yeah I totally agree with you in terms of not using lib integrations to back ourselves up. Mentioning these can easily backfire IMHO
if you or @unique sedge have bandwidth, it would be great to see if you can get NADO working for formality
NADO: https://arxiv.org/pdf/2205.14219.pdf
I already have FUDGE working for sentiment... would be pretty easy to complete all 2 x 2 after that, and then run CFG with formal prompts and sentiment-relevant prompts, and then evaluate
on GPT-2?
It seems also that there is a reproduction issues with their code:
https://github.com/MtSomeThree/constrDecoding/issues/4
thanks @fallow egret for checking this out!!!! GPT2 is what I was thinking, yeah
I can put you in touch with the primary authors โ Sidi and Tau
or, I'll just reach out to them
I think we can just use the code and it's their issue if the results are not great ๐
also wait โ there's no issue running the code, just reproducing the results?
yeah
that's what I'm thinking, too
we just report the results (if anything, we can footnote this issue, or something)
sure, so I can take it. Let's sync on private message on the exact experiment (dataset, metric)
I can take care answering to R1 and the memory analysis he requested
- We kinda establish that a model with CFG consumes 2x the flop (2 forwards) but still follows the perf / flop plot. So you kinda can train a half model and infer with CFG.
- So the question is: is this tradeoff smart in inference as well, given that you use two cache lines with CFG, but 2x bigger models need to store more floats per token in cache (bc of the bigger hidden dim) and store 2x params?
- I do the maths and show that it depends on your VRAM / intended cache size. For small models, the weights are negligible in VRAM, you can have big caches, and the double cache for CFG is not worth the 2x reduction in params. However, for LLMs, especially the very big ones (> 30B), the weights take a massive amount of memory and the 2x cache lines would outgrow the param halving after only very big amounts of VRAM
I end with this chart
it reads like this: Say you have 10GB VRAM. For model sizes above the red line (up to 1B in this case), you should stick with vanilla models. The 2x cache line with overweigh the /2 param counts. Then, below the red lines (1B and above), prefer deploying CFG: your VRAM isn't big enough to store a big cache, and the /2 param count is better
Looks good
Hello, I'm back from getting married ๐ฅฐ
ICLR reviews look decent. We're in the top 40% of papers by review score. Do we have a google doc for organizing our response yet? Or have we been doing it in this thread?
some notes are here
https://docs.google.com/document/d/1iDQaPl3BKmdOYLvvDrJwKZoZijeks4qJsxmkWCFmWVk/edit#heading=h.e4qpo5vysxsq
we might clean it up and use it to organize responses
Also some tl;dr and relevant messages in this thread
- R1: #1111624010581680179 message
- R2: some controlled NLG experiments we consider quickly doing: #1111624010581680179 message
- R3: we are confused by what the reviewer wants but we wrote some draft responses in the google doc
Hi Stella! congratulations ๐ฅณ ! Hope you had an AMAZING wedding!
As you can see, we have started working on answers:
R3 is probably the easiest to convince since they barely understood the paper (my guess is that "uh, CFG, not novel!" then barely skimmed the paper + weak understanding of CFG anyway ("training technique"?!?!)). Maybe R3 should be answered at a high level since their critics aren't that deep. The main point is convincing of novelty. I don't know how to prove it besides 1) "trust me bro", or 2) "our work is novel. The proof is that the arxiv release was followed by implementations in major LLM inference engines => it wasn't already there", but people seem to agree that this is bad defense and can backfire. Especially bc it seems people didn't really get how to use it, especially the neg prompt, it seems
R2 is totally out of my scope. Alex seems to know how to tackle his points.
R1 is addressed with the aforementioned analysis
I think we should probably answer tomorrow
I think that this might make more sense with the axes switched? The VRAM seems like the more fundamental constraint to me, where you then maybe vary the params and move between the regions
Fair
I will need to triple check the maths, but the idea is here. Worst thing that can happen is that the slope changes a bit. Not much to worry.
hello @loud adder , congratulations!!!! i hope your wedding was amazing as well and wow, we weren't expecting to hear from you โ don't you have a honeymoon or something?? I didn't know EAI was part of that hahaa
beautiful graph! alright i'll edit R3 and the response to R1.
We should also convert that to a table that we can copy/paste into the rebuttal. If I'm not mistaken, OpenReview doesn't let you upload images in your response, does it?
as long as you can put a link to imgur...
(I don't fully agree btw, I think parameter count is the dependent variable here. R1 asked for the effect of CFG on memory, so that implies that we study memory as a dependent var)
graph updated. Grey area represents a model too big to even fit on the amount of VRAM
ok, I just checked... no image uploads to OpenReview.
We can ask them to click a link, but shouldn't expect they will. We've all been through enough phishing videos.... Also, it's one more click
So we want numbers we can paste into the box as well for the quick headline, and then they can click if they want to see more
We can update the PDF tho. So we can put the figure in it.
yeah definitely. again, wouldn't expect the reviewer to check. I can't even get my advisor to read my updates...
- I said this having only skimmed the reviews based on what I would generally expect from a plot like this.
- The dependent variable is the y-axis...
- That said, the actual dependent variable here is the memory usage. I assume you either misspoke or got the words confused, but ultimately you're correct.
I'll read the review in question again and if your characterization of the request is correct I agree the original format likely makes sense.
new figure version reads like:
- y axis interpretation: if you have 10GB of VRAM, serve vanilla models up to 1B. Then, serve with CFG. 5B and above => can't fit.
- x axis reading: say you have a 1B. You need at least 2GB to serve it. Up to 10GB, serve with CFG. Then, you'd be better serving an actual 2B
whoops!! yes, I meant parameter count is the independent variable and vram is the dependent var.
@versed flax what is the green "vanilla" writing supposed to be aligned with?
it just shows the upper triangle. The wording ain't great as well.
It's a label for the region as a whole
i see... so the diagonal lines are lower bounds based on parameter count? and there's no upper bound because data tensors can take up VRAM?
if i'm just not understanding, but everyone else is, it's ok, we can move on
The grey shades region is the region where the model doesn't fit within the specified VRAM
You have a lower bound: model too big => can't fit => failure.
You have no higher bound => more VRAM means you can't fit a bigger and bigger kv cache
Let me see if I can explain the plot (since I'm not actually sure I'm following 100%)
- The question is whether our claims about "matching larger models" remains true is we care about VRAM (w/ k-v caching) rather than # params
- The red line is the Pareto optimal frontier as you trade off # params vs VRAM
I'm confused about what the blue dots are though.
This is specifically a response to
R1: memory cost analysis is recommended. The proposed method requires a second run of the model, which may increase the memory cost (for example, the key-value cache).
- If I understand correctly what you mean, yes.
- Yes. Serving with CFG costs more kv cache but less params, and (kinda) gives you the performance of a model twice the size. So, below the red line, you should serve with CFG, above it, you should serve an actual 2x model (without CFG). If you want to maximize the amount of tokens you fit in your kv cache, that is.
Blue dots are just the actual param count / max kv cache size for the models in the paper (gpt2-*, pythia-*, llama-*). Since they have variation in arch and there's a little alignment to 64 at play in the hidden dim, they don't exactly fall onto the red line.
Yes, this is meant to answer that remark from R1.
Side note: looking over the paper again the misalignment between plots and where they're referenced in the text is very distracting
i have to stare at this some more. so, below the red line (vertically), you have enough excess memory, but not so much, so you can afford to serve the same size model with CFG. Below the green line, that size model won't fit. Above the red line, you have so much extra memory that you should just serve a bigger model?
I don't think I fully understand, but I think the y-axis label could be improved "VRAM at Equality". Equality to what?
Yes
ok maybe the region in between red and green can be shaded light green for "go"?
The wording is terrible and GPT-4 copied it from my terrible csv. It's labeled "equality" bc on this line you can fit a kv cache of equal size whether you choose to serve with CFG or serve a 2x model
the region below the green line โ "gray" is fine, but "red" for "stop" is also OK. Above the red line can be light blue. And in a legend, or in the caption, we can define what each of these colors mean. The reality is that there are 3 separate regions, here, not just two
yes, let me GPT4 this rn
nah doesn't burn enough CO2
lolll
so just looking at one verticle line:
at parameter count = 1B, we intersect with the green line at 1.1 2 GB VRAM (green) and 10 GB (red, and blue dot)
does that mean that for a model with 1B parameters, vanilla costs us ~~1.1 ~~ 2 GB and CFG costs us 10? so 5x as much? that seems high to me
(2GB*, log scale)
I don't understand the wording. Let me try again.
at param count (X) = 1B, I see green line intersect VRAM (y) at the 2B y-tick, and red line at the 10b y-tick
oh duh lol my bad
meaning like a Figure will be at Page 5, but it will be referenced at page 2?... yeah we should do a better job at shuffling them around
I guess what I thought the reviewer was looking for is performance vs. VRAM for CFG vs. vanilla.
Just like Fig 11:
You have X amount of VRAM. You want to use it all and serve efficiently. So you'll store the model weights, and use the rest for a kv cache. You want that kv cache to fit as many tokens as possible.
So you have 3 options:
-
Serve your model as is. Boo, lame, boring. so you fill your mem with params P + cache cost per token C * cache size S. This S is the only variable, and you want to maximize it.
-
You're a chad and you want to DOUBLE THE PERFORMANCE! and you've read about this CFG paper. But now you're using 2C per token. so you use your VRAM as P + 2C * S.
-
You wonder whether you shouldn't directly serve a 2x bigger model with 2P params and a slightly bigger cache cost C' (C prime), but C' < 2C. Your VRAM is used with 2P + C' * S
At some point, if you can fit a big S, most of your VRAM will store the cache, and you really want a smaller cache footprint. But if your model is big, the parameters will dominate in VRAM, you can't store a big S, and you'll want to reduce the parameter memory footprint. So what's the decision boundary? Red line, decided as S = P / (2C - C')
is it clearer @patent gull ?
can I go to sleep? .___.
lolll i'm still parsing
4am ๐คก
you can go to sleep lol but what do you think of my prev post?
about replicating Fig 11? that's what I thought the reviewer was asking for
R1 explicitly mentions KV cache. It's an inference question. I'm not sure I can see another way of interpreting the question
But if you have one, please explain
for filling the same KV-cache budget, what is your accuracy with CFG vs. a bigger model?
parallel to Fig 11. For the same FLOPs budget, we show accuracies on vanilla vs. CFG
so, like, you want to store 2k tokens in your kv cache, what's your best strategy?
ultimately, the user doesn't care about "what's the biggest model I can fit"... the user cares about "what's the maximal accuracy I can get with a fixed budget"
yah.... I have VRAM X, does P + 2C * S give me better accuracy, or does 2P + C' * S?
I'm assuming P + 2C * S will, because that means a slightly bigger model. maybe i'm contradicting myself earlier when I said VRAM was dependent variable
well then it depends on how big you want your kv cache to be, I guess
lemme think
like, you have 30GB. If you only care about perf, then that's a no brainer, use a 15B+CFG (fp16, so 15B => 30GB). You'll match the perf of a 30B without needing the actual 60GB. But you'll have a cache size = 0. Dumb dumb.
ummm actually I thought this was dataset dependent.. i thought for each dataset, there's a max-size datapoint, so we scale KV cache to that, and then we can maximize model size
but honestly, i'm very green to this kind of engineering work so I dumb
nah, kv cache is model dependant. It's your context_len (model dependent) * num_cache_lines (how many sequences do you want to cache when serving)
ah right. ok... if someone who is smarter than me can look at that graph and make a meaningful decision about which model to choose, then i will believe you haha, i just can't summarize it myself.
dear fuckin Yann LeCun I'm realizing how much I actually learned about LLMs since I switched job
there's not a unique answer to "given my amount of VRAM, what model do I choose?" because you have to trade off the amount of VRAM you dedicate to the params and the VRAM for your kv cache.
That's the same in training, which you may be more familiar with
You can't answer "what model size do I train for my amount of VRAM?" because it also depends on the tradeoff you're willing to do on your batch size
but for inference especially, can't we assume num_cache_lines = K (some constant, preferably for simplicity's sake, K=1)?
then, since your KV cache is upper bounded by the model's sequence length, can't you make a decision:
- model m + CFG
- model m'
based off of accuracy and the maximal amount of parameters that will fit in the cache?
if you run your own chatbot for yourself, then, yeah, ok num_cache_line=1 is fair (for now, but in a near future you'll want to run N concurrent instances because your LLMs will run different programs, so you'll want N cache lines etc)
but if you run a big data center with millions of users like OpenAI, you absolutely can't decide num_cache_line=1, that's basically dedicating 1 GPU per person, that's insane
ok wait i did parse this finally
I'm sorry my English is just complete trash. I just shouldn't be allowed to speak.
no lol your good, it's really not your fault, that was a very clear answer
but S is upper-bounded by the model's sequence length, right?
it's not just \in {0, \infty}, right?
in a non hypothetical scenario, S is a multiple of your ctx len
S = ctx_len * num_concurrent_cache_lines
in a cloud setting for instance, each user gets ctx_len cached token. So you'll allocate one cache line for user Alex, another cache line for user Stella, another one for user Honglu and so on
there's gotta be a way we can make a better argument then "it depends"
there's really not
you want to serve millions of user with 1 GPU? Serve pythia-14M.
you want to serve 1 user = 1 GPU? Serve a big model
lol. yeah but this is research... we don't have to consider 1 million users
You want to go brankrupt? Serve 1 user = 8 GPUs.
hard disagree
scaling is all the rage
ok... 1 user, fixed VRAM. which model do i choose?
easy. the biggest that fits, with CFG
because it'll give you the performance of a model that should be twice as big
but I have 2x the cache, so i should be able to serve a model MORE than twice as big with the same memory constraint, right?
and since your kv cache size will be super negligible bc you just want 1 cache line, you don't have to worry about 2C being greater than C', because (2C - C') * S <<< P, since S is so small (assuming you don't have one of those crazy models with 100k ctx len ofc lol)
sigh. ok i'm naive and i don't typically have my head in this space, but i'm gonna say something super high-level and dumb โ I feel like there's a way we can fix certain variables and make a better argument about "here's the model we choose to maximize accuracy".
But if charts like these are actually super typical and we can reasonably expect the reviewer to interpret it correctly, then great.... @blissful garden any thoughts?
basically, all i'm saying is that we have to plan for the reviewer having the attention span of a goldfish, and if we can't convince them in that timespan, we're not getting a score boost
a sentence like "for fixed VRAM, CFG delivers 130% the performance" checks that box for me
as a goldfish myself
I can speak for other goldfish
Ok my argument to R1 is "You're raising a good point, we did the maths, and there's a tradeoff. In certain scenarios where you want to serve big models you'd better run inference with CFG than run a 2x model"
can you be explicit about what those "big model" scenarios are? >1B parameters?
depends on your vram lol
and "you'd better run inference with CFG" because why, higher accuracy?
ok fix a VRAM
we can add: "as an example, if you have 10GB of VRAM, models up to 1B should be served as is, but 1B to 5B models should be served with CFG"
ok โ "1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you get better performance than a 4B model, which takes up the same VRAM"?
i think we're getting there imo
your example is good
and perfect, something textual we can put in the reviewer response
"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you can fit a bigger kv cache than a 4B model, for comparable performance"
fixed
plz I rly need to sleep, I have 5h of sleep remaining
ok sure
we good?
go to sleep
we can talk more tomorrow. i don't understand why "bigger kv cache" is the dependent variable here
enough CO2 burnt
"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you can fit a bigger kv cache than a 4B model, for comparable performance"
->
"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG, with the same KV cache size you can get X% more performance"
?
what do you call performance?
same thing I thought you were calling performance โ accuracy on the benchmarks
same as figure 11
ummm you can go to sleep we can talk tmrw
ok cool
What? Why does each user provide fixed amount of tokens for the models and why is there this weird 'num of cached lines'? When serving the model isn't there a distributed messaging queue and async workers grab dynamically sized inputs and do batching before assigning it to models?
also, since we can revise the paper, I wonder if we should add a super short subsection or subsubsection explaining the challenges of applying CFG in language domain and why it doesn't work verbatim. We could address R3 blah blah blah, and look, here is a new short paragraph explaining that we are not applying existing technique trivially.
Why does each user provide fixed amount of tokens
Don't overthink it. They were just an illustration of concurrent runs.
<rest of the message>
That's how the queueing system works, not how the cache itselt, the big tensor of size (num_cache_lines, 2, num_layers, num_heads, hidden_dim) work. (the tensor might or might not be explicit into the code, but in this end, this is how the VRAM will be allocated for the cache.
Yes, we should 100% do this
Are we trying to make us look genius because we apply cfg on the model output instead of the model output ( ๐คก ), but our model output is logits rather than regression?
Updating on FUDGE:
I implement the method with some shallow sentiment classifier (65m parameters):
https://huggingface.co/docs/transformers/tasks/sequence_classification
The problem is that the running time is extremely slow (since you need to run on 200 samples for each token). With max tokens 20 (which is not enough), It takes 67 sec per sample.
Which means that running it on ~500 samples will take ~9h (and for the full dataset which contains 25k samples it takes 45h).
We need also to run multiple experiments (there is there a guidance hyper-parameter).
Any ideas?
cc @patent gull
do you have the script somewhere? I can see if I can scale it in the SAI cluster
Very ugly, but it should be correct
what's the command to run this script on the full dataset for a particular cfg? I can spin up that many nodes and for each maybe shard the model to 8 GPUs so that it's faster
oh so the model is fixed to gpt2-medium? Or should the --model-name argument be used somewhere
We can change it, but we decide to do this experiment with GPT2-medium
The current run is with the default 1 guidance
Let me add it as a parameter and clean a little bit the code
yeah sounds good. It runs well. So I remove the [:300] for the full run right?
yeah go ahead. I will think about how to scale this bad boy.
๐
@patent gull Please verify that we are fine with this experiment (models/dataset)
Wow 1/24936 [01:50<762:33:17, 110.09s/it] lol ๐
lol, yes. The algorithm is a disaster from a computation perspective. I don't understand why it's even considered as a valid option
For every generated token you need to run the classifier on 200X number of samples in the batch
does it generate in batches or just 1 token at a time?
1 token a time. But I'm not sure it will help since in any case the bottleneck is running the classifier
classifier can also run in batches I guess?
The classifier is running in batch
how hard is it to vectorize everything with a large batch size?
I see the vram isn't fully used
This is the point, that it's already big batch of 200 and if you increase the number of batch to N then you need to run it in batch of 200*N (which means that in practice you will not be able to run a big batch)
Should not be a big deal, I can do it
Ok, so I clean the code and extract outside the guidance scale
oh so the classifier does it for the top 200 tokens for each generation step, is that right? Sorry I only start to understand it right now
Yes. What they are doing is simply Classifier guidance.
The problem is that in order to do it you need to run the classification on every possible next token. In the paper they 'compormise' on taking the top 200 ๐
lol this is crazy
WELL I MEAN
If that is the current way of doing things, I say we already have a GODDAM strong argument for CFG, even if are scores are lower
if we do N generations with 200*N for the classifier, it could fill up the vram but not sure how much faster it gets. Also generating 25k samples is probably more than necessary. Maybe we should just do first 100-300 samples and change a handful of bigger models with a couple different cfg......
I agree, this is exactly what I told @patent gull
Yes, this is why I choose 300, I think it's legit
Let me take a look. Just waking up now
I'm pretty sure we will get better results, since the classifier is crappy (if we take stronger, then the running time will be inifinite, and in any case in their paper they suggest to use weak classifier)
Apologies for the delay
that looks great to me. what's the dataset?
also i think to compare apples-to-apples, we might want to make sure both variations see the same input. And CFG is probably going to see an input like "Write a happy response" or something
So i would prepend every example in the dataset with "I'm feeling happy today. <input sentence>"
or something. @versed flax @blissful garden any ideas for a good prompt that captures sentiment for a non instruction-tuned model? I know you played around a bit with this, @versed flax
for non-instruction-tuned models, story completion is usually the way to go because a lot of pretraining data has those stuff from books, blogs or whatever
Following the previous night I am absolutely exhausted and today was mostly dedicated to surviving. I will be unable to do good work and must delay my answer to R1 to tmrw.
"Today, Parisian celebrated"
The dataset is imdb. For each review sample removing the last 64 words.
The idea is to follow:
https://github.com/vicgalle/zero-shot-reward-models/
And use their classifier (with Flan-T5) for evaluation
ZYN: Zero-Shot Reward Models with Yes-No Questions - GitHub - vicgalle/zero-shot-reward-models: ZYN: Zero-Shot Reward Models with Yes-No Questions
no problem haha
ok if it's movie reviews, then I would prepend the phrase "I enjoyed this movie. <prompt>..."
Do you want to do it also on the FUDGE experiment?!
that's my thinking, yeah, otherwise p(xi | x<i) is different across CG vs CFG .... how do we know that CFG worked, compared to just adding that prompt changed the sentiment anyway?
Yes, I see
Ok, so we should decide on the prompt before collecting the FUDGE results
yeah.. when i get to the office, i can try out some different prompts with CFG and see what seems to be working
@blissful garden anything else is needed from my side?
It's good so far. I will play with it tonight
Ok, thanks!
Aw man we missed an opportunity. Maybe we wouldโve gotten higher scores if we named our paper โAll you need is CFG for LLMs with applications in ChatGPT based on Diffusionโ
@fallow egret where did you find that model? is it a recommended one for sentiment analysis? the model card says it was trained on an "unknown dataset"
we can probably swap in a better one. Is there a standard one for sentiment analysis? I don't know much about this field.
i don't know, either. i see that it is the example one used in the HF tutorial on sentiment, but it looks out of date, since in the tutorial, that model returns "POSITIVE" and "NEGATIVE" labels https://huggingface.co/docs/transformers/tasks/sequence_classification#inference
it's trained on IMDB, though, so ideally it is in-domain
Yes, it was a 'tutorial' model. There are of course much better models, but the problem is that using stronger model will significantly increase the computation
Also in the paper they emphasis that the classifier should be shallow compare to the base model
i see, ok SGTM, then
Yeah, it scored highly on IMDB, which is the dataset we're using
but just to be clear on the experiment โ
at first I was thinking that we were going to use the same classifier that we use in CG to evaluate the outputs of both CFG, and CG?
or do you think we should use a different classifier for evaluation?
Yes, I think it should be different than the CG model (stronger model)
ok. i see the arguments for and against. If CG does badly with a different classifier, someone could just argue "well, you chose a purposely bad classifier"
But this is one of the FUDGE limitation... You can't use a strong model as the classifier
ok cool, makes sense
also on the experimental design, I see that we are using CG just to make things "positive"?
Yes, this is why using the same classifier is completely unfair (the objective is to make it 'positive' according to this classifier)
final_res.append(t['score'] if (t['label'] == 'LABEL_1') else (1 - t['score']))
i'm thinking that a more interesting objective would be to try to flip the label?
E.g. if the y_true is POSITIVE, then try to get y_pred to be NEGATIVE, and vice versa
because if you're taking the first 64 tokens as prompt, for all prompts that are already positive in those first 64 tokens, there's not much to be done, is there? and then we wouldn't really be differentiating between the two approaches, because they'd both look good
Yes, if the review is positive in the beginning it's not interesting, but most of the 'trimmed' reviews are neutral
oh ok cool, good to know!! thanks
ok, then, i agree with your experiment. maybe we can even measure the \delta from prompt -> prompt + completion
i.e. p(POSITIVE | prompt + completion) - p(POSITIVE | prompt)
where p is the stronger classifier
Yes, it's very good idea.
ok i'll try to find a stronger classification model and will come up with a few prompts. helps to have CFG in huggingface now thanks for @versed flax ๐
I think that the prompted Flan-T5 is a valid classifier (and it has ~x4 parameters comparing to the shallow model)
here's another one (at least they report their validation accuracy lol): https://huggingface.co/hipnologo/gpt2-imdb-finetune
ok cool i'll check that one out, too
but it's not fine-tuned on a sentiment dataset?
Yes, it's not. I think it might have an advantage. But for sure I see also the disadvantages.
So not sure what is the best option...
i did a small run with negative prompting and GPT2-medium
I found that the following negative prompt gave us the biggest increase in CFG:
"A bad movie review starts like this"
A bad movie review starts like this. 3 0.019473
4 0.014441
5 0.031700
Bad review here. 3 -0.008154
4 0.000345
5 0.000045
Bad. 3 0.006696
4 0.020785
5 -0.003599
This is terrible. 3 0.004726
4 0.007281
5 0.014920
Thus starts a terrible movie review. 3 0.004966
4 0.020504
5 0.008596
To write something terrible, write this. 3 -0.010605
4 -0.000713
5 -0.006353```
but these numbers aren't huge, honestly. \delta is classifier(CFG output ) - classifier(vanilla output).
So +.03 means that CFG with that negative prompt boosted the sentiment score by ~3%.
I can try with a positive prompt, too
these are over the first 200 examples in IMDB
I had a migraine and stopped working yesterday, but please remind me to take a look at our draft response Tuesday (today) afternoon.
will do!! I hope you feel OK
status is โ
R3: I looked it over/edited, I feel like we're OK to respond ASAP on that, whenever you get the chance to look.
R1: I think @versed flax did the necessary experiments, we need to craft a response.
R2: maybe today/tomorrow we'll be done with the experiments and have the response ready
positive prompting was a lot harder to achieve โ in fact, CFG with most positive prompts, in most settings, negatively affected sentiment
pos_prompt guidance_strength
A good movie review starts like this. 0.10 -0.104267
0.25 -0.066911
0.50 -0.029078
0.75 -0.032493
Great review here. 0.10 -0.122568
0.25 -0.111521
0.50 -0.064845
0.75 -0.046463
Great. 0.10 -0.098969
0.25 -0.078565
0.50 -0.049975
0.75 -0.041289
This is great. 0.10 -0.066222
0.25 -0.029284
0.50 -0.058933
0.75 -0.030547
Thus starts a great movie review. 0.10 0.074312
0.25 0.013816
0.50 -0.035235
0.75 -0.030811
To write something great, write this. 0.10 -0.141006
0.25 -0.094684
0.50 -0.066705
0.75 -0.016745
I think we should try both positive and negative settings, though, for the experiment. We can prepend "Thus starts a great movie review." and "A bad movie review starts like this." for CG.
If anyone with access to a long-running compute cluster with some decent memory can run my script, that would be very appreciated!! here is my script:
Try:
Positive: A bad movie review:
Negative: Movie review:
changed some codes to shard the data for 8 gpus and taking this script for a spin in the cluster. There should be some files coming out when you wake up.
@patent gull I see you called model.generate(.... I got warnings that the max length is defaulted to 20 and the prompt is longer. Does this need to change or it is ok with the current setup?
ahh errored out... One sample got 800+ tokens and crashed the max length of that distilbert classification model lol
The full script is taking too long but I will just leave it running.
I will queue 2 more jobs specifically for first 5000 data points, one for negative with cfg 3, 4, 5, and one for positive with cfg 1, 1.25, 1.5, 1.75. When each cfg finished a csv will be saved (we can combine later).
Let's see how many files we get tomorrow when I wake up (or error out). Heading to bed.
what are the blue dots, again?
models. Red is regression line.
why do we care about the y=2x line, again?
minimum amount of ram needed to load the model, fp16 assumed (2 bytes / param)
so the blue dots are the minimum VRAM that the model needs + the KV cache for the maximum sequence that the model takes?
what is the right way to address the second part of their review:
There is no guarantee that the Eq.6 will obtain a legal probability with the probabilities of all possibilities summing up to 1.
they're right โ and we're not doing special normalization in the LogitWarper. Does HF do normalization under the hood in the .generate() function? I think it must, if the user is doing top-p and top-k sampling as well
There's always a softmax before the actual sampling
right duh
for R2:
Compared with text-to-image generation, the optimal \gamma value in the language modelling seems to be small (<2), while large \gamma value leads to poor performance. Have any observations on it?
Maybe we can say that CFG is applied in autoregressive sampling at every step, so \gamma actually needs to be smaller, as it has a repeated impact
I would say it's bc of two things:
- in img generation the range is -1;1, it may be smaller with logits
- in img generation the values are independent but here there's a softmax and changing the max value dramatically alters the whole distribution
pixel range is -1;1
It may also be: 3. The conditional and unconditional outputs are more different in text than image
I think it's more the nature of diffusion models: after very small amount of iteration the differences between the conditional probability and the unconditional probability should be neglectable
This is a great explanation as well
We could see something similar with our paper as well as we sample more and more tokens
The continuation will be impacted less and less by the CFG'd tokens of the initial prompt
is this plot clearer? I changed the text and now I think it's much better
I'm getting caught upon the rebuttal google docs now
ok I will condense these explanations in the google doc
Friends, I am not a native English speaker, therefore, I will not post the answers to the rebuttals before Alex / Stella proof reads them. Please, when you think an answer is good enough, post it. Let's not wait another round. It's been 5 days already. We're 50% in.
@versed flax in the reponse to R1, it says
We have completed a memory analysis and will include our results in the paper. In general, we found a tradeoff between serving larger models, and serving a smaller model with CFG.
Is the updated paper going to be posted simultaneously with the reply? Or is that a to-do?
That's a TO-DO as of now. Can be done fast for the memory thing. I'm not quite sure about the controlled NLG. Experiments are still running.
I'm tweaking the reply to reviewer 1 a little and otherwise think it's good
awesome!
I would add the memory experiments and the formatting fixes that R1 recommends now
We can tell the other reviewer that theirs is running and that we'll update when it's done
to the paper? ok
Does the page size constraint still applies?
Usually we get an extra page, it should say on the call for papers page
good. I will check that then.
And maybe I will stop depending on you and Alex for the English and use ChatGPT instead lol
I want to add this to the end of the discussion of the results
At a high level, this means that it depends on your use-case. For researchers or small scale deployments where people are using the largest model that they can fit on their GPU, it's better to use CFG. However for very large scale commercial deployments, it makes more sense to increase the size of the model. We further note that increasing the size of the model is not always possible: OpenAI probably doesn't have a version of GPT-4 that's twice as big sitting around.
I love it!
cristal clear and wraps it up perfectly
(although they probably do since GPT-4 turbo is prolly a distilled version of GPT-4)
"discreetness"?
No, the fact that GPUs come in fixed sizes: 16, 24, 40, 48, 80
Yeah
Models do too... though generally are spaced to double in size (6.7B -> 13B -> 20B -> 40B)
Depends on the family? I remember Chinchilla models no doubling everytime but I may be mistaken here
do I post it now?
Your guess is better than mine. I have no prior experience with reviewers.
I changed the review to say that the results were added to the paper and that we made the formatting changes they recommend. So make those changes and then it's good to go IMO
(A general principle at play here is that you should show that you've done what they want instead of promising that you will whenever possible)
Then I'll try pulling that off tonight and posting the PDF and the answer at the same time
So right now we are inconsistent in our replies to R2 and R3
We tell R3 that CFG for LMs is new
But acknowledge with R2 that it's not
Which position are we taking? We cannot take both
As far as we are aware, the application of CFG to autoregressive language models is novel, as previously they had only been applied to non-autoregressive diffusion models in computer vision
that reads "new" to me
Oh sorry. R2 tells us it's not
Misread that
I'm about to give a talk and have to run, but I can do a final pass before the submission this evening (in 4-ish hours)
18min left. Praying that nothing breaks when it comes out.
yes
Ttyl
@fallow egret @patent gull the resulting files of the 1000 samples of fudge. Any good?
What guidance values you used?
yeah it's the 1
cool, so this is what they used in the paper
I have a for loop in my bash script so it's done 1 only. Tomorrow maybe 1.25
For a fair comparison we should run it with few guidance scale.
But I'm not sure it's worth to waste on it too much time. In any case it's simply non-valid method
I have 1, 1.25, 1.5 and 1.75 in my script
we can tell R2 that it's running just like what Stella said. If we get 1.25 we can give them a teaser. But no need to wait for it to finish
yeah looks like that crazy 200 distillbert thing is a massive bottleneck. CFG barely made the whole thing slower
Yes, I agree. In any case the main point is to stress that theoretically the alternative is using CG (as in diffusion models), in LLM it is also known as FUDGE. However, the problem is that in the context of LLM you need run the classifier on every combination of (state,next_token), which make it impractical.
In the FUDGE paper, they resolve this issue by sampling the top 200 tokens and used a shallow classifier. From our experience, even when using a relatively shallow network (65m parameters), the running time is still more than order of magnitude comparing to CFG, which make this method impractical for many real-world use cases
can we just grab those 1 results and try CFG alone, possibly with neg prompts, and argue that CFG produces similarly controlled results?
they control the sentiment right?
if we do we get one quick chart to show and also makes our method stronger
Yes, I think it's fine. In addition to the last comment that I wrote to add this small experiment that also demonstrate that you are not getting better result with FUDGE. But the main point is to emphesais the usability of CFG for real world use-cases
hey just catching up. I will take a look at these results now
sorry โ what is being pickled in these files? I just see lists of strings
can someone forward Elad's script, again?
@patent gull this one plus some minor thing to split up for 8 GPUs and save to separate files
ah ok โ so will just compare to the vanilla GPT generations
maybe try the same stuff but with CFG generation and some negative prompt, and get the sentiment score. As Elad said we may not outperform but if we are not lagging too much behind it might be worth mentioning
need to go to bed otherwise I can try it very quick.
no problem
yeah i'm just wondering... I remember some old work about creating the ideal prompt, given a classifier, I'm trying to find it
i don't think it'll be directly useful in our case, but.. hmm
FYI:
There will be a strict upper limit of 9 pages for the main text of the submission, with unlimited additional pages for citations. This page limit applies to both the initial and final camera ready version.
So I think I will add the memory analysis in the appendix
It's secondary to the contribution I would say
Yes, I agree, I think it belongs next to the FLOPs analysis, and can be mentioned in the main body but explored more deeply in the appendix
just like the FLOPs
Totes
btw i'm reading through what you wrote to R1. very nice. I finally understand it ๐ญ๐ญ๐ญ hahaha
Haha I'm glad
also the plot looks so much better
Yes the text finally makes sense
what happened to the blue dots?
obliterated. Not needed.
one question โ are we implicitly assuming that a model twice as large is as accurate as a model with CFG?
no, i know that haha, but I think we should reiterate that in the response
let me work that in
oh ok
For the chart, I have the following comments:
- the "CFG" annotation can be more central โ 50%/50% of the plot, instead of off to the side
- Can we change "CFG" -> "CFG wins"
- "Vanilla -> Vanilla Wins"
โ if you'd like to send me the code, I can play with the chart myself, whatever's easier.
I can fix that in an instant
ok great!
no problem/rush at all
i see you just copy/pasted โ there's a typo "models bigger than 5G" -> "models bigger than 5B"
lol. we're not comparing cell phone service plans, here
B=G tho ๐ญ (but yes, you're right)
yeah but let's be consistent
Hello sorry for being awol. Had to go for my thesis submission and defense schedule to college, been busy in that. sorry for not being able to help.
still running sentiment controlled NLG.. I wonder if we want to add a second controlled NLG attribute
formality is one that others have used, and there's a nice model here that does well in assessing formality: https://huggingface.co/s-nlp/roberta-base-formality-ranker
specifically, R2 asked us to compare this to controlled NLG sota methods
There is a simple recent work by DeepMind in which they simply provide few examples of tuples <prompt, prompt_score on the data> and provide a meta prompt that ask the model to provide alternative prompt that will give the best score. You can iterate (by adding the result of the new suggestion).
It's seems to work very nicely, and we can apply it easily on our use to find the best prompt for the CFG
to GPT4, or something?
so the model basically infers based on what was working, what will work?
Yes, exactly
I'm now working on few improvements to this method (like providing few failure cases for each run). But it still work in progress and their basic idea is nice if you provide a good context in the meta-prompt about the task
@fallow egret @patent gull are we good to post?
I think it's important to add for each reviewer 1 sentence in the beginning which stress the main positive things he found in our paper (something like 'we are glad you find our...'), it's important again for the AC decision
We already posted responses to R1 and R3.
For R2, we planned an experiment comparing CFG to a controlled NLG baseline, where we're controlling for sentiment.
I just got some good results from CFG. I'm comparing to SOTA baseline now.
I do wonder, though, if sentiment is enough. Ideally, we compare several different controlled NLG objectives. What do you think, @loud adder ?
Sentiment may be enough for an initial response to R2, but ideally if we're updating the paper, I'd feel better including more experiments on more controlled factors
lol, it was really uploaded with ๐
We hope this clarifies the points raised in your review. If you would please consider raising your score, we would really, really appreciate it!!
Ok, I think that at least when we see that end of the review period is coming we should add a comment for each reviewer which is much more formal and doesn't contain any promise for future changes (this is a direct reason for reject). It should simply state that we modify the text and address all the concern raised by the reviewer (list them).
haha feel free to edit, but i've found it helps to ask, sometimes
I don't think that there is an option to edit responses
yeah, there is... I edited @versed flax, the button is off to the side
btw, good news, good results from CFG vs. baseline CG
here's the delta increase in positive sentiment via CFG for a few settings/prompts:
Great movie review: 0.10 0.075225
0.25 -0.136310
0.50 -0.015543
0.75 0.034303
That was a good movie! 0.10 0.364103
0.25 0.312607
0.50 0.192197
0.75 0.044026```
and here's the delta increase in sentiment via CG for the defaults that the authors used:
``` baseline_df['delta'].mean()
0.065204710023016```
ideally we test a lot more values for guidance strength for CG, but it is SO SLOW to run.
Let me draft a response to R2, and then we can see whether it looks good, or whether we should do more experimentation
Amazing!! I think it's definitely enough material to address this point of his review
I had edited this out of the Google doc before it was posted... can you check and see if there's other divergences with the Google doc?
I think we should submit a quick response to R2 even if the experiments aren't fully in telling them that they're on the way so they don't feel ignored
sorry!! I didn't check the revision-history/didn't realize you had edited it out... and have been handling a lot of things today
No worries
It's not a big deal, the language just seemed a little over the top
I was more concerned about whether this was a sign that an old draft was used (and other edits I made later I view as more important)
Did another pass over the two posted reviews.., they look good!
the reposted paper is very good too, thanks to @versed flax
I'm almost done with R2... just have to answer that last question
ok R2 is done in the google draft, grabbing dinner now
Just woke up. @patent gull still need me to run some more tests on both fudge and your script? I can try more neg prompts in parallel with you guys and see if anything comes up.
why are those guidance values below 1?
confused_pikachu.jpg
R2 will be quite unhappy that we run yet another method with yet another gamma
yeah ๐คทโโ๏ธ i see positive/negative prompts as kinda being in the same category https://huggingface.co/docs/transformers/internal/generation_utils#transformers.UnbatchedClassifierFreeGuidanceLogitsProcessor.example
but yeah we generally don't have a good answer for guidance strength and what works and what doesn't
that means we interpolate between both prompt, thus reducing specificity to the user prompt
for negative/positive prompting, it also means we're emphasizing more/less of the negative prompt
$(1 - \gamma) p(w_i | w_{<i}, \hat{c}) + \gamma p(w_i | w_{<i}, c)$
for \gamma \in [0, 1], \text{you're mixing part of } \hat{c} \text{ with c}
AlexSpangher
Compile Error! Click the
reaction for more information.
(You may edit your message to recompile.)
Hey @blissful garden thanks โ I feel pretty good on the prompts for sentiment. I think over the next few days Iโll try to get the formality classifier going.
In the meantime, as soon as someone takes a look at R2 and says itโs ok, I can post
Seems like the only thing is the last question where we listed 5 counter points. 1-4 look good. I agree with Elad and have some doubts on 5
The rest looks really good!
ok i'll post our response
should we re-ping the reviewers on the OpenReview comment threads?
I don't know how ICLR works
in ACL, the ACs started encouraging reviewers to respond
Yeah that's probably a good idea
Thank you for your review! We were wondering if you were planning on updating your score to reflect our reply or if there were any additional questions you'd like us to answer.
ok โ updating now! thanks for the great text!
Unfortunately it looks like we won't be accepted to ICLR unless a miracle occurs. Due to other peoples' reviews responding our paper has fallen to ~ the median review score.
https://x.com/shaohua0116/status/1728158662265340047
This doesn't mean the paper isn't good, it means we got unlucky. Peer review is a crapshoot and sometimes it takes three submissions to get the right luck.
The next venues are ICML and ACL, both of which have deadlines in January. I think we're good to submit as-is (after changing the format and making sure we fit within the length reqs), but if people want to improve the paper more we can have a meeting in December and discuss options.
COLM is probably also something to think of (and maybe have a better chance of being properly evaluated by qualified experts)
True, I hadn't considered that.
What about some ICLR workshop? I think at this stage it's going to be hard to get accepted (since indeed many papers got out with the same sampling modification). On the other hand with such a good experimental section it will be very easy to get accepted to a workshop
Partially it's a question of what @versed flax's goals are
Can I get your view on the different tradeoffs?
I am having a hard time having a relevant opinion, I don't have publishing experience and what just happened makes me question the chance of getting this paper through a high impact conference
My take was that our paper wasn't really judged by the right people this time. Workshop has the advantage of being specialized to the right domain. I had good experience with that earlier this year but my sample size was 1 ๐.
Trying ICML and ACL has the benefit of prestige. If it gets accepted, for example it's an entry ticket for job interview or a dream-come-true moment for @versed flax if I remember correctly. COLM is probably like betting on a super young venue. But in my own field a lot of young journals run by competent experts did rise up extremely quickly and carried others' and my mediocre papers that got published on it.
I have no idea whether a resubmission gets a lower chance or not. At least in math nobody cares how many times you submitted before.
What about Elad's take that the paper gets older?
yeah this is what I don't know about. How much would resubmission hurt the publication chance in ML.
I mean the paper did come out in parallel with a bunch of others doing similar sampling method. It just gets submitted late but they shouldn't judge that on when you submit
I just want to stress that the resubmission is not an issue (as @loud adder wrote, it's very common to try few times until getting accepted). The problem is that there are currently many papers with the exact same method, I'm guessing that some of them got accepted to some tier-1 conference. For ICLR it was still a boundary case, but now it's going to be very hard to defend on the novelty claim (which automatically reduce the score to <6).
From a prestige point of view, getting accepted to a good workshop is not the same as getting accepted to the main conference, but I think it's also good for the resume.
In any case, for sure it's your decision only. Whatever you will decide I will be available to help also in the next submissions
Unfortunately it is judge according to the submission time (since it is a blind submission and the reviewers are not supposed to check in Arxiv for the original publication date) . In any case for sure they will not start to compare dates with other works. Let's hope it will still get accepted to ICLR
In theory you are right. But technically resubmission is at least a 6 month delay. If your work is an important work, people must have talked about it, cited or used it and things get old easily in ML. If there is a perfect isolation between submission and the original arxiv, it automatically becomes "not novel" because this "anonymous submission" is older than your own preprint and reviewers are not allowed to draw connections between these two
Basically this is not enforceable because a perfect execution is saying "good works cannot resubmit".
I'd rather guess that in reality reviewers secretly look up and know what date this paper came out and who wrote it. If it's truly an original work when it comes out, they just don't mention "novelty". If it's obviously a copy cat of other method with significant time difference, they cite this "novelty" issue.
We can ping the ACs if this becomes a serious issue
That's what we did with trlX, when we claimed we were the first people to do something and one reviewer came back and was like "what about trlX?"
You all have more experience than me. If I make a decision, it will necessarily be less informed that any of yours. My goal is to maximize impact & recognition, but I'm not ready to take risky bets a risk losing it all
sorry for the delay here, I missed a lot of this discussion.
I have a few thoughts:
-
Huge bummer and yes more a reflection of randomness than actual goodness-of-fit. ICLR has a crazy mean-tendency bias... a 1-point difference in any reviewers score would've totally changed our outlook.
-
I think this is worthy of a conference paper, given the amount of different angles we bring together: conventional benchmarks, cot, memory/compute analysis, assistants, etc. I'm willing to be overruled on this point, but I think it's more than a workshop paper.
-
In my opinion, it doesn't matter as much that other people have done this sampling modification, as they have really focused on specific cases. Also, the current reviews have undeniably made this paper a lot stronger. My gut is that we need to do a better job highlighting our novelties into the introduction, in essence: introduction = \gamma * our paper + (1 - \gamma ) (other papers). wait a minute.... that looks familiar....
-
That being said, I have concerns about ACL since (a) I don't know that people in that conference care as much about compute/memory evals as they do in Neurips or ICLR (b) the paper format for ACL is much different and smaller, we would have to cut a lot of stuff or move stuff into the appendix. Which might not be terrible โ we might indeed have too much introductory maths. But still, it's going to be considerable work to reformat for ACL.
My gut is that we submit at least one more cycle. We got very helpful reviews that got to some core weaknesses in our paper, and we addressed them. The paper is stronger as a result โ the review cycle worked.
None of the reviewers seemed to care about other NLP papers that did CFG-like sampling. The criticism was the comparison to CFG in vision, which was fair, @versed flax was directly inspired by vision, so it's a very fair criticism. So, we do a better job of highlighting our response to R2.
In my mind, the tradeoffs:
conference:
- pros: gives the paper more credibility and standing.
- cons: possibility of another rejection.
workshop
- pros: gets the work out there, at least.
- cons: variance in quality in workshops is HUGE. Paper has less credibility, in my mind.
I think COLM might be pretty cool to consider. I looked up the dates and ICML reviews will become available before the COLM deadline. So there's the possibility that we submit to ICML and then if we get bad scores, fix and submit to COLM. ACL reviews won't be available in time for COLM. On the plus side for ACL, the reviewers there tend to write a LOT more, and actually respond to rebuttals but ๐คทโโ๏ธ the timing and venue doesn't seem optimal to me
That seems smart. I'm okay.
should we say something? I don't know how these channels usually work:
@here private comment period ends today
we can easily say "we responded to everything including with new experiments and haven't heard back." I just don't know what is typically acceptable for ICLR. For instance, in *CL conferences, we're advised to do this only as a last resort, if we suspect serious ethical issues on the reviewers part
yeah no idea how to do this. With 7000+ submissions I bet a lot of other people also said that they didn't hear back from reviewer
That's my thinking, yeah
This is the mechanism to communicate to the AC any unresolved discussion points (if you do not have any unresolved discussion points, there is no need to send a private comment).
I mean, it seems to be exactly our case?
Yes
Send:
Dear AC,
We have tried to engage with the reviewers, but unfortunately none of them responded to our rebuttal. As far as we can tell we have given compelling responses to all reviewers that would warrant a reconsideration of their initial scores.
copy paste, send?
I would say exactly that except โcompelling responsesโ -> โcompelling responses, including two requested analyses, to all reviewersโ
@versed flax do you want to send?
the three of us are 1st authors, if you want to do it, I won't prevent you
Either/or, I donโt care
Ok let me do it before I lose service then, im on a train
alright let me see
FYI
To write a private comment to the ACs, you can simply go to your submission on OpenReview, and write a new comment. The allowable readers are ACs, SACs, and PCs.
"Dear Area Chairs,
We are writing to let you know that we tried to engage with the reviewers, but unfortunately none of them responded to our rebuttal. As far as we can tell, we have given compelling responses to all reviewers, including with analyses that we incorporated into a new draft of the paper, that would warrant a reconsideration of their initial scores.
We summarize the major unresolved discussion points here:
-
A memory cost analysis is recommended (Reviewer 3Gz2): We have completed a memory analysis and have included our results in a paper update. In general, we found a tradeoff between serving larger models, and serving a smaller model with CFG; in our analysis we identify the optimal tradeoff point across model sizes and VRAM. Please see Section 4 and Appendix B.3
-
Comparison to other baseline controllable NLG tasks (Reviewer RjYY): We have completed this comparison. The baseline Classified-guided control increases sentiment by .065 points, whereas CFG (our method) increases by .312 points. Additionally, the baseline is very slow โ it is >100x slower than CFG.
-
Lack of novelty (Reviewer YQBo): with respect, we argue that the application of CFG to autoregressive language models is novel, as previously they had only been applied to non-autoregressive diffusion models in computer vision. This is not a trivial adaption, and the core contribution of our paper is to adapt and rigorously test this across a wide range of different prompting techniques to prove it's validity.
We very much appreciated the reviewers points, and they undeniably made the paper stronger. We had hoped for a robust debate.
We additionally would like to report that we strongly believe that Reviewer YQBo did not fully grasp the point of the paper, as they seem to be under the impression that CFG involves furter model training, which is DOES NOT.
We would be very appreciative if you took these points into consideration in your review.
fire
ok I'll wait 20~ min for other people @here to read
and then send. if you don't hear my ACK, assume that I don't have service
sent
: ๐ฅ
Thank you! Sorry I have been so busy today
We're resubmitting to ICML tomorrow @loud adder and anyone else. If you'd like to give our paper a glance, it's here: https://www.overleaf.com/5232387143jdfyzsrvmjsv#565401
This is the final version?
Overall look fine, it just important to pay attention that is exactly 8 pages (currently there are 3 missing lines)
i didn't realize that was a stipulation to be exactly 8 pages, no less! but i will return to flesh out the discussion section a little bit better anyway
so i'll make it work
Yes, it's stupid but you can get automatic rejection on such things. It can be easily solved when everything is finished by playing a little bit with the figure size/captions space
Hey @versed flax or others on the CFG paper, we're using CFG as a baseline for a new project and I had a question about the merged HF implementation of CFG which I thought you might know the answer to:
Is the prompt being conditioned on by HF generate() the entire input sequence (and does this stay static / you don't add new generated tokens to this extra-conditioned prompt as you go on?) I think the answer to this is yes but wanted to confirm.
and also, is there a way to pass settings to HF generation such that only a sub-prefix of the initial input sequence is more strongly conditioned on?
we'd like to be able to pass "<Instruction1>..... <context here>" as input, and only condition on <Instruction1> when generating further output from the model
Thanks!
The input_ids given to .generate() is the "positive" prompt, the one given to negative_input_ids is the negative (/ unconditional) prompt. Sampled tokens are appended to both during generation
Hm, so continually updating both to include new generated tokens is the desired approach?
yes
gotcha
If you want to do what you say (which is exactly the same as Context-Aware Decoding), you want to use instr+ctx as positive and intr as negative
I see, thanks!
Ah yeah going by their abstract that does sound like what we want. Thank you, appreciate it!
You're very welcome :)
What a strange rejection ๐ฆ just gave 4 without any concrete reason (besides typo)
should we reply? I'm pretty happy with the 7 and 6
like "lol dude why 4 just because of typos??"
I think we should stress the novelty (which is the main weakness according to the other reviewers).
Let's hope he will either change his mind after he will see the other reviews (his confidence is only 3), or hope that the AC will kick him (this could happen with very high probability)
Seems like we are above the threshold now ๐ค ๐
