#Evaluating Classifier-Free Guidance impact

1 messages ยท Page 4 of 1

patent gull
#

well hold on... probably not the right call to plot 2 tasks on the same axis

#

i'm just thinking

#

what i'm seeing actually is that Quanaco has a acc peak at 1.5

#

and Wizard has an acc peak between 1.1-1.25

fallow egret
#

Yes, they have peak in defferent places

patent gull
#

the % invalid min-regions look a lot wider

#

i wonder if there's such a thing as grouped-line chart...

fallow egret
#

I have to go for ~15 minutes, sorry...

patent gull
#

ok no problem

fallow egret
#

I'm back, in my opinion the two charts that includes everything definitely demonstrate the two trends (the acc and the invalid). It's a little bit strange to mix both datasets and models on the same graph, but it might be a valid option if we want to emphasis and put all the results in the main paper.
Another option is to split it to two and put one of the datasets/models in the appendix (I don't think it's that bad, not everything should be in the main paper. People read the appendix, especially when there is a reference for the appendix figure in the paper)

patent gull
#

yes, i kinda think we should split

#

ultimately it's up to you but i think maybe sticking with gsm8 is a good idea since it's like you said, more important

#

i redid them so that accuracy and invalid are grouped, that way we can have a real ylabel

#

still have a lot of vertical white space, which i'm not happy about but ๐Ÿคทโ€โ™‚๏ธ

fallow egret
#

Yes, they are not in the same scale, I don't think we can do something with it (either the acc and invalid doesn't have the same scale or the model perfomences). Observe you have typos in the name.

patent gull
#

we can do a broken axis:

#

but it's easily missed by readers thus not a favorite technique imo

fallow egret
patent gull
#

it's very late for me. can i send you the finished files and can you put them into the paper? do you know how to use subfig?

fallow egret
#

np, can you send me also the code?

#

oh, these are the finish figures

patent gull
#

I will send you files

#

these are screenshots

#

i will send you code too in case you're curious

#

thank you

fallow egret
patent gull
#

yup fixing that now then gonna send over

#

ok these are the figs

#

i made gsm smaller so it could go in the main body

fallow egret
#

Wizard LM is 30B

patent gull
#

whoops

fallow egret
#

๐Ÿ™Œ so I'm putting one in the main paper and one in the appendix?

patent gull
#

i think so?

#

i think 4 plots would be too much info in the main body

#

i also think squeezing two tasks into one plot isn't great

fallow egret
#

I agree, great. So I will modify the paper accordingly

patent gull
#

great, thank you so much man

#

i'm out for the nightโœŒ๏ธ

fallow egret
#

thank you, it looks much better now

#

good night

patent gull
#

we had to do something about those plots haha

#

but we did

fallow egret
#

Ok, I think it looks good! please review when you wake up ๐Ÿ™‚
(please also go over the captions, I changed them yesterday according to the feedback)

versed flax
#

a lot happened while I was aleep!

#

The new plots are cool, they really show the trends

#

So, we have 8h before before ArXiv submissions close today

#

Once you're ready for release, please ๐Ÿ‘ this message :) @loud adder @patent gull @blissful garden @unique sedge. I will send the paper either 2h before or when I get your validations, whichever happens first.

fallow egret
versed flax
#

That's why I'll do it the very second I get everyone's go

patent gull
#

Just waking up. There was one word in the acknowledgments that I donโ€™t remember

#

But it was used a lot

#

And I didnโ€™t know what it meant

#

Ah yes, what do you mean by โ€œredactorโ€?

versed flax
#

"writer" then?

patent gull
#

Also did what were the comments from @loud adder โ€˜s two people she was showing it to?

#

Anything helpful?

#

Sure writer/editor

versed flax
#

Definitions of redactor
noun someone who puts text into appropriate form for publication
yeeee!

patent gull
#

???

#

Wow Iโ€™ve never heard that word before

patent gull
#

โ€œTo redactโ€ means to remove something from a text

#

So I guess itโ€™s a view of writing in the negative lol, but Iโ€™ll take it, it sounds fancy!!

versed flax
#

let's go with "writer" then. If it's confusing to you, it will be confusing to a lot of people

versed flax
patent gull
#

Haha ๐Ÿคทโ€โ™‚๏ธ yeahโ€ฆ

versed flax
#

but there was a "nsaphra" reading the paper yesterday

patent gull
#

Lololol

#

Maybe nsaphra made some changes

versed flax
#

nope

patent gull
#

Btw how did the paper drop down to <10 pages?

versed flax
patent gull
#

Gotcha

#

Cool!

versed flax
#

(aka: Stella LaTeX magic)

loud adder
#

NeurIPS also uses 1.5โ€ margins which are quite large. Since weโ€™re just using their template rather than submitting to the venue I edited the style file to use 1โ€ margins

loud adder
#

So the way it works is that you can resubmit as many times as you like in the next 4 hours (until 1400 ET / 1800 UTC) and itโ€™ll go live at the same time. After that it gets pushed back a day though.

versed flax
#

yes, though I'd be happy not submitting a bazillion times bc we're fixing punctuation lol

patent gull
#

Alright. As soon as I get to the office Iโ€™ll give it another read, but Iโ€™ll only change anything if I see something major

unique sedge
patent gull
#

uh oh ok not a big deal but be prepared for a resubmit

#

we never said what Figure 1 displayed ๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜‚๐Ÿ˜‚

#

how much time do i have?

versed flax
#

1h max

#

30 minutes preferred

patent gull
#

ok

versed flax
#

WHERE IS IT

patent gull
#

idk man

#

it belongs in the intro anyway

#

that's too far down to be intro-ing Figure 1 for the first time

versed flax
#

all right

patent gull
#

alright good

#

signed off

versed flax
#

awesome!!

patent gull
#

(just triple-checking all the figure captions)

#

ok great

#

i'm logging off overleaf otherwise I'm gonna drive you and myself crazy

#

overall, a million thumbs up ๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘๐Ÿ‘

#

this paper came out so well, had so many unique parts, and tied together really nicely at the end

#

it's a great paper, really foundational. We're in a different ballgame from CAD at this point

versed flax
#

Well, it's time :)

#

Submission time \o/

patent gull
#

ok done

#

good

versed flax
#

uh, I need "endorsement" bc I never published in cs.CL

#

The code is 7MN9HQ

#

@patent gull it seems you can endorse me

loud adder
#

@versed flax I can never find the page to endorse a paperโ€ฆ there should be an option to send an email

#

Feel free to send it to me

versed flax
#

Thank you! it worked!

#

! Package natbib Error: Bibliography not compatible with author-year citations.

#

trying to solve it

patent gull
#

now you're doubly endorsed

#

there's a fix for this, hold on... i know there's some github package that just fixes this for you

#

magically

versed flax
#

I don't get why it complains about author-year, the neurips template uses numbers

patent gull
#

hmm

#

overleaf unfortunately does fix a lot of things under the hood

#

do you have a local latex install?

versed flax
#

yeah

patent gull
#

sometimes i've had to go through that a bunch of times to make sure it works

#

overleaf is magical in a lot of ways

#

ugh i wanna find this github package

#

maybe try this?
\usepackage[numbers]{natbib}?

fallow egret
#

Yes, it's really a nightmare to export from overleaf to arxiv

versed flax
#

Thank you Elad for telling me to save some time for the submission

#

๐Ÿ™

fallow egret
#

I hope it's enough time ๐Ÿคž

versed flax
fallow egret
#

It was compiled and submitted?

versed flax
#

compiled, yes

#

I'm filling the forms

fallow egret
#

Amazing ๐Ÿ™Œ this was quick

patent gull
#

cool!

versed flax
#

Do we want to fill any of this?

fallow egret
#

I don't think so

versed flax
#

Friends, it's party time!

#

Thank you everyone! It's been a blast. Next stop: Sunday 6pm UTC for a bit of advertising, and we'll talk about conference submission later :)

patent gull
#

woooooooooooooooooooooooooooooooooooooooooooooooooooooo!

#

let's take a nice long breather, now

#

wow

loud adder
#

Congrats yโ€™all

sand mesa
#

yay!

versed flax
#

FYI:

Your article is currently scheduled to be announced at Mon, 3 Jul 2023 00:00:00 GMT.
Updates before Fri, 30 Jun 2023 18:00:00 GMT will
not delay announcement.

stone umbra
#

๐ŸŽ‡ ๐ŸŽ‰ congrats on finishing!

versed flax
#
loud adder
#

LMK when you tweet about it and Iโ€™ll retweet it from the EleutherAI account

versed flax
#

Well then I'll do it right now :)

#

wait maybe I should add Fig 1 to the tweet?

loud adder
#

I generally find more success and engagement with tweets that walk you through the highlights of the paper. I would add a couple more, drawing out particularly interesting figures and talking about them a bit?

versed flax
#

All right. Let me give it a try. That's a first for me.

fallow egret
#

Is it allowed?

#

At least for ICCV and CVPR (until last ban decision), it was not allow (as authors) to publish on social media

versed flax
#

We have no conference in sight, so...

long shell
#

Did anyone happen to look at the predictions on the lambada val set? I'm curious what sort of incorrect responses CFG is fixing

blissful garden
blissful garden
# fallow egret

Ooooh haha my memory about the example stays at our first one

versed flax
#

GUYS WE GOT RETWEETED BY JEREMY HOWARD

wheat zenith
#

As a an AI model user, I hope it's okay just to drop to the Discord here just to post: I love love this work so much. No kidding, two days ago I was lamenting "What is wrong with this world, why don't LLMS have negative prompts." https://news.ycombinator.com/item?id=36537845 and then POOF. The world is right again.

#

Only constructive comment I might contribute is on the concept of negative guidance, as a user who prompts. Weird imagine the idea in text LLMS, yeah. But what about audio LLMs?

#

To me it doesn't seem that strange in an audio LLM like musicgen. Since musicgen has a CFG like var, out of the box negative CFG could output music I plausibly considered vaguely like the opposite of my text prompt. In this case even without a positive prompt, just the unconditional and negative only since I hadn't modified it yet. (The range of negative CFG that produced normal sounding but different music was quite narrow and fiddly, typically something like -.2 to -.3., and changed for every prompt, so hard to use though.)

#

I've been trying to bang two rocks together to make negative guidance work in TTS LLM, and now I feel so much less crazy that this exists. It doesn't quite make as much sense there, but it will be fun at least. (I think about it maybe like a director showing an actor a scene, and then being like, "Ok you see that? I want the opposite of that."

versed flax
#

I think there's a wording trick but I couldn't find it

wheat zenith
#

I can't actually follow the math or the fundamentals to know if what I did was like this idea, but I did try using a bunch of other generated samples in a way that seems similar. I took one voice, I found some kind of difference between that voice and 100 random english audio samples. Just counting token frequencies. So the idea is you have the tokens in that voice, that are unique, but not just 'human speech' -- and then you flip the sign on those, and penalize them in the sampler. It's like an anti voice. Not sure it makes sense!

wheat zenith
#

The wonderful thing about AI models, especially recent one, is that it's almost hard not to make output that is at least interesting.

#

Can I bug you about one somewhat random question only vagulely related? On music gen github, someone posted cool music and also that they used "-p sampling" and then a bunch of other people were asking if it was really using -p sampling, did that work, and I thought it would be funny to actualy try it. So like, reverse the order of the logits, least likely first, otherwise jsut like topp. Actually though, the out seems genuinely kind of useful and different an audio LLM model. And as far as I understand, it's not just equivalent to something else? In a TTS model, it makes peole have a christopher walken speech pattern. They choose wrong places to pause. SO COOL.

versed flax
wheat zenith
# versed flax well I mistakenly implemented that on LMs and it was just bad lol

That's what I expected. I think maybe the Bark audio TTS model may just be unusually robust, you can ban 75% of the tokens randomly and sometimes its sounds mostly normal. It was okay ish musicgen for short periods, as well, eventually degrades to non music. For music I feel like I want really just endless text boxes for different prompts with CFG weights, some positives, negatives, some CFG values that vary over time, like based on the current token count. Feels pretty natural in music. It's like a conductor, holding out a hand to section of the orchestra, slowly raising it up, increasing the weight of one section, decreasing another. Continuously changing.

#

Bark is not yet in Huggingface but I'm so excited I almost want to try and port this code...

wheat zenith
versed flax
#

So that will generate lyrics right

#

But now let's say you want to use a neg prompt so that these lyrics are not about love

#

As far as I could think, your neg prompt would be "I wrote a love song, the lyrics are:"

#

And again, there must be a better way to prompt engineer a neg prompt, but we did not find what it was

#

because then the continuation won't be a love song at all, which will lead to a weird negative continuation:

"I wrote a love song, the lyrics are: <something not about love at all>"

wheat zenith
#

Right. What does working correctly look like?

versed flax
#

no idea. We couldn't find the right way to phrase it.

#

We used negative prompts only as more general versions of the prompt or totally opposite of the prompt (surprisingly, that still works), but we couldn't find the prompt engineering to make it more targeted /granular

wheat zenith
#

The first though I had, skimming the code, was I'm gonna add in a text box that can swap in for the unconditional or the 'neutral prompt' -- no idea what that enables or if it makes sense. But in audio I did have to use like 'generic english voices' not just 'unconditional generation' for the token thing I did.

versed flax
#

nice!

#

let us know

wheat zenith
#

But just vaguely, maybe 'unconditional' being another input, could ground the "opposite" concept somehow.

#

The great thing about audio? As long as changes the sound... it could still be a useful knob to turn, even if you have no real idea why it's having the effect, or can predict it really.

#

Trickier in pure text.

versed flax
wheat zenith
#

The Diffusion people have been liviing a life of spoiled luxory. Negative prompts, control net, a billion other syntax tweaks, while the LLM community has nothing. They are ravenous and I get it.

#

Actually in Bark, there's kind of two prompts, two different sets of tokens, both are used at inference, concatted. One for the voice, one for the text to say. So each could each have this implemented seperately, gonna be crazy.

#

It's all just tokens out of a GPT model, it shoud all work

#

They should implement something like the visualization tool you made, that is super cool too

#

ggerganov will have it done so fast. if you google any random weird sampling thing, half the time, the only working code I can find that isn't the original repo, is in ggml. he just implement everything.

blissful garden
versed flax
#

and yes, many likes and follows!

wheat zenith
versed flax
#

@fallow egret that's for you

fallow egret
wheat zenith
versed flax
wheat zenith
#

You're toppling the Diffusion cartel. They can't keep all this stuff to themselves any longer. We're coming for all of it. Even when it doesn't really make sense an LLM. I'm putting in my prompts anyway.

fallow egret
wheat zenith
#

The llama/oobabooga/text-gen community will probably try a lot of obvious twists and variants, if there's a new variable exposed, people will start really exploring.

#

Is is possible to trade more than 2x compute time, in for some further gains?

fallow egret
fallow egret
wheat zenith
#

There is always trivial brute force stuff. Not really same concept though. Like you can run an entirely second audio model inside the sampling loop and use it to judge the emotion of the output, and then backtrack and keep trying. It's the least efficient way to do something like that, but if you only need 2 minutes of audio, you can run it all night and it eventually works.

strange magnet
#

RT'd ๐Ÿ™‚
Great work! It's very exciting to see a project like this come to fruition in Eleuther, where someone can come in with their ideas & results and get help refining it into an impressive paper ๐Ÿฅณ

azure lion
#

(typo ๐Ÿค“)

loud adder
fallow egret
#

P.S, I actually run comparison to ensemble. CFG works significantly better

loud adder
fallow egret
loud adder
fallow egret
# loud adder I feel like it would fit as a natural subcolumn here?

For each one of the experiments?
We can do that, but I think that generally ensemble try to tackle very different issue. So it will be nice to mention in one of the setting that we beat ensemble (with half computation resources!), but I'm not sure we want to do that on all these experiments since it's not an apple to apple comparison with respect to the problem it is trying to tackle

#

If you just meant the table representation format- then yes, it sound a good idea!

loud adder
fallow egret
patent gull
#

@wheat zenith retweet us!!! ~~

wheat zenith
patent gull
wheat zenith
#

You are now my last two tweets. And I have been tweeting like 3 times a month.

#

So that's a LOT

patent gull
#

I just wanna say a huge, huge thanks and congrats to @versed flax who will never take credit for it but is truly the leader here. He went many, many sleepless trying to be awake when we all were and coordinate. Endlessly thoughtful, experimentative, questioning. You really motivated me to be a better thinker.

Also a huge shout out to @blissful garden for powering us through all the tough experiments!!! You also tolerated all my last-minute requests asking for different plots!!

versed flax
wheat zenith
#

I'm pretty amateur, every single line of code there, probably learned last month, lol

patent gull
#

the acknowledgements in the paper don't fully capture how hard these two worked and the spirit, energy and devotion here. This came together quickly but doesn't mean it wasn't deep

patent gull
blissful garden
patent gull
#

i'm down

#

maybe we have a finetuning paper more focused on negative prompting?

#

that seems like an area that we can really own and build from this paper on

wheat zenith
#

Is the paper locked, or could you also test CFG and/or negative prompts in some audio LLM? To me they feel pretty natural, negatives too. Sound descriptions have pretty clear opposites. A loud scratch voice, a soft smooth voice, whatever. Even a person, or an entire voice. If you asked a group of people to pick another voice out of a set, that was the opposite, probably mostly pick same person. As opposed to something conceptually hard to grasp like "the opposite of a love poem"

azure lion
blissful garden
wheat zenith
fallow egret
wheat zenith
#

In musicgen, you can do anything and make weird music. for example mapping CFG to a sine wave, based on tokens. sounds great, adds variety

#

It breaks up the repetition. audio is easy mode I think. Just being different, is good.

blissful garden
wheat zenith
#

I also happened to ask someone in the HuggingFace discord about logit attribution, and this is like, the Discord where that concept seems to be literally created, wild timing. I had only practical question about using it to make the audio waveform visualation, act like a debugger for your prompt, but also look cool. But the idea is like an audio version of the colored words in the paper actually.

blissful garden
patent gull
#

shoutout to @fallow egret and @paws too, i think you guys handled a tonnn of back-and-forth, chaotic discussions very, very well and with grace. without your parts, this would be a way flimsier paper

loud adder
#

Made the tables a bit cleaner. Especially if we decide we want to add more comparisons, this will scale nicer than the original layout

patent gull
loud adder
#

@wheat zenith FYI there's also a thread for training models to generate music, #1106671860294357055

loud adder
#

(They seem to have stalled out due to people being busy, but additional manpower might help with that)

patent gull
#

i'm loosely involved in that project... i think it's also a question of getting the boilerplate together/training baselines. I question whether it's the right time to start considering extensions like CFG, but ultimately, additional personpower does always help!!

fallow egret
#

I think for me an interesting direction of extending this work will be to extend it to the RL context. You can see CFG as modifying the model policy given another policy (negative). And I think that an interesting direction is given a new reward function how we can steer the model properly only during inference, I think this could be done with the ILQL framework, but these are only very initial thoughts...
https://arxiv.org/pdf/2206.11871.pdf

versed flax
# patent gull I just wanna say a huge, huge thanks and congrats to <@212467543745626112> who w...

Dude it's really been a wild and fun. Really, massive kudos to your never stopping improving the paper's quality when I was ready to settle. Massive thanks to @blissful garden glu for running tirelessly all those experiments. And overall for the incredible quality of your reasoning to push the paper further and further.

And obviously thanks to Stella for stepping in in the very beginning, and send me in the right direction to be able to discover and show the power of CFG, and the multiple reading passes

blissful garden
patent gull
#

hahaha

#

(is that a high bar? i haven't written mine yet)

versed flax
versed flax
versed flax
versed flax
loud adder
#

@versed flax it seems like itโ€™s really making rounds!

versed flax
#

Retweets from Alexia Jolicoeur-Martineau, Emad Mostaque, Jeremy Howards, lucidrains, and some others whose names I forgot

loud adder
#

The raw stats are also pretty cool to see

versed flax
#

It's really exciting

#

I almost never use Twitter so I don't know how big of an effect that is, but it's definitely non-zero

#

Oh yeah, someone from Nomic.ai who (ofc) commented on the GPT4All experiment!

loud adder
#

Curious observation: EleutherAI retweeting it seems to have made basically no impact. < 100 people have seen the EAI retweet

versed flax
#

That's crazy. I had no followers base

#

I don't know who got to see it first then

#

I thought it was your retweet that impacted it

#

Or maybe you retweeting my post made the recsys show my post to EAI's followers directly rather than your retweet?

loud adder
#

I quote tweeted tho

#

Every other quote tweet weโ€™ve ever done seems to have 20-100x as many views

versed flax
#

๐Ÿคทโ€โ™‚๏ธ weird

#

Oh maybe that's bc my initial tweet mentioned your account

loud adder
#

I guess we did something to offend The Great Musk and got throttled ๐Ÿ˜‚

versed flax
#

hahaha

#

that would explain your low view count but not my high view count ๐Ÿ˜†

loud adder
#

Thatโ€™s easily explained by doing good work and getting noticed

versed flax
#

That's a nice compliment

#

I'm waiting to see whether it delivers on downstream applications before self gratification and claiming we did "good work" haha

loud adder
#

Fair enough

versed flax
#

btw @loud adder what's the consensus on non-English LLMs? Nobody seems to really care, why?

  1. No academic interest due to lower innovation / lesser citation potential?
  2. No industry interest bc it's just too expensive to build a dataset and train one?
  3. No interest because we just aim for massive multilingual models?
loud adder
#

Almost everyone who trains LLMs is paid by a US or Chinese company

#

Thereโ€™s a small Korean scene

#

(Our Korean models are the best OS ones AFAIK)

#

Thereโ€™s a Swedish non-profit thatโ€™s trained single-digit pan-Nordic models

versed flax
#

I would be so down training a french one

loud adder
#

Go find me ~ 1 TB of French text and we can talk

versed flax
#

I have 10GB berk

#

Challenge accepted though

loud adder
#

And Cedille has an unreleased model they sell commercially IIRC

loud adder
#

mC4 will get you half way there IIRC

#

Maybe French Wikipedia and a couple other courses can close out the rest

versed flax
#

mC4 is Common Crawl?

loud adder
#

Yeah

versed flax
loud adder
#

What do you mean by โ€œis that realโ€?

#

If you can find really high quality data you can get away with less, but weโ€™re talking like โ€œa substantial fraction of all books ever written in Frenchโ€ kind of quality

versed flax
#

I mean, is this a number set in stone that can't be challenged with those modern, quality first, approaches?

loud adder
#

That is based on modern, quality first approaches

versed flax
#

Ah. ๐Ÿ˜‚

loud adder
versed flax
#

I know there's a pretty big source of books I want to scrape but I don't know the actual size of it

versed flax
#

The open question though is: should the code be written in french too? That doesn't exist lol

loud adder
#

Thatโ€™s part of why I said quality will suffer, but you can live without code โ€œin Frenchโ€ most likely

versed flax
#

That's interesting. I'm not sure there's a high value doing this (ChatGPT is already pretty amazing at French tbh and I'm sure it's not specifically built with french in mind)

#

But it sounds like a fun ride

loud adder
#

Thereโ€™s some amount of cross-lingual generalization though, see
https://arxiv.org/abs/1910.11856
https://arxiv.org/abs/2005.00633
https://arxiv.org/abs/2211.01786

#

I would anticipate that code specifically is a high-transfer medium. But I donโ€™t have good evidence of that.

#

I guess we had some in Crosslingual Generalization through Multitask Finetuning. But the evaluation metrics were pretty lacking

versed flax
#

Yeah that echoes a private conv I had with @unique sedge earlier. It's easier to learn another language than starting from scratch. You already know how to reason, syntax can be more or less transferable, and vocabulary is just a thin sugarcoating around those much harder implicit tasks / skills

#

Ok the question was "why" and apparently the answer is "funding"

loud adder
versed flax
#

I didn't! I will skim through the paper before falling asleep

versed flax
loud adder
#

Recently, huge language models have shown impressive generative skills, allowing them to handle a wide variety of problems. Typically, 'prompting' is used to condition generation, either with task instructions and context or with a small number of samples. However, problems, including hallucination, deterioration, and wandering, have been observ...

versed flax
#

So damn great! I can't count the number of papers o discovered because my phone recommended me an article from MTP

patent gull
#

everyone loves that table, they keep tweeting it --- i'm pretty psyched!! latex \cellcolor{} ftw

#

also i'm so glad @versed flax pushed for the assistant angle, and really pulled all-nighters to make it work

#

i think that's why people are so psyched about us and not CAD or another one

versed flax
#

Told ya. Marketing. Lol.

#

There are two main selling points: assistants & 0.5x model size

#

Those are the things that people seem to like about it

#

CFG will land in Hugging Face tomorrow I guess :)

patent gull
#

@loud adder we are chatting about follow-up papers in order to capitalize on this attention... do you think we can continue to use the cluster?

loud adder
patent gull
#

cool!! thank you so much!! yeah I don't think any of us are ready to jump in 100% yet, but we are talking about paper #2 being a fine-tuning paper

#

mainly @blissful garden 's idea, but we're thinking of fine-tuning on CFG-generated data to see if we can "bake" in some of the benefits, thereby getting rid of the 2x inference cost

fallow egret
versed flax
versed flax
versed flax
#

Someone using the pod? the more I use it the more broken transformers get

#

now I even get a protobuf error

#

yesterday I had some weird lib issues

blissful garden
#

We should probably have conda in it instead of mixing everyone's env together

loud adder
#

@fallow egret Can you translate this coverage of CFG for us ๐Ÿ™๐Ÿผ

https://twitter.com/MikeE_3_14/status/1675930643857825792?s=20

ื”ื™ื•ื ื‘ #shorthebrewpapereviews ืกื•ืงืจื™ื ืžืืžืจ:
Stay on topic with Classifier-Free Guidance(CFG)
ื”ืžืืžืจ ืžืฉืชืžืฉ ื‘ืฉื™ื˜ืช ืœ CFG ืฉื”ื•ืฆืขื” ื›ื“ื™ ืœืฉืคืจ ืืช ื”ื“ื’ื™ืžื” ืฉืœ ืžื•ื“ืœื™ ื“ื™ืคื•ื–ื™ื” ืžื•ืชื ื™ื (conditionined). ืžื˜ืจืช CFG ื”ื™ื ืดืœื›ื•ื•ื ืŸ ื”ืชืืžืช ืืช ื”ื“ื’ื™ืžื”ืด ืœื”ืชื ื™ื” (ื™ืฉ ืคืจืžื˜ืง ื”ืฉื•ืœื˜ ื‘ืขื•ืฆืžืช ื”ื”ืชืืžื”). ื›ืืŸ ื–ื” ื ืขืฉื” ืœ #LLMs

versed flax
# loud adder <@1057033987811459203> Can you translate this coverage of CFG for us ๐Ÿ™๐Ÿผ https...

Google translate does a fairly decent job:

Today in #shorthebrewpapereviews we are reviewing an article:
Stay on topic with Classifier-Free Guidance (CFG)
The article uses the proposed CFG method to improve the sampling of conditioned diffusion models. The purpose of CFG is to "adjust the adaptation of the sample" to conditioning (there is a parameter that controls the intensity of the adaptation). Here it is done for #LLMs

#

Here CFG is used to improve the ability of a language model to generate long and coherent answers to a prompt without forgetting the context. Here the unconditional model is the same model that generates text without conditioning in the prompt. That is, to construct an answer to a given prompt, we move the answer away from the unconditional sample when the strength of the removal is controlled with a gamma parameter

#

The proposed method works quite nicely (not surprising because it is kind of math-based - the formula for calculating the gradients is based on the Bayes formula). That is, the more you raise the Gamma, the more suitable the answer is to the prompt.

fallow egret
versed flax
loud adder
tepid gazelle
fallow egret
strange magnet
#

๐Ÿ˜„

versed flax
#

Lit

blissful garden
#

I wish those counted as citations ๐Ÿ˜‚

gleaming torrent
#

https://twitter.com/novelaiofficial/status/1682010357819142147 maybe could retweet this for more visibility?

New Phrase Repetition Penalty & Classifier Free Guidance Settings!
It is our pleasure to expose you to new settings that allow you to take Clio to a whole new level.

  • Updated data storage for faster saving, and updated flash attention to v2 for increased Clio generation speeds!
versed flax
# blissful garden I wish those counted as citations ๐Ÿ˜‚

Btw, about that, any idea why the Bibliographic Explorer doesn't work? https://arxiv.org/abs/2306.17806

versed flax
#

(Also, Semantic Scholar usually extracts figures & tables, it didn't)

loud adder
#

I guess SS failed to parse the paper properly then

#

Here's a paper that also has zero citations but bib explorer works fine: https://arxiv.org/abs/2306.01481

versed flax
#

Uh. I'll try and see what I can do then.

tepid gazelle
blissful garden
fallow egret
fallow egret
versed flax
obtuse tiger
#

Hi

#

I found this paragraph a bit weird because you say that embeddings are good and have nice structure and then say oh yeah actually we are doing logit arithmetic. But I think this is just equivalent to doing arithmetic with the final layer hiddens since the unembedding is a linear transform right?

obtuse tiger
#

Cool, I kind of suspected that, but it was unclear. If I were you I might make an update to the paper to clarify but up to you guys of course.

#

actually sorry

#

I just read the second to last sentence

#

which makes it more clear

#

I still feel like it's confusing-ish

#

because

#

idk it's like the core of what you're doing

#

and it should be 1000% clear

#

but anyway

blissful garden
#

I actually felt this paragraph is a bit hard to parse as well. I would have been confused when reading it the first time, but I'm not trained with ML background so I blamed myself lol

#

might be a better way to phrase it though. Will def think about it when we prepare to submit it somewhere

obtuse tiger
#

Also could we combine this with the tuned lens?

blissful garden
#

btw @versed flax any thoughts on where to submit?

obtuse tiger
#

To do CFG in intermediate layers

blissful garden
obtuse tiger
#

they end up having to do "counterbalancing subtraction"

#

which is kinda like negative prompting

obtuse tiger
# obtuse tiger Also related to this https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/stee...

actually sorry this is the more recent thing https://arxiv.org/abs/2308.10248

versed flax
blissful garden
#

Also we should brainstorm what kind of questions we can ask if we do CFG in intermediate layers. It sounds cool and there should be some interesting collaboration here.

obtuse tiger
obtuse tiger
#

Maybe we could start a thread in #concept-editing or smth

versed flax
obtuse tiger
#

Oh sure I think I agree this would need to be a different paper

obtuse tiger
#

#1146877254153031930

blissful garden
# versed flax It's out of scope for me

Oh yeah I didn't mean to change anything for our current paper. For submission we just need to decide if we want to say more about negative prompts

We used to have some thoughts about a second paper and we can slowly picking them up and brainstorm

patent gull
#

hey just caught up on this ๐Ÿ™‚ these new ideas sound interesting and cool

#

we had an alternative direction for paper #2 in the idea of fine-tuning, just want to keep that one alive, too!

#

but reg. paper #1

versed flax
#

There's ICLR submission at the end of the month. You guy ok to submit?

patent gull
#

ICLR deadline is 9/28, I just re-checked it, it's in Vienna too

#

yah

#

just messaging about that

#

I'm cool to submit there!! I think it's a good idea. Stella's typing though. what do you think, stella?

loud adder
#

(I also have a half written message suggesting ICLR from earlier today but got distracted before finishing it)

patent gull
#

alternatives in the NLP domain are NAACL (11/23 I think) and ACL (January-ish, hasn't been announced). otherwise, we can always wait for Neurips :/

but i vote ICLR. @blissful garden ?

loud adder
#

We should know about ICLR in time to submit to ACL or ICML

patent gull
#

is ICML better than ICLR or are they roughly equivalent? I'd probably rank NAACL lowest

loud adder
#

I view ICML, ICLR, and NeurIPS as equivalent

versed flax
#

I think I agree with that

patent gull
#

great ๐Ÿ™‚ let's go for ICLR then.

imo i don't think the paper needs much. maybe another round of grammar-editing, following @obtuse tiger 's point about clarity up there

Do you think there is an anonymity-preserving way to mention in the paper everything that has happened since Arxiv? i.e. that CFG is incorporated into Huggingface and llama.cpp? that's certainly a cool contribution

loud adder
#

"Since its public release, CFG has been readily adopted by major LLM libraries including llama.cpp and transformers"

patent gull
#

cool.

maybe we can also throw cool examples of generations using CFG that the community has generated into the appendix? too bad we never set up a tipline for community members to send us what they played around with...

versed flax
#

I only someone kept track of everything and lurked in the communities using CFG ๐Ÿ˜

loud adder
#

I don't see that hurting the paper's chances, but it's non-standard and I don't see it helping.

patent gull
#

well i'm thinking of ways of saying "the community thought this was useful" ... showing it's been incorporated into major libraries, and including examples of grassroots adoption are ways of doing that?

loud adder
#

I am not aware of adoption at a scale wherein it would significantly influence reviewers. I could be underestimating it, but off the top of my head papers that do that are things like VQGAN-CLIP (> 1 billion uses) or things like FSDP and trlX which are used in million-dollar model trainings

versed flax
#

ngl it's not like it changed the world (yet?) the adoption is quite slow

fallow egret
#

It might help in case the experimental section was thin. But the experimental section of this paper is so vast and extensive that it's hard to believe that it will add any positive points. The only claim I can see for a rejection is lack of novelty.

Regarding ICLR, it's of course a great conference. The negative part is the open-review process, which is tough and might result that the top result in google will be an old version or rejection with ugly bad reviews.

blissful garden
# fallow egret It might help in case the experimental section was thin. But the experimental se...

I actually like open review a lot better than the closed reviewing process in pure math journal submission. We have way more shitty reviews than one can imagine that are obviously biased and/or even personal. Very occasionally there are also questionable papers get accepted in top journals very fast. I wish people could have seen the whole process in every submissions. If a paper is objectively good, there is nothing to be afraid. If there are fair points that need to be improved, we will just improve them.

fallow egret
# blissful garden I actually like open review a lot better than the closed reviewing process in pu...

I'm not against submitting to ICLR, and as a researcher I of course think that the open-review process is positive. However, as an author this format require much more effort (there might be full discussion with the reviewers + requests for few draft versions), and you have the publicity that make you think ten time on every sentence. So I think we should submit, but it is something that should be considered

unique sedge
#

Have no strong opinions on submissions to conferences. On board with anything you choose ๐Ÿ˜„

wheat zenith
# versed flax Not that I'm aware of.

I'm worried I'm asking a dumb question and missing the obvious, so forgive for commenting in your group research channel again. But I didn't understood this response and it's been bugging me. What was the reason you can't trade more than 2x compute time and possibly enable model capabilities or outputs you couldn't get just inferencing twice?

As a concrete example, with this transformers patch change you can use the negative prompt as a second positive prompt, and that seems like it is a useful tool. https://github.com/huggingface/transformers/pull/25339#issuecomment-1667814849 So at a minimum, wouldn't I then have to inference three times instead of two if I want to use that second positive guidance but also want to use negative guidance at the same time? Or is there some way of reducing or collapsing all the combinations back down to two steps?

Thanks for being so nice when I randomly barged in originally btw, I kind of missed the context of this channel being a semi-private group research spot in the excitement of the moment but everyone was exceptionally chill about it.

blissful garden
fallow egret
# wheat zenith I'm worried I'm asking a dumb question and missing the obvious, so forgive for c...

I don't think that there is a dispute that linear combination will work, and there might be practical use cases. However, I think that from a research perspective it's not interesting since by the sum property (additivity, commutativity), you can split it to a sum of the positive part and the negative part. Now since it is already known that a sum of different models logit behave as an ensemble method + we know that the minus behave as a contrastive decoding, then the expected result is clear. So this is why I think that it will not be interesting from a research perspective (there is no no novelty/ new information that you can deduce from such experiments )

wheat zenith
# blissful garden Yeah I guess you can take any linear combination of prompts. Not sure about exac...

Super helpful, thanks. Yeah I just wanted to make sure it was different, so there could be a reason you might actually want to do the extra work of 3x inferences, and there wasn't some underlying reason why it could always down to just two. I lurk your research here a bit because you guys keep coming up with fascinating sampling and prompt concepts that are fun to even think about, what it means in a prompt or if it was 'working correctly'. Small code changes that open up tons of new prompt possibilities, and model outputs are *wildly * different. I'm not involved in research myself, it's just really fun to try your ideas and see what the heck comes out. ๐Ÿ™ (I barely tried neg guidance in audio yet, still mostly unexplored, and just noticed you are thinking about CFG gen 2 already.)

steep bone
#

Is this work similar to this ACL paper: https://arxiv.org/abs/2307.03214 ?

versed flax
blissful garden
versed flax
#

They don't cite us ๐Ÿ˜ ๐Ÿ˜‚

loud adder
#

The ACL submission deadline was in 2022

versed flax
loud adder
#

Oh

#

๐Ÿคฆโ€โ™€๏ธ

patent gull
#

well definitely another paper to add to the related works!

patent gull
#

@everyone @unique sedge @fallow egret

Hello everyone we got results back from ICLR.

We're right below the margin of comfort for acceptance. If 1 or more reviewers increases their score by 1, we will be MUCH more comfortable with our chances.

We've identified 2 small experiments we think have a great chance of increasing our scores:

  • show memory comparisons
  • show NLG controlled generation comparison

I think @versed flax already addressed #1. Does anyone have any bandwidth to address #2? I will work closely with you to do this

#

w.r.t. #2, here is guidance for an experiment.

SOTA controlled NLG baselines:

Experiments:

  • sentiment
  • formality

I think there are classifiers for both, I think the experiment can "is CFG output classified as formal, via formality classifier vs. is NADO output classified as formal, via the formality classifier"

#

i think it's going to be difficult to show that CFG beats SOTA controlled NLG, because SOTA NLG assumes the presence of a classifier, which is a benefit of CFG that we don't need one, so we can do NLG beyond just formality and sentiment. But as long as we show it's not too different in these areas, that would be a nice result and might cause R2 to raise their score

versed flax
#

I'll add that:

  • #1 will be addressed soon wrt to the memory question. It's a fair and important question. I ran the calculations necessary.
  • #2 is imho the hardest to address. His questions are totally outside of my comfort zone, so that's the thing I will personnally won't be able to tackle correctly
  • #3 gave us a 5 while being notably confused by the paper and thinking it was a training technique. Honglu and I think that if we fix his understanding and show him that it is indeed better than a training technique, we can get a getter grade from him
patent gull
#

that's doubtlessly true. but in terms of outlining what work we will do between now and 11/22, there's nothing to be done for R3 besides crafting a good argument

#

@unique sedge and @fallow egret if we can come together and address some of the actual work-items, then we raise our chances

#

p(score increase ) = sum_{reviewers} poisson(\lamba)

#

with a very, very low lambda

#

@loud adder @blissful garden any way to get access to some A100s to run some CFG runs to address #2?

blissful garden
#

@patent gull @versed flax did you guys have access to SAI cluster?
Sadly the A40 pods are taken away from EAI afaik. We have some 4090 I think

#

@tepid gazelle Do you know what compute resources does EAI have right now? 4090 pods?

tepid gazelle
versed flax
#

@blissful garden / @patent gull I have some CoreWeave instances with my job now. Depending on the duration of the experiments I can run them

#

can't give you access tho

blissful garden
#

I have access to SAI cluster. If you guys have codes for small models I can scale it on SAI cluster. Jobs can get preempted but half a day is usually not a problem

patent gull
#

Ok I can set up some experiments for you to run @versed flax. I just feel like 2080s are going to be annoying if we want to run any CFG on any models beyond just llama 7b or something

blissful garden
#

I have TPU v3 pods that I can share, but TPU is a different beast ๐Ÿ˜‚

patent gull
#

I have access to 2080s tooโ€ฆ I can set up some experiments with smaller models and then pass โ€˜em off

fallow egret
#

Hi, sorry for the late response. @patent gull I have a bandwidth to work and help in whatever is needed.

fallow egret
#

I read the reviews. I'm not sure how much this experiment will help (overall the experimental section of the paper is the strong part of the section). It seems that the main concern (as expected) is the lack of novelty and contribution. I think we should think about the strategy how to address this issue.
I think it will be important to address this issue and upload the rebuttal response as soon as possible so the reviewer will have a chance to give a feedback and develop a discussion, because it will not be easy to convince them about the contribution (actually with this score we need to convince the AC).
I think there are two paths:

  1. Differentiate our work from previous works (I think it's possible, we discuss about it a lot few months ago).
  2. This is mainly for R@3, which think that the experiment section were insightful: I think we should focus on experimental contribution (this is a valid contribution and reviewers sometime forget about the importance of a solid experimental paper).
versed flax
fallow egret
# versed flax R3 didn't read the paper, it's pretty clear. It shouldn't be hard to prove that ...

I don't think that integration in libraries is a valid claim for academic contribution. In the end there are indeed many previous work on decoding methods which seems to be equivalent to CFG (we know at least 3-4 works). The fact that they didn't release the code or bother to integrate it in big repos doesn't mean you have added value on top of their work.
I think that even if R3 didn't read the paper it will not going to be easy to convince the AC

patent gull
#

my experience has been that directly addressing as many reviewers concerns as possible is the best chance to increase the score

#

p( score increase) = \sum_{reviewers} p(reviewer score increase)

#

and in OpenReview we can respond to each reviewer individually

#

Yes @fallow egret , we should quickly craft a response to R3 and try to respond to all the intellectual points as soon as possible to encourage discussion. But that doesn't preclude us from also trying to run the experiments they ask for. In the end, it may not amount to anything

#

but if 1 reviewer increases their score by 1, then our paper has a much better chance

blissful garden
patent gull
#

if you or @unique sedge have bandwidth, it would be great to see if you can get NADO working for formality

NADO: https://arxiv.org/pdf/2205.14219.pdf

I already have FUDGE working for sentiment... would be pretty easy to complete all 2 x 2 after that, and then run CFG with formal prompts and sentiment-relevant prompts, and then evaluate

fallow egret
#

on GPT-2?

patent gull
#

thanks @fallow egret for checking this out!!!! GPT2 is what I was thinking, yeah

#

I can put you in touch with the primary authors โ€” Sidi and Tau

#

or, I'll just reach out to them

fallow egret
#

I think we can just use the code and it's their issue if the results are not great ๐Ÿ™‚

patent gull
#

also wait โ€” there's no issue running the code, just reproducing the results?

#

yeah

#

that's what I'm thinking, too

#

we just report the results (if anything, we can footnote this issue, or something)

fallow egret
#

sure, so I can take it. Let's sync on private message on the exact experiment (dataset, metric)

versed flax
#

I can take care answering to R1 and the memory analysis he requested

versed flax
#
  • We kinda establish that a model with CFG consumes 2x the flop (2 forwards) but still follows the perf / flop plot. So you kinda can train a half model and infer with CFG.
  • So the question is: is this tradeoff smart in inference as well, given that you use two cache lines with CFG, but 2x bigger models need to store more floats per token in cache (bc of the bigger hidden dim) and store 2x params?
  • I do the maths and show that it depends on your VRAM / intended cache size. For small models, the weights are negligible in VRAM, you can have big caches, and the double cache for CFG is not worth the 2x reduction in params. However, for LLMs, especially the very big ones (> 30B), the weights take a massive amount of memory and the 2x cache lines would outgrow the param halving after only very big amounts of VRAM
versed flax
#

I end with this chart

#

it reads like this: Say you have 10GB VRAM. For model sizes above the red line (up to 1B in this case), you should stick with vanilla models. The 2x cache line with overweigh the /2 param counts. Then, below the red lines (1B and above), prefer deploying CFG: your VRAM isn't big enough to store a big cache, and the /2 param count is better

blissful garden
#

Looks good

loud adder
#

Hello, I'm back from getting married ๐Ÿฅฐ

ICLR reviews look decent. We're in the top 40% of papers by review score. Do we have a google doc for organizing our response yet? Or have we been doing it in this thread?

blissful garden
versed flax
#

As you can see, we have started working on answers:

R3 is probably the easiest to convince since they barely understood the paper (my guess is that "uh, CFG, not novel!" then barely skimmed the paper + weak understanding of CFG anyway ("training technique"?!?!)). Maybe R3 should be answered at a high level since their critics aren't that deep. The main point is convincing of novelty. I don't know how to prove it besides 1) "trust me bro", or 2) "our work is novel. The proof is that the arxiv release was followed by implementations in major LLM inference engines => it wasn't already there", but people seem to agree that this is bad defense and can backfire. Especially bc it seems people didn't really get how to use it, especially the neg prompt, it seems

R2 is totally out of my scope. Alex seems to know how to tackle his points.

R1 is addressed with the aforementioned analysis

I think we should probably answer tomorrow

loud adder
# versed flax

I think that this might make more sense with the axes switched? The VRAM seems like the more fundamental constraint to me, where you then maybe vary the params and move between the regions

versed flax
#

I will need to triple check the maths, but the idea is here. Worst thing that can happen is that the slope changes a bit. Not much to worry.

patent gull
#

hello @loud adder , congratulations!!!! i hope your wedding was amazing as well and wow, we weren't expecting to hear from you โ€” don't you have a honeymoon or something?? I didn't know EAI was part of that hahaa

patent gull
versed flax
patent gull
versed flax
patent gull
#

ok, I just checked... no image uploads to OpenReview.

We can ask them to click a link, but shouldn't expect they will. We've all been through enough phishing videos.... Also, it's one more click

So we want numbers we can paste into the box as well for the quick headline, and then they can click if they want to see more

versed flax
#

We can update the PDF tho. So we can put the figure in it.

patent gull
#

yeah definitely. again, wouldn't expect the reviewer to check. I can't even get my advisor to read my updates...

loud adder
# patent gull (I don't fully agree btw, I think parameter count is the dependent variable here...
  1. I said this having only skimmed the reviews based on what I would generally expect from a plot like this.
  2. The dependent variable is the y-axis...
  3. That said, the actual dependent variable here is the memory usage. I assume you either misspoke or got the words confused, but ultimately you're correct.

I'll read the review in question again and if your characterization of the request is correct I agree the original format likely makes sense.

versed flax
#

new figure version reads like:

  • y axis interpretation: if you have 10GB of VRAM, serve vanilla models up to 1B. Then, serve with CFG. 5B and above => can't fit.
  • x axis reading: say you have a 1B. You need at least 2GB to serve it. Up to 10GB, serve with CFG. Then, you'd be better serving an actual 2B
patent gull
#

whoops!! yes, I meant parameter count is the independent variable and vram is the dependent var.

@versed flax what is the green "vanilla" writing supposed to be aligned with?

versed flax
loud adder
patent gull
#

i see... so the diagonal lines are lower bounds based on parameter count? and there's no upper bound because data tensors can take up VRAM?
if i'm just not understanding, but everyone else is, it's ok, we can move on

loud adder
#

The grey shades region is the region where the model doesn't fit within the specified VRAM

versed flax
loud adder
#

Let me see if I can explain the plot (since I'm not actually sure I'm following 100%)

#
  1. The question is whether our claims about "matching larger models" remains true is we care about VRAM (w/ k-v caching) rather than # params
  2. The red line is the Pareto optimal frontier as you trade off # params vs VRAM

I'm confused about what the blue dots are though.

#

This is specifically a response to

R1: memory cost analysis is recommended. The proposed method requires a second run of the model, which may increase the memory cost (for example, the key-value cache).

versed flax
#
  1. If I understand correctly what you mean, yes.
  2. Yes. Serving with CFG costs more kv cache but less params, and (kinda) gives you the performance of a model twice the size. So, below the red line, you should serve with CFG, above it, you should serve an actual 2x model (without CFG). If you want to maximize the amount of tokens you fit in your kv cache, that is.
#

Blue dots are just the actual param count / max kv cache size for the models in the paper (gpt2-*, pythia-*, llama-*). Since they have variation in arch and there's a little alignment to 64 at play in the hidden dim, they don't exactly fall onto the red line.

Yes, this is meant to answer that remark from R1.

loud adder
#

Side note: looking over the paper again the misalignment between plots and where they're referenced in the text is very distracting

patent gull
#

i have to stare at this some more. so, below the red line (vertically), you have enough excess memory, but not so much, so you can afford to serve the same size model with CFG. Below the green line, that size model won't fit. Above the red line, you have so much extra memory that you should just serve a bigger model?

#

I don't think I fully understand, but I think the y-axis label could be improved "VRAM at Equality". Equality to what?

patent gull
#

ok maybe the region in between red and green can be shaded light green for "go"?

versed flax
patent gull
#

the region below the green line โ€“ "gray" is fine, but "red" for "stop" is also OK. Above the red line can be light blue. And in a legend, or in the caption, we can define what each of these colors mean. The reality is that there are 3 separate regions, here, not just two

versed flax
#

yes, let me GPT4 this rn

patent gull
#

hahaha

#

ax.fill_between()

versed flax
#

nah doesn't burn enough CO2

patent gull
#

lolll

so just looking at one verticle line:
at parameter count = 1B, we intersect with the green line at 1.1 2 GB VRAM (green) and 10 GB (red, and blue dot)

#

does that mean that for a model with 1B parameters, vanilla costs us ~~1.1 ~~ 2 GB and CFG costs us 10? so 5x as much? that seems high to me

versed flax
patent gull
#

at param count (X) = 1B, I see green line intersect VRAM (y) at the 2B y-tick, and red line at the 10b y-tick

patent gull
patent gull
#

I guess what I thought the reviewer was looking for is performance vs. VRAM for CFG vs. vanilla.

Just like Fig 11:

versed flax
#

You have X amount of VRAM. You want to use it all and serve efficiently. So you'll store the model weights, and use the rest for a kv cache. You want that kv cache to fit as many tokens as possible.
So you have 3 options:

  1. Serve your model as is. Boo, lame, boring. so you fill your mem with params P + cache cost per token C * cache size S. This S is the only variable, and you want to maximize it.

  2. You're a chad and you want to DOUBLE THE PERFORMANCE! and you've read about this CFG paper. But now you're using 2C per token. so you use your VRAM as P + 2C * S.

  3. You wonder whether you shouldn't directly serve a 2x bigger model with 2P params and a slightly bigger cache cost C' (C prime), but C' < 2C. Your VRAM is used with 2P + C' * S

At some point, if you can fit a big S, most of your VRAM will store the cache, and you really want a smaller cache footprint. But if your model is big, the parameters will dominate in VRAM, you can't store a big S, and you'll want to reduce the parameter memory footprint. So what's the decision boundary? Red line, decided as S = P / (2C - C')

#

is it clearer @patent gull ?

#

can I go to sleep? .___.

patent gull
#

lolll i'm still parsing

versed flax
#

4am ๐Ÿคก

patent gull
#

you can go to sleep lol but what do you think of my prev post?

#

about replicating Fig 11? that's what I thought the reviewer was asking for

versed flax
#

R1 explicitly mentions KV cache. It's an inference question. I'm not sure I can see another way of interpreting the question

#

But if you have one, please explain

patent gull
#

for filling the same KV-cache budget, what is your accuracy with CFG vs. a bigger model?

#

parallel to Fig 11. For the same FLOPs budget, we show accuracies on vanilla vs. CFG

versed flax
#

so, like, you want to store 2k tokens in your kv cache, what's your best strategy?

patent gull
#

ultimately, the user doesn't care about "what's the biggest model I can fit"... the user cares about "what's the maximal accuracy I can get with a fixed budget"

patent gull
#

I'm assuming P + 2C * S will, because that means a slightly bigger model. maybe i'm contradicting myself earlier when I said VRAM was dependent variable

versed flax
#

lemme think

#

like, you have 30GB. If you only care about perf, then that's a no brainer, use a 15B+CFG (fp16, so 15B => 30GB). You'll match the perf of a 30B without needing the actual 60GB. But you'll have a cache size = 0. Dumb dumb.

patent gull
#

ummm actually I thought this was dataset dependent.. i thought for each dataset, there's a max-size datapoint, so we scale KV cache to that, and then we can maximize model size

#

but honestly, i'm very green to this kind of engineering work so I dumb

versed flax
#

nah, kv cache is model dependant. It's your context_len (model dependent) * num_cache_lines (how many sequences do you want to cache when serving)

patent gull
#

ah right. ok... if someone who is smarter than me can look at that graph and make a meaningful decision about which model to choose, then i will believe you haha, i just can't summarize it myself.

versed flax
#

dear fuckin Yann LeCun I'm realizing how much I actually learned about LLMs since I switched job

versed flax
#

That's the same in training, which you may be more familiar with

#

You can't answer "what model size do I train for my amount of VRAM?" because it also depends on the tradeoff you're willing to do on your batch size

patent gull
#

but for inference especially, can't we assume num_cache_lines = K (some constant, preferably for simplicity's sake, K=1)?

#

then, since your KV cache is upper bounded by the model's sequence length, can't you make a decision:

  • model m + CFG
  • model m'
    based off of accuracy and the maximal amount of parameters that will fit in the cache?
versed flax
#

but if you run a big data center with millions of users like OpenAI, you absolutely can't decide num_cache_line=1, that's basically dedicating 1 GPU per person, that's insane

patent gull
versed flax
#

I'm sorry my English is just complete trash. I just shouldn't be allowed to speak.

patent gull
#

no lol your good, it's really not your fault, that was a very clear answer

#

but S is upper-bounded by the model's sequence length, right?

#

it's not just \in {0, \infty}, right?

versed flax
#

in a non hypothetical scenario, S is a multiple of your ctx len

#

S = ctx_len * num_concurrent_cache_lines

#

in a cloud setting for instance, each user gets ctx_len cached token. So you'll allocate one cache line for user Alex, another cache line for user Stella, another one for user Honglu and so on

patent gull
#

there's gotta be a way we can make a better argument then "it depends"

versed flax
#

there's really not

#

you want to serve millions of user with 1 GPU? Serve pythia-14M.

#

you want to serve 1 user = 1 GPU? Serve a big model

patent gull
#

lol. yeah but this is research... we don't have to consider 1 million users

versed flax
#

You want to go brankrupt? Serve 1 user = 8 GPUs.

versed flax
#

scaling is all the rage

patent gull
#

ok... 1 user, fixed VRAM. which model do i choose?

versed flax
#

easy. the biggest that fits, with CFG

patent gull
#

ok WITH cfg

#

that's great. how do you know?

versed flax
#

because it'll give you the performance of a model that should be twice as big

patent gull
#

but I have 2x the cache, so i should be able to serve a model MORE than twice as big with the same memory constraint, right?

versed flax
#

and since your kv cache size will be super negligible bc you just want 1 cache line, you don't have to worry about 2C being greater than C', because (2C - C') * S <<< P, since S is so small (assuming you don't have one of those crazy models with 100k ctx len ofc lol)

patent gull
#

sigh. ok i'm naive and i don't typically have my head in this space, but i'm gonna say something super high-level and dumb โ€” I feel like there's a way we can fix certain variables and make a better argument about "here's the model we choose to maximize accuracy".

But if charts like these are actually super typical and we can reasonably expect the reviewer to interpret it correctly, then great.... @blissful garden any thoughts?

#

basically, all i'm saying is that we have to plan for the reviewer having the attention span of a goldfish, and if we can't convince them in that timespan, we're not getting a score boost

#

a sentence like "for fixed VRAM, CFG delivers 130% the performance" checks that box for me

#

as a goldfish myself

#

I can speak for other goldfish

versed flax
#

Ok my argument to R1 is "You're raising a good point, we did the maths, and there's a tradeoff. In certain scenarios where you want to serve big models you'd better run inference with CFG than run a 2x model"

patent gull
#

can you be explicit about what those "big model" scenarios are? >1B parameters?

versed flax
#

depends on your vram lol

patent gull
#

and "you'd better run inference with CFG" because why, higher accuracy?

#

ok fix a VRAM

versed flax
#

we can add: "as an example, if you have 10GB of VRAM, models up to 1B should be served as is, but 1B to 5B models should be served with CFG"

patent gull
#

ok โ€” "1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you get better performance than a 4B model, which takes up the same VRAM"?

#

i think we're getting there imo

#

your example is good

#

and perfect, something textual we can put in the reviewer response

versed flax
#

"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you can fit a bigger kv cache than a 4B model, for comparable performance"

#

fixed

#

plz I rly need to sleep, I have 5h of sleep remaining

patent gull
#

ok sure

versed flax
#

we good?

patent gull
#

go to sleep

#

we can talk more tomorrow. i don't understand why "bigger kv cache" is the dependent variable here

versed flax
#

enough CO2 burnt

patent gull
#

"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you can fit a bigger kv cache than a 4B model, for comparable performance"

->

"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG, with the same KV cache size you can get X% more performance"

?

versed flax
#

what do you call performance?

patent gull
#

same thing I thought you were calling performance โ€” accuracy on the benchmarks

#

same as figure 11

versed flax
#

yes, totally

#

so why having a {bigg,small}er kv cache would have any impact on that?

patent gull
#

ummm you can go to sleep we can talk tmrw

versed flax
#

ok cool

blissful garden
blissful garden
#

also, since we can revise the paper, I wonder if we should add a super short subsection or subsubsection explaining the challenges of applying CFG in language domain and why it doesn't work verbatim. We could address R3 blah blah blah, and look, here is a new short paragraph explaining that we are not applying existing technique trivially.

versed flax
# blissful garden What? Why does each user provide fixed amount of tokens for the models and why i...

Why does each user provide fixed amount of tokens
Don't overthink it. They were just an illustration of concurrent runs.

<rest of the message>
That's how the queueing system works, not how the cache itselt, the big tensor of size (num_cache_lines, 2, num_layers, num_heads, hidden_dim) work. (the tensor might or might not be explicit into the code, but in this end, this is how the VRAM will be allocated for the cache.

versed flax
fallow egret
#

Updating on FUDGE:
I implement the method with some shallow sentiment classifier (65m parameters):
https://huggingface.co/docs/transformers/tasks/sequence_classification

The problem is that the running time is extremely slow (since you need to run on 200 samples for each token). With max tokens 20 (which is not enough), It takes 67 sec per sample.
Which means that running it on ~500 samples will take ~9h (and for the full dataset which contains 25k samples it takes 45h).
We need also to run multiple experiments (there is there a guidance hyper-parameter).
Any ideas?
cc @patent gull

blissful garden
fallow egret
blissful garden
#

oh so the model is fixed to gpt2-medium? Or should the --model-name argument be used somewhere

fallow egret
#

We can change it, but we decide to do this experiment with GPT2-medium

#

The current run is with the default 1 guidance

#

Let me add it as a parameter and clean a little bit the code

blissful garden
fallow egret
#

Yes, exactly

#

you need me to clean the code?

blissful garden
fallow egret
#

๐Ÿ‘

#

@patent gull Please verify that we are fine with this experiment (models/dataset)

blissful garden
#

Wow 1/24936 [01:50<762:33:17, 110.09s/it] lol ๐Ÿ˜‚

fallow egret
#

lol, yes. The algorithm is a disaster from a computation perspective. I don't understand why it's even considered as a valid option

#

For every generated token you need to run the classifier on 200X number of samples in the batch

blissful garden
#

does it generate in batches or just 1 token at a time?

fallow egret
#

1 token a time. But I'm not sure it will help since in any case the bottleneck is running the classifier

blissful garden
#

classifier can also run in batches I guess?

fallow egret
#

The classifier is running in batch

blissful garden
#

how hard is it to vectorize everything with a large batch size?

#

I see the vram isn't fully used

fallow egret
#

This is the point, that it's already big batch of 200 and if you increase the number of batch to N then you need to run it in batch of 200*N (which means that in practice you will not be able to run a big batch)

#

Should not be a big deal, I can do it

blissful garden
#

oh so the classifier does it for the top 200 tokens for each generation step, is that right? Sorry I only start to understand it right now

fallow egret
#

Yes. What they are doing is simply Classifier guidance.
The problem is that in order to do it you need to run the classification on every possible next token. In the paper they 'compormise' on taking the top 200 ๐Ÿ™‚

blissful garden
#

lol this is crazy

versed flax
#

WELL I MEAN

#

If that is the current way of doing things, I say we already have a GODDAM strong argument for CFG, even if are scores are lower

blissful garden
fallow egret
fallow egret
patent gull
#

Let me take a look. Just waking up now

fallow egret
#

I'm pretty sure we will get better results, since the classifier is crappy (if we take stronger, then the running time will be inifinite, and in any case in their paper they suggest to use weak classifier)

patent gull
#

Apologies for the delay

patent gull
#

that looks great to me. what's the dataset?

#

also i think to compare apples-to-apples, we might want to make sure both variations see the same input. And CFG is probably going to see an input like "Write a happy response" or something

#

So i would prepend every example in the dataset with "I'm feeling happy today. <input sentence>"

#

or something. @versed flax @blissful garden any ideas for a good prompt that captures sentiment for a non instruction-tuned model? I know you played around a bit with this, @versed flax

blissful garden
versed flax
#

Following the previous night I am absolutely exhausted and today was mostly dedicated to surviving. I will be unable to do good work and must delay my answer to R1 to tmrw.

fallow egret
patent gull
#

no problem haha

#

ok if it's movie reviews, then I would prepend the phrase "I enjoyed this movie. <prompt>..."

fallow egret
#

Do you want to do it also on the FUDGE experiment?!

patent gull
#

that's my thinking, yeah, otherwise p(xi | x<i) is different across CG vs CFG .... how do we know that CFG worked, compared to just adding that prompt changed the sentiment anyway?

fallow egret
#

Yes, I see

#

Ok, so we should decide on the prompt before collecting the FUDGE results

patent gull
#

yeah.. when i get to the office, i can try out some different prompts with CFG and see what seems to be working

fallow egret
#

@blissful garden anything else is needed from my side?

blissful garden
fallow egret
#

Ok, thanks!

patent gull
#

Aw man we missed an opportunity. Maybe we wouldโ€™ve gotten higher scores if we named our paper โ€œAll you need is CFG for LLMs with applications in ChatGPT based on Diffusionโ€

patent gull
#

@fallow egret where did you find that model? is it a recommended one for sentiment analysis? the model card says it was trained on an "unknown dataset"

blissful garden
#

we can probably swap in a better one. Is there a standard one for sentiment analysis? I don't know much about this field.

patent gull
#

it's trained on IMDB, though, so ideally it is in-domain

fallow egret
#

Also in the paper they emphasis that the classifier should be shallow compare to the base model

patent gull
#

i see, ok SGTM, then

#

Yeah, it scored highly on IMDB, which is the dataset we're using

#

but just to be clear on the experiment โ€”

at first I was thinking that we were going to use the same classifier that we use in CG to evaluate the outputs of both CFG, and CG?

#

or do you think we should use a different classifier for evaluation?

fallow egret
#

Yes, I think it should be different than the CG model (stronger model)

patent gull
#

ok. i see the arguments for and against. If CG does badly with a different classifier, someone could just argue "well, you chose a purposely bad classifier"

fallow egret
#

But this is one of the FUDGE limitation... You can't use a strong model as the classifier

patent gull
#

ok cool, makes sense

#

also on the experimental design, I see that we are using CG just to make things "positive"?

fallow egret
#

Yes, this is why using the same classifier is completely unfair (the objective is to make it 'positive' according to this classifier)

patent gull
#

final_res.append(t['score'] if (t['label'] == 'LABEL_1') else (1 - t['score']))

i'm thinking that a more interesting objective would be to try to flip the label?

E.g. if the y_true is POSITIVE, then try to get y_pred to be NEGATIVE, and vice versa

#

because if you're taking the first 64 tokens as prompt, for all prompts that are already positive in those first 64 tokens, there's not much to be done, is there? and then we wouldn't really be differentiating between the two approaches, because they'd both look good

fallow egret
#

Yes, if the review is positive in the beginning it's not interesting, but most of the 'trimmed' reviews are neutral

patent gull
#

oh ok cool, good to know!! thanks

#

ok, then, i agree with your experiment. maybe we can even measure the \delta from prompt -> prompt + completion
i.e. p(POSITIVE | prompt + completion) - p(POSITIVE | prompt)
where p is the stronger classifier

fallow egret
#

Yes, it's very good idea.

patent gull
#

ok i'll try to find a stronger classification model and will come up with a few prompts. helps to have CFG in huggingface now thanks for @versed flax ๐Ÿ˜‰

fallow egret
#

I think that the prompted Flan-T5 is a valid classifier (and it has ~x4 parameters comparing to the shallow model)

patent gull
patent gull
patent gull
fallow egret
patent gull
#

i did a small run with negative prompting and GPT2-medium

#

I found that the following negative prompt gave us the biggest increase in CFG:

#

"A bad movie review starts like this"

#
A bad movie review starts like this.      3                    0.019473
                                          4                    0.014441
                                          5                    0.031700
Bad review here.                          3                   -0.008154
                                          4                    0.000345
                                          5                    0.000045
Bad.                                      3                    0.006696
                                          4                    0.020785
                                          5                   -0.003599
This is terrible.                         3                    0.004726
                                          4                    0.007281
                                          5                    0.014920
Thus starts a terrible movie review.      3                    0.004966
                                          4                    0.020504
                                          5                    0.008596
To write something terrible, write this.  3                   -0.010605
                                          4                   -0.000713
                                          5                   -0.006353```

but these numbers aren't huge, honestly. \delta is classifier(CFG output ) - classifier(vanilla output).

So +.03 means that CFG with that negative prompt boosted the sentiment score by ~3%.
#

I can try with a positive prompt, too

#

these are over the first 200 examples in IMDB

loud adder
#

I had a migraine and stopped working yesterday, but please remind me to take a look at our draft response Tuesday (today) afternoon.

patent gull
#

will do!! I hope you feel OK

#

status is โ€”

R3: I looked it over/edited, I feel like we're OK to respond ASAP on that, whenever you get the chance to look.
R1: I think @versed flax did the necessary experiments, we need to craft a response.
R2: maybe today/tomorrow we'll be done with the experiments and have the response ready

patent gull
#

positive prompting was a lot harder to achieve โ€” in fact, CFG with most positive prompts, in most settings, negatively affected sentiment

pos_prompt                             guidance_strength
A good movie review starts like this.  0.10                -0.104267
                                       0.25                -0.066911
                                       0.50                -0.029078
                                       0.75                -0.032493
Great review here.                     0.10                -0.122568
                                       0.25                -0.111521
                                       0.50                -0.064845
                                       0.75                -0.046463
Great.                                 0.10                -0.098969
                                       0.25                -0.078565
                                       0.50                -0.049975
                                       0.75                -0.041289
This is great.                         0.10                -0.066222
                                       0.25                -0.029284
                                       0.50                -0.058933
                                       0.75                -0.030547
Thus starts a great movie review.      0.10                 0.074312
                                       0.25                 0.013816
                                       0.50                -0.035235
                                       0.75                -0.030811
To write something great, write this.  0.10                -0.141006
                                       0.25                -0.094684
                                       0.50                -0.066705
                                       0.75                -0.016745
#

I think we should try both positive and negative settings, though, for the experiment. We can prepend "Thus starts a great movie review." and "A bad movie review starts like this." for CG.

If anyone with access to a long-running compute cluster with some decent memory can run my script, that would be very appreciated!! here is my script:

versed flax
blissful garden
# patent gull

changed some codes to shard the data for 8 gpus and taking this script for a spin in the cluster. There should be some files coming out when you wake up.

blissful garden
#

@patent gull I see you called model.generate(.... I got warnings that the max length is defaulted to 20 and the prompt is longer. Does this need to change or it is ok with the current setup?

blissful garden
#

ahh errored out... One sample got 800+ tokens and crashed the max length of that distilbert classification model lol

blissful garden
#

The full script is taking too long but I will just leave it running.
I will queue 2 more jobs specifically for first 5000 data points, one for negative with cfg 3, 4, 5, and one for positive with cfg 1, 1.25, 1.5, 1.75. When each cfg finished a csv will be saved (we can combine later).
Let's see how many files we get tomorrow when I wake up (or error out). Heading to bed.

patent gull
versed flax
patent gull
#

why do we care about the y=2x line, again?

versed flax
patent gull
#

so the blue dots are the minimum VRAM that the model needs + the KV cache for the maximum sequence that the model takes?

#

what is the right way to address the second part of their review:

There is no guarantee that the Eq.6 will obtain a legal probability with the probabilities of all possibilities summing up to 1.

they're right โ€” and we're not doing special normalization in the LogitWarper. Does HF do normalization under the hood in the .generate() function? I think it must, if the user is doing top-p and top-k sampling as well

versed flax
patent gull
#

right duh

#

for R2:

Compared with text-to-image generation, the optimal \gamma value in the language modelling seems to be small (<2), while large \gamma value leads to poor performance. Have any observations on it?

Maybe we can say that CFG is applied in autoregressive sampling at every step, so \gamma actually needs to be smaller, as it has a repeated impact

versed flax
patent gull
#

pixel range is -1;1

versed flax
#

It may also be: 3. The conditional and unconditional outputs are more different in text than image

fallow egret
#

I think it's more the nature of diffusion models: after very small amount of iteration the differences between the conditional probability and the unconditional probability should be neglectable

versed flax
#

This is a great explanation as well

#

We could see something similar with our paper as well as we sample more and more tokens

#

The continuation will be impacted less and less by the CFG'd tokens of the initial prompt

#

is this plot clearer? I changed the text and now I think it's much better

loud adder
#

I'm getting caught upon the rebuttal google docs now

patent gull
#

ok I will condense these explanations in the google doc

versed flax
#

Friends, I am not a native English speaker, therefore, I will not post the answers to the rebuttals before Alex / Stella proof reads them. Please, when you think an answer is good enough, post it. Let's not wait another round. It's been 5 days already. We're 50% in.

loud adder
#

@versed flax in the reponse to R1, it says

We have completed a memory analysis and will include our results in the paper. In general, we found a tradeoff between serving larger models, and serving a smaller model with CFG.
Is the updated paper going to be posted simultaneously with the reply? Or is that a to-do?

versed flax
loud adder
#

I'm tweaking the reply to reviewer 1 a little and otherwise think it's good

versed flax
#

awesome!

loud adder
#

I would add the memory experiments and the formatting fixes that R1 recommends now

#

We can tell the other reviewer that theirs is running and that we'll update when it's done

versed flax
#

Does the page size constraint still applies?

loud adder
#

Usually we get an extra page, it should say on the call for papers page

versed flax
#

good. I will check that then.

#

And maybe I will stop depending on you and Alex for the English and use ChatGPT instead lol

loud adder
#

I want to add this to the end of the discussion of the results

At a high level, this means that it depends on your use-case. For researchers or small scale deployments where people are using the largest model that they can fit on their GPU, it's better to use CFG. However for very large scale commercial deployments, it makes more sense to increase the size of the model. We further note that increasing the size of the model is not always possible: OpenAI probably doesn't have a version of GPT-4 that's twice as big sitting around.

versed flax
#

I love it!

#

cristal clear and wraps it up perfectly

#

(although they probably do since GPT-4 turbo is prolly a distilled version of GPT-4)

loud adder
#

Shhh

#

I'm also now curious how GPU size discreetness impacts this

versed flax
#

"discreetness"?

loud adder
#

No, the fact that GPUs come in fixed sizes: 16, 24, 40, 48, 80

versed flax
#

Ah!

#

then we could add new horizontal frontiers on the chart with GPU models

loud adder
#

Yeah

#

Models do too... though generally are spaced to double in size (6.7B -> 13B -> 20B -> 40B)

versed flax
#

Depends on the family? I remember Chinchilla models no doubling everytime but I may be mistaken here

loud adder
#

I also did a pass on the reply to R3

#

Don't quite love it but I think it's good?

versed flax
loud adder
# versed flax do I post it _now_?

I changed the review to say that the results were added to the paper and that we made the formatting changes they recommend. So make those changes and then it's good to go IMO

#

(A general principle at play here is that you should show that you've done what they want instead of promising that you will whenever possible)

versed flax
#

Then I'll try pulling that off tonight and posting the PDF and the answer at the same time

loud adder
#

So right now we are inconsistent in our replies to R2 and R3

#

We tell R3 that CFG for LMs is new

#

But acknowledge with R2 that it's not

#

Which position are we taking? We cannot take both

versed flax
# loud adder But acknowledge with R2 that it's not

As far as we are aware, the application of CFG to autoregressive language models is novel, as previously they had only been applied to non-autoregressive diffusion models in computer vision
that reads "new" to me

loud adder
#

Oh sorry. R2 tells us it's not

#

Misread that

#

I'm about to give a talk and have to run, but I can do a final pass before the submission this evening (in 4-ish hours)

blissful garden
versed flax
#

yes

loud adder
#

Ttyl

versed flax
#

good luck!

#

shine!

blissful garden
fallow egret
#

What guidance values you used?

blissful garden
#

oh shit this is just the baseline.....

#

job is still running

fallow egret
#

if it's 1 then it's not the baseline

#

this is the value that they used

blissful garden
#

yeah it's the 1

fallow egret
#

cool, so this is what they used in the paper

blissful garden
#

I have a for loop in my bash script so it's done 1 only. Tomorrow maybe 1.25

fallow egret
#

For a fair comparison we should run it with few guidance scale.
But I'm not sure it's worth to waste on it too much time. In any case it's simply non-valid method

blissful garden
#

I have 1, 1.25, 1.5 and 1.75 in my script

#

we can tell R2 that it's running just like what Stella said. If we get 1.25 we can give them a teaser. But no need to wait for it to finish

#

yeah looks like that crazy 200 distillbert thing is a massive bottleneck. CFG barely made the whole thing slower

fallow egret
#

Yes, I agree. In any case the main point is to stress that theoretically the alternative is using CG (as in diffusion models), in LLM it is also known as FUDGE. However, the problem is that in the context of LLM you need run the classifier on every combination of (state,next_token), which make it impractical.
In the FUDGE paper, they resolve this issue by sampling the top 200 tokens and used a shallow classifier. From our experience, even when using a relatively shallow network (65m parameters), the running time is still more than order of magnitude comparing to CFG, which make this method impractical for many real-world use cases

blissful garden
#

can we just grab those 1 results and try CFG alone, possibly with neg prompts, and argue that CFG produces similarly controlled results?

#

they control the sentiment right?

#

if we do we get one quick chart to show and also makes our method stronger

fallow egret
#

Yes, I think it's fine. In addition to the last comment that I wrote to add this small experiment that also demonstrate that you are not getting better result with FUDGE. But the main point is to emphesais the usability of CFG for real world use-cases

patent gull
#

hey just catching up. I will take a look at these results now

#

sorry โ€” what is being pickled in these files? I just see lists of strings

#

can someone forward Elad's script, again?

blissful garden
patent gull
#

ah ok โ€” so will just compare to the vanilla GPT generations

blissful garden
#

need to go to bed otherwise I can try it very quick.

patent gull
#

no problem

#

yeah i'm just wondering... I remember some old work about creating the ideal prompt, given a classifier, I'm trying to find it

#

i don't think it'll be directly useful in our case, but.. hmm

versed flax
#

FYI:

There will be a strict upper limit of 9 pages for the main text of the submission, with unlimited additional pages for citations. This page limit applies to both the initial and final camera ready version.

#

So I think I will add the memory analysis in the appendix

#

It's secondary to the contribution I would say

patent gull
#

Yes, I agree, I think it belongs next to the FLOPs analysis, and can be mentioned in the main body but explored more deeply in the appendix

#

just like the FLOPs

versed flax
#

Totes

patent gull
#

btw i'm reading through what you wrote to R1. very nice. I finally understand it ๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ hahaha

versed flax
#

Haha I'm glad

patent gull
#

also the plot looks so much better

versed flax
#

Yes the text finally makes sense

patent gull
#

what happened to the blue dots?

versed flax
#

obliterated. Not needed.

patent gull
#

one question โ€” are we implicitly assuming that a model twice as large is as accurate as a model with CFG?

versed flax
#

yes

#

it's not implicit. It's something we kinda introduce in the paper.

patent gull
#

no, i know that haha, but I think we should reiterate that in the response

#

let me work that in

versed flax
#

oh ok

patent gull
#

For the chart, I have the following comments:

  • the "CFG" annotation can be more central โ€” 50%/50% of the plot, instead of off to the side
  • Can we change "CFG" -> "CFG wins"
  • "Vanilla -> Vanilla Wins"

โ€” if you'd like to send me the code, I can play with the chart myself, whatever's easier.

versed flax
#

I can fix that in an instant

patent gull
#

ok great!

versed flax
#

I'm writing the appendix rn now

#

so it'll be done later unless it's needed now

patent gull
#

no problem/rush at all

#

i see you just copy/pasted โ€” there's a typo "models bigger than 5G" -> "models bigger than 5B"

#

lol. we're not comparing cell phone service plans, here

versed flax
#

B=G tho ๐Ÿ˜ญ (but yes, you're right)

patent gull
#

yeah but let's be consistent

unique sedge
#

Hello sorry for being awol. Had to go for my thesis submission and defense schedule to college, been busy in that. sorry for not being able to help.

patent gull
#

still running sentiment controlled NLG.. I wonder if we want to add a second controlled NLG attribute

#

specifically, R2 asked us to compare this to controlled NLG sota methods

fallow egret
# patent gull yeah i'm just wondering... I remember some old work about creating the ideal pro...

There is a simple recent work by DeepMind in which they simply provide few examples of tuples <prompt, prompt_score on the data> and provide a meta prompt that ask the model to provide alternative prompt that will give the best score. You can iterate (by adding the result of the new suggestion).
It's seems to work very nicely, and we can apply it easily on our use to find the best prompt for the CFG

patent gull
#

to GPT4, or something?

#

so the model basically infers based on what was working, what will work?

fallow egret
#

Yes, exactly

#

I'm now working on few improvements to this method (like providing few failure cases for each run). But it still work in progress and their basic idea is nice if you provide a good context in the meta-prompt about the task

loud adder
#

@fallow egret @patent gull are we good to post?

fallow egret
#

I think it's important to add for each reviewer 1 sentence in the beginning which stress the main positive things he found in our paper (something like 'we are glad you find our...'), it's important again for the AC decision

patent gull
#

We already posted responses to R1 and R3.

For R2, we planned an experiment comparing CFG to a controlled NLG baseline, where we're controlling for sentiment.

I just got some good results from CFG. I'm comparing to SOTA baseline now.

I do wonder, though, if sentiment is enough. Ideally, we compare several different controlled NLG objectives. What do you think, @loud adder ?

#

Sentiment may be enough for an initial response to R2, but ideally if we're updating the paper, I'd feel better including more experiments on more controlled factors

fallow egret
#

lol, it was really uploaded with ๐Ÿ˜…
We hope this clarifies the points raised in your review. If you would please consider raising your score, we would really, really appreciate it!!

#

Ok, I think that at least when we see that end of the review period is coming we should add a comment for each reviewer which is much more formal and doesn't contain any promise for future changes (this is a direct reason for reject). It should simply state that we modify the text and address all the concern raised by the reviewer (list them).

patent gull
#

haha feel free to edit, but i've found it helps to ask, sometimes

fallow egret
#

I don't think that there is an option to edit responses

patent gull
#

yeah, there is... I edited @versed flax, the button is off to the side

#

btw, good news, good results from CFG vs. baseline CG

#

here's the delta increase in positive sentiment via CFG for a few settings/prompts:

Great movie review:     0.10                 0.075225
                        0.25                -0.136310
                        0.50                -0.015543
                        0.75                 0.034303
That was a good movie!  0.10                 0.364103
                        0.25                 0.312607
                        0.50                 0.192197
                        0.75                 0.044026```

and here's the delta increase in sentiment via CG for the defaults that the authors used:

``` baseline_df['delta'].mean()
0.065204710023016```
#

ideally we test a lot more values for guidance strength for CG, but it is SO SLOW to run.

Let me draft a response to R2, and then we can see whether it looks good, or whether we should do more experimentation

fallow egret
#

Amazing!! I think it's definitely enough material to address this point of his review

loud adder
loud adder
patent gull
#

sorry!! I didn't check the revision-history/didn't realize you had edited it out... and have been handling a lot of things today

loud adder
#

No worries

#

It's not a big deal, the language just seemed a little over the top

#

I was more concerned about whether this was a sign that an old draft was used (and other edits I made later I view as more important)

#

Did another pass over the two posted reviews.., they look good!

patent gull
#

the reposted paper is very good too, thanks to @versed flax

#

I'm almost done with R2... just have to answer that last question

#

ok R2 is done in the google draft, grabbing dinner now

blissful garden
#

Just woke up. @patent gull still need me to run some more tests on both fudge and your script? I can try more neg prompts in parallel with you guys and see if anything comes up.

versed flax
patent gull
#

Itโ€™s positive guidance, not negative

#

Positive worked a lot better than negative

versed flax
#

confused_pikachu.jpg

#

R2 will be quite unhappy that we run yet another method with yet another gamma

patent gull
#

but yeah we generally don't have a good answer for guidance strength and what works and what doesn't

versed flax
#

that means we interpolate between both prompt, thus reducing specificity to the user prompt

patent gull
#

for negative/positive prompting, it also means we're emphasizing more/less of the negative prompt

#

$(1 - \gamma) p(w_i | w_{<i}, \hat{c}) + \gamma p(w_i | w_{<i}, c)$

for \gamma \in [0, 1], \text{you're mixing part of } \hat{c} \text{ with c}

vital pondBOT
#

AlexSpangher
Compile Error! Click the errors reaction for more information.
(You may edit your message to recompile.)

blissful garden
#

oh crap the FUDGE job got preempted....

#

so we just got guidance = 1

patent gull
blissful garden
patent gull
#

ok i'll post our response

versed flax
#

The answer to R2 is absolutely fabulous

#

Congrats guys

patent gull
#

should we re-ping the reviewers on the OpenReview comment threads?

#

I don't know how ICLR works

#

in ACL, the ACs started encouraging reviewers to respond

loud adder
#

Yeah that's probably a good idea

#

Thank you for your review! We were wondering if you were planning on updating your score to reflect our reply or if there were any additional questions you'd like us to answer.

patent gull
#

ok โ€” updating now! thanks for the great text!

loud adder
#

Unfortunately it looks like we won't be accepted to ICLR unless a miracle occurs. Due to other peoples' reviews responding our paper has fallen to ~ the median review score.

https://x.com/shaohua0116/status/1728158662265340047

This doesn't mean the paper isn't good, it means we got unlucky. Peer review is a crapshoot and sometimes it takes three submissions to get the right luck.

The next venues are ICML and ACL, both of which have deadlines in January. I think we're good to submit as-is (after changing the format and making sure we fit within the length reqs), but if people want to improve the paper more we can have a meeting in December and discuss options.

blissful garden
#

COLM is probably also something to think of (and maybe have a better chance of being properly evaluated by qualified experts)

loud adder
#

True, I hadn't considered that.

fallow egret
#

What about some ICLR workshop? I think at this stage it's going to be hard to get accepted (since indeed many papers got out with the same sampling modification). On the other hand with such a good experimental section it will be very easy to get accepted to a workshop

loud adder
#

Partially it's a question of what @versed flax's goals are

versed flax
#

I am having a hard time having a relevant opinion, I don't have publishing experience and what just happened makes me question the chance of getting this paper through a high impact conference

blissful garden
#

My take was that our paper wasn't really judged by the right people this time. Workshop has the advantage of being specialized to the right domain. I had good experience with that earlier this year but my sample size was 1 ๐Ÿ˜‚.
Trying ICML and ACL has the benefit of prestige. If it gets accepted, for example it's an entry ticket for job interview or a dream-come-true moment for @versed flax if I remember correctly. COLM is probably like betting on a super young venue. But in my own field a lot of young journals run by competent experts did rise up extremely quickly and carried others' and my mediocre papers that got published on it.

I have no idea whether a resubmission gets a lower chance or not. At least in math nobody cares how many times you submitted before.

versed flax
#

What about Elad's take that the paper gets older?

blissful garden
# versed flax What about Elad's take that the paper gets older?

yeah this is what I don't know about. How much would resubmission hurt the publication chance in ML.
I mean the paper did come out in parallel with a bunch of others doing similar sampling method. It just gets submitted late but they shouldn't judge that on when you submit

fallow egret
#

I just want to stress that the resubmission is not an issue (as @loud adder wrote, it's very common to try few times until getting accepted). The problem is that there are currently many papers with the exact same method, I'm guessing that some of them got accepted to some tier-1 conference. For ICLR it was still a boundary case, but now it's going to be very hard to defend on the novelty claim (which automatically reduce the score to <6).
From a prestige point of view, getting accepted to a good workshop is not the same as getting accepted to the main conference, but I think it's also good for the resume.

In any case, for sure it's your decision only. Whatever you will decide I will be available to help also in the next submissions

fallow egret
blissful garden
# fallow egret Unfortunately it is judge according to the submission time (since it is a blind ...

In theory you are right. But technically resubmission is at least a 6 month delay. If your work is an important work, people must have talked about it, cited or used it and things get old easily in ML. If there is a perfect isolation between submission and the original arxiv, it automatically becomes "not novel" because this "anonymous submission" is older than your own preprint and reviewers are not allowed to draw connections between these two

#

Basically this is not enforceable because a perfect execution is saying "good works cannot resubmit".

I'd rather guess that in reality reviewers secretly look up and know what date this paper came out and who wrote it. If it's truly an original work when it comes out, they just don't mention "novelty". If it's obviously a copy cat of other method with significant time difference, they cite this "novelty" issue.

loud adder
#

We can ping the ACs if this becomes a serious issue

#

That's what we did with trlX, when we claimed we were the first people to do something and one reviewer came back and was like "what about trlX?"

versed flax
#

You all have more experience than me. If I make a decision, it will necessarily be less informed that any of yours. My goal is to maximize impact & recognition, but I'm not ready to take risky bets a risk losing it all

patent gull
#

sorry for the delay here, I missed a lot of this discussion.

I have a few thoughts:

  1. Huge bummer and yes more a reflection of randomness than actual goodness-of-fit. ICLR has a crazy mean-tendency bias... a 1-point difference in any reviewers score would've totally changed our outlook.

  2. I think this is worthy of a conference paper, given the amount of different angles we bring together: conventional benchmarks, cot, memory/compute analysis, assistants, etc. I'm willing to be overruled on this point, but I think it's more than a workshop paper.

  3. In my opinion, it doesn't matter as much that other people have done this sampling modification, as they have really focused on specific cases. Also, the current reviews have undeniably made this paper a lot stronger. My gut is that we need to do a better job highlighting our novelties into the introduction, in essence: introduction = \gamma * our paper + (1 - \gamma ) (other papers). wait a minute.... that looks familiar....

  4. That being said, I have concerns about ACL since (a) I don't know that people in that conference care as much about compute/memory evals as they do in Neurips or ICLR (b) the paper format for ACL is much different and smaller, we would have to cut a lot of stuff or move stuff into the appendix. Which might not be terrible โ€” we might indeed have too much introductory maths. But still, it's going to be considerable work to reformat for ACL.

patent gull
#

My gut is that we submit at least one more cycle. We got very helpful reviews that got to some core weaknesses in our paper, and we addressed them. The paper is stronger as a result โ€” the review cycle worked.

None of the reviewers seemed to care about other NLP papers that did CFG-like sampling. The criticism was the comparison to CFG in vision, which was fair, @versed flax was directly inspired by vision, so it's a very fair criticism. So, we do a better job of highlighting our response to R2.

In my mind, the tradeoffs:

conference:

  • pros: gives the paper more credibility and standing.
  • cons: possibility of another rejection.

workshop

  • pros: gets the work out there, at least.
  • cons: variance in quality in workshops is HUGE. Paper has less credibility, in my mind.

I think COLM might be pretty cool to consider. I looked up the dates and ICML reviews will become available before the COLM deadline. So there's the possibility that we submit to ICML and then if we get bad scores, fix and submit to COLM. ACL reviews won't be available in time for COLM. On the plus side for ACL, the reviewers there tend to write a LOT more, and actually respond to rebuttals but ๐Ÿคทโ€โ™‚๏ธ the timing and venue doesn't seem optimal to me

patent gull
#

should we say something? I don't know how these channels usually work:

#

@here private comment period ends today

#

we can easily say "we responded to everything including with new experiments and haven't heard back." I just don't know what is typically acceptable for ICLR. For instance, in *CL conferences, we're advised to do this only as a last resort, if we suspect serious ethical issues on the reviewers part

blissful garden
#

yeah no idea how to do this. With 7000+ submissions I bet a lot of other people also said that they didn't hear back from reviewer

versed flax
#

do we risk something by doing so?

#

if not, the expectation is strictly positive

loud adder
#

That's my thinking, yeah

versed flax
#

This is the mechanism to communicate to the AC any unresolved discussion points (if you do not have any unresolved discussion points, there is no need to send a private comment).
I mean, it seems to be exactly our case?

loud adder
#

Yes

#

Send:
Dear AC,
We have tried to engage with the reviewers, but unfortunately none of them responded to our rebuttal. As far as we can tell we have given compelling responses to all reviewers that would warrant a reconsideration of their initial scores.

versed flax
#

copy paste, send?

patent gull
#

I would say exactly that except โ€œcompelling responsesโ€ -> โ€œcompelling responses, including two requested analyses, to all reviewersโ€

#

@versed flax do you want to send?

versed flax
patent gull
#

Either/or, I donโ€™t care

#

Ok let me do it before I lose service then, im on a train

#

alright let me see

versed flax
#

FYI

To write a private comment to the ACs, you can simply go to your submission on OpenReview, and write a new comment. The allowable readers are ACs, SACs, and PCs.

patent gull
#

"Dear Area Chairs,

We are writing to let you know that we tried to engage with the reviewers, but unfortunately none of them responded to our rebuttal. As far as we can tell, we have given compelling responses to all reviewers, including with analyses that we incorporated into a new draft of the paper, that would warrant a reconsideration of their initial scores.

We summarize the major unresolved discussion points here:

  • A memory cost analysis is recommended (Reviewer 3Gz2): We have completed a memory analysis and have included our results in a paper update. In general, we found a tradeoff between serving larger models, and serving a smaller model with CFG; in our analysis we identify the optimal tradeoff point across model sizes and VRAM. Please see Section 4 and Appendix B.3

  • Comparison to other baseline controllable NLG tasks (Reviewer RjYY): We have completed this comparison. The baseline Classified-guided control increases sentiment by .065 points, whereas CFG (our method) increases by .312 points. Additionally, the baseline is very slow โ€” it is >100x slower than CFG.

  • Lack of novelty (Reviewer YQBo): with respect, we argue that the application of CFG to autoregressive language models is novel, as previously they had only been applied to non-autoregressive diffusion models in computer vision. This is not a trivial adaption, and the core contribution of our paper is to adapt and rigorously test this across a wide range of different prompting techniques to prove it's validity.

We very much appreciated the reviewers points, and they undeniably made the paper stronger. We had hoped for a robust debate.

We additionally would like to report that we strongly believe that Reviewer YQBo did not fully grasp the point of the paper, as they seem to be under the impression that CFG involves furter model training, which is DOES NOT.

We would be very appreciative if you took these points into consideration in your review.

versed flax
#

fire

patent gull
#

ok I'll wait 20~ min for other people @here to read

#

and then send. if you don't hear my ACK, assume that I don't have service

patent gull
#

sent

versed flax
#

: ๐Ÿ”ฅ

blissful garden
patent gull
fallow egret
#

This is the final version?

patent gull
#

Near final

#

Will probably do some more editing before I submit

fallow egret
#

Overall look fine, it just important to pay attention that is exactly 8 pages (currently there are 3 missing lines)

patent gull
#

i didn't realize that was a stipulation to be exactly 8 pages, no less! but i will return to flesh out the discussion section a little bit better anyway

#

so i'll make it work

fallow egret
tepid gazelle
#

Hey @versed flax or others on the CFG paper, we're using CFG as a baseline for a new project and I had a question about the merged HF implementation of CFG which I thought you might know the answer to:

Is the prompt being conditioned on by HF generate() the entire input sequence (and does this stay static / you don't add new generated tokens to this extra-conditioned prompt as you go on?) I think the answer to this is yes but wanted to confirm.
and also, is there a way to pass settings to HF generation such that only a sub-prefix of the initial input sequence is more strongly conditioned on?

we'd like to be able to pass "<Instruction1>..... <context here>" as input, and only condition on <Instruction1> when generating further output from the model

#

Thanks!

versed flax
tepid gazelle
versed flax
#

yes

tepid gazelle
#

gotcha

versed flax
#

If you want to do what you say (which is exactly the same as Context-Aware Decoding), you want to use instr+ctx as positive and intr as negative

tepid gazelle
#

I see, thanks!

#

Ah yeah going by their abstract that does sound like what we want. Thank you, appreciate it!

versed flax
#

You're very welcome :)

fallow egret
#

What a strange rejection ๐Ÿ˜ฆ just gave 4 without any concrete reason (besides typo)

versed flax
#

like "lol dude why 4 just because of typos??"

fallow egret
#

I think we should stress the novelty (which is the main weakness according to the other reviewers).
Let's hope he will either change his mind after he will see the other reviews (his confidence is only 3), or hope that the AC will kick him (this could happen with very high probability)

fallow egret
#

Seems like we are above the threshold now ๐Ÿคž ๐Ÿ‘€