Evaluating Classifier-Free Guidance impact | EleutherAI | Page 4

patent gull Jun 30, 2023, 4:52 AM

#

well hold on... probably not the right call to plot 2 tasks on the same axis

#

i'm just thinking

#

what i'm seeing actually is that Quanaco has a acc peak at 1.5

#

and Wizard has an acc peak between 1.1-1.25

fallow egret Jun 30, 2023, 4:53 AM

#

Yes, they have peak in defferent places

patent gull Jun 30, 2023, 4:54 AM

#

the % invalid min-regions look a lot wider

#

i wonder if there's such a thing as grouped-line chart...

fallow egret Jun 30, 2023, 4:55 AM

#

I have to go for ~15 minutes, sorry...

patent gull Jun 30, 2023, 4:55 AM

#

ok no problem

fallow egret Jun 30, 2023, 5:10 AM

#

I'm back, in my opinion the two charts that includes everything definitely demonstrate the two trends (the acc and the invalid). It's a little bit strange to mix both datasets and models on the same graph, but it might be a valid option if we want to emphasis and put all the results in the main paper.
Another option is to split it to two and put one of the datasets/models in the appendix (I don't think it's that bad, not everything should be in the main paper. People read the appendix, especially when there is a reference for the appendix figure in the paper)

patent gull Jun 30, 2023, 5:15 AM

#

yes, i kinda think we should split

#

ultimately it's up to you but i think maybe sticking with gsm8 is a good idea since it's like you said, more important

#

i redid them so that accuracy and invalid are grouped, that way we can have a real ylabel

#

#

#

still have a lot of vertical white space, which i'm not happy about but 🤷‍♂️

fallow egret Jun 30, 2023, 5:17 AM

#

Yes, they are not in the same scale, I don't think we can do something with it (either the acc and invalid doesn't have the same scale or the model perfomences). Observe you have typos in the name.

patent gull Jun 30, 2023, 5:18 AM

#

we can do a broken axis:

#

https://matplotlib.org/3.1.0/gallery/subplots_axes_and_figures/broken_axis.html

#

but it's easily missed by readers thus not a favorite technique imo

fallow egret Jun 30, 2023, 5:19 AM

#

patent gull but it's easily missed by readers thus not a favorite technique imo

Yes, I also find it very confusing

patent gull Jun 30, 2023, 5:20 AM

#

it's very late for me. can i send you the finished files and can you put them into the paper? do you know how to use subfig?

fallow egret Jun 30, 2023, 5:20 AM

#

np, can you send me also the code?

#

oh, these are the finish figures

patent gull Jun 30, 2023, 5:21 AM

#

I will send you files

#

these are screenshots

#

i will send you code too in case you're curious

#

thank you

fallow egret Jun 30, 2023, 5:22 AM

#

patent gull i will send you code too in case you're curious

np, thank you 🙂
Just observe the typos in the models names (and parameters)

patent gull Jun 30, 2023, 5:22 AM

#

yup fixing that now then gonna send over

#

ok these are the figs

#

i made gsm smaller so it could go in the main body

#

📎 Untitled.ipynb

fallow egret Jun 30, 2023, 5:26 AM

#

Wizard LM is 30B

patent gull Jun 30, 2023, 5:27 AM

#

whoops

#

ok

fallow egret Jun 30, 2023, 5:28 AM

#

🙌 so I'm putting one in the main paper and one in the appendix?

patent gull Jun 30, 2023, 5:29 AM

#

i think so?

#

i think 4 plots would be too much info in the main body

#

i also think squeezing two tasks into one plot isn't great

fallow egret Jun 30, 2023, 5:30 AM

#

I agree, great. So I will modify the paper accordingly

patent gull Jun 30, 2023, 5:30 AM

#

great, thank you so much man

#

i'm out for the night✌️

fallow egret Jun 30, 2023, 5:31 AM

#

thank you, it looks much better now

#

good night

patent gull Jun 30, 2023, 5:31 AM

#

we had to do something about those plots haha

#

but we did

fallow egret Jun 30, 2023, 5:51 AM

#

Ok, I think it looks good! please review when you wake up 🙂
(please also go over the captions, I changed them yesterday according to the feedback)

versed flax Jun 30, 2023, 10:04 AM

#

a lot happened while I was aleep!

#

The new plots are cool, they really show the trends

#

So, we have 8h before before ArXiv submissions close today

#

Once you're ready for release, please 👍 this message :) @loud adder @patent gull @blissful garden @unique sedge. I will send the paper either 2h before or when I get your validations, whichever happens first.

fallow egret Jun 30, 2023, 10:11 AM

#

versed flax So, we have 8h before before ArXiv submissions close today

You should take a big buffer because exporting from overleaf to arxiv could be very exhausting and might takes time

versed flax Jun 30, 2023, 10:12 AM

#

That's why I'll do it the very second I get everyone's go

patent gull Jun 30, 2023, 1:25 PM

#

Just waking up. There was one word in the acknowledgments that I don’t remember

#

But it was used a lot

#

And I didn’t know what it meant

#

Ah yes, what do you mean by “redactor”?

versed flax Jun 30, 2023, 1:27 PM

#

patent gull Ah yes, what do you mean by “redactor”?

wait, that's not a word 😆 ?

#

"writer" then?

patent gull Jun 30, 2023, 1:28 PM

#

Also did what were the comments from @loud adder ‘s two people she was showing it to?

#

Anything helpful?

#

Sure writer/editor

versed flax Jun 30, 2023, 1:28 PM

#

Definitions of redactor
noun someone who puts text into appropriate form for publication
yeeee!

patent gull Jun 30, 2023, 1:28 PM

#

???

#

Wow I’ve never heard that word before

versed flax Jun 30, 2023, 1:28 PM

#

https://www.vocabulary.com/dictionary/redactor#:~:text=Definitions of redactor,into appropriate form for publication

patent gull Jun 30, 2023, 1:28 PM

#

“To redact” means to remove something from a text

#

So I guess it’s a view of writing in the negative lol, but I’ll take it, it sounds fancy!!

versed flax Jun 30, 2023, 1:29 PM

#

let's go with "writer" then. If it's confusing to you, it will be confusing to a lot of people

versed flax Jun 30, 2023, 1:30 PM

#

patent gull Also did what were the comments from <@193204646687408129> ‘s two people she ...

no feedback has been communicated to me

patent gull Jun 30, 2023, 1:31 PM

#

Haha 🤷‍♂️ yeah…

versed flax Jun 30, 2023, 1:32 PM

#

but there was a "nsaphra" reading the paper yesterday

patent gull Jun 30, 2023, 1:32 PM

#

versed flax no feedback has been communicated to me

Bummer :/ would’ve loved some additional feedback

#

Lololol

#

Maybe nsaphra made some changes

versed flax Jun 30, 2023, 1:32 PM

#

nope

patent gull Jun 30, 2023, 1:32 PM

#

Btw how did the paper drop down to <10 pages?

versed flax Jun 30, 2023, 1:33 PM

#

patent gull Btw how did the paper drop down to <10 pages?

better figures layout AFAIK

patent gull Jun 30, 2023, 1:33 PM

#

Gotcha

#

Cool!

versed flax Jun 30, 2023, 1:33 PM

#

(aka: Stella LaTeX magic)

loud adder Jun 30, 2023, 1:37 PM

#

NeurIPS also uses 1.5” margins which are quite large. Since we’re just using their template rather than submitting to the venue I edited the style file to use 1” margins

versed flax Jun 30, 2023, 1:38 PM

#

versed flax Once you're ready for release, please 👍 this message :) <@193204646687408129> <...

up

loud adder Jun 30, 2023, 1:47 PM

#

So the way it works is that you can resubmit as many times as you like in the next 4 hours (until 1400 ET / 1800 UTC) and it’ll go live at the same time. After that it gets pushed back a day though.

versed flax Jun 30, 2023, 1:49 PM

#

yes, though I'd be happy not submitting a bazillion times bc we're fixing punctuation lol

patent gull Jun 30, 2023, 1:49 PM

#

loud adder NeurIPS also uses 1.5” margins which are quite large. Since we’re just using the...

Ahhhh that makes sense

#

Alright. As soon as I get to the office I’ll give it another read, but I’ll only change anything if I see something major

unique sedge Jun 30, 2023, 3:06 PM

#

versed flax Once you're ready for release, please 👍 this message :) <@193204646687408129> <...

love how the paper has turned out. Good luck in submitting and congrats!

patent gull Jun 30, 2023, 3:40 PM

#

uh oh ok not a big deal but be prepared for a resubmit

#

#

we never said what Figure 1 displayed 😂😂😂😂

#

how much time do i have?

versed flax Jun 30, 2023, 3:40 PM

#

1h max

#

30 minutes preferred

patent gull Jun 30, 2023, 3:40 PM

#

ok

versed flax Jun 30, 2023, 3:41 PM

#

patent gull

I WROTE SOMETHING THERE

#

WHERE IS IT

patent gull Jun 30, 2023, 3:41 PM

#

idk man

#

it belongs in the intro anyway

#

that's too far down to be intro-ing Figure 1 for the first time

versed flax Jun 30, 2023, 3:41 PM

#

all right

patent gull Jun 30, 2023, 3:49 PM

#

alright good

#

signed off

versed flax Jun 30, 2023, 3:51 PM

#

awesome!!

patent gull Jun 30, 2023, 3:54 PM

#

(just triple-checking all the figure captions)

#

ok great

#

i'm logging off overleaf otherwise I'm gonna drive you and myself crazy

#

overall, a million thumbs up 👍👍👍👍👍

#

this paper came out so well, had so many unique parts, and tied together really nicely at the end

#

it's a great paper, really foundational. We're in a different ballgame from CAD at this point

versed flax Jun 30, 2023, 3:58 PM

#

Well, it's time :)

#

Submission time \o/

patent gull Jun 30, 2023, 4:04 PM

#

ok done

#

good

versed flax Jun 30, 2023, 4:12 PM

#

uh, I need "endorsement" bc I never published in cs.CL

#

The code is 7MN9HQ

#

@patent gull it seems you can endorse me

loud adder Jun 30, 2023, 4:15 PM

#

@versed flax I can never find the page to endorse a paper… there should be an option to send an email

#

Feel free to send it to me

versed flax Jun 30, 2023, 4:16 PM

#

loud adder <@212467543745626112> I can never find the page to endorse a paper… there should...

It seems you can just click on this link:

https://arxiv.org/auth/endorse?x=7MN9HQ

fallow egret Jun 30, 2023, 4:37 PM

#

versed flax It seems you can just click on this link: > https://arxiv.org/auth/endorse?x=7MN...

Done

versed flax Jun 30, 2023, 4:37 PM

#

Thank you! it worked!

#

! Package natbib Error: Bibliography not compatible with author-year citations.

#

trying to solve it

patent gull Jun 30, 2023, 4:49 PM

#

now you're doubly endorsed

#

there's a fix for this, hold on... i know there's some github package that just fixes this for you

#

magically

versed flax Jun 30, 2023, 4:52 PM

#

I don't get why it complains about author-year, the neurips template uses numbers

patent gull Jun 30, 2023, 4:52 PM

#

hmm

#

overleaf unfortunately does fix a lot of things under the hood

#

do you have a local latex install?

versed flax Jun 30, 2023, 4:53 PM

#

yeah

patent gull Jun 30, 2023, 4:53 PM

#

sometimes i've had to go through that a bunch of times to make sure it works

#

overleaf is magical in a lot of ways

#

ugh i wanna find this github package

#

maybe try this?
\usepackage[numbers]{natbib}?

fallow egret Jun 30, 2023, 4:56 PM

#

Yes, it's really a nightmare to export from overleaf to arxiv

versed flax Jun 30, 2023, 4:57 PM

#

Thank you Elad for telling me to save some time for the submission

#

🙏

fallow egret Jun 30, 2023, 4:58 PM

#

I hope it's enough time 🤞

versed flax Jun 30, 2023, 4:58 PM

#

patent gull maybe try this? ` \usepackage[numbers]{natbib}`?

it worked!

fallow egret Jun 30, 2023, 4:59 PM

#

It was compiled and submitted?

versed flax Jun 30, 2023, 5:00 PM

#

compiled, yes

#

I'm filling the forms

fallow egret Jun 30, 2023, 5:00 PM

#

Amazing 🙌 this was quick

patent gull Jun 30, 2023, 5:00 PM

#

cool!

versed flax Jun 30, 2023, 5:02 PM

#

Do we want to fill any of this?

lsCQjDW76idkFAEBAEvhsCQjCG1OKjggCgoAg8N8SD9QTRSGEGpGogAAAABJRU5ErkJggg.png

fallow egret Jun 30, 2023, 5:03 PM

#

I don't think so

versed flax Jun 30, 2023, 5:07 PM

#

Friends, it's party time!

#

https://tenor.com/view/dancing-unicorn-unicorn-chubbicorn-chubbiverse-rave-party-gif-25145066

Tenor

#

Thank you everyone! It's been a blast. Next stop: Sunday 6pm UTC for a bit of advertising, and we'll talk about conference submission later :)

patent gull Jun 30, 2023, 5:16 PM

#

woooooooooooooooooooooooooooooooooooooooooooooooooooooo!

#

let's take a nice long breather, now

#

wow

loud adder Jun 30, 2023, 5:19 PM

#

Congrats y’all

sand mesa Jun 30, 2023, 6:47 PM

#

yay!

versed flax Jun 30, 2023, 9:00 PM

#

FYI:

Your article is currently scheduled to be announced at Mon, 3 Jul 2023 00:00:00 GMT.
Updates before Fri, 30 Jun 2023 18:00:00 GMT will
not delay announcement.

stone umbra Jul 1, 2023, 12:52 PM

#

🎇 🎉 congrats on finishing!

versed flax Jul 3, 2023, 12:25 AM

#

https://arxiv.org/abs/2306.17806 here it is folks!

arXiv.org

Stay on topic with Classifier-Free Guidance

Classifier-Free Guidance (CFG) has recently emerged in text-to-image
generation as a lightweight technique to encourage prompt-adherence in
generations. In this work, we demonstrate that CFG can be used broadly as an
inference-time technique in pure language modeling. We show that CFG (1)
improves the performance of Pythia, GPT-2 and LLaMA-famil...

loud adder Jul 3, 2023, 12:28 AM

#

LMK when you tweet about it and I’ll retweet it from the EleutherAI account

versed flax Jul 3, 2023, 12:29 AM

#

Well then I'll do it right now :)

#

https://twitter.com/Vermeille_/status/1675664118500454400 done!

#

wait maybe I should add Fig 1 to the tweet?

loud adder Jul 3, 2023, 12:37 AM

#

I generally find more success and engagement with tweets that walk you through the highlights of the paper. I would add a couple more, drawing out particularly interesting figures and talking about them a bit?

versed flax Jul 3, 2023, 12:38 AM

#

All right. Let me give it a try. That's a first for me.

loud adder Jul 3, 2023, 12:40 AM

#

Here are a couple examples:

https://twitter.com/AiEleuther/status/1660811179239849986?s=20

https://twitter.com/BlancheMinerva/status/1650503734085009408?s=20

https://twitter.com/BlancheMinerva/status/1643411683858169861?s=20

https://twitter.com/iScienceLuvr/status/1663366898577477633?s=20

versed flax Jul 3, 2023, 1:11 AM

#

loud adder I generally find more success and engagement with tweets that walk you through t...

https://twitter.com/Vermeille_/status/1675664118500454400 how did I do?

fallow egret Jul 3, 2023, 1:23 AM

#

Is it allowed?

#

At least for ICCV and CVPR (until last ban decision), it was not allow (as authors) to publish on social media

versed flax Jul 3, 2023, 1:25 AM

#

We have no conference in sight, so...

long shell Jul 3, 2023, 2:06 AM

#

Did anyone happen to look at the predictions on the lambada val set? I'm curious what sort of incorrect responses CFG is fixing

blissful garden Jul 3, 2023, 3:00 AM

#

versed flax https://twitter.com/Vermeille_/status/1675664118500454400 how did I do?

What is that "@Halocene" in the middle

fallow egret Jul 3, 2023, 6:29 AM

#

blissful garden What is that "@Halocene" in the middle

blissful garden Jul 3, 2023, 6:30 AM

#

fallow egret

Ooooh haha my memory about the example stays at our first one

versed flax Jul 3, 2023, 7:25 AM

#

GUYS WE GOT RETWEETED BY JEREMY HOWARD

versed flax Jul 3, 2023, 8:17 AM

#

long shell Did anyone happen to look at the predictions on the lambada val set? I'm curious...

I don't think we did

wheat zenith Jul 3, 2023, 8:17 AM

#

As a an AI model user, I hope it's okay just to drop to the Discord here just to post: I love love this work so much. No kidding, two days ago I was lamenting "What is wrong with this world, why don't LLMS have negative prompts." https://news.ycombinator.com/item?id=36537845 and then POOF. The world is right again.

#

Only constructive comment I might contribute is on the concept of negative guidance, as a user who prompts. Weird imagine the idea in text LLMS, yeah. But what about audio LLMs?

#

To me it doesn't seem that strange in an audio LLM like musicgen. Since musicgen has a CFG like var, out of the box negative CFG could output music I plausibly considered vaguely like the opposite of my text prompt. In this case even without a positive prompt, just the unconditional and negative only since I hadn't modified it yet. (The range of negative CFG that produced normal sounding but different music was quite narrow and fiddly, typically something like -.2 to -.3., and changed for every prompt, so hard to use though.)

#

I've been trying to bang two rocks together to make negative guidance work in TTS LLM, and now I feel so much less crazy that this exists. It doesn't quite make as much sense there, but it will be fun at least. (I think about it maybe like a director showing an actor a scene, and then being like, "Ok you see that? I want the opposite of that."

versed flax Jul 3, 2023, 8:20 AM

#

wheat zenith As a an AI model user, I hope it's okay just to drop to the Discord here just to...

I'm so glad to be the "POOF" in your world haha. If you experiment with negative prompting, please let us know, it's a bit more challenging than with diffusion models since the sampled text get appended to the neg prompt as well, and it's hard achieving a neg prompt making sense with its opposite continuation

#

I think there's a wording trick but I couldn't find it

wheat zenith Jul 3, 2023, 8:24 AM

#

I can't actually follow the math or the fundamentals to know if what I did was like this idea, but I did try using a bunch of other generated samples in a way that seems similar. I took one voice, I found some kind of difference between that voice and 100 random english audio samples. Just counting token frequencies. So the idea is you have the tokens in that voice, that are unique, but not just 'human speech' -- and then you flip the sign on those, and penalize them in the sampler. It's like an anti voice. Not sure it makes sense!

versed flax Jul 3, 2023, 8:24 AM

#

wheat zenith I can't actually follow the math or the fundamentals to know if what I did was l...

it totally does

wheat zenith Jul 3, 2023, 8:25 AM

#

The wonderful thing about AI models, especially recent one, is that it's almost hard not to make output that is at least interesting.

#

Can I bug you about one somewhat random question only vagulely related? On music gen github, someone posted cool music and also that they used "-p sampling" and then a bunch of other people were asking if it was really using -p sampling, did that work, and I thought it would be funny to actualy try it. So like, reverse the order of the logits, least likely first, otherwise jsut like topp. Actually though, the out seems genuinely kind of useful and different an audio LLM model. And as far as I understand, it's not just equivalent to something else? In a TTS model, it makes peole have a christopher walken speech pattern. They choose wrong places to pause. SO COOL.

versed flax Jul 3, 2023, 8:33 AM

#

wheat zenith Can I bug you about one somewhat random question only vagulely related? On music...

well I mistakenly implemented that on LMs and it was just bad lol

wheat zenith Jul 3, 2023, 8:42 AM

#

versed flax well I mistakenly implemented that on LMs and it was just bad lol

That's what I expected. I think maybe the Bark audio TTS model may just be unusually robust, you can ban 75% of the tokens randomly and sometimes its sounds mostly normal. It was okay ish musicgen for short periods, as well, eventually degrades to non music. For music I feel like I want really just endless text boxes for different prompts with CFG weights, some positives, negatives, some CFG values that vary over time, like based on the current token count. Feels pretty natural in music. It's like a conductor, holding out a hand to section of the orchestra, slowly raising it up, increasing the weight of one section, decreasing another. Continuously changing.

#

Bark is not yet in Huggingface but I'm so excited I almost want to try and port this code...

wheat zenith Jul 3, 2023, 9:17 AM

#

versed flax I think there's a wording trick but I couldn't find it

Is there additional context to the "wording trick" phrase? Or do you mean generally you think it's plausible that fully negative (total opposite) prompting is useful and effective, but the prompt engineering isn't yet known how to make it work?

versed flax Jul 3, 2023, 9:18 AM

#

wheat zenith Is there additional context to the "wording trick" phrase? Or do you mean genera...

Say you want to generate lyrics. Your prompt would be "I wrote a song, the lyrics are:"

#

So that will generate lyrics right

#

But now let's say you want to use a neg prompt so that these lyrics are not about love

#

As far as I could think, your neg prompt would be "I wrote a love song, the lyrics are:"

#

And again, there must be a better way to prompt engineer a neg prompt, but we did not find what it was

#

because then the continuation won't be a love song at all, which will lead to a weird negative continuation:

"I wrote a love song, the lyrics are: <something not about love at all>"

wheat zenith Jul 3, 2023, 9:24 AM

#

Right. What does working correctly look like?

versed flax Jul 3, 2023, 9:24 AM

#

no idea. We couldn't find the right way to phrase it.

#

We used negative prompts only as more general versions of the prompt or totally opposite of the prompt (surprisingly, that still works), but we couldn't find the prompt engineering to make it more targeted /granular

wheat zenith Jul 3, 2023, 9:26 AM

#

The first though I had, skimming the code, was I'm gonna add in a text box that can swap in for the unconditional or the 'neutral prompt' -- no idea what that enables or if it makes sense. But in audio I did have to use like 'generic english voices' not just 'unconditional generation' for the token thing I did.

versed flax Jul 3, 2023, 9:26 AM

#

nice!

#

let us know

wheat zenith Jul 3, 2023, 9:27 AM

#

But just vaguely, maybe 'unconditional' being another input, could ground the "opposite" concept somehow.

#

The great thing about audio? As long as changes the sound... it could still be a useful knob to turn, even if you have no real idea why it's having the effect, or can predict it really.

#

Trickier in pure text.

versed flax Jul 3, 2023, 9:31 AM

#

https://github.com/ggerganov/llama.cpp/issues/2083 people want it in llama.cpp now!

GitHub

Feature request: Classifier-Free Guidance sampling to stay on topic...

@ggerganov retweeted the "Stay on topic with Classifier-Free Guidance" paper that came out showing that "Classifier-Free Guidance (CFG)"... "can be used broadly as an infer...

wheat zenith Jul 3, 2023, 9:33 AM

#

The Diffusion people have been liviing a life of spoiled luxory. Negative prompts, control net, a billion other syntax tweaks, while the LLM community has nothing. They are ravenous and I get it.

#

Actually in Bark, there's kind of two prompts, two different sets of tokens, both are used at inference, concatted. One for the voice, one for the text to say. So each could each have this implemented seperately, gonna be crazy.

#

It's all just tokens out of a GPT model, it shoud all work

#

They should implement something like the visualization tool you made, that is super cool too

#

ggerganov will have it done so fast. if you google any random weird sampling thing, half the time, the only working code I can find that isn't the original repo, is in ggml. he just implement everything.

blissful garden Jul 3, 2023, 10:11 AM

#

versed flax https://github.com/ggerganov/llama.cpp/issues/2083 people want it in llama.cpp n...

oh wow this is awesome!

blissful garden Jul 3, 2023, 10:12 AM

#

versed flax https://github.com/ggerganov/llama.cpp/issues/2083 people want it in llama.cpp n...

Since the big names are retweeting did your twitter notification blow up? 😂

versed flax Jul 3, 2023, 10:15 AM

#

blissful garden Since the big names are retweeting did your twitter notification blow up? 😂

ngl, 36 retweets and 117 likes on a post is the most activity I've had on twitter lol

#

and yes, many likes and follows!

wheat zenith Jul 3, 2023, 11:07 AM

#

versed flax ngl, 36 retweets and 117 likes on a post is the most activity I've had on twitte...

I can see you're busy, some time when you not, I wonder if you remember if the inaccurate answers at high CFG values were just a wrong number, of they were possibly wrong in weirder way, perhaps something like "Q: How many apples do they have? A: 3 cans of tennis balls."

versed flax Jul 3, 2023, 11:08 AM

#

@fallow egret that's for you

fallow egret Jul 3, 2023, 11:15 AM

#

wheat zenith I can see you're busy, some time when you not, I wonder if you remember if the i...

In very high CFG values, you start to get garbage, the interesting part is in the medium-high range, then you can see that you still getting high percentage of valid answers, but the generated content is too much adhere to the prompt, this does not allow the development of a rich reasoning chain that will get to the correct answer.

wheat zenith Jul 3, 2023, 11:18 AM

#

fallow egret In very high CFG values, you start to get garbage, the interesting part is in th...

Interesting, thanks. Just a hunch, I don't know much but I crank up values and get output like I posted from going way too far, just trying stuff. Maybe ramp up or down CFG value over the course of the sample could find a real sweet spot better than fixed value.

versed flax Jul 3, 2023, 11:19 AM

#

wheat zenith Interesting, thanks. Just a hunch, I don't know much but I crank up values and g...

People just do crazy things with the guidance strength. I wanted to keep things simple for the paper

wheat zenith Jul 3, 2023, 11:24 AM

#

You're toppling the Diffusion cartel. They can't keep all this stuff to themselves any longer. We're coming for all of it. Even when it doesn't really make sense an LLM. I'm putting in my prompts anyway.

fallow egret Jul 3, 2023, 11:26 AM

#

wheat zenith Interesting, thanks. Just a hunch, I don't know much but I crank up values and g...

Yes, this sounds like an interesting direction for future work

wheat zenith Jul 3, 2023, 11:28 AM

#

The llama/oobabooga/text-gen community will probably try a lot of obvious twists and variants, if there's a new variable exposed, people will start really exploring.

#

Is is possible to trade more than 2x compute time, in for some further gains?

fallow egret Jul 3, 2023, 11:29 AM

#

wheat zenith The llama/oobabooga/text-gen community will probably try a lot of obvious twists...

I hope it will happen, this means a lot of citations 🙂

versed flax Jul 3, 2023, 11:29 AM

#

wheat zenith Is is possible to trade more than 2x compute time, in for some further gains?

Not that I'm aware of.

fallow egret Jul 3, 2023, 11:31 AM

#

wheat zenith Is is possible to trade more than 2x compute time, in for some further gains?

Actually in CoT you have self-consistency when you run multiple time the chains, and then there is an interesting trade-off (you can apply different cfg values in each iteration, etc' and do smart ensemble)

wheat zenith Jul 3, 2023, 11:37 AM

#

There is always trivial brute force stuff. Not really same concept though. Like you can run an entirely second audio model inside the sampling loop and use it to judge the emotion of the output, and then backtrack and keep trying. It's the least efficient way to do something like that, but if you only need 2 minutes of audio, you can run it all night and it eventually works.

strange magnet Jul 3, 2023, 1:57 PM

#

RT'd 🙂
Great work! It's very exciting to see a project like this come to fruition in Eleuther, where someone can come in with their ideas & results and get help refining it into an impressive paper 🥳

azure lion Jul 3, 2023, 2:04 PM

#

(typo 🤓)

loud adder Jul 3, 2023, 2:51 PM

#

We got an email from someone who wants us to cite their paper on sampling from LLMs
Paper: https://arxiv.org/abs/2110.08294
Seems like we should be able to run their generative code pretty easily if we want to add a comparison t the paper: https://github.com/zhenwang9102/coherence-boosting/blob/main/generation/generation.py

fallow egret Jul 3, 2023, 2:56 PM

#

loud adder We got an email from someone who wants us to cite their paper on sampling from L...

I think they want us to cite them because of equation 2 which is equivalent to CAD

#

P.S, I actually run comparison to ensemble. CFG works significantly better

loud adder Jul 3, 2023, 2:59 PM

#

fallow egret P.S, I actually run comparison to ensemble. CFG works significantly better

Nice! Let's definitely get this added to the paper

fallow egret Jul 3, 2023, 3:00 PM

#

loud adder Nice! Let's definitely get this added to the paper

It was a short table, so it seems a little bit strange to add it as a table, but we can definitely think on an appropriate way to present this results

loud adder Jul 3, 2023, 3:02 PM

#

fallow egret It was a short table, so it seems a little bit strange to add it as a table, but...

I feel like it would fit as a natural subcolumn here?

Screenshot_2023-07-03_at_11.01.49_AM.png

fallow egret Jul 3, 2023, 3:05 PM

#

loud adder I feel like it would fit as a natural subcolumn here?

For each one of the experiments?
We can do that, but I think that generally ensemble try to tackle very different issue. So it will be nice to mention in one of the setting that we beat ensemble (with half computation resources!), but I'm not sure we want to do that on all these experiments since it's not an apple to apple comparison with respect to the problem it is trying to tackle

#

If you just meant the table representation format- then yes, it sound a good idea!

loud adder Jul 3, 2023, 3:13 PM

#

fallow egret I think they want us to cite them because of equation 2 which is equivalent to C...

Is it, given that there's a log? Or is their log f our f?

fallow egret Jul 3, 2023, 4:05 PM

#

loud adder Is it, given that there's a log? Or is their log f our f?

We also apply the addition with respect to the log of the probabilities (this is also the case in the original vision CFG)

patent gull Jul 3, 2023, 4:15 PM

#

@wheat zenith retweet us!!! ~~

wheat zenith Jul 3, 2023, 4:21 PM

#

patent gull <@614946962139250711> retweet us!!! ~~

I did! I'm gonna eventually post a ton of negative prompts I'm sure too, I love them too much. https://twitter.com/jonathanfly/status/1675854740142399490

patent gull Jul 3, 2023, 4:23 PM

#

@here retweet us!! https://twitter.com/Vermeille_/status/1675664118500454400

wheat zenith Jul 3, 2023, 4:24 PM

#

You are now my last two tweets. And I have been tweeting like 3 times a month.

#

So that's a LOT

patent gull Jul 3, 2023, 4:25 PM

#

I just wanna say a huge, huge thanks and congrats to @versed flax who will never take credit for it but is truly the leader here. He went many, many sleepless trying to be awake when we all were and coordinate. Endlessly thoughtful, experimentative, questioning. You really motivated me to be a better thinker.

Also a huge shout out to @blissful garden for powering us through all the tough experiments!!! You also tolerated all my last-minute requests asking for different plots!!

versed flax Jul 3, 2023, 4:25 PM

#

wheat zenith I did! I'm gonna eventually post a ton of negative prompts I'm sure too, I love ...

definitely share your experiments with us!

wheat zenith Jul 3, 2023, 4:26 PM

#

I'm pretty amateur, every single line of code there, probably learned last month, lol

patent gull Jul 3, 2023, 4:26 PM

#

the acknowledgements in the paper don't fully capture how hard these two worked and the spirit, energy and devotion here. This came together quickly but doesn't mean it wasn't deep

patent gull Jul 3, 2023, 4:27 PM

#

versed flax definitely share your experiments with us!

yes please!! (time to start talking about a follow up paper lol???)

blissful garden Jul 3, 2023, 4:28 PM

#

patent gull yes please!! (time to start talking about a follow up paper lol???)

use CFG in finetuning?

patent gull Jul 3, 2023, 4:28 PM

#

i'm down

#

maybe we have a finetuning paper more focused on negative prompting?

#

that seems like an area that we can really own and build from this paper on

wheat zenith Jul 3, 2023, 4:29 PM

#

Is the paper locked, or could you also test CFG and/or negative prompts in some audio LLM? To me they feel pretty natural, negatives too. Sound descriptions have pretty clear opposites. A loud scratch voice, a soft smooth voice, whatever. Even a person, or an entire voice. If you asked a group of people to pick another voice out of a set, that was the opposite, probably mostly pick same person. As opposed to something conceptually hard to grasp like "the opposite of a love poem"

azure lion Jul 3, 2023, 4:32 PM

#

wheat zenith Is the paper locked, or could you also test CFG and/or negative prompts in some ...

why not write a separate paper for that?

blissful garden Jul 3, 2023, 4:34 PM

#

wheat zenith Is the paper locked, or could you also test CFG and/or negative prompts in some ...

MusicGen has CFG in it already right? I remember there was a conversation about that

wheat zenith Jul 3, 2023, 4:35 PM

#

blissful garden MusicGen has CFG in it already right? I remember there was a conversation about ...

Yeah. It only had one prompt though. So if you flip the sign, it's just a pure negative prompt. And then the regular unconditional it always uses. It does actually kind of work, but the range where it works is narrow and fiddly, you have to try to find it.

fallow egret Jul 3, 2023, 4:36 PM

#

patent gull I just wanna say a huge, huge thanks and congrats to <@212467543745626112> who w...

I want to join the congrats, and I'm sure everyone will agree that you also deserve to be applaused. The three of you did really great work. This is high quality paper that definitely generate a lot of interest

wheat zenith Jul 3, 2023, 4:36 PM

#

In musicgen, you can do anything and make weird music. for example mapping CFG to a sine wave, based on tokens. sounds great, adds variety

#

It breaks up the repetition. audio is easy mode I think. Just being different, is good.

blissful garden Jul 3, 2023, 4:37 PM

#

patent gull maybe we have a finetuning paper more focused on negative prompting?

Actually this is gonna be really interesting. We can take any finetuning dataset, prepending each paragraph with negative prompt and finetune towards the extrapolated logit distribution instead of just the next prediction.

wheat zenith Jul 3, 2023, 4:38 PM

#

I also happened to ask someone in the HuggingFace discord about logit attribution, and this is like, the Discord where that concept seems to be literally created, wild timing. I had only practical question about using it to make the audio waveform visualation, act like a debugger for your prompt, but also look cool. But the idea is like an audio version of the colored words in the paper actually.

blissful garden Jul 3, 2023, 4:39 PM

#

wheat zenith Yeah. It only had one prompt though. So if you flip the sign, it's just a pure n...

oh I see. Yeah it would be fun to properly try out negative prompt

patent gull Jul 3, 2023, 4:44 PM

#

shoutout to @fallow egret and @paws too, i think you guys handled a tonnn of back-and-forth, chaotic discussions very, very well and with grace. without your parts, this would be a way flimsier paper

loud adder Jul 3, 2023, 4:45 PM

#

Made the tables a bit cleaner. Especially if we decide we want to add more comparisons, this will scale nicer than the original layout

Screenshot_2023-07-03_at_12.44.50_PM.png

patent gull Jul 3, 2023, 4:45 PM

#

wheat zenith In musicgen, you can do anything and make weird music. for example mapping CFG t...

I agree that music would be interesting but it feels like a different direction. I'm most interested in seeing how far this can go in the language domain. (however, if you wanna take it in the music direction, do it!!!!! i'm sure we'll all be interested in contributing)

patent gull Jul 3, 2023, 4:45 PM

#

loud adder Made the tables a bit cleaner. Especially if we decide we want to add more compa...

nice!

loud adder Jul 3, 2023, 4:57 PM

#

@wheat zenith FYI there's also a thread for training models to generate music, #1106671860294357055

patent gull Jul 3, 2023, 4:58 PM

#

loud adder <@614946962139250711> FYI there's also a thread for training models to generate ...

yes was gonna mention haha

loud adder Jul 3, 2023, 4:58 PM

#

(They seem to have stalled out due to people being busy, but additional manpower might help with that)

patent gull Jul 3, 2023, 5:00 PM

#

i'm loosely involved in that project... i think it's also a question of getting the boilerplate together/training baselines. I question whether it's the right time to start considering extensions like CFG, but ultimately, additional personpower does always help!!

fallow egret Jul 3, 2023, 5:08 PM

#

I think for me an interesting direction of extending this work will be to extend it to the RL context. You can see CFG as modifying the model policy given another policy (negative). And I think that an interesting direction is given a new reward function how we can steer the model properly only during inference, I think this could be done with the ILQL framework, but these are only very initial thoughts...
https://arxiv.org/pdf/2206.11871.pdf

versed flax Jul 3, 2023, 5:13 PM

#

patent gull I just wanna say a huge, huge thanks and congrats to <@212467543745626112> who w...

Dude it's really been a wild and fun. Really, massive kudos to your never stopping improving the paper's quality when I was ready to settle. Massive thanks to @blissful garden glu for running tirelessly all those experiments. And overall for the incredible quality of your reasoning to push the paper further and further.

And obviously thanks to Stella for stepping in in the very beginning, and send me in the right direction to be able to discover and show the power of CFG, and the multiple reading passes

blissful garden Jul 3, 2023, 5:16 PM

#

versed flax Dude it's really been a wild and fun. Really, massive kudos to your never stoppi...

Ok I'm just gonna say thank you everyone because I'm really bad at writing those. But this is really sincere since I enjoyed the past month way better than writing my PhD thesis.

patent gull Jul 3, 2023, 5:17 PM

#

hahaha

#

(is that a high bar? i haven't written mine yet)

versed flax Jul 3, 2023, 5:20 PM

#

blissful garden Ok I'm just gonna say thank you everyone because I'm really bad at writing those...

for real. I've been dreading the 4y my PhD lasted, but that month was a blast

versed flax Jul 3, 2023, 10:21 PM

#

https://github.com/mlc-ai/mlc-llm/issues/499 Another feature request! That's three!

versed flax Jul 3, 2023, 11:48 PM

#

https://github.com/turboderp/exllama/issues/129 and four!

versed flax Jul 4, 2023, 12:12 AM

#

https://github.com/LostRuins/koboldcpp/issues/292 another one!

GitHub

[Enhancement] UI support for CFG scaling/negative prompts. · Issue ...

CFG scaling is being looking into in llama.cpp. The jist of the issue is better explained here, and in the original paper: ggerganov#2083 (comment) For instruct models, sounds like Koboldcpp's ...

loud adder Jul 4, 2023, 12:55 AM

#

@versed flax it seems like it’s really making rounds!

versed flax Jul 4, 2023, 12:55 AM

#

loud adder <@212467543745626112> it seems like it’s really making rounds!

I'm so stoked

#

Retweets from Alexia Jolicoeur-Martineau, Emad Mostaque, Jeremy Howards, lucidrains, and some others whose names I forgot

loud adder Jul 4, 2023, 12:57 AM

#

The raw stats are also pretty cool to see

versed flax Jul 4, 2023, 12:57 AM

#

It's really exciting

#

I almost never use Twitter so I don't know how big of an effect that is, but it's definitely non-zero

#

Oh yeah, someone from Nomic.ai who (ofc) commented on the GPT4All experiment!

loud adder Jul 4, 2023, 1:01 AM

#

Curious observation: EleutherAI retweeting it seems to have made basically no impact. < 100 people have seen the EAI retweet

versed flax Jul 4, 2023, 1:01 AM

#

That's crazy. I had no followers base

#

I don't know who got to see it first then

#

I thought it was your retweet that impacted it

#

Or maybe you retweeting my post made the recsys show my post to EAI's followers directly rather than your retweet?

loud adder Jul 4, 2023, 1:04 AM

#

I quote tweeted tho

#

Every other quote tweet we’ve ever done seems to have 20-100x as many views

versed flax Jul 4, 2023, 1:05 AM

#

🤷‍♂️ weird

#

Oh maybe that's bc my initial tweet mentioned your account

loud adder Jul 4, 2023, 1:06 AM

#

I guess we did something to offend The Great Musk and got throttled 😂

versed flax Jul 4, 2023, 1:06 AM

#

hahaha

#

that would explain your low view count but not my high view count 😆

loud adder Jul 4, 2023, 1:07 AM

#

That’s easily explained by doing good work and getting noticed

versed flax Jul 4, 2023, 1:08 AM

#

That's a nice compliment

#

I'm waiting to see whether it delivers on downstream applications before self gratification and claiming we did "good work" haha

loud adder Jul 4, 2023, 1:12 AM

#

Fair enough

versed flax Jul 4, 2023, 1:17 AM

#

btw @loud adder what's the consensus on non-English LLMs? Nobody seems to really care, why?

No academic interest due to lower innovation / lesser citation potential?
No industry interest bc it's just too expensive to build a dataset and train one?
No interest because we just aim for massive multilingual models?

loud adder Jul 4, 2023, 1:21 AM

#

Almost everyone who trains LLMs is paid by a US or Chinese company

#

There’s a small Korean scene

#

(Our Korean models are the best OS ones AFAIK)

#

There’s a Swedish non-profit that’s trained single-digit pan-Nordic models

versed flax Jul 4, 2023, 1:23 AM

#

I would be so down training a french one

loud adder Jul 4, 2023, 1:25 AM

#

Go find me ~ 1 TB of French text and we can talk

versed flax Jul 4, 2023, 1:25 AM

#

I have 10GB berk

#

Challenge accepted though

loud adder Jul 4, 2023, 1:25 AM

#

There’s this model which is a French fine tune of GPT-J: https://huggingface.co/Cedille/fr-boris

Cedille/fr-boris · Hugging Face

#

And Cedille has an unreleased model they sell commercially IIRC

loud adder Jul 4, 2023, 1:27 AM

#

versed flax I have 10GB <:berk:750111476483752166>

We can probably make do with 300 GB, though quality will suffer compared to 1 TB. And this is post-filtering, to be clear

#

mC4 will get you half way there IIRC

#

Maybe French Wikipedia and a couple other courses can close out the rest

versed flax Jul 4, 2023, 1:29 AM

#

mC4 is Common Crawl?

loud adder Jul 4, 2023, 1:29 AM

#

Yeah

versed flax Jul 4, 2023, 1:29 AM

#

loud adder Go find me ~ 1 TB of French text and we can talk

Is that real though? It looks like there are a lot of talks about quantity vs quality happening

loud adder Jul 4, 2023, 1:30 AM

#

What do you mean by “is that real”?

#

If you can find really high quality data you can get away with less, but we’re talking like “a substantial fraction of all books ever written in French” kind of quality

versed flax Jul 4, 2023, 1:31 AM

#

I mean, is this a number set in stone that can't be challenged with those modern, quality first, approaches?

loud adder Jul 4, 2023, 1:31 AM

#

That is based on modern, quality first approaches

versed flax Jul 4, 2023, 1:31 AM

#

Ah. 😂

loud adder Jul 4, 2023, 1:32 AM

#

loud adder We can probably make do with 300 GB, though quality will suffer compared to 1 TB...

This is about mixing it with code data and running multiple epochs

versed flax Jul 4, 2023, 1:32 AM

#

I know there's a pretty big source of books I want to scrape but I don't know the actual size of it

versed flax Jul 4, 2023, 1:33 AM

#

loud adder This is about mixing it with code data and running multiple epochs

Oh I read this paper!

#

The open question though is: should the code be written in french too? That doesn't exist lol

loud adder Jul 4, 2023, 1:34 AM

#

That’s part of why I said quality will suffer, but you can live without code “in French” most likely

versed flax Jul 4, 2023, 1:35 AM

#

That's interesting. I'm not sure there's a high value doing this (ChatGPT is already pretty amazing at French tbh and I'm sure it's not specifically built with french in mind)

#

But it sounds like a fun ride

loud adder Jul 4, 2023, 1:37 AM

#

There’s some amount of cross-lingual generalization though, see
https://arxiv.org/abs/1910.11856
https://arxiv.org/abs/2005.00633
https://arxiv.org/abs/2211.01786

arXiv.org

On the Cross-lingual Transferability of Monolingual Representations

State-of-the-art unsupervised multilingual models (e.g., multilingual BERT)
have been shown to generalize in a zero-shot cross-lingual setting. This
generalization ability has been attributed to the use of a shared subword
vocabulary and joint training across multiple languages giving rise to deep
multilingual abstractions. We evaluate this hypo...

arXiv.org

From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Tr...

Massively multilingual transformers pretrained with language modeling
objectives (e.g., mBERT, XLM-R) have become a de facto default transfer
paradigm for zero-shot cross-lingual transfer in NLP, offering unmatched
transfer performance. Current downstream evaluations, however, verify their
efficacy predominantly in transfer settings involving la...

arXiv.org

Crosslingual Generalization through Multitask Finetuning

Multitask prompted finetuning (MTF) has been shown to help large language
models generalize to new tasks in a zero-shot setting, but so far explorations
of MTF have focused on English data and models. We apply MTF to the pretrained
multilingual BLOOM and mT5 model families to produce finetuned variants called
BLOOMZ and mT0. We find finetuning l...

#

I would anticipate that code specifically is a high-transfer medium. But I don’t have good evidence of that.

#

I guess we had some in Crosslingual Generalization through Multitask Finetuning. But the evaluation metrics were pretty lacking

versed flax Jul 4, 2023, 1:39 AM

#

Yeah that echoes a private conv I had with @unique sedge earlier. It's easier to learn another language than starting from scratch. You already know how to reason, syntax can be more or less transferable, and vocabulary is just a thin sugarcoating around those much harder implicit tasks / skills

#

Ok the question was "why" and apparently the answer is "funding"

loud adder Jul 4, 2023, 1:41 AM

#

Have you read https://arxiv.org/abs/2212.09535

arXiv.org

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

The BLOOM model is a large publicly available multilingual language model,
but its pretraining was limited to 46 languages. To extend the benefits of
BLOOM to other languages without incurring prohibitively large costs, it is
desirable to adapt BLOOM to new languages not seen during pretraining. In this
work, we apply existing language adaptatio...

versed flax Jul 4, 2023, 1:41 AM

#

I didn't! I will skim through the paper before falling asleep

versed flax Jul 5, 2023, 2:30 AM

#

🥳 https://twitter.com/apage43/status/1676416243652505601 people are starting to use it in practice and are happy about it 🥳

Aaron Miller (@apage43)

@Vermeille_ @AiEleuther trying it on my toy MPT+ggml repo - makes MPT-Chat actually responsive to system prompts in a way that it normally very much isn't!

loud adder Jul 5, 2023, 6:45 AM

#

@versed flax got a Google news alert about the paper too 🙂 https://www.marktechpost.com/2023/07/03/eleuther-ai-research-group-demonstrate-how-classifier-free-guidance-cfg-can-be-used-with-llms/

MarkTechPost

Eleuther AI Research Group Demonstrate How Classifier-free Guidance...

Recently, huge language models have shown impressive generative skills, allowing them to handle a wide variety of problems. Typically, 'prompting' is used to condition generation, either with task instructions and context or with a small number of samples. However, problems, including hallucination, deterioration, and wandering, have been observ...

versed flax Jul 5, 2023, 7:54 AM

#

So damn great! I can't count the number of papers o discovered because my phone recommended me an article from MTP

patent gull Jul 5, 2023, 4:01 PM

#

everyone loves that table, they keep tweeting it --- i'm pretty psyched!! latex \cellcolor{} ftw

#

also i'm so glad @versed flax pushed for the assistant angle, and really pulled all-nighters to make it work

#

i think that's why people are so psyched about us and not CAD or another one

versed flax Jul 5, 2023, 4:10 PM

#

Told ya. Marketing. Lol.

#

There are two main selling points: assistants & 0.5x model size

#

Those are the things that people seem to like about it

#

CFG will land in Hugging Face tomorrow I guess :)

patent gull Jul 6, 2023, 2:19 PM

#

@loud adder we are chatting about follow-up papers in order to capitalize on this attention... do you think we can continue to use the cluster?

loud adder Jul 6, 2023, 2:20 PM

#

patent gull <@193204646687408129> we are chatting about follow-up papers in order to capital...

Yes, you can plan on continued access to 8xA40s for as long as is productive and you make progress

patent gull Jul 6, 2023, 2:20 PM

#

cool!! thank you so much!! yeah I don't think any of us are ready to jump in 100% yet, but we are talking about paper #2 being a fine-tuning paper

#

mainly @blissful garden 's idea, but we're thinking of fine-tuning on CFG-generated data to see if we can "bake" in some of the benefits, thereby getting rid of the 2x inference cost

fallow egret Jul 6, 2023, 4:49 PM

#

patent gull cool!! thank you so much!! yeah I don't think any of us are ready to jump in 100...

It will be very interesting to test it with Dromedary prompts, and see if you can get boost in performances in the self-alignment process. This will be very important and interesting results in the field. One of the issues in Dromedary is that they have very intensive prompts, and I'm guessing that the base LLaMA model is not adhere well to the prompts
https://arxiv.org/pdf/2305.03047.pdf

versed flax Jul 7, 2023, 4:53 PM

#

https://github.com/ggerganov/llama.cpp/pull/2135 Someone added it to llama.cpp!

GitHub

Implement classifier-free guidance by bullno1 · Pull Request #2135 ...

Closes #2083.
To test:
bin/Release/main
--mirostat 2
-ngl 63
-m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin
--verbose-prompt
--prompt "A chat betwe...

versed flax Jul 7, 2023, 5:43 PM

#

https://vermeille.github.io/cfg-llm/ quickly made a paper page

versed flax Jul 7, 2023, 6:35 PM

#

Someone using the pod? the more I use it the more broken transformers get

#

now I even get a protobuf error

#

yesterday I had some weird lib issues

blissful garden Jul 8, 2023, 7:58 AM

#

versed flax Someone using the pod? the more I use it the more broken `transformers` get

oh I tried to install streamlit and wanted to see if I can get the UI working... Maybe that breaks it

#

We should probably have conda in it instead of mixing everyone's env together

versed flax Jul 8, 2023, 10:54 AM

#

blissful garden oh I tried to install streamlit and wanted to see if I can get the UI working......

I fixed it don't worry.

loud adder Jul 8, 2023, 1:20 PM

#

@fallow egret Can you translate this coverage of CFG for us 🙏🏼

https://twitter.com/MikeE_3_14/status/1675930643857825792?s=20

Dr. Mike Erlihson - Math, AI, DeepL, Powerlifting (@MikeE_3_14)

היום ב #shorthebrewpapereviews סוקרים מאמר:
Stay on topic with Classifier-Free Guidance(CFG)
המאמר משתמש בשיטת ל CFG שהוצעה כדי לשפר את הדגימה של מודלי דיפוזיה מותנים (conditionined). מטרת CFG היא ״לכוונן התאמת את הדגימה״ להתניה (יש פרמטק השולט בעוצמת ההתאמה). כאן זה נעשה ל #LLMs

versed flax Jul 8, 2023, 2:34 PM

#

loud adder <@1057033987811459203> Can you translate this coverage of CFG for us 🙏🏼 https...

Google translate does a fairly decent job:

Today in #shorthebrewpapereviews we are reviewing an article:
Stay on topic with Classifier-Free Guidance (CFG)
The article uses the proposed CFG method to improve the sampling of conditioned diffusion models. The purpose of CFG is to "adjust the adaptation of the sample" to conditioning (there is a parameter that controls the intensity of the adaptation). Here it is done for #LLMs

#

Here CFG is used to improve the ability of a language model to generate long and coherent answers to a prompt without forgetting the context. Here the unconditional model is the same model that generates text without conditioning in the prompt. That is, to construct an answer to a given prompt, we move the answer away from the unconditional sample when the strength of the removal is controlled with a gamma parameter

#

The proposed method works quite nicely (not surprising because it is kind of math-based - the formula for calculating the gradients is based on the Bayes formula). That is, the more you raise the Gamma, the more suitable the answer is to the prompt.

fallow egret Jul 8, 2023, 4:25 PM

#

loud adder <@1057033987811459203> Can you translate this coverage of CFG for us 🙏🏼 https...

Yes, google translate is pretty accurate. He also wrote it in the Israeli ML facebook group, and I already thank him and clarify few small points (like the fact that the gradients are in the diffusion model case, in the LLM setting we work directly on the log probability)

versed flax Jul 11, 2023, 4:26 PM

#

https://github.com/ggerganov/llama.cpp/pull/2135 CFG is officially in llama.cpp! The PR has been merged moments ago!

GitHub

Implement classifier-free guidance by bullno1 · Pull Request #2135 ...

Closes #2083.
To test:
bin/Release/main
--mirostat 2
-ngl 63
-m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin
--verbose-prompt
--prompt "A chat betwe...

loud adder Jul 13, 2023, 3:28 PM

#

Somewhat related to the "pretrain with CFG" idea: https://huggingface.co/seonghyeonye/flipped_11B

seonghyeonye/flipped_11B · Hugging Face

tepid gazelle Jul 18, 2023, 8:18 PM

#

It would be really awesome to see how the analyses done in this paper are affected by CFG:

https://www-files.anthropic.com/production/files/measuring-faithfulness-in-chain-of-thought-reasoning.pdf

I'd hope that with CFG, models will be much more likely to change their final answer when conditioned strongly on the generated reasoning chain.

fallow egret Jul 19, 2023, 6:09 AM

#

tepid gazelle It would be really awesome to see how the analyses done in this paper are affect...

We perform such experiment (contrasting the prompt + chain vs only prompt on the answer token), results were good. However, we omitted these results from the paper since it was not 'a real' CFG (but more resemble to negative prompting)

strange magnet Jul 20, 2023, 12:55 PM

#

😄

#

https://docs.novelai.net/text/cfg.html

Advanced: CFG - NovelAI Documentation

NovelAI text and image generation documentation and guidebook.

versed flax Jul 20, 2023, 12:57 PM

#

strange magnet https://docs.novelai.net/text/cfg.html

😎

#

Lit

blissful garden Jul 20, 2023, 1:17 PM

#

I wish those counted as citations 😂

gleaming torrent Jul 20, 2023, 1:27 PM

#

https://twitter.com/novelaiofficial/status/1682010357819142147 maybe could retweet this for more visibility?

NovelAI (@novelaiofficial)

New Phrase Repetition Penalty & Classifier Free Guidance Settings!
It is our pleasure to expose you to new settings that allow you to take Clio to a whole new level.

Updated data storage for faster saving, and updated flash attention to v2 for increased Clio generation speeds!

versed flax Jul 20, 2023, 2:47 PM

#

blissful garden I wish those counted as citations 😂

Btw, about that, any idea why the Bibliographic Explorer doesn't work? https://arxiv.org/abs/2306.17806

arXiv.org

Stay on topic with Classifier-Free Guidance

Classifier-Free Guidance (CFG) has recently emerged in text-to-image
generation as a lightweight technique to encourage prompt-adherence in
generations. In this work, we demonstrate that CFG can be used broadly as an
inference-time technique in pure language modeling. We show that CFG (1)
improves the performance of Pythia, GPT-2 and LLaMA-famil...

loud adder Jul 20, 2023, 2:51 PM

#

versed flax Btw, about that, any idea why the Bibliographic Explorer doesn't work? https://a...

Semantic scholar doesn't think the paper has any citations: https://www.semanticscholar.org/paper/Stay-on-topic-with-Classifier-Free-Guidance-Sanchez-Fan/420e700d6902d065dc557c481979054477f9c6cb

versed flax Jul 20, 2023, 2:52 PM

#

loud adder Semantic scholar doesn't think the paper has any citations: https://www.semantic...

Yes, but the paper cites stuff itself. Isn't that enought for the Bibliographic Explorer?

#

(Also, Semantic Scholar usually extracts figures & tables, it didn't)

loud adder Jul 20, 2023, 2:53 PM

#

I guess SS failed to parse the paper properly then

#

Here's a paper that also has zero citations but bib explorer works fine: https://arxiv.org/abs/2306.01481

arXiv.org

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Tra...

Noticing the urgent need to provide tools for fast and user-friendly
qualitative analysis of large-scale textual corpora of the modern NLP, we
propose to turn to the mature and well-tested methods from the domain of
Information Retrieval (IR) - a research field with a long history of tackling
TB-scale document collections. We discuss how Pyserin...

versed flax Jul 20, 2023, 2:54 PM

#

Uh. I'll try and see what I can do then.

tepid gazelle Jul 20, 2023, 3:06 PM

#

fallow egret We perform such experiment (contrasting the prompt + chain vs only prompt on the...

are such results / outputs saved somewhere? no worries if not

blissful garden Jul 20, 2023, 3:21 PM

#

versed flax Btw, about that, any idea why the Bibliographic Explorer doesn't work? https://a...

hmm never used Bibliographic Explorer at all...
Pure math people never care about citations so I'm quite behind about those tools

fallow egret Jul 20, 2023, 6:07 PM

#

tepid gazelle are such results / outputs saved somewhere? no worries if not

No, but I think If needed I can easily find the code and rerun it...

fallow egret Jul 20, 2023, 6:15 PM

#

tepid gazelle are such results / outputs saved somewhere? no worries if not

Actually, I found some result, although it's not a good model to evaluate COT since it's too weak, still you can definiately see the improvement

versed flax Jul 21, 2023, 12:11 AM

#

llama.cpp is about to add CFG to the web interface!
https://github.com/ggerganov/llama.cpp/pull/2217

#

rustformers is looking at it too
https://github.com/rustformers/llm/issues/377

#

https://github.com/abetlen/llama-cpp-python/issues/506 python bindings

obtuse tiger Aug 31, 2023, 3:33 PM

#

Hi

#

I found this paragraph a bit weird because you say that embeddings are good and have nice structure and then say oh yeah actually we are doing logit arithmetic. But I think this is just equivalent to doing arithmetic with the final layer hiddens since the unembedding is a linear transform right?

Captura_de_pantalla_2023-08-31_a_las_8.33.49_a.m..png

blissful garden Aug 31, 2023, 3:42 PM

#

obtuse tiger I found this paragraph a bit weird because you say that embeddings are good and ...

yeah that's what we meant

obtuse tiger Aug 31, 2023, 3:43 PM

#

Cool, I kind of suspected that, but it was unclear. If I were you I might make an update to the paper to clarify but up to you guys of course.

#

actually sorry

#

I just read the second to last sentence

#

which makes it more clear

#

I still feel like it's confusing-ish

#

because

#

idk it's like the core of what you're doing

#

and it should be 1000% clear

#

but anyway

blissful garden Aug 31, 2023, 3:45 PM

#

I actually felt this paragraph is a bit hard to parse as well. I would have been confused when reading it the first time, but I'm not trained with ML background so I blamed myself lol

#

might be a better way to phrase it though. Will def think about it when we prepare to submit it somewhere

obtuse tiger Aug 31, 2023, 3:49 PM

#

Also could we combine this with the tuned lens?

blissful garden Aug 31, 2023, 3:49 PM

#

btw @versed flax any thoughts on where to submit?

obtuse tiger Aug 31, 2023, 3:49 PM

#

To do CFG in intermediate layers

blissful garden Aug 31, 2023, 3:49 PM

#

obtuse tiger To do CFG in intermediate layers

oh that sounds like a cool idea!

obtuse tiger Aug 31, 2023, 3:50 PM

#

Steering GPT-2-XL by adding an activation vector — AI Alignment For...

Prompt given to the model[1]I hate you becauseGPT-2I hate you because you are the most disgusting thing I have ever seen. GPT-2 + "Love" vectorI hate…

#

they end up having to do "counterbalancing subtraction"

#

which is kinda like negative prompting

obtuse tiger Aug 31, 2023, 3:58 PM

#

obtuse tiger Also related to this https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/stee...

actually sorry this is the more recent thing https://arxiv.org/abs/2308.10248

arXiv.org

Activation Addition: Steering Language Models Without Optimization

Reliably controlling the behavior of large language models (LLMs) is a
pressing open problem. Existing methods include supervised finetuning,
reinforcement learning from human feedback (RLHF), prompt engineering and
guided decoding. We instead investigate activation engineering: modifying
activations at inference time to predictably alter model ...

versed flax Aug 31, 2023, 4:17 PM

#

blissful garden btw <@212467543745626112> any thoughts on where to submit?

I'll look into it. Maybe the more seasoned researchers can judge what's the best conf we can realistically submit to

blissful garden Aug 31, 2023, 4:54 PM

#

versed flax I'll look into it. Maybe the more seasoned researchers can judge what's the best...

ICLR is probably the next big deadline I guess 🤔

#

Also we should brainstorm what kind of questions we can ask if we do CFG in intermediate layers. It sounds cool and there should be some interesting collaboration here.

obtuse tiger Aug 31, 2023, 4:58 PM

#

blissful garden Also we should brainstorm what kind of questions we can ask if we do CFG in inte...

Would potentially be interested in collaborating, I think there’s some interesting connections with interp and concept editing

versed flax Aug 31, 2023, 5:09 PM

#

blissful garden Also we should brainstorm what kind of questions we can ask if we do CFG in inte...

It's out of scope for me

obtuse tiger Aug 31, 2023, 6:13 PM

#

Maybe we could start a thread in #concept-editing or smth

versed flax Aug 31, 2023, 6:16 PM

#

obtuse tiger Maybe we could start a thread in <#1143813928346976347> or smth

It's cool if you want to pursue it. My point is that the paper is about porting CFG, and the initial CFG is not used for hidden layers. Those experiments deserves to be run but imho they're not in this paper's scope

obtuse tiger Aug 31, 2023, 6:17 PM

#

Oh sure I think I agree this would need to be a different paper

versed flax Aug 31, 2023, 6:36 PM

#

obtuse tiger Oh sure I think I agree this would need to be a different paper

Totes!

obtuse tiger Aug 31, 2023, 6:41 PM

#

#1146877254153031930

blissful garden Sep 1, 2023, 1:32 AM

#

versed flax It's out of scope for me

Oh yeah I didn't mean to change anything for our current paper. For submission we just need to decide if we want to say more about negative prompts

We used to have some thoughts about a second paper and we can slowly picking them up and brainstorm

patent gull Sep 7, 2023, 9:12 PM

#

hey just caught up on this 🙂 these new ideas sound interesting and cool

#

we had an alternative direction for paper #2 in the idea of fine-tuning, just want to keep that one alive, too!

#

but reg. paper #1

versed flax Sep 7, 2023, 9:13 PM

#

There's ICLR submission at the end of the month. You guy ok to submit?

patent gull Sep 7, 2023, 9:13 PM

#

ICLR deadline is 9/28, I just re-checked it, it's in Vienna too

#

yah

#

just messaging about that

#

I'm cool to submit there!! I think it's a good idea. Stella's typing though. what do you think, stella?

loud adder Sep 7, 2023, 9:13 PM

#

(I also have a half written message suggesting ICLR from earlier today but got distracted before finishing it)

patent gull Sep 7, 2023, 9:15 PM

#

alternatives in the NLP domain are NAACL (11/23 I think) and ACL (January-ish, hasn't been announced). otherwise, we can always wait for Neurips :/

but i vote ICLR. @blissful garden ?

loud adder Sep 7, 2023, 9:16 PM

#

We should know about ICLR in time to submit to ACL or ICML

patent gull Sep 7, 2023, 9:17 PM

#

is ICML better than ICLR or are they roughly equivalent? I'd probably rank NAACL lowest

loud adder Sep 7, 2023, 9:17 PM

#

I view ICML, ICLR, and NeurIPS as equivalent

versed flax Sep 7, 2023, 9:19 PM

#

I think I agree with that

patent gull Sep 7, 2023, 9:19 PM

#

great 🙂 let's go for ICLR then.

imo i don't think the paper needs much. maybe another round of grammar-editing, following @obtuse tiger 's point about clarity up there

Do you think there is an anonymity-preserving way to mention in the paper everything that has happened since Arxiv? i.e. that CFG is incorporated into Huggingface and llama.cpp? that's certainly a cool contribution

loud adder Sep 7, 2023, 9:21 PM

#

"Since its public release, CFG has been readily adopted by major LLM libraries including llama.cpp and transformers"

patent gull Sep 7, 2023, 9:22 PM

#

cool.

maybe we can also throw cool examples of generations using CFG that the community has generated into the appendix? too bad we never set up a tipline for community members to send us what they played around with...

versed flax Sep 7, 2023, 9:23 PM

#

I only someone kept track of everything and lurked in the communities using CFG 😏

loud adder Sep 7, 2023, 9:24 PM

#

I don't see that hurting the paper's chances, but it's non-standard and I don't see it helping.

patent gull Sep 7, 2023, 9:25 PM

#

well i'm thinking of ways of saying "the community thought this was useful" ... showing it's been incorporated into major libraries, and including examples of grassroots adoption are ways of doing that?

loud adder Sep 7, 2023, 9:36 PM

#

I am not aware of adoption at a scale wherein it would significantly influence reviewers. I could be underestimating it, but off the top of my head papers that do that are things like VQGAN-CLIP (> 1 billion uses) or things like FSDP and trlX which are used in million-dollar model trainings

versed flax Sep 7, 2023, 9:44 PM

#

ngl it's not like it changed the world (yet?) the adoption is quite slow

fallow egret Sep 8, 2023, 6:44 AM

#

It might help in case the experimental section was thin. But the experimental section of this paper is so vast and extensive that it's hard to believe that it will add any positive points. The only claim I can see for a rejection is lack of novelty.

Regarding ICLR, it's of course a great conference. The negative part is the open-review process, which is tough and might result that the top result in google will be an old version or rejection with ugly bad reviews.

blissful garden Sep 8, 2023, 8:12 AM

#

fallow egret It might help in case the experimental section was thin. But the experimental se...

I actually like open review a lot better than the closed reviewing process in pure math journal submission. We have way more shitty reviews than one can imagine that are obviously biased and/or even personal. Very occasionally there are also questionable papers get accepted in top journals very fast. I wish people could have seen the whole process in every submissions. If a paper is objectively good, there is nothing to be afraid. If there are fair points that need to be improved, we will just improve them.

fallow egret Sep 8, 2023, 8:29 AM

#

blissful garden I actually like open review a lot better than the closed reviewing process in pu...

I'm not against submitting to ICLR, and as a researcher I of course think that the open-review process is positive. However, as an author this format require much more effort (there might be full discussion with the reviewers + requests for few draft versions), and you have the publicity that make you think ten time on every sentence. So I think we should submit, but it is something that should be considered

unique sedge Sep 8, 2023, 8:33 AM

#

Have no strong opinions on submissions to conferences. On board with anything you choose 😄

wheat zenith Sep 9, 2023, 1:26 AM

#

versed flax Not that I'm aware of.

I'm worried I'm asking a dumb question and missing the obvious, so forgive for commenting in your group research channel again. But I didn't understood this response and it's been bugging me. What was the reason you can't trade more than 2x compute time and possibly enable model capabilities or outputs you couldn't get just inferencing twice?

As a concrete example, with this transformers patch change you can use the negative prompt as a second positive prompt, and that seems like it is a useful tool. https://github.com/huggingface/transformers/pull/25339#issuecomment-1667814849 So at a minimum, wouldn't I then have to inference three times instead of two if I want to use that second positive guidance but also want to use negative guidance at the same time? Or is there some way of reducing or collapsing all the combinations back down to two steps?

Thanks for being so nice when I randomly barged in originally btw, I kind of missed the context of this channel being a semi-private group research spot in the excitement of the moment but everyone was exceptionally chill about it.

blissful garden Sep 9, 2023, 4:35 AM

#

wheat zenith I'm worried I'm asking a dumb question and missing the obvious, so forgive for c...

Yeah I guess you can take any linear combination of prompts. Not sure about exactly what comes out of it but people should feel free to explore. If there are 3 separate prompts, maybe indeed you will have to go through all of them at minimum.

fallow egret Sep 9, 2023, 4:56 AM

#

wheat zenith I'm worried I'm asking a dumb question and missing the obvious, so forgive for c...

I don't think that there is a dispute that linear combination will work, and there might be practical use cases. However, I think that from a research perspective it's not interesting since by the sum property (additivity, commutativity), you can split it to a sum of the positive part and the negative part. Now since it is already known that a sum of different models logit behave as an ensemble method + we know that the minus behave as a contrastive decoding, then the expected result is clear. So this is why I think that it will not be interesting from a research perspective (there is no no novelty/ new information that you can deduce from such experiments )

wheat zenith Sep 9, 2023, 5:59 AM

#

blissful garden Yeah I guess you can take any linear combination of prompts. Not sure about exac...

Super helpful, thanks. Yeah I just wanted to make sure it was different, so there could be a reason you might actually want to do the extra work of 3x inferences, and there wasn't some underlying reason why it could always down to just two. I lurk your research here a bit because you guys keep coming up with fascinating sampling and prompt concepts that are fun to even think about, what it means in a prompt or if it was 'working correctly'. Small code changes that open up tons of new prompt possibilities, and model outputs are *wildly * different. I'm not involved in research myself, it's just really fun to try your ideas and see what the heck comes out. 🙏 (I barely tried neg guidance in audio yet, still mostly unexplored, and just noticed you are thinking about CFG gen 2 already.)

steep bone Sep 10, 2023, 9:10 AM

#

Is this work similar to this ACL paper: https://arxiv.org/abs/2307.03214 ?

arXiv.org

PREADD: Prefix-Adaptive Decoding for Controlled Text Generation

We propose Prefix-Adaptive Decoding (PREADD), a flexible method for
controlled text generation. Unlike existing methods that use auxiliary expert
models to control for attributes, PREADD does not require an external model,
instead relying on linearly combining output logits from multiple prompts.
Specifically, PREADD contrasts the output logits ...

versed flax Sep 10, 2023, 10:15 AM

#

wheat zenith I'm worried I'm asking a dumb question and missing the obvious, so forgive for c...

Taking linear combinations works.
However, what I was saying is, we found that CFG is like a 2x model, don't think that 2 prompts = 2x, and N prompts = Nx. There's not link.

blissful garden Sep 10, 2023, 10:22 AM

#

steep bone Is this work similar to this ACL paper: https://arxiv.org/abs/2307.03214 ?

Yeah, it seems that the math is exactly the same. They seem to focus on the toxicity and sentiment control with negative prompt which is a bit different in terms of narratives. And... phew... I'm glad we have a better timestamp in terms of arxiv post date 😂

versed flax Sep 10, 2023, 11:27 AM

#

They don't cite us 😠😂

loud adder Sep 10, 2023, 1:53 PM

#

versed flax They don't cite us 😠😂

They actually predate us, but the ACL anon policy means they couldn't release it until later

#

The ACL submission deadline was in 2022

versed flax Sep 10, 2023, 2:05 PM

#

loud adder They actually predate us, but the ACL anon policy means they couldn't release it...

In case this needed to be made explicit: it was a joke 😀

loud adder Sep 10, 2023, 2:08 PM

#

Oh

#

🤦‍♀️

patent gull Sep 12, 2023, 5:50 PM

#

well definitely another paper to add to the related works!

patent gull Nov 11, 2023, 8:48 PM

#

@everyone @unique sedge @fallow egret

Hello everyone we got results back from ICLR.

We're right below the margin of comfort for acceptance. If 1 or more reviewers increases their score by 1, we will be MUCH more comfortable with our chances.

We've identified 2 small experiments we think have a great chance of increasing our scores:

show memory comparisons
show NLG controlled generation comparison

I think @versed flax already addressed #1. Does anyone have any bandwidth to address #2? I will work closely with you to do this

#

w.r.t. #2, here is guidance for an experiment.

SOTA controlled NLG baselines:

FUDGE: https://arxiv.org/pdf/2104.05218.pdf
NADO: https://arxiv.org/pdf/2205.14219.pdf

Experiments:

sentiment
formality

I think there are classifiers for both, I think the experiment can "is CFG output classified as formal, via formality classifier vs. is NADO output classified as formal, via the formality classifier"

#

i think it's going to be difficult to show that CFG beats SOTA controlled NLG, because SOTA NLG assumes the presence of a classifier, which is a benefit of CFG that we don't need one, so we can do NLG beyond just formality and sentiment. But as long as we show it's not too different in these areas, that would be a nice result and might cause R2 to raise their score

versed flax Nov 11, 2023, 8:57 PM

#

I'll add that:

#1 will be addressed soon wrt to the memory question. It's a fair and important question. I ran the calculations necessary.
#2 is imho the hardest to address. His questions are totally outside of my comfort zone, so that's the thing I will personnally won't be able to tackle correctly
#3 gave us a 5 while being notably confused by the paper and thinking it was a training technique. Honglu and I think that if we fix his understanding and show him that it is indeed better than a training technique, we can get a getter grade from him

patent gull Nov 11, 2023, 8:58 PM

#

that's doubtlessly true. but in terms of outlining what work we will do between now and 11/22, there's nothing to be done for R3 besides crafting a good argument

#

@unique sedge and @fallow egret if we can come together and address some of the actual work-items, then we raise our chances

#

p(score increase ) = sum_{reviewers} poisson(\lamba)

#

with a very, very low lambda

#

@loud adder @blissful garden any way to get access to some A100s to run some CFG runs to address #2?

blissful garden Nov 11, 2023, 9:16 PM

#

@patent gull @versed flax did you guys have access to SAI cluster?
Sadly the A40 pods are taken away from EAI afaik. We have some 4090 I think

#

@tepid gazelle Do you know what compute resources does EAI have right now? 4090 pods?

tepid gazelle Nov 11, 2023, 9:20 PM

#

blissful garden <@981242445696221224> Do you know what compute resources does EAI have right now...

We have 2080s on CW, and A100s on SAI cluster

versed flax Nov 11, 2023, 9:21 PM

#

@blissful garden / @patent gull I have some CoreWeave instances with my job now. Depending on the duration of the experiments I can run them

#

can't give you access tho

blissful garden Nov 11, 2023, 9:24 PM

#

I have access to SAI cluster. If you guys have codes for small models I can scale it on SAI cluster. Jobs can get preempted but half a day is usually not a problem

patent gull Nov 11, 2023, 9:25 PM

#

Ok I can set up some experiments for you to run @versed flax. I just feel like 2080s are going to be annoying if we want to run any CFG on any models beyond just llama 7b or something

blissful garden Nov 11, 2023, 9:25 PM

#

I have TPU v3 pods that I can share, but TPU is a different beast 😂

patent gull Nov 11, 2023, 9:28 PM

#

I have access to 2080s too… I can set up some experiments with smaller models and then pass ‘em off

fallow egret Nov 12, 2023, 4:00 PM

#

Hi, sorry for the late response. @patent gull I have a bandwidth to work and help in whatever is needed.

fallow egret Nov 12, 2023, 5:11 PM

#

I read the reviews. I'm not sure how much this experiment will help (overall the experimental section of the paper is the strong part of the section). It seems that the main concern (as expected) is the lack of novelty and contribution. I think we should think about the strategy how to address this issue.
I think it will be important to address this issue and upload the rebuttal response as soon as possible so the reviewer will have a chance to give a feedback and develop a discussion, because it will not be easy to convince them about the contribution (actually with this score we need to convince the AC).
I think there are two paths:

Differentiate our work from previous works (I think it's possible, we discuss about it a lot few months ago).
This is mainly for R@3, which think that the experiment section were insightful: I think we should focus on experimental contribution (this is a valid contribution and reviewers sometime forget about the importance of a solid experimental paper).

versed flax Nov 12, 2023, 5:31 PM

#

fallow egret I read the reviews. I'm not sure how much this experiment will help (overall the...

R3 didn't read the paper, it's pretty clear. It shouldn't be hard to prove that the work is indeed novel (pointing out the fact that following the paper it was implemented in a lot of inference libs should be enough, if it were not model, it would have already been there)

fallow egret Nov 12, 2023, 5:49 PM

#

versed flax R3 didn't read the paper, it's pretty clear. It shouldn't be hard to prove that ...

I don't think that integration in libraries is a valid claim for academic contribution. In the end there are indeed many previous work on decoding methods which seems to be equivalent to CFG (we know at least 3-4 works). The fact that they didn't release the code or bother to integrate it in big repos doesn't mean you have added value on top of their work.
I think that even if R3 didn't read the paper it will not going to be easy to convince the AC

patent gull Nov 12, 2023, 6:07 PM

#

my experience has been that directly addressing as many reviewers concerns as possible is the best chance to increase the score

#

p( score increase) = \sum_{reviewers} p(reviewer score increase)

#

and in OpenReview we can respond to each reviewer individually

#

Yes @fallow egret , we should quickly craft a response to R3 and try to respond to all the intellectual points as soon as possible to encourage discussion. But that doesn't preclude us from also trying to run the experiments they ask for. In the end, it may not amount to anything

#

but if 1 reviewer increases their score by 1, then our paper has a much better chance

blissful garden Nov 12, 2023, 6:09 PM

#

fallow egret I don't think that integration in libraries is a valid claim for academic contri...

Yeah I totally agree with you in terms of not using lib integrations to back ourselves up. Mentioning these can easily backfire IMHO

patent gull Nov 12, 2023, 6:14 PM

#

if you or @unique sedge have bandwidth, it would be great to see if you can get NADO working for formality

NADO: https://arxiv.org/pdf/2205.14219.pdf

I already have FUDGE working for sentiment... would be pretty easy to complete all 2 x 2 after that, and then run CFG with formal prompts and sentiment-relevant prompts, and then evaluate

fallow egret Nov 12, 2023, 6:14 PM

#

on GPT-2?

#

It seems also that there is a reproduction issues with their code:
https://github.com/MtSomeThree/constrDecoding/issues/4

GitHub

Result Reproduction& potential ethical issues · Issue #4 · MtSomeTh...

Hi, we have encountered difficulties in reproducing your work. Could you please provide us with the generated results on the CommonGen test dataset? Additionally, we noticed that you tested your mo...

patent gull Nov 12, 2023, 6:47 PM

#

thanks @fallow egret for checking this out!!!! GPT2 is what I was thinking, yeah

#

I can put you in touch with the primary authors — Sidi and Tau

#

or, I'll just reach out to them

fallow egret Nov 12, 2023, 6:48 PM

#

I think we can just use the code and it's their issue if the results are not great 🙂

patent gull Nov 12, 2023, 6:48 PM

#

also wait — there's no issue running the code, just reproducing the results?

#

yeah

#

that's what I'm thinking, too

#

we just report the results (if anything, we can footnote this issue, or something)

fallow egret Nov 12, 2023, 6:50 PM

#

sure, so I can take it. Let's sync on private message on the exact experiment (dataset, metric)

versed flax Nov 12, 2023, 8:00 PM

#

I can take care answering to R1 and the memory analysis he requested

versed flax Nov 12, 2023, 9:27 PM

#

We kinda establish that a model with CFG consumes 2x the flop (2 forwards) but still follows the perf / flop plot. So you kinda can train a half model and infer with CFG.
So the question is: is this tradeoff smart in inference as well, given that you use two cache lines with CFG, but 2x bigger models need to store more floats per token in cache (bc of the bigger hidden dim) and store 2x params?
I do the maths and show that it depends on your VRAM / intended cache size. For small models, the weights are negligible in VRAM, you can have big caches, and the double cache for CFG is not worth the 2x reduction in params. However, for LLMs, especially the very big ones (> 30B), the weights take a massive amount of memory and the 2x cache lines would outgrow the param halving after only very big amounts of VRAM

versed flax Nov 12, 2023, 9:53 PM

#

I end with this chart

#

it reads like this: Say you have 10GB VRAM. For model sizes above the red line (up to 1B in this case), you should stick with vanilla models. The 2x cache line with overweigh the /2 param counts. Then, below the red lines (1B and above), prefer deploying CFG: your VRAM isn't big enough to store a big cache, and the /2 param count is better

#

blissful garden Nov 12, 2023, 10:31 PM

#

Looks good

loud adder Nov 13, 2023, 12:15 AM

#

Hello, I'm back from getting married 🥰

ICLR reviews look decent. We're in the top 40% of papers by review score. Do we have a google doc for organizing our response yet? Or have we been doing it in this thread?

blissful garden Nov 13, 2023, 12:50 AM

#

loud adder Hello, I'm back from getting married 🥰 ICLR reviews look decent. We're in the ...

some notes are here
https://docs.google.com/document/d/1iDQaPl3BKmdOYLvvDrJwKZoZijeks4qJsxmkWCFmWVk/edit#heading=h.e4qpo5vysxsq
we might clean it up and use it to organize responses

blissful garden Nov 13, 2023, 12:55 AM

#

loud adder Hello, I'm back from getting married 🥰 ICLR reviews look decent. We're in the ...

Also some tl;dr and relevant messages in this thread

R1: #1111624010581680179 message
R2: some controlled NLG experiments we consider quickly doing: #1111624010581680179 message
R3: we are confused by what the reviewer wants but we wrote some draft responses in the google doc

versed flax Nov 13, 2023, 1:15 AM

#

loud adder Hello, I'm back from getting married 🥰 ICLR reviews look decent. We're in the ...

Hi Stella! congratulations 🥳 ! Hope you had an AMAZING wedding!

#

As you can see, we have started working on answers:

R3 is probably the easiest to convince since they barely understood the paper (my guess is that "uh, CFG, not novel!" then barely skimmed the paper + weak understanding of CFG anyway ("training technique"?!?!)). Maybe R3 should be answered at a high level since their critics aren't that deep. The main point is convincing of novelty. I don't know how to prove it besides 1) "trust me bro", or 2) "our work is novel. The proof is that the arxiv release was followed by implementations in major LLM inference engines => it wasn't already there", but people seem to agree that this is bad defense and can backfire. Especially bc it seems people didn't really get how to use it, especially the neg prompt, it seems

R2 is totally out of my scope. Alex seems to know how to tackle his points.

R1 is addressed with the aforementioned analysis

I think we should probably answer tomorrow

loud adder Nov 13, 2023, 1:32 AM

#

versed flax

I think that this might make more sense with the axes switched? The VRAM seems like the more fundamental constraint to me, where you then maybe vary the params and move between the regions

versed flax Nov 13, 2023, 1:34 AM

#

loud adder I think that this might make more sense with the axes switched? The VRAM seems l...

Fair

#

I will need to triple check the maths, but the idea is here. Worst thing that can happen is that the slope changes a bit. Not much to worry.

patent gull Nov 13, 2023, 1:52 AM

#

hello @loud adder , congratulations!!!! i hope your wedding was amazing as well and wow, we weren't expecting to hear from you — don't you have a honeymoon or something?? I didn't know EAI was part of that hahaa

patent gull Nov 13, 2023, 2:03 AM

#

versed flax I will need to triple check the maths, but the idea is here. Worst thing that ca...

beautiful graph! alright i'll edit R3 and the response to R1.

We should also convert that to a table that we can copy/paste into the rebuttal. If I'm not mistaken, OpenReview doesn't let you upload images in your response, does it?

versed flax Nov 13, 2023, 2:04 AM

#

patent gull beautiful graph! alright i'll edit R3 and the response to R1. We should also c...

as long as you can put a link to imgur...

patent gull Nov 13, 2023, 2:04 AM

#

loud adder I think that this might make more sense with the axes switched? The VRAM seems l...

(I don't fully agree btw, I think parameter count is the dependent variable here. R1 asked for the effect of CFG on memory, so that implies that we study memory as a dependent var)

versed flax Nov 13, 2023, 2:05 AM

#

patent gull beautiful graph! alright i'll edit R3 and the response to R1. We should also c...

graph updated. Grey area represents a model too big to even fit on the amount of VRAM

patent gull Nov 13, 2023, 2:07 AM

#

ok, I just checked... no image uploads to OpenReview.

We can ask them to click a link, but shouldn't expect they will. We've all been through enough phishing videos.... Also, it's one more click

So we want numbers we can paste into the box as well for the quick headline, and then they can click if they want to see more

versed flax Nov 13, 2023, 2:07 AM

#

We can update the PDF tho. So we can put the figure in it.

patent gull Nov 13, 2023, 2:08 AM

#

yeah definitely. again, wouldn't expect the reviewer to check. I can't even get my advisor to read my updates...

loud adder Nov 13, 2023, 2:09 AM

#

patent gull (I don't fully agree btw, I think parameter count is the dependent variable here...

I said this having only skimmed the reviews based on what I would generally expect from a plot like this.
The dependent variable is the y-axis...
That said, the actual dependent variable here is the memory usage. I assume you either misspoke or got the words confused, but ultimately you're correct.

I'll read the review in question again and if your characterization of the request is correct I agree the original format likely makes sense.

versed flax Nov 13, 2023, 2:10 AM

#

new figure version reads like:

y axis interpretation: if you have 10GB of VRAM, serve vanilla models up to 1B. Then, serve with CFG. 5B and above => can't fit.
x axis reading: say you have a 1B. You need at least 2GB to serve it. Up to 10GB, serve with CFG. Then, you'd be better serving an actual 2B

patent gull Nov 13, 2023, 2:15 AM

#

whoops!! yes, I meant parameter count is the independent variable and vram is the dependent var.

@versed flax what is the green "vanilla" writing supposed to be aligned with?

versed flax Nov 13, 2023, 2:15 AM

#

patent gull whoops!! yes, I meant parameter count is the independent variable and vram is th...

it just shows the upper triangle. The wording ain't great as well.

loud adder Nov 13, 2023, 2:15 AM

#

patent gull whoops!! yes, I meant parameter count is the independent variable and vram is th...

It's a label for the region as a whole

patent gull Nov 13, 2023, 2:17 AM

#

i see... so the diagonal lines are lower bounds based on parameter count? and there's no upper bound because data tensors can take up VRAM?
if i'm just not understanding, but everyone else is, it's ok, we can move on

loud adder Nov 13, 2023, 2:18 AM

#

The grey shades region is the region where the model doesn't fit within the specified VRAM

versed flax Nov 13, 2023, 2:18 AM

#

patent gull i see... so the diagonal lines are lower bounds based on parameter count? and th...

You have a lower bound: model too big => can't fit => failure.
You have no higher bound => more VRAM means you can't fit a bigger and bigger kv cache

loud adder Nov 13, 2023, 2:19 AM

#

Let me see if I can explain the plot (since I'm not actually sure I'm following 100%)

#

The question is whether our claims about "matching larger models" remains true is we care about VRAM (w/ k-v caching) rather than # params
The red line is the Pareto optimal frontier as you trade off # params vs VRAM

I'm confused about what the blue dots are though.

#

This is specifically a response to

R1: memory cost analysis is recommended. The proposed method requires a second run of the model, which may increase the memory cost (for example, the key-value cache).

versed flax Nov 13, 2023, 2:25 AM

#

If I understand correctly what you mean, yes.
Yes. Serving with CFG costs more kv cache but less params, and (kinda) gives you the performance of a model twice the size. So, below the red line, you should serve with CFG, above it, you should serve an actual 2x model (without CFG). If you want to maximize the amount of tokens you fit in your kv cache, that is.

#

Blue dots are just the actual param count / max kv cache size for the models in the paper (gpt2-*, pythia-*, llama-*). Since they have variation in arch and there's a little alignment to 64 at play in the hidden dim, they don't exactly fall onto the red line.

Yes, this is meant to answer that remark from R1.

loud adder Nov 13, 2023, 2:32 AM

#

Side note: looking over the paper again the misalignment between plots and where they're referenced in the text is very distracting

patent gull Nov 13, 2023, 2:32 AM

#

i have to stare at this some more. so, below the red line (vertically), you have enough excess memory, but not so much, so you can afford to serve the same size model with CFG. Below the green line, that size model won't fit. Above the red line, you have so much extra memory that you should just serve a bigger model?

#

I don't think I fully understand, but I think the y-axis label could be improved "VRAM at Equality". Equality to what?

versed flax Nov 13, 2023, 2:33 AM

#

patent gull i have to stare at this some more. so, below the red line (vertically), you have...

Yes

patent gull Nov 13, 2023, 2:34 AM

#

ok maybe the region in between red and green can be shaded light green for "go"?

versed flax Nov 13, 2023, 2:35 AM

#

patent gull I don't think I fully understand, but I think the y-axis label could be improved...

The wording is terrible and GPT-4 copied it from my terrible csv. It's labeled "equality" bc on this line you can fit a kv cache of equal size whether you choose to serve with CFG or serve a 2x model

patent gull Nov 13, 2023, 2:35 AM

#

the region below the green line – "gray" is fine, but "red" for "stop" is also OK. Above the red line can be light blue. And in a legend, or in the caption, we can define what each of these colors mean. The reality is that there are 3 separate regions, here, not just two

versed flax Nov 13, 2023, 2:36 AM

#

yes, let me GPT4 this rn

patent gull Nov 13, 2023, 2:36 AM

#

hahaha

#

ax.fill_between()

versed flax Nov 13, 2023, 2:36 AM

#

nah doesn't burn enough CO2

patent gull Nov 13, 2023, 2:39 AM

#

lolll

so just looking at one verticle line:
at parameter count = 1B, we intersect with the green line at ~~1.1~~ 2 GB VRAM (green) and 10 GB (red, and blue dot)

#

does that mean that for a model with 1B parameters, vanilla costs us ~~1.1 ~~ 2 GB and CFG costs us 10? so 5x as much? that seems high to me

versed flax Nov 13, 2023, 2:40 AM

#

patent gull lolll so just looking at one verticle line: at parameter count = 1B, we interse...

(2GB*, log scale)

versed flax Nov 13, 2023, 2:41 AM

#

patent gull does that mean that for a model with 1B parameters, vanilla costs us ~~1.1 ~~ 2 ...

I don't understand the wording. Let me try again.

patent gull Nov 13, 2023, 2:41 AM

#

at param count (X) = 1B, I see green line intersect VRAM (y) at the 2B y-tick, and red line at the 10b y-tick

patent gull Nov 13, 2023, 2:42 AM

#

versed flax I don't understand the wording. Let me try again.

oh duh lol my bad

patent gull Nov 13, 2023, 2:43 AM

#

loud adder __Side note:__ looking over the paper again the misalignment between plots and w...

meaning like a Figure will be at Page 5, but it will be referenced at page 2?... yeah we should do a better job at shuffling them around

#

I guess what I thought the reviewer was looking for is performance vs. VRAM for CFG vs. vanilla.

Just like Fig 11:

#

versed flax Nov 13, 2023, 2:52 AM

#

You have X amount of VRAM. You want to use it all and serve efficiently. So you'll store the model weights, and use the rest for a kv cache. You want that kv cache to fit as many tokens as possible.
So you have 3 options:

Serve your model as is. Boo, lame, boring. so you fill your mem with params P + cache cost per token C * cache size S. This S is the only variable, and you want to maximize it.
You're a chad and you want to DOUBLE THE PERFORMANCE! and you've read about this CFG paper. But now you're using 2C per token. so you use your VRAM as P + 2C * S.
You wonder whether you shouldn't directly serve a 2x bigger model with 2P params and a slightly bigger cache cost C' (C prime), but C' < 2C. Your VRAM is used with 2P + C' * S

At some point, if you can fit a big S, most of your VRAM will store the cache, and you really want a smaller cache footprint. But if your model is big, the parameters will dominate in VRAM, you can't store a big S, and you'll want to reduce the parameter memory footprint. So what's the decision boundary? Red line, decided as S = P / (2C - C')

#

is it clearer @patent gull ?

#

can I go to sleep? .___.

patent gull Nov 13, 2023, 2:57 AM

#

lolll i'm still parsing

versed flax Nov 13, 2023, 2:57 AM

#

4am 🤡

patent gull Nov 13, 2023, 2:58 AM

#

you can go to sleep lol but what do you think of my prev post?

#

about replicating Fig 11? that's what I thought the reviewer was asking for

versed flax Nov 13, 2023, 2:59 AM

#

R1 explicitly mentions KV cache. It's an inference question. I'm not sure I can see another way of interpreting the question

#

But if you have one, please explain

patent gull Nov 13, 2023, 2:59 AM

#

for filling the same KV-cache budget, what is your accuracy with CFG vs. a bigger model?

#

parallel to Fig 11. For the same FLOPs budget, we show accuracies on vanilla vs. CFG

versed flax Nov 13, 2023, 3:00 AM

#

so, like, you want to store 2k tokens in your kv cache, what's your best strategy?

patent gull Nov 13, 2023, 3:01 AM

#

ultimately, the user doesn't care about "what's the biggest model I can fit"... the user cares about "what's the maximal accuracy I can get with a fixed budget"

patent gull Nov 13, 2023, 3:01 AM

#

versed flax so, like, you want to store 2k tokens in your kv cache, what's your best strateg...

yah.... I have VRAM X, does P + 2C * S give me better accuracy, or does 2P + C' * S?

#

I'm assuming P + 2C * S will, because that means a slightly bigger model. maybe i'm contradicting myself earlier when I said VRAM was dependent variable

versed flax Nov 13, 2023, 3:04 AM

#

patent gull yah.... I have VRAM X, does P + 2C * S give me better accuracy, or does 2P + C'...

well then it depends on how big you want your kv cache to be, I guess

#

lemme think

#

like, you have 30GB. If you only care about perf, then that's a no brainer, use a 15B+CFG (fp16, so 15B => 30GB). You'll match the perf of a 30B without needing the actual 60GB. But you'll have a cache size = 0. Dumb dumb.

patent gull Nov 13, 2023, 3:10 AM

#

ummm actually I thought this was dataset dependent.. i thought for each dataset, there's a max-size datapoint, so we scale KV cache to that, and then we can maximize model size

#

but honestly, i'm very green to this kind of engineering work so I dumb

versed flax Nov 13, 2023, 3:11 AM

#

nah, kv cache is model dependant. It's your context_len (model dependent) * num_cache_lines (how many sequences do you want to cache when serving)

patent gull Nov 13, 2023, 3:14 AM

#

ah right. ok... if someone who is smarter than me can look at that graph and make a meaningful decision about which model to choose, then i will believe you haha, i just can't summarize it myself.

versed flax Nov 13, 2023, 3:14 AM

#

dear fuckin Yann LeCun I'm realizing how much I actually learned about LLMs since I switched job

versed flax Nov 13, 2023, 3:15 AM

#

patent gull ah right. ok... if someone who is smarter than me can look at that graph and mak...

there's not a unique answer to "given my amount of VRAM, what model do I choose?" because you have to trade off the amount of VRAM you dedicate to the params and the VRAM for your kv cache.

#

That's the same in training, which you may be more familiar with

#

You can't answer "what model size do I train for my amount of VRAM?" because it also depends on the tradeoff you're willing to do on your batch size

patent gull Nov 13, 2023, 3:17 AM

#

but for inference especially, can't we assume num_cache_lines = K (some constant, preferably for simplicity's sake, K=1)?

#

then, since your KV cache is upper bounded by the model's sequence length, can't you make a decision:

model m + CFG
model m'
based off of accuracy and the maximal amount of parameters that will fit in the cache?

versed flax Nov 13, 2023, 3:19 AM

#

patent gull but for inference especially, can't we assume num_cache_lines = K (some constant...

if you run your own chatbot for yourself, then, yeah, ok num_cache_line=1 is fair (for now, but in a near future you'll want to run N concurrent instances because your LLMs will run different programs, so you'll want N cache lines etc)

#

but if you run a big data center with millions of users like OpenAI, you absolutely can't decide num_cache_line=1, that's basically dedicating 1 GPU per person, that's insane

patent gull Nov 13, 2023, 3:20 AM

#

versed flax You have X amount of VRAM. _You want to use it all and serve efficiently._ So yo...

ok wait i did parse this finally

versed flax Nov 13, 2023, 3:21 AM

#

I'm sorry my English is just complete trash. I just shouldn't be allowed to speak.

patent gull Nov 13, 2023, 3:21 AM

#

no lol your good, it's really not your fault, that was a very clear answer

#

but S is upper-bounded by the model's sequence length, right?

#

it's not just \in {0, \infty}, right?

versed flax Nov 13, 2023, 3:22 AM

#

in a non hypothetical scenario, S is a multiple of your ctx len

#

S = ctx_len * num_concurrent_cache_lines

#

in a cloud setting for instance, each user gets ctx_len cached token. So you'll allocate one cache line for user Alex, another cache line for user Stella, another one for user Honglu and so on

patent gull Nov 13, 2023, 3:27 AM

#

there's gotta be a way we can make a better argument then "it depends"

versed flax Nov 13, 2023, 3:27 AM

#

there's really not

#

you want to serve millions of user with 1 GPU? Serve pythia-14M.

#

you want to serve 1 user = 1 GPU? Serve a big model

patent gull Nov 13, 2023, 3:29 AM

#

lol. yeah but this is research... we don't have to consider 1 million users

versed flax Nov 13, 2023, 3:29 AM

#

You want to go brankrupt? Serve 1 user = 8 GPUs.

versed flax Nov 13, 2023, 3:29 AM

#

patent gull lol. yeah but this is research... we don't have to consider 1 million users

hard disagree

#

scaling is all the rage

patent gull Nov 13, 2023, 3:30 AM

#

ok... 1 user, fixed VRAM. which model do i choose?

versed flax Nov 13, 2023, 3:30 AM

#

easy. the biggest that fits, with CFG

patent gull Nov 13, 2023, 3:30 AM

#

ok WITH cfg

#

that's great. how do you know?

versed flax Nov 13, 2023, 3:31 AM

#

because it'll give you the performance of a model that should be twice as big

patent gull Nov 13, 2023, 3:32 AM

#

but I have 2x the cache, so i should be able to serve a model MORE than twice as big with the same memory constraint, right?

versed flax Nov 13, 2023, 3:32 AM

#

and since your kv cache size will be super negligible bc you just want 1 cache line, you don't have to worry about 2C being greater than C', because (2C - C') * S <<< P, since S is so small (assuming you don't have one of those crazy models with 100k ctx len ofc lol)

patent gull Nov 13, 2023, 3:35 AM

#

sigh. ok i'm naive and i don't typically have my head in this space, but i'm gonna say something super high-level and dumb — I feel like there's a way we can fix certain variables and make a better argument about "here's the model we choose to maximize accuracy".

But if charts like these are actually super typical and we can reasonably expect the reviewer to interpret it correctly, then great.... @blissful garden any thoughts?

#

basically, all i'm saying is that we have to plan for the reviewer having the attention span of a goldfish, and if we can't convince them in that timespan, we're not getting a score boost

#

a sentence like "for fixed VRAM, CFG delivers 130% the performance" checks that box for me

#

as a goldfish myself

#

I can speak for other goldfish

versed flax Nov 13, 2023, 3:38 AM

#

Ok my argument to R1 is "You're raising a good point, we did the maths, and there's a tradeoff. In certain scenarios where you want to serve big models you'd better run inference with CFG than run a 2x model"

patent gull Nov 13, 2023, 3:39 AM

#

can you be explicit about what those "big model" scenarios are? >1B parameters?

versed flax Nov 13, 2023, 3:39 AM

#

depends on your vram lol

patent gull Nov 13, 2023, 3:39 AM

#

and "you'd better run inference with CFG" because why, higher accuracy?

#

ok fix a VRAM

versed flax Nov 13, 2023, 3:41 AM

#

we can add: "as an example, if you have 10GB of VRAM, models up to 1B should be served as is, but 1B to 5B models should be served with CFG"

patent gull Nov 13, 2023, 3:43 AM

#

ok — "1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you get better performance than a 4B model, which takes up the same VRAM"?

#

i think we're getting there imo

#

your example is good

#

and perfect, something textual we can put in the reviewer response

versed flax Nov 13, 2023, 3:44 AM

#

"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you can fit a bigger kv cache than a 4B model, for comparable performance"

#

fixed

#

plz I rly need to sleep, I have 5h of sleep remaining

patent gull Nov 13, 2023, 3:48 AM

#

ok sure

versed flax Nov 13, 2023, 3:48 AM

#

we good?

patent gull Nov 13, 2023, 3:48 AM

#

go to sleep

#

we can talk more tomorrow. i don't understand why "bigger kv cache" is the dependent variable here

versed flax Nov 13, 2023, 3:49 AM

#

enough CO2 burnt

patent gull Nov 13, 2023, 3:49 AM

#

"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG you can fit a bigger kv cache than a 4B model, for comparable performance"

->

"1B to 5B models should be served with CFG because, e.g.) for a 2B model + CFG, with the same KV cache size you can get X% more performance"

?

versed flax Nov 13, 2023, 3:50 AM

#

what do you call performance?

patent gull Nov 13, 2023, 3:50 AM

#

same thing I thought you were calling performance — accuracy on the benchmarks

#

same as figure 11

versed flax Nov 13, 2023, 3:50 AM

#

yes, totally

#

so why having a {bigg,small}er kv cache would have any impact on that?

patent gull Nov 13, 2023, 3:52 AM

#

ummm you can go to sleep we can talk tmrw

versed flax Nov 13, 2023, 3:52 AM

#

ok cool

blissful garden Nov 13, 2023, 8:07 AM

#

versed flax in a cloud setting for instance, each user gets ctx_len cached token. So you'll ...

What? Why does each user provide fixed amount of tokens for the models and why is there this weird 'num of cached lines'? When serving the model isn't there a distributed messaging queue and async workers grab dynamically sized inputs and do batching before assigning it to models?

blissful garden Nov 13, 2023, 8:42 AM

#

also, since we can revise the paper, I wonder if we should add a super short subsection or subsubsection explaining the challenges of applying CFG in language domain and why it doesn't work verbatim. We could address R3 blah blah blah, and look, here is a new short paragraph explaining that we are not applying existing technique trivially.

versed flax Nov 13, 2023, 9:18 AM

#

blissful garden What? Why does each user provide fixed amount of tokens for the models and why i...

Why does each user provide fixed amount of tokens
Don't overthink it. They were just an illustration of concurrent runs.

<rest of the message>
That's how the queueing system works, not how the cache itselt, the big tensor of size (num_cache_lines, 2, num_layers, num_heads, hidden_dim) work. (the tensor might or might not be explicit into the code, but in this end, this is how the VRAM will be allocated for the cache.

loud adder Nov 13, 2023, 11:36 AM

#

blissful garden also, since we can revise the paper, I wonder if we should add a super short sub...

Yes, we should 100% do this

versed flax Nov 13, 2023, 12:34 PM

#

blissful garden also, since we can revise the paper, I wonder if we should add a super short sub...

Are we trying to make us look genius because we apply cfg on the model output instead of the model output ( 🤡 ), but our model output is logits rather than regression?

fallow egret Nov 13, 2023, 3:29 PM

#

Updating on FUDGE:
I implement the method with some shallow sentiment classifier (65m parameters):
https://huggingface.co/docs/transformers/tasks/sequence_classification

The problem is that the running time is extremely slow (since you need to run on 200 samples for each token). With max tokens 20 (which is not enough), It takes 67 sec per sample.
Which means that running it on ~500 samples will take ~9h (and for the full dataset which contains 25k samples it takes 45h).
We need also to run multiple experiments (there is there a guidance hyper-parameter).
Any ideas?
cc @patent gull

Text classification

blissful garden Nov 13, 2023, 4:05 PM

#

fallow egret Updating on FUDGE: I implement the method with some shallow sentiment classifier...

do you have the script somewhere? I can see if I can scale it in the SAI cluster

fallow egret Nov 13, 2023, 4:11 PM

#

Very ugly, but it should be correct

📎 zyn_test_exp.py

blissful garden Nov 13, 2023, 4:47 PM

#

fallow egret Updating on FUDGE: I implement the method with some shallow sentiment classifier...

what's the command to run this script on the full dataset for a particular cfg? I can spin up that many nodes and for each maybe shard the model to 8 GPUs so that it's faster

#

oh so the model is fixed to gpt2-medium? Or should the --model-name argument be used somewhere

fallow egret Nov 13, 2023, 4:50 PM

#

We can change it, but we decide to do this experiment with GPT2-medium

#

The current run is with the default 1 guidance

#

Let me add it as a parameter and clean a little bit the code

blissful garden Nov 13, 2023, 4:52 PM

#

fallow egret Let me add it as a parameter and clean a little bit the code

yeah sounds good. It runs well. So I remove the [:300] for the full run right?

fallow egret Nov 13, 2023, 4:52 PM

#

Yes, exactly

#

you need me to clean the code?

blissful garden Nov 13, 2023, 4:52 PM

#

fallow egret you need me to clean the code?

yeah go ahead. I will think about how to scale this bad boy.

fallow egret Nov 13, 2023, 4:53 PM

#

👍

#

@patent gull Please verify that we are fine with this experiment (models/dataset)

blissful garden Nov 13, 2023, 4:54 PM

#

Wow 1/24936 [01:50<762:33:17, 110.09s/it] lol 😂

fallow egret Nov 13, 2023, 4:55 PM

#

lol, yes. The algorithm is a disaster from a computation perspective. I don't understand why it's even considered as a valid option

#

For every generated token you need to run the classifier on 200X number of samples in the batch

blissful garden Nov 13, 2023, 5:00 PM

#

does it generate in batches or just 1 token at a time?

fallow egret Nov 13, 2023, 5:03 PM

#

1 token a time. But I'm not sure it will help since in any case the bottleneck is running the classifier

blissful garden Nov 13, 2023, 5:06 PM

#

classifier can also run in batches I guess?

fallow egret Nov 13, 2023, 5:08 PM

#

The classifier is running in batch

blissful garden Nov 13, 2023, 5:08 PM

#

how hard is it to vectorize everything with a large batch size?

#

I see the vram isn't fully used

fallow egret Nov 13, 2023, 5:09 PM

#

This is the point, that it's already big batch of 200 and if you increase the number of batch to N then you need to run it in batch of 200*N (which means that in practice you will not be able to run a big batch)

#

Should not be a big deal, I can do it

#

Ok, so I clean the code and extract outside the guidance scale

📎 zyn_test_exp.py

blissful garden Nov 13, 2023, 5:15 PM

#

oh so the classifier does it for the top 200 tokens for each generation step, is that right? Sorry I only start to understand it right now

fallow egret Nov 13, 2023, 5:19 PM

#

Yes. What they are doing is simply Classifier guidance.
The problem is that in order to do it you need to run the classification on every possible next token. In the paper they 'compormise' on taking the top 200 🙂

blissful garden Nov 13, 2023, 5:19 PM

#

lol this is crazy

versed flax Nov 13, 2023, 5:20 PM

#

WELL I MEAN

#

If that is the current way of doing things, I say we already have a GODDAM strong argument for CFG, even if are scores are lower

blissful garden Nov 13, 2023, 5:24 PM

#

fallow egret Yes. What they are doing is simply Classifier guidance. The problem is that in o...

if we do N generations with 200*N for the classifier, it could fill up the vram but not sure how much faster it gets. Also generating 25k samples is probably more than necessary. Maybe we should just do first 100-300 samples and change a handful of bigger models with a couple different cfg......

fallow egret Nov 13, 2023, 5:40 PM

#

versed flax If _that_ is the current way of doing things, I say we already have a GODDAM str...

I agree, this is exactly what I told @patent gull

fallow egret Nov 13, 2023, 5:40 PM

#

blissful garden if we do N generations with 200*N for the classifier, it could fill up the vram ...

Yes, this is why I choose 300, I think it's legit

patent gull Nov 13, 2023, 5:42 PM

#

Let me take a look. Just waking up now

fallow egret Nov 13, 2023, 5:42 PM

#

I'm pretty sure we will get better results, since the classifier is crappy (if we take stronger, then the running time will be inifinite, and in any case in their paper they suggest to use weak classifier)

patent gull Nov 13, 2023, 5:42 PM

#

Apologies for the delay

patent gull Nov 13, 2023, 6:08 PM

#

that looks great to me. what's the dataset?

#

also i think to compare apples-to-apples, we might want to make sure both variations see the same input. And CFG is probably going to see an input like "Write a happy response" or something

#

So i would prepend every example in the dataset with "I'm feeling happy today. <input sentence>"

#

or something. @versed flax @blissful garden any ideas for a good prompt that captures sentiment for a non instruction-tuned model? I know you played around a bit with this, @versed flax

blissful garden Nov 13, 2023, 6:11 PM

#

patent gull or something. <@212467543745626112> <@823129585230544906> any ideas for a good p...

for non-instruction-tuned models, story completion is usually the way to go because a lot of pretraining data has those stuff from books, blogs or whatever

versed flax Nov 13, 2023, 6:12 PM

#

Following the previous night I am absolutely exhausted and today was mostly dedicated to surviving. I will be unable to do good work and must delay my answer to R1 to tmrw.

versed flax Nov 13, 2023, 6:12 PM

#

patent gull or something. <@212467543745626112> <@823129585230544906> any ideas for a good p...

"Today, Parisian celebrated"

fallow egret Nov 13, 2023, 6:13 PM

#

The dataset is imdb. For each review sample removing the last 64 words.
The idea is to follow:
https://github.com/vicgalle/zero-shot-reward-models/
And use their classifier (with Flan-T5) for evaluation

GitHub

GitHub - vicgalle/zero-shot-reward-models: ZYN: Zero-Shot Reward Mo...

ZYN: Zero-Shot Reward Models with Yes-No Questions - GitHub - vicgalle/zero-shot-reward-models: ZYN: Zero-Shot Reward Models with Yes-No Questions

patent gull Nov 13, 2023, 6:13 PM

#

no problem haha

#

ok if it's movie reviews, then I would prepend the phrase "I enjoyed this movie. <prompt>..."

fallow egret Nov 13, 2023, 6:16 PM

#

Do you want to do it also on the FUDGE experiment?!

patent gull Nov 13, 2023, 6:17 PM

#

that's my thinking, yeah, otherwise p(xi | x<i) is different across CG vs CFG .... how do we know that CFG worked, compared to just adding that prompt changed the sentiment anyway?

fallow egret Nov 13, 2023, 6:18 PM

#

Yes, I see

#

Ok, so we should decide on the prompt before collecting the FUDGE results

patent gull Nov 13, 2023, 6:19 PM

#

yeah.. when i get to the office, i can try out some different prompts with CFG and see what seems to be working

fallow egret Nov 13, 2023, 6:28 PM

#

@blissful garden anything else is needed from my side?

blissful garden Nov 13, 2023, 6:33 PM

#

fallow egret <@823129585230544906> anything else is needed from my side?

It's good so far. I will play with it tonight

fallow egret Nov 13, 2023, 6:34 PM

#

Ok, thanks!

patent gull Nov 13, 2023, 8:07 PM

#

Aw man we missed an opportunity. Maybe we would’ve gotten higher scores if we named our paper “All you need is CFG for LLMs with applications in ChatGPT based on Diffusion”

#

https://x.com/baaadas/status/1723631321677984255?s=46&t=u8QKCW7dBqQIph3hCSUiNA

patent gull Nov 13, 2023, 8:56 PM

#

@fallow egret where did you find that model? is it a recommended one for sentiment analysis? the model card says it was trained on an "unknown dataset"

blissful garden Nov 13, 2023, 8:58 PM

#

we can probably swap in a better one. Is there a standard one for sentiment analysis? I don't know much about this field.

patent gull Nov 13, 2023, 8:58 PM

#

i don't know, either. i see that it is the example one used in the HF tutorial on sentiment, but it looks out of date, since in the tutorial, that model returns "POSITIVE" and "NEGATIVE" labels https://huggingface.co/docs/transformers/tasks/sequence_classification#inference

#

it's trained on IMDB, though, so ideally it is in-domain

fallow egret Nov 13, 2023, 9:01 PM

#

patent gull <@1057033987811459203> where did you find that model? is it a recommended one fo...

Yes, it was a 'tutorial' model. There are of course much better models, but the problem is that using stronger model will significantly increase the computation

#

Also in the paper they emphasis that the classifier should be shallow compare to the base model

patent gull Nov 13, 2023, 9:02 PM

#

i see, ok SGTM, then

#

Yeah, it scored highly on IMDB, which is the dataset we're using

#

but just to be clear on the experiment —

at first I was thinking that we were going to use the same classifier that we use in CG to evaluate the outputs of both CFG, and CG?

#

or do you think we should use a different classifier for evaluation?

fallow egret Nov 13, 2023, 9:04 PM

#

Yes, I think it should be different than the CG model (stronger model)

patent gull Nov 13, 2023, 9:05 PM

#

ok. i see the arguments for and against. If CG does badly with a different classifier, someone could just argue "well, you chose a purposely bad classifier"

fallow egret Nov 13, 2023, 9:05 PM

#

But this is one of the FUDGE limitation... You can't use a strong model as the classifier

patent gull Nov 13, 2023, 9:06 PM

#

ok cool, makes sense

#

also on the experimental design, I see that we are using CG just to make things "positive"?

fallow egret Nov 13, 2023, 9:07 PM

#

Yes, this is why using the same classifier is completely unfair (the objective is to make it 'positive' according to this classifier)

patent gull Nov 13, 2023, 9:08 PM

#

final_res.append(t['score'] if (t['label'] == 'LABEL_1') else (1 - t['score']))

i'm thinking that a more interesting objective would be to try to flip the label?

E.g. if the y_true is POSITIVE, then try to get y_pred to be NEGATIVE, and vice versa

#

because if you're taking the first 64 tokens as prompt, for all prompts that are already positive in those first 64 tokens, there's not much to be done, is there? and then we wouldn't really be differentiating between the two approaches, because they'd both look good

fallow egret Nov 13, 2023, 9:11 PM

#

Yes, if the review is positive in the beginning it's not interesting, but most of the 'trimmed' reviews are neutral

patent gull Nov 13, 2023, 9:11 PM

#

oh ok cool, good to know!! thanks

#

ok, then, i agree with your experiment. maybe we can even measure the \delta from prompt -> prompt + completion
i.e. p(POSITIVE | prompt + completion) - p(POSITIVE | prompt)
where p is the stronger classifier

fallow egret Nov 13, 2023, 9:12 PM

#

Yes, it's very good idea.

patent gull Nov 13, 2023, 9:13 PM

#

ok i'll try to find a stronger classification model and will come up with a few prompts. helps to have CFG in huggingface now thanks for @versed flax 😉

fallow egret Nov 13, 2023, 9:15 PM

#

I think that the prompted Flan-T5 is a valid classifier (and it has ~x4 parameters comparing to the shallow model)

patent gull Nov 13, 2023, 9:22 PM

#

here's another one (at least they report their validation accuracy lol): https://huggingface.co/hipnologo/gpt2-imdb-finetune

patent gull Nov 13, 2023, 9:23 PM

#

fallow egret I think that the prompted Flan-T5 is a valid classifier (and it has ~x4 paramete...

ok cool i'll check that one out, too

patent gull Nov 13, 2023, 9:25 PM

#

fallow egret I think that the prompted Flan-T5 is a valid classifier (and it has ~x4 paramete...

but it's not fine-tuned on a sentiment dataset?

fallow egret Nov 13, 2023, 9:27 PM

#

patent gull but it's not fine-tuned on a sentiment dataset?

Yes, it's not. I think it might have an advantage. But for sure I see also the disadvantages.
So not sure what is the best option...

patent gull Nov 14, 2023, 4:53 AM

#

i did a small run with negative prompting and GPT2-medium

#

I found that the following negative prompt gave us the biggest increase in CFG:

#

"A bad movie review starts like this"

#

A bad movie review starts like this.      3                    0.019473
                                          4                    0.014441
                                          5                    0.031700
Bad review here.                          3                   -0.008154
                                          4                    0.000345
                                          5                    0.000045
Bad.                                      3                    0.006696
                                          4                    0.020785
                                          5                   -0.003599
This is terrible.                         3                    0.004726
                                          4                    0.007281
                                          5                    0.014920
Thus starts a terrible movie review.      3                    0.004966
                                          4                    0.020504
                                          5                    0.008596
To write something terrible, write this.  3                   -0.010605
                                          4                   -0.000713
                                          5                   -0.006353```

but these numbers aren't huge, honestly. \delta is classifier(CFG output ) - classifier(vanilla output).

So +.03 means that CFG with that negative prompt boosted the sentiment score by ~3%.

#

I can try with a positive prompt, too

#

these are over the first 200 examples in IMDB

loud adder Nov 14, 2023, 5:08 AM

#

I had a migraine and stopped working yesterday, but please remind me to take a look at our draft response Tuesday (today) afternoon.

patent gull Nov 14, 2023, 5:16 AM

#

will do!! I hope you feel OK

#

status is —

R3: I looked it over/edited, I feel like we're OK to respond ASAP on that, whenever you get the chance to look.
R1: I think @versed flax did the necessary experiments, we need to craft a response.
R2: maybe today/tomorrow we'll be done with the experiments and have the response ready

patent gull Nov 14, 2023, 6:01 AM

#

positive prompting was a lot harder to achieve — in fact, CFG with most positive prompts, in most settings, negatively affected sentiment

pos_prompt                             guidance_strength
A good movie review starts like this.  0.10                -0.104267
                                       0.25                -0.066911
                                       0.50                -0.029078
                                       0.75                -0.032493
Great review here.                     0.10                -0.122568
                                       0.25                -0.111521
                                       0.50                -0.064845
                                       0.75                -0.046463
Great.                                 0.10                -0.098969
                                       0.25                -0.078565
                                       0.50                -0.049975
                                       0.75                -0.041289
This is great.                         0.10                -0.066222
                                       0.25                -0.029284
                                       0.50                -0.058933
                                       0.75                -0.030547
Thus starts a great movie review.      0.10                 0.074312
                                       0.25                 0.013816
                                       0.50                -0.035235
                                       0.75                -0.030811
To write something great, write this.  0.10                -0.141006
                                       0.25                -0.094684
                                       0.50                -0.066705
                                       0.75                -0.016745

#

I think we should try both positive and negative settings, though, for the experiment. We can prepend "Thus starts a great movie review." and "A bad movie review starts like this." for CG.

If anyone with access to a long-running compute cluster with some decent memory can run my script, that would be very appreciated!! here is my script:

#

📎 controlled-gen-scripts.py

versed flax Nov 14, 2023, 9:00 AM

#

patent gull positive prompting was a lot harder to achieve — in fact, CFG with most positive...

Try:
Positive: A bad movie review:
Negative: Movie review:

blissful garden Nov 14, 2023, 1:49 PM

#

patent gull

changed some codes to shard the data for 8 gpus and taking this script for a spin in the cluster. There should be some files coming out when you wake up.

blissful garden Nov 14, 2023, 2:24 PM

#

@patent gull I see you called model.generate(.... I got warnings that the max length is defaulted to 20 and the prompt is longer. Does this need to change or it is ok with the current setup?

blissful garden Nov 14, 2023, 3:30 PM

#

ahh errored out... One sample got 800+ tokens and crashed the max length of that distilbert classification model lol

blissful garden Nov 15, 2023, 12:12 AM

#

The full script is taking too long but I will just leave it running.
I will queue 2 more jobs specifically for first 5000 data points, one for negative with cfg 3, 4, 5, and one for positive with cfg 1, 1.25, 1.5, 1.75. When each cfg finished a csv will be saved (we can combine later).
Let's see how many files we get tomorrow when I wake up (or error out). Heading to bed.

patent gull Nov 15, 2023, 8:33 AM

#

versed flax enough CO2 burnt

what are the blue dots, again?

versed flax Nov 15, 2023, 8:39 AM

#

patent gull what are the blue dots, again?

models. Red is regression line.

patent gull Nov 15, 2023, 8:40 AM

#

why do we care about the y=2x line, again?

versed flax Nov 15, 2023, 8:41 AM

#

patent gull why do we care about the y=2x line, again?

minimum amount of ram needed to load the model, fp16 assumed (2 bytes / param)

patent gull Nov 15, 2023, 8:44 AM

#

so the blue dots are the minimum VRAM that the model needs + the KV cache for the maximum sequence that the model takes?

#

what is the right way to address the second part of their review:

There is no guarantee that the Eq.6 will obtain a legal probability with the probabilities of all possibilities summing up to 1.

they're right — and we're not doing special normalization in the LogitWarper. Does HF do normalization under the hood in the .generate() function? I think it must, if the user is doing top-p and top-k sampling as well

versed flax Nov 15, 2023, 8:53 AM

#

patent gull what is the right way to address the second part of their review: `There is no...

There's always a softmax before the actual sampling

patent gull Nov 15, 2023, 8:54 AM

#

right duh

#

for R2:

Compared with text-to-image generation, the optimal \gamma value in the language modelling seems to be small (<2), while large \gamma value leads to poor performance. Have any observations on it?

Maybe we can say that CFG is applied in autoregressive sampling at every step, so \gamma actually needs to be smaller, as it has a repeated impact

versed flax Nov 15, 2023, 9:03 AM

#

patent gull for R2: `Compared with text-to-image generation, the optimal \gamma value in t...

I would say it's bc of two things:

in img generation the range is -1;1, it may be smaller with logits
in img generation the values are independent but here there's a softmax and changing the max value dramatically alters the whole distribution

patent gull Nov 15, 2023, 9:04 AM

#

pixel range is -1;1

versed flax Nov 15, 2023, 9:08 AM

#

It may also be: 3. The conditional and unconditional outputs are more different in text than image

fallow egret Nov 15, 2023, 9:29 AM

#

I think it's more the nature of diffusion models: after very small amount of iteration the differences between the conditional probability and the unconditional probability should be neglectable

versed flax Nov 15, 2023, 11:48 AM

#

This is a great explanation as well

#

We could see something similar with our paper as well as we sample more and more tokens

#

The continuation will be impacted less and less by the CFG'd tokens of the initial prompt

#

is this plot clearer? I changed the text and now I think it's much better

#

loud adder Nov 15, 2023, 2:40 PM

#

I'm getting caught upon the rebuttal google docs now

patent gull Nov 15, 2023, 7:02 PM

#

ok I will condense these explanations in the google doc

versed flax Nov 15, 2023, 7:30 PM

#

Friends, I am not a native English speaker, therefore, I will not post the answers to the rebuttals before Alex / Stella proof reads them. Please, when you think an answer is good enough, post it. Let's not wait another round. It's been 5 days already. We're 50% in.

loud adder Nov 15, 2023, 7:50 PM

#

@versed flax in the reponse to R1, it says

We have completed a memory analysis and will include our results in the paper. In general, we found a tradeoff between serving larger models, and serving a smaller model with CFG.
Is the updated paper going to be posted simultaneously with the reply? Or is that a to-do?

versed flax Nov 15, 2023, 7:52 PM

#

loud adder <@212467543745626112> in the reponse to R1, it says > We have completed a memory...

That's a TO-DO as of now. Can be done fast for the memory thing. I'm not quite sure about the controlled NLG. Experiments are still running.

loud adder Nov 15, 2023, 7:53 PM

#

I'm tweaking the reply to reviewer 1 a little and otherwise think it's good

versed flax Nov 15, 2023, 7:54 PM

#

awesome!

loud adder Nov 15, 2023, 7:55 PM

#

I would add the memory experiments and the formatting fixes that R1 recommends now

#

We can tell the other reviewer that theirs is running and that we'll update when it's done

versed flax Nov 15, 2023, 7:56 PM

#

loud adder I would add the memory experiments and the formatting fixes that R1 recommends n...

to the paper? ok

#

Does the page size constraint still applies?

loud adder Nov 15, 2023, 7:58 PM

#

Usually we get an extra page, it should say on the call for papers page

versed flax Nov 15, 2023, 7:58 PM

#

good. I will check that then.

#

And maybe I will stop depending on you and Alex for the English and use ChatGPT instead lol

loud adder Nov 15, 2023, 8:02 PM

#

I want to add this to the end of the discussion of the results

At a high level, this means that it depends on your use-case. For researchers or small scale deployments where people are using the largest model that they can fit on their GPU, it's better to use CFG. However for very large scale commercial deployments, it makes more sense to increase the size of the model. We further note that increasing the size of the model is not always possible: OpenAI probably doesn't have a version of GPT-4 that's twice as big sitting around.

versed flax Nov 15, 2023, 8:02 PM

#

I love it!

#

cristal clear and wraps it up perfectly

#

(although they probably do since GPT-4 turbo is prolly a distilled version of GPT-4)

loud adder Nov 15, 2023, 8:04 PM

#

Shhh

#

I'm also now curious how GPU size discreetness impacts this

versed flax Nov 15, 2023, 8:04 PM

#

"discreetness"?

loud adder Nov 15, 2023, 8:05 PM

#

No, the fact that GPUs come in fixed sizes: 16, 24, 40, 48, 80

versed flax Nov 15, 2023, 8:06 PM

#

Ah!

#

then we could add new horizontal frontiers on the chart with GPU models

loud adder Nov 15, 2023, 8:07 PM

#

Yeah

#

Models do too... though generally are spaced to double in size (6.7B -> 13B -> 20B -> 40B)

versed flax Nov 15, 2023, 8:10 PM

#

Depends on the family? I remember Chinchilla models no doubling everytime but I may be mistaken here

versed flax Nov 15, 2023, 8:11 PM

#

loud adder I want to add this to the end of the discussion of the results > At a high lev...

do I post it now?

loud adder Nov 15, 2023, 8:21 PM

#

I also did a pass on the reply to R3

#

Don't quite love it but I think it's good?

versed flax Nov 15, 2023, 8:21 PM

#

loud adder Don't quite love it but I think it's good?

Your guess is better than mine. I have no prior experience with reviewers.

loud adder Nov 15, 2023, 8:22 PM

#

versed flax do I post it _now_?

I changed the review to say that the results were added to the paper and that we made the formatting changes they recommend. So make those changes and then it's good to go IMO

#

(A general principle at play here is that you should show that you've done what they want instead of promising that you will whenever possible)

versed flax Nov 15, 2023, 8:23 PM

#

Then I'll try pulling that off tonight and posting the PDF and the answer at the same time

loud adder Nov 15, 2023, 8:25 PM

#

So right now we are inconsistent in our replies to R2 and R3

#

We tell R3 that CFG for LMs is new

#

But acknowledge with R2 that it's not

#

Which position are we taking? We cannot take both

versed flax Nov 15, 2023, 8:28 PM

#

loud adder But acknowledge with R2 that it's not

As far as we are aware, the application of CFG to autoregressive language models is novel, as previously they had only been applied to non-autoregressive diffusion models in computer vision
that reads "new" to me

loud adder Nov 15, 2023, 8:29 PM

#

Oh sorry. R2 tells us it's not

#

Misread that

#

I'm about to give a talk and have to run, but I can do a final pass before the submission this evening (in 4-ish hours)

blissful garden Nov 15, 2023, 8:30 PM

#

versed flax That's a TO-DO as of now. Can be done fast for the memory thing. I'm not quite s...

18min left. Praying that nothing breaks when it comes out.

loud adder Nov 15, 2023, 8:31 PM

#

versed flax > As far as we are aware, the application of CFG to autoregressive language mode...

versed flax Nov 15, 2023, 8:31 PM

#

yes

loud adder Nov 15, 2023, 8:32 PM

#

Ttyl

versed flax Nov 15, 2023, 8:32 PM

#

good luck!

#

shine!

blissful garden Nov 15, 2023, 9:36 PM

#

@fallow egret @patent gull the resulting files of the 1000 samples of fudge. Any good?

📎 fudge.tar.gz

fallow egret Nov 15, 2023, 9:38 PM

#

What guidance values you used?

blissful garden Nov 15, 2023, 9:39 PM

#

oh shit this is just the baseline.....

#

job is still running

fallow egret Nov 15, 2023, 9:39 PM

#

if it's 1 then it's not the baseline

#

this is the value that they used

blissful garden Nov 15, 2023, 9:39 PM

#

yeah it's the 1

fallow egret Nov 15, 2023, 9:40 PM

#

cool, so this is what they used in the paper

blissful garden Nov 15, 2023, 9:40 PM

#

I have a for loop in my bash script so it's done 1 only. Tomorrow maybe 1.25

fallow egret Nov 15, 2023, 9:42 PM

#

For a fair comparison we should run it with few guidance scale.
But I'm not sure it's worth to waste on it too much time. In any case it's simply non-valid method

blissful garden Nov 15, 2023, 9:44 PM

#

I have 1, 1.25, 1.5 and 1.75 in my script

#

we can tell R2 that it's running just like what Stella said. If we get 1.25 we can give them a teaser. But no need to wait for it to finish

#

yeah looks like that crazy 200 distillbert thing is a massive bottleneck. CFG barely made the whole thing slower

fallow egret Nov 15, 2023, 9:51 PM

#

Yes, I agree. In any case the main point is to stress that theoretically the alternative is using CG (as in diffusion models), in LLM it is also known as FUDGE. However, the problem is that in the context of LLM you need run the classifier on every combination of (state,next_token), which make it impractical.
In the FUDGE paper, they resolve this issue by sampling the top 200 tokens and used a shallow classifier. From our experience, even when using a relatively shallow network (65m parameters), the running time is still more than order of magnitude comparing to CFG, which make this method impractical for many real-world use cases

blissful garden Nov 15, 2023, 9:53 PM

#

can we just grab those 1 results and try CFG alone, possibly with neg prompts, and argue that CFG produces similarly controlled results?

#

they control the sentiment right?

#

if we do we get one quick chart to show and also makes our method stronger

fallow egret Nov 15, 2023, 9:56 PM

#

Yes, I think it's fine. In addition to the last comment that I wrote to add this small experiment that also demonstrate that you are not getting better result with FUDGE. But the main point is to emphesais the usability of CFG for real world use-cases

patent gull Nov 15, 2023, 10:09 PM

#

hey just catching up. I will take a look at these results now

#

sorry — what is being pickled in these files? I just see lists of strings

#

can someone forward Elad's script, again?

blissful garden Nov 15, 2023, 10:25 PM

#

fallow egret Ok, so I clean the code and extract outside the guidance scale

@patent gull this one plus some minor thing to split up for 8 GPUs and save to separate files

patent gull Nov 15, 2023, 10:28 PM

#

ah ok — so will just compare to the vanilla GPT generations

blissful garden Nov 15, 2023, 10:49 PM

#

patent gull ah ok — so will just compare to the vanilla GPT generations

maybe try the same stuff but with CFG generation and some negative prompt, and get the sentiment score. As Elad said we may not outperform but if we are not lagging too much behind it might be worth mentioning

#

need to go to bed otherwise I can try it very quick.

patent gull Nov 15, 2023, 10:50 PM

#

no problem

#

yeah i'm just wondering... I remember some old work about creating the ideal prompt, given a classifier, I'm trying to find it

#

i don't think it'll be directly useful in our case, but.. hmm

versed flax Nov 15, 2023, 11:06 PM

#

FYI:

There will be a strict upper limit of 9 pages for the main text of the submission, with unlimited additional pages for citations. This page limit applies to both the initial and final camera ready version.

#

So I think I will add the memory analysis in the appendix

#

It's secondary to the contribution I would say

patent gull Nov 15, 2023, 11:07 PM

#

Yes, I agree, I think it belongs next to the FLOPs analysis, and can be mentioned in the main body but explored more deeply in the appendix

#

just like the FLOPs

versed flax Nov 15, 2023, 11:08 PM

#

Totes

patent gull Nov 15, 2023, 11:08 PM

#

btw i'm reading through what you wrote to R1. very nice. I finally understand it 😭😭😭 hahaha

versed flax Nov 15, 2023, 11:08 PM

#

Haha I'm glad

patent gull Nov 15, 2023, 11:10 PM

#

also the plot looks so much better

versed flax Nov 15, 2023, 11:10 PM

#

Yes the text finally makes sense

patent gull Nov 15, 2023, 11:12 PM

#

what happened to the blue dots?

versed flax Nov 15, 2023, 11:13 PM

#

obliterated. Not needed.

patent gull Nov 15, 2023, 11:18 PM

#

one question — are we implicitly assuming that a model twice as large is as accurate as a model with CFG?

versed flax Nov 15, 2023, 11:18 PM

#

yes

#

it's not implicit. It's something we kinda introduce in the paper.

patent gull Nov 15, 2023, 11:19 PM

#

no, i know that haha, but I think we should reiterate that in the response

#

let me work that in

versed flax Nov 15, 2023, 11:19 PM

#

oh ok

patent gull Nov 15, 2023, 11:21 PM

#

For the chart, I have the following comments:

the "CFG" annotation can be more central — 50%/50% of the plot, instead of off to the side
Can we change "CFG" -> "CFG wins"
"Vanilla -> Vanilla Wins"

— if you'd like to send me the code, I can play with the chart myself, whatever's easier.

versed flax Nov 15, 2023, 11:22 PM

#

I can fix that in an instant

patent gull Nov 15, 2023, 11:22 PM

#

ok great!

versed flax Nov 15, 2023, 11:22 PM

#

I'm writing the appendix rn now

#

so it'll be done later unless it's needed now

patent gull Nov 15, 2023, 11:22 PM

#

no problem/rush at all

#

i see you just copy/pasted — there's a typo "models bigger than 5G" -> "models bigger than 5B"

#

lol. we're not comparing cell phone service plans, here

versed flax Nov 15, 2023, 11:30 PM

#

B=G tho 😭 (but yes, you're right)

patent gull Nov 15, 2023, 11:30 PM

#

yeah but let's be consistent

unique sedge Nov 16, 2023, 2:39 AM

#

Hello sorry for being awol. Had to go for my thesis submission and defense schedule to college, been busy in that. sorry for not being able to help.

patent gull Nov 16, 2023, 2:56 AM

#

still running sentiment controlled NLG.. I wonder if we want to add a second controlled NLG attribute

#

formality is one that others have used, and there's a nice model here that does well in assessing formality: https://huggingface.co/s-nlp/roberta-base-formality-ranker

#

specifically, R2 asked us to compare this to controlled NLG sota methods

fallow egret Nov 16, 2023, 3:56 AM

#

patent gull yeah i'm just wondering... I remember some old work about creating the ideal pro...

There is a simple recent work by DeepMind in which they simply provide few examples of tuples <prompt, prompt_score on the data> and provide a meta prompt that ask the model to provide alternative prompt that will give the best score. You can iterate (by adding the result of the new suggestion).
It's seems to work very nicely, and we can apply it easily on our use to find the best prompt for the CFG

patent gull Nov 16, 2023, 4:00 AM

#

to GPT4, or something?

#

so the model basically infers based on what was working, what will work?

fallow egret Nov 16, 2023, 4:05 AM

#

Yes, exactly

#

I'm now working on few improvements to this method (like providing few failure cases for each run). But it still work in progress and their basic idea is nice if you provide a good context in the meta-prompt about the task

loud adder Nov 16, 2023, 4:29 AM

#

@fallow egret @patent gull are we good to post?

fallow egret Nov 16, 2023, 4:33 AM

#

I think it's important to add for each reviewer 1 sentence in the beginning which stress the main positive things he found in our paper (something like 'we are glad you find our...'), it's important again for the AC decision

patent gull Nov 16, 2023, 4:35 AM

#

We already posted responses to R1 and R3.

For R2, we planned an experiment comparing CFG to a controlled NLG baseline, where we're controlling for sentiment.

I just got some good results from CFG. I'm comparing to SOTA baseline now.

I do wonder, though, if sentiment is enough. Ideally, we compare several different controlled NLG objectives. What do you think, @loud adder ?

#

Sentiment may be enough for an initial response to R2, but ideally if we're updating the paper, I'd feel better including more experiments on more controlled factors

fallow egret Nov 16, 2023, 4:41 AM

#

lol, it was really uploaded with 😅
We hope this clarifies the points raised in your review. If you would please consider raising your score, we would really, really appreciate it!!

#

Ok, I think that at least when we see that end of the review period is coming we should add a comment for each reviewer which is much more formal and doesn't contain any promise for future changes (this is a direct reason for reject). It should simply state that we modify the text and address all the concern raised by the reviewer (list them).

patent gull Nov 16, 2023, 4:47 AM

#

haha feel free to edit, but i've found it helps to ask, sometimes

fallow egret Nov 16, 2023, 4:48 AM

#

I don't think that there is an option to edit responses

patent gull Nov 16, 2023, 4:50 AM

#

yeah, there is... I edited @versed flax, the button is off to the side

#

btw, good news, good results from CFG vs. baseline CG

#

here's the delta increase in positive sentiment via CFG for a few settings/prompts:

Great movie review:     0.10                 0.075225
                        0.25                -0.136310
                        0.50                -0.015543
                        0.75                 0.034303
That was a good movie!  0.10                 0.364103
                        0.25                 0.312607
                        0.50                 0.192197
                        0.75                 0.044026```

and here's the delta increase in sentiment via CG for the defaults that the authors used:

``` baseline_df['delta'].mean()
0.065204710023016```

#

ideally we test a lot more values for guidance strength for CG, but it is SO SLOW to run.

Let me draft a response to R2, and then we can see whether it looks good, or whether we should do more experimentation

fallow egret Nov 16, 2023, 4:57 AM

#

Amazing!! I think it's definitely enough material to address this point of his review

loud adder Nov 16, 2023, 4:59 AM

#

fallow egret lol, it was really uploaded with 😅 `We hope this clarifies the points raised i...

I had edited this out of the Google doc before it was posted... can you check and see if there's other divergences with the Google doc?

loud adder Nov 16, 2023, 5:00 AM

#

patent gull ideally we test a lot more values for guidance strength for CG, but it is SO SLO...

I think we should submit a quick response to R2 even if the experiments aren't fully in telling them that they're on the way so they don't feel ignored

patent gull Nov 16, 2023, 5:01 AM

#

sorry!! I didn't check the revision-history/didn't realize you had edited it out... and have been handling a lot of things today

loud adder Nov 16, 2023, 5:03 AM

#

No worries

#

It's not a big deal, the language just seemed a little over the top

#

I was more concerned about whether this was a sign that an old draft was used (and other edits I made later I view as more important)

#

Did another pass over the two posted reviews.., they look good!

patent gull Nov 16, 2023, 5:22 AM

#

the reposted paper is very good too, thanks to @versed flax

#

I'm almost done with R2... just have to answer that last question

#

ok R2 is done in the google draft, grabbing dinner now

blissful garden Nov 16, 2023, 6:24 AM

#

Just woke up. @patent gull still need me to run some more tests on both fudge and your script? I can try more neg prompts in parallel with you guys and see if anything comes up.

versed flax Nov 16, 2023, 8:39 AM

#

patent gull here's the delta increase in positive sentiment via CFG for a few settings/promp...

why are those guidance values below 1?

patent gull Nov 16, 2023, 8:39 AM

#

It’s positive guidance, not negative

#

Positive worked a lot better than negative

versed flax Nov 16, 2023, 8:40 AM

#

confused_pikachu.jpg

#

R2 will be quite unhappy that we run yet another method with yet another gamma

patent gull Nov 16, 2023, 8:44 AM

#

yeah 🤷‍♂️ i see positive/negative prompts as kinda being in the same category https://huggingface.co/docs/transformers/internal/generation_utils#transformers.UnbatchedClassifierFreeGuidanceLogitsProcessor.example

#

but yeah we generally don't have a good answer for guidance strength and what works and what doesn't

versed flax Nov 16, 2023, 8:45 AM

#

that means we interpolate between both prompt, thus reducing specificity to the user prompt

patent gull Nov 16, 2023, 8:48 AM

#

for negative/positive prompting, it also means we're emphasizing more/less of the negative prompt

#

$(1 - \gamma) p(w_i | w_{<i}, \hat{c}) + \gamma p(w_i | w_{<i}, c)$

for \gamma \in [0, 1], \text{you're mixing part of } \hat{c} \text{ with c}

vital pondBOT Nov 16, 2023, 9:02 AM

#

AlexSpangher
Compile Error! Click the errors reaction for more information.
(You may edit your message to recompile.)

blissful garden Nov 16, 2023, 9:31 AM

#

oh crap the FUDGE job got preempted....

#

so we just got guidance = 1

patent gull Nov 16, 2023, 10:32 AM

#

blissful garden Just woke up. <@1102703708669751306> still need me to run some more tests on bot...

Hey @blissful garden thanks — I feel pretty good on the prompts for sentiment. I think over the next few days I’ll try to get the formality classifier going.

In the meantime, as soon as someone takes a look at R2 and says it’s ok, I can post

blissful garden Nov 16, 2023, 10:53 AM

#

patent gull Hey <@823129585230544906> thanks — I feel pretty good on the prompts for sentime...

Seems like the only thing is the last question where we listed 5 counter points. 1-4 look good. I agree with Elad and have some doubts on 5
The rest looks really good!

patent gull Nov 16, 2023, 9:09 PM

#

ok i'll post our response

versed flax Nov 17, 2023, 1:29 AM

#

The answer to R2 is absolutely fabulous

#

Congrats guys

patent gull Nov 23, 2023, 2:49 AM

#

should we re-ping the reviewers on the OpenReview comment threads?

#

I don't know how ICLR works

#

in ACL, the ACs started encouraging reviewers to respond

loud adder Nov 23, 2023, 3:27 AM

#

Yeah that's probably a good idea

#

Thank you for your review! We were wondering if you were planning on updating your score to reflect our reply or if there were any additional questions you'd like us to answer.

patent gull Nov 23, 2023, 5:42 AM

#

ok — updating now! thanks for the great text!

loud adder Nov 25, 2023, 3:00 PM

#

Unfortunately it looks like we won't be accepted to ICLR unless a miracle occurs. Due to other peoples' reviews responding our paper has fallen to ~ the median review score.

https://x.com/shaohua0116/status/1728158662265340047

This doesn't mean the paper isn't good, it means we got unlucky. Peer review is a crapshoot and sometimes it takes three submissions to get the right luck.

The next venues are ICML and ACL, both of which have deadlines in January. I think we're good to submit as-is (after changing the format and making sure we fit within the length reqs), but if people want to improve the paper more we can have a meeting in December and discuss options.

blissful garden Nov 25, 2023, 4:13 PM

#

COLM is probably also something to think of (and maybe have a better chance of being properly evaluated by qualified experts)

loud adder Nov 25, 2023, 4:45 PM

#

True, I hadn't considered that.

fallow egret Nov 25, 2023, 6:38 PM

#

What about some ICLR workshop? I think at this stage it's going to be hard to get accepted (since indeed many papers got out with the same sampling modification). On the other hand with such a good experimental section it will be very easy to get accepted to a workshop

loud adder Nov 25, 2023, 11:03 PM

#

Partially it's a question of what @versed flax's goals are

versed flax Nov 25, 2023, 11:04 PM

#

loud adder Partially it's a question of what <@212467543745626112>'s goals are

Can I get your view on the different tradeoffs?

#

I am having a hard time having a relevant opinion, I don't have publishing experience and what just happened makes me question the chance of getting this paper through a high impact conference

blissful garden Nov 25, 2023, 11:21 PM

#

My take was that our paper wasn't really judged by the right people this time. Workshop has the advantage of being specialized to the right domain. I had good experience with that earlier this year but my sample size was 1 😂.
Trying ICML and ACL has the benefit of prestige. If it gets accepted, for example it's an entry ticket for job interview or a dream-come-true moment for @versed flax if I remember correctly. COLM is probably like betting on a super young venue. But in my own field a lot of young journals run by competent experts did rise up extremely quickly and carried others' and my mediocre papers that got published on it.

I have no idea whether a resubmission gets a lower chance or not. At least in math nobody cares how many times you submitted before.

versed flax Nov 25, 2023, 11:26 PM

#

What about Elad's take that the paper gets older?

blissful garden Nov 25, 2023, 11:27 PM

#

versed flax What about Elad's take that the paper gets older?

yeah this is what I don't know about. How much would resubmission hurt the publication chance in ML.
I mean the paper did come out in parallel with a bunch of others doing similar sampling method. It just gets submitted late but they shouldn't judge that on when you submit

fallow egret Nov 26, 2023, 6:47 AM

#

I just want to stress that the resubmission is not an issue (as @loud adder wrote, it's very common to try few times until getting accepted). The problem is that there are currently many papers with the exact same method, I'm guessing that some of them got accepted to some tier-1 conference. For ICLR it was still a boundary case, but now it's going to be very hard to defend on the novelty claim (which automatically reduce the score to <6).
From a prestige point of view, getting accepted to a good workshop is not the same as getting accepted to the main conference, but I think it's also good for the resume.

In any case, for sure it's your decision only. Whatever you will decide I will be available to help also in the next submissions

fallow egret Nov 26, 2023, 9:26 AM

#

blissful garden yeah this is what I don't know about. How much would resubmission hurt the publi...

Unfortunately it is judge according to the submission time (since it is a blind submission and the reviewers are not supposed to check in Arxiv for the original publication date) . In any case for sure they will not start to compare dates with other works. Let's hope it will still get accepted to ICLR

blissful garden Nov 26, 2023, 1:28 PM

#

fallow egret Unfortunately it is judge according to the submission time (since it is a blind ...

In theory you are right. But technically resubmission is at least a 6 month delay. If your work is an important work, people must have talked about it, cited or used it and things get old easily in ML. If there is a perfect isolation between submission and the original arxiv, it automatically becomes "not novel" because this "anonymous submission" is older than your own preprint and reviewers are not allowed to draw connections between these two

#

Basically this is not enforceable because a perfect execution is saying "good works cannot resubmit".

I'd rather guess that in reality reviewers secretly look up and know what date this paper came out and who wrote it. If it's truly an original work when it comes out, they just don't mention "novelty". If it's obviously a copy cat of other method with significant time difference, they cite this "novelty" issue.

loud adder Nov 26, 2023, 2:45 PM

#

We can ping the ACs if this becomes a serious issue

#

That's what we did with trlX, when we claimed we were the first people to do something and one reviewer came back and was like "what about trlX?"

versed flax Nov 26, 2023, 3:45 PM

#

You all have more experience than me. If I make a decision, it will necessarily be less informed that any of yours. My goal is to maximize impact & recognition, but I'm not ready to take risky bets a risk losing it all

patent gull Dec 2, 2023, 7:55 PM

#

sorry for the delay here, I missed a lot of this discussion.

I have a few thoughts:

Huge bummer and yes more a reflection of randomness than actual goodness-of-fit. ICLR has a crazy mean-tendency bias... a 1-point difference in any reviewers score would've totally changed our outlook.
I think this is worthy of a conference paper, given the amount of different angles we bring together: conventional benchmarks, cot, memory/compute analysis, assistants, etc. I'm willing to be overruled on this point, but I think it's more than a workshop paper.
In my opinion, it doesn't matter as much that other people have done this sampling modification, as they have really focused on specific cases. Also, the current reviews have undeniably made this paper a lot stronger. My gut is that we need to do a better job highlighting our novelties into the introduction, in essence: introduction = \gamma * our paper + (1 - \gamma ) (other papers). wait a minute.... that looks familiar....
That being said, I have concerns about ACL since (a) I don't know that people in that conference care as much about compute/memory evals as they do in Neurips or ICLR (b) the paper format for ACL is much different and smaller, we would have to cut a lot of stuff or move stuff into the appendix. Which might not be terrible — we might indeed have too much introductory maths. But still, it's going to be considerable work to reformat for ACL.

patent gull Dec 2, 2023, 8:11 PM

#

My gut is that we submit at least one more cycle. We got very helpful reviews that got to some core weaknesses in our paper, and we addressed them. The paper is stronger as a result — the review cycle worked.

None of the reviewers seemed to care about other NLP papers that did CFG-like sampling. The criticism was the comparison to CFG in vision, which was fair, @versed flax was directly inspired by vision, so it's a very fair criticism. So, we do a better job of highlighting our response to R2.

In my mind, the tradeoffs:

conference:

pros: gives the paper more credibility and standing.
cons: possibility of another rejection.

workshop

pros: gets the work out there, at least.
cons: variance in quality in workshops is HUGE. Paper has less credibility, in my mind.

I think COLM might be pretty cool to consider. I looked up the dates and ICML reviews will become available before the COLM deadline. So there's the possibility that we submit to ICML and then if we get bad scores, fix and submit to COLM. ACL reviews won't be available in time for COLM. On the plus side for ACL, the reviewers there tend to write a LOT more, and actually respond to rebuttals but 🤷‍♂️ the timing and venue doesn't seem optimal to me

versed flax Dec 3, 2023, 1:07 AM

#

patent gull My gut is that we submit at least one more cycle. We got very helpful reviews th...

That seems smart. I'm okay.

patent gull Dec 4, 2023, 8:45 PM

#

should we say something? I don't know how these channels usually work:

#

#

@here private comment period ends today

#

we can easily say "we responded to everything including with new experiments and haven't heard back." I just don't know what is typically acceptable for ICLR. For instance, in *CL conferences, we're advised to do this only as a last resort, if we suspect serious ethical issues on the reviewers part

blissful garden Dec 4, 2023, 9:06 PM

#

yeah no idea how to do this. With 7000+ submissions I bet a lot of other people also said that they didn't hear back from reviewer

versed flax Dec 4, 2023, 9:09 PM

#

do we risk something by doing so?

#

if not, the expectation is strictly positive

loud adder Dec 4, 2023, 9:10 PM

#

That's my thinking, yeah

versed flax Dec 4, 2023, 9:11 PM

#

This is the mechanism to communicate to the AC any unresolved discussion points (if you do not have any unresolved discussion points, there is no need to send a private comment).
I mean, it seems to be exactly our case?

loud adder Dec 4, 2023, 9:13 PM

#

Yes

#

Send:
Dear AC,
We have tried to engage with the reviewers, but unfortunately none of them responded to our rebuttal. As far as we can tell we have given compelling responses to all reviewers that would warrant a reconsideration of their initial scores.

versed flax Dec 4, 2023, 9:15 PM

#

copy paste, send?

patent gull Dec 4, 2023, 9:39 PM

#

I would say exactly that except “compelling responses” -> “compelling responses, including two requested analyses, to all reviewers”

#

@versed flax do you want to send?

versed flax Dec 4, 2023, 9:41 PM

#

patent gull <@212467543745626112> do you want to send?

the three of us are 1st authors, if you want to do it, I won't prevent you

patent gull Dec 4, 2023, 9:41 PM

#

Either/or, I don’t care

#

Ok let me do it before I lose service then, im on a train

#

alright let me see

versed flax Dec 4, 2023, 9:45 PM

#

FYI

To write a private comment to the ACs, you can simply go to your submission on OpenReview, and write a new comment. The allowable readers are ACs, SACs, and PCs.

patent gull Dec 4, 2023, 9:54 PM

#

"Dear Area Chairs,

We are writing to let you know that we tried to engage with the reviewers, but unfortunately none of them responded to our rebuttal. As far as we can tell, we have given compelling responses to all reviewers, including with analyses that we incorporated into a new draft of the paper, that would warrant a reconsideration of their initial scores.

We summarize the major unresolved discussion points here:

A memory cost analysis is recommended (Reviewer 3Gz2): We have completed a memory analysis and have included our results in a paper update. In general, we found a tradeoff between serving larger models, and serving a smaller model with CFG; in our analysis we identify the optimal tradeoff point across model sizes and VRAM. Please see Section 4 and Appendix B.3
Comparison to other baseline controllable NLG tasks (Reviewer RjYY): We have completed this comparison. The baseline Classified-guided control increases sentiment by .065 points, whereas CFG (our method) increases by .312 points. Additionally, the baseline is very slow — it is >100x slower than CFG.
Lack of novelty (Reviewer YQBo): with respect, we argue that the application of CFG to autoregressive language models is novel, as previously they had only been applied to non-autoregressive diffusion models in computer vision. This is not a trivial adaption, and the core contribution of our paper is to adapt and rigorously test this across a wide range of different prompting techniques to prove it's validity.

We very much appreciated the reviewers points, and they undeniably made the paper stronger. We had hoped for a robust debate.

We additionally would like to report that we strongly believe that Reviewer YQBo did not fully grasp the point of the paper, as they seem to be under the impression that CFG involves furter model training, which is DOES NOT.

We would be very appreciative if you took these points into consideration in your review.

versed flax Dec 4, 2023, 9:55 PM

#

fire

patent gull Dec 4, 2023, 9:55 PM

#

ok I'll wait 20~ min for other people @here to read

#

and then send. if you don't hear my ACK, assume that I don't have service

patent gull Dec 4, 2023, 10:39 PM

#

sent

versed flax Dec 4, 2023, 11:01 PM

#

: 🔥

blissful garden Dec 4, 2023, 11:03 PM

#

patent gull sent

Thank you! Sorry I have been so busy today

patent gull Feb 1, 2024, 9:47 AM

#

We're resubmitting to ICML tomorrow @loud adder and anyone else. If you'd like to give our paper a glance, it's here: https://www.overleaf.com/5232387143jdfyzsrvmjsv#565401

Overleaf, Online LaTeX Editor

An online LaTeX editor that’s easy to use. No installation, real-time collaboration, version control, hundreds of LaTeX templates, and more.

fallow egret Feb 1, 2024, 3:27 PM

#

This is the final version?

patent gull Feb 1, 2024, 4:48 PM

#

Near final

#

Will probably do some more editing before I submit

fallow egret Feb 1, 2024, 5:50 PM

#

Overall look fine, it just important to pay attention that is exactly 8 pages (currently there are 3 missing lines)

patent gull Feb 1, 2024, 10:37 PM

#

i didn't realize that was a stipulation to be exactly 8 pages, no less! but i will return to flesh out the discussion section a little bit better anyway

#

so i'll make it work

fallow egret Feb 2, 2024, 8:00 AM

#

patent gull i didn't realize that was a stipulation to be exactly 8 pages, no less! but i wi...

Yes, it's stupid but you can get automatic rejection on such things. It can be easily solved when everything is finished by playing a little bit with the figure size/captions space

tepid gazelle Feb 5, 2024, 3:51 PM

#

Hey @versed flax or others on the CFG paper, we're using CFG as a baseline for a new project and I had a question about the merged HF implementation of CFG which I thought you might know the answer to:

Is the prompt being conditioned on by HF generate() the entire input sequence (and does this stay static / you don't add new generated tokens to this extra-conditioned prompt as you go on?) I think the answer to this is yes but wanted to confirm.
and also, is there a way to pass settings to HF generation such that only a sub-prefix of the initial input sequence is more strongly conditioned on?

we'd like to be able to pass "<Instruction1>..... <context here>" as input, and only condition on <Instruction1> when generating further output from the model

#

Thanks!

versed flax Feb 5, 2024, 3:52 PM

#

tepid gazelle Hey <@212467543745626112> or others on the CFG paper, we're using CFG as a basel...

The input_ids given to .generate() is the "positive" prompt, the one given to negative_input_ids is the negative (/ unconditional) prompt. Sampled tokens are appended to both during generation

tepid gazelle Feb 5, 2024, 3:53 PM

#

versed flax The input_ids given to .generate() is the "positive" prompt, the one given to ne...

Hm, so continually updating both to include new generated tokens is the desired approach?

versed flax Feb 5, 2024, 3:54 PM

#

yes

tepid gazelle Feb 5, 2024, 3:54 PM

#

gotcha

versed flax Feb 5, 2024, 3:54 PM

#

If you want to do what you say (which is exactly the same as Context-Aware Decoding), you want to use instr+ctx as positive and intr as negative

tepid gazelle Feb 5, 2024, 3:56 PM

#

I see, thanks!

#

Ah yeah going by their abstract that does sound like what we want. Thank you, appreciate it!

versed flax Feb 5, 2024, 3:57 PM

#

You're very welcome :)

fallow egret Mar 21, 2024, 5:34 PM

#

What a strange rejection 😦 just gave 4 without any concrete reason (besides typo)

versed flax Mar 21, 2024, 5:42 PM

#

fallow egret What a strange rejection 😦 just gave 4 without any concrete reason (besides ty...

should we reply? I'm pretty happy with the 7 and 6

#

like "lol dude why 4 just because of typos??"

fallow egret Mar 21, 2024, 5:46 PM

#

I think we should stress the novelty (which is the main weakness according to the other reviewers).
Let's hope he will either change his mind after he will see the other reviews (his confidence is only 3), or hope that the AC will kick him (this could happen with very high probability)

fallow egret Apr 2, 2024, 8:49 AM

#

Seems like we are above the threshold now 🤞 👀

#Evaluating Classifier-Free Guidance impact