Evaluating Classifier-Free Guidance impact | EleutherAI | Page 3

patent gull Jun 24, 2023, 4:34 PM

#

^ I could say the same thing

#

so let's just move forward

versed flax Jun 24, 2023, 4:34 PM

#

then I'd say it's better stated in the intro. Though it introduces the notation in Sec2

fallow egret Jun 24, 2023, 4:35 PM

#

Yes, I agree that something currently look wrong in the first paragraphs of section 2 (until 2.1)

blissful garden Jun 24, 2023, 4:35 PM

#

Really? I thought I was rephrasing the flow and fixing the grammar and adding the last paragraph.
It was done before all our conversations so of if was important to look at your original text I'm sorry

fallow egret Jun 24, 2023, 4:37 PM

#

blissful garden Really? I thought I was rephrasing the flow and fixing the grammar and adding th...

This is fine 🙂 as I said feel free to change, I just want to have access to the original version, to see the diffs (it's just the account limitations, we should work with premium)

versed flax Jun 24, 2023, 4:37 PM

#

let me transfer the ownership to Stella then

fallow egret Jun 24, 2023, 4:38 PM

#

@versed flax you can also subscribe and cancel. It will not charge you anything, and we will have 14 days to work with premium

versed flax Jun 24, 2023, 4:38 PM

#

Ha, good.

fallow egret Jun 24, 2023, 4:39 PM

#

versed flax Ha, good.

Just remember to cancel 🙂

patent gull Jun 24, 2023, 4:41 PM

#

i have a premium sub

#

eductional account

#

idk how to transfer, though

fallow egret Jun 24, 2023, 4:43 PM

#

IMO the line of works and presenting the prompt alignment issue should be only in the intro. In section 2 there should be only the formal notations (defining the first tokens in the sequence as the prompt)

patent gull Jun 24, 2023, 4:45 PM

#

woohoo we have full history!

patent gull Jun 24, 2023, 4:46 PM

#

fallow egret IMO the line of works and presenting the prompt alignment issue should be only i...

yup I made that change

blissful garden Jun 24, 2023, 4:49 PM

#

just commented out my remarks for Section 2 to make it cleaner. For the record I didn't touch the content of 2.1 and 2.2 except the equation numbering change.

patent gull Jun 24, 2023, 4:52 PM

#

ok great

#

section 1 and 2 look good to me

fallow egret Jun 24, 2023, 4:53 PM

#

blissful garden just commented out my remarks for Section 2 to make it cleaner. For the record I...

Just in time, I think I address all the issues (please check).
Also I filled all the citations

patent gull Jun 24, 2023, 4:53 PM

#

Section 3 needs a roadmap before diving into 3.1... it's our most important section and we need to prime the reader to get all of our great results

#

wait.. it's all just inputted from another file

#

hm oh. right. catching up

fallow egret Jun 24, 2023, 4:54 PM

#

patent gull wait.. it's all just inputted from another file

Yes, you missed all the fun yesterday 🙂

patent gull Jun 24, 2023, 4:55 PM

#

😅

#

lucky me

#

ok.. what's the plan with Section 3? Elad thinks it needs a rewrite? I actually think that just putting in more structure:

road map in the start
first/last sentences in each subsection tying it to the overall picture

is fine

blissful garden Jun 24, 2023, 4:55 PM

#

so do we restructure section 3?

patent gull Jun 24, 2023, 4:56 PM

#

☝️

blissful garden Jun 24, 2023, 4:56 PM

#

Elad has a point in terms of classifying benchmarks based on tasks. But we need to think carefully about that. Might mean we split up some tables and stuff

patent gull Jun 24, 2023, 4:56 PM

#

"classifying benchmarks based on tasks" <- can someone summarize this for me?

fallow egret Jun 24, 2023, 4:57 PM

#

cc @patent gull

blissful garden Jun 24, 2023, 4:57 PM

#

here, seems like we have 3 categories for the general benchmarks

patent gull Jun 24, 2023, 4:57 PM

#

cool thanks

#

ok yeah i do agree with this

#

I was thinking that even just doing points 1/3/4 would really clarify things

blissful garden Jun 24, 2023, 4:59 PM

#

so overall do we do

common sense reasoning
close-book QA
completion
machine translation
code generation

patent gull Jun 24, 2023, 5:00 PM

#

but 5 makes a lot of sense to me, too. #2 looks more like a statement to me? idk what's the proposal, there

fallow egret Jun 24, 2023, 5:00 PM

#

patent gull but 5 makes a lot of sense to me, too. #2 looks more like a statement to me? idk...

Yes, 2 is actually explaining why 5 is needed

patent gull Jun 24, 2023, 5:01 PM

#

gotcha

#

sure. I did like the idea of lm_harness as it's own standalone, since I think people will start to see that more and more, and just know what it is

#

but i think breaking it up is also something I support

#

does this imply that we need to redo Figure 1 to be more broken up?

patent gull Jun 24, 2023, 5:04 PM

#

blissful garden so overall do we do - common sense reasoning - close-book QA - completion - mach...

ok... here's what i think. I think the order should be:

close-book QA
completion
common sense reasoning
machine translation
code generation

And I think we need to designate a single person to take this on, or at least direct other people, since it's a bigger-picture change that touches a lot of people's work

blissful garden Jun 24, 2023, 5:04 PM

#

patent gull does this imply that we need to redo Figure 1 to be more broken up?

yes, will need smaller table for each task like the llama paper

fallow egret Jun 24, 2023, 5:05 PM

#

patent gull does this imply that we need to redo Figure 1 to be more broken up?

I think it's fine to put it in one figure and refer in different subsections to the same figure

fallow egret Jun 24, 2023, 5:06 PM

#

patent gull ok... here's what i think. I think the order should be: close-book QA completio...

Sounds great

blissful garden Jun 24, 2023, 5:06 PM

#

fallow egret I think it's fine to put it in one figure and refer in different subsections to ...

oh if it works like this I'm fine

fallow egret Jun 24, 2023, 5:07 PM

#

🙌 I'm excited, I think it will really improve the quality of the paper!

blissful garden Jun 24, 2023, 5:08 PM

#

patent gull ok... here's what i think. I think the order should be: close-book QA completio...

there was also a concern that it might turn out that we tried too many common sense reasoning tasks

patent gull Jun 24, 2023, 5:08 PM

#

i see

blissful garden Jun 24, 2023, 5:08 PM

#

completion is just lambada, QA are just triviaqa and sciq, machine translation is just wmt14 fr-en, and then a whole army of reasoning tasks

fallow egret Jun 24, 2023, 5:09 PM

#

As I said, in my opinion it will actualy reduce their volume since all of this list only get one subsection (assuming all subsection will have ~ same length as it should be)

blissful garden Jun 24, 2023, 5:09 PM

#

it's also funny that some non-reasoning tasks (sciq and lambada) tend to get bigger improvements than reasoning tasks which I don't understand 😂

patent gull Jun 24, 2023, 5:11 PM

#

alright i need to get breakfast/etc. can someone fill in this structure here with all the tasks that are gonna go in each part? just visually so we can agree on it and not go back-and-forth a million times:

https://docs.google.com/document/d/1LEDBU1xtue-x4793hRGNOLMJmgDqKou6LpNO8md8IUU/edit?usp=sharing

Google Docs

CFG section 3/4 breakdown

Section 3 close-book QA completion common sense reasoning machine translation code generation

#

alright, i'll be back in 10-20. I think once it's mapped out, I might be able to take on this task of rewriting Secion 3, but not sure if i'm mentally able to after EMNLP

blissful garden Jun 24, 2023, 5:16 PM

#

There are some sporadic experiments like the GPT-J codegen completion and my initial image generation tasks. Maybe let me move them to appendix?

Also moving the figure 1 up to 3.1 (somehow it ended up in 3.3)

fallow egret Jun 24, 2023, 5:22 PM

#

blissful garden There are some sporadic experiments like the GPT-J codegen completion and my ini...

I think it's a good idea

loud adder Jun 24, 2023, 5:28 PM

#

When would be a productive time for me to read through the paper today

patent gull Jun 24, 2023, 5:46 PM

#

@loud adder here are where your comments might be most appreciated/productive:

section 1,2 are done. Style edits/thoughts appreciated
section. 3 we are debating structure. If you could see how the structure feels to you currently, and whether you think it needs an overhaul, that would be great

#

section 4, we’d appreciate your thoughts on content. last time I checked (pre EMNLP) was in a good state writing-wise. Not sure if it changed. But your thoughts on the content and whether you think it’s good/conclusive or needs more would be very appreciated!

#

Cant speak to the rest of the sections personally

versed flax Jun 24, 2023, 6:04 PM

#

blissful garden There are some sporadic experiments like the GPT-J codegen completion and my ini...

Meh. I see their value as introductory experiments, but are they interesting enough to be moved after the more detailed and thorough experiments, in the appendix?

blissful garden Jun 24, 2023, 6:06 PM

#

versed flax Meh. I see their value as introductory experiments, but are they interesting eno...

Yeah i was seeing them the same way and used it to introduce the humaneval. But if we base the whole section on well-established benchmarks, they don't seem to have places to go
I imagine if we structure around benchmark tasks it will be tightly written like "here is the QA, and next is reasoning, and next..."

versed flax Jun 24, 2023, 6:13 PM

#

blissful garden Yeah i was seeing them the same way and used it to introduce the humaneval. But ...

They I'd try to compress them a lot (1 sentence) and move the detail to the appendix as you propose, yes

patent gull Jun 24, 2023, 6:13 PM

#

Yeah… “early experiments showed…” or “initial runs suggest that…”

versed flax Jun 24, 2023, 6:13 PM

#

exactly

blissful garden Jun 24, 2023, 6:13 PM

#

patent gull Yeah… “early experiments showed…” or “initial runs suggest that…”

Yeah this definitely works.

patent gull Jun 24, 2023, 7:02 PM

#

i'm reading through Section 3 again with respect to @fallow egret 's new structure and now I actually have a counterargument...

#

the current organization scheme can roughly (with some reorg) be broken down into different "Methods", not "Tasks"... so another way to reorg would be:

3.1: zero-shot prompting
3.5: chain of thought
3.2-3.3 text-to-text generation
3.4: negative prompting, (although maybe this deserves it's own section?)

#

I actually think this is a little bit more logical than breaking it down by "Tasks" because this paper is really more about the CFG mechanism than it is about the semantics of the tasks.

With this breakdown, we can make a big-picture story that we're really trying to probe CFG with different parts of the prompt/prompt formulations. Those categories above are a nice breakdown

#

besides, i think if we break down into tasks, we'd have to have insights/hypotheses about why it does well at the tasks, specifically, and we don't have anything besides hand-waviness or trying to find citations.

This "methods" breakdown is really more about viewing CFG with different prompt setups, which then Section 4 addresses much more directly

blissful garden Jun 24, 2023, 7:08 PM

#

patent gull besides, i think if we break down into tasks, we'd have to have insights/hypothe...

Yeah and the army of reasoning tasks is gonna stand out. Would be hard to justify our bias other than saying "other people do it too"

patent gull Jun 24, 2023, 7:09 PM

#

right, it's a little unbalanced if we go with the task breakdown

fallow egret Jun 24, 2023, 7:11 PM

#

I'm completely fine with this 'method' split as long as we emphasis it. I agree that this distinction sounds indeed much better. The only thing is that I think it's a little bit strange to classify the code generation as text-to-text + it sounds that there should be merge between 3.2-3.3

patent gull Jun 24, 2023, 7:11 PM

#

i'm open to either way, it was just something that came to mind, btw

#

i think yeah my only qualm with "Tasks" was I was starting to write it in my mind

#

and realized that I didn't have a great hypothesis/justification for why CFG would do well in common sense, QA, etc.

#

besides just repeating over and over again "more adherence to the prompt"

fallow egret Jun 24, 2023, 7:12 PM

#

Yes, I agree that it's much better split and better stress the different effect on each method

patent gull Jun 24, 2023, 7:13 PM

#

which maybe that'll also work 🤷‍♂️ idk. but it doesn't feel like it builds, actually, in the same way the methods split does

#

plus, your work gets to stand on its own, now 😉

#

ok i'll take a stab at putting that structure in the beginning, and if it doesn't work, we can always evaluate and take a diff direction

fallow egret Jun 24, 2023, 7:14 PM

#

Lol, this doesn't effect my vote 🙂

blissful garden Jun 24, 2023, 7:14 PM

#

fallow egret I'm completely fine with this 'method' split as long as we emphasis it. I agree ...

Yeah if there is a narrative to combine translation with codegen I'm fine

fallow egret Jun 24, 2023, 7:15 PM

#

Ok, so for me the method split sounds indeed more reasonable, any objection?

versed flax Jun 24, 2023, 7:16 PM

#

I'm trapped because of timezones but will do another pass later

patent gull Jun 24, 2023, 7:16 PM

#

i'm gonna update the old_sections/experiment file

#

we can always go a different direction with another file lol

fallow egret Jun 24, 2023, 7:19 PM

#

patent gull i'm gonna update the `old_sections/experiment` file

By the way IMO working with separate tex file for each section is much more convenient

patent gull Jun 24, 2023, 7:44 PM

#

alright, i did a little bit showing what kind of structure I have in mind. Haven't finished up the last parts ... will be back in a bit to do that

#

language might be a little sloppy, especially around the "we hypothesize..." bits.... feel free to change!!

unique sedge Jun 24, 2023, 7:48 PM

#

patent gull the current organization scheme can *roughly* (with some reorg) be broken down i...

I like this idea

#

Negative prompting probably deserves its own section if its like a good bridge between going from tasks to the section where we talk about why it works.

Every section should add something to the global thread and make the case stronger

versed flax Jun 24, 2023, 7:50 PM

#

patent gull the current organization scheme can *roughly* (with some reorg) be broken down i...

This makes a lot of sense. A lot.

#

Negative prompting needs to be addressed separately I guess, especially for future work

patent gull Jun 24, 2023, 7:51 PM

#

We have those really good human evals though, right?

versed flax Jun 24, 2023, 7:51 PM

#

We need to find a way to make it work. It's just too powerful. It won't be for this paper but it's good to address it

versed flax Jun 24, 2023, 7:52 PM

#

patent gull We have those really good human evals though, right?

Yes. That's a "mild" neg prompting situation

#

But it's still one.

patent gull Jun 24, 2023, 7:52 PM

#

Hmmm let’s see. Idk it still might fit

versed flax Jun 24, 2023, 7:53 PM

#

I mean, it's not as granular or interesting as I wanted it to be, but it's still neg prompting abd the results are quite awesome

patent gull Jun 24, 2023, 7:54 PM

#

In my opinion I think it fits in with a method-driven reorg of section 3

#

Although maybe we’ll be able to evaluate better once it’s all written

versed flax Jun 24, 2023, 8:16 PM

#

I'll work on that as I get home

patent gull Jun 24, 2023, 10:05 PM

#

i had a thought for another explanatory experiment, what do you guys think

#

so we argue that CFG increases the adherence to the prompt

#

this implies that true continuation w_c is more likely under true prompt w_p vs. another random prompt w_{p'} in the CFG setting vs. the vanilla setting

#

so we measure \delta p = p(w_c | w_p) - p(w_c | w_{p'})

versed flax Jun 24, 2023, 10:07 PM

#

Yes. That's what were trying to measure with KL, Kendall tau etc

patent gull Jun 24, 2023, 10:08 PM

#

but we're not holding the continuation the same

#

and testing different prompts

versed flax Jun 24, 2023, 10:18 PM

#

patent gull but we're not holding the continuation the same

Uh?

patent gull Jun 24, 2023, 10:25 PM

#

what we're testing with KL, Kendall, etc. is whether the logit distributions of CFG look similar to Instruction-tuned models

#

not explicitly whether they're following the prompts better

#

and what I'm saying that if a model ISN'T following the prompt well, we would expect this delta:

$\delta c = p(w_c | w_p) - p(w_c | w_{p'})$

to be lower

#

than a model that is

vital pondBOT Jun 24, 2023, 10:26 PM

#

Alex Spangher

versed flax Jun 24, 2023, 10:27 PM

#

patent gull not explicitly whether they're following the prompts better

You do, with entropy, don't you?

patent gull Jun 24, 2023, 10:27 PM

#

we're testing something more like:

$m = < p_1(w_c | w_p) || p_2(w_c | w_p)>$

vital pondBOT Jun 24, 2023, 10:27 PM

#

Alex Spangher

patent gull Jun 24, 2023, 10:28 PM

#

yeah I try to make the argument with entropy, but I was just thinking about another way to test the argument maybe more directly

#

anyway, I'm gonna keep editing section 3

#

that was just a passing thought

versed flax Jun 24, 2023, 10:31 PM

#

patent gull yeah I try to make the argument with entropy, but I was just thinking about anot...

What would be more direct than cross entropy of gold target s?

patent gull Jun 24, 2023, 10:32 PM

#

lower entropy is evidence of prompt adherence, but not bullet proof

#

there are other reasons why entropy might decrease besides greater prompt adherence

versed flax Jun 24, 2023, 10:33 PM

#

It shows better language modeling

patent gull Jun 24, 2023, 10:34 PM

#

uhh yeah i think you're right

#

but in theory, the model could be both doing well on benchmarks and generating crappy english

#

totally possible to overfit on benchmarks

patent gull Jun 24, 2023, 11:31 PM

#

ok done editing Section 3

#

left some comments, didn't touch "Continuations"

#

but i did a lot of work trying to make it more structural and flow together better

#

please let's not make major changes without a discussion here!! I'll try to look at Section 4 later tonight or tomorrow

#

who wants to take a stab at the conclusion? if no one does by the time i'm done with Section 4, then I will

#

i think we're close, everyone. it's shaping up

#

appendices need work, but the language in the main body is really coming together, I think

versed flax Jun 24, 2023, 11:35 PM

#

@patent gull I'm back home

#

How can I maximize my usefulness?

patent gull Jun 24, 2023, 11:41 PM

#

Cool!

#

I left some comments in the negative prompting section

#

If you take a pass at those then I can look later

#

And then we all have to start addressing that appendix lol…..

blissful garden Jun 24, 2023, 11:57 PM

#

~~are [] in section 3 placeholders for citations?~~ oh yeah they are. Fixed some stuff for Section 3.3

versed flax Jun 25, 2023, 12:26 AM

#

Sec 2.1 uses r, Sec 2.2 uses gamma (and so do all the figures). Has someone a well thought opinion on the notation we should prefer?

blissful garden Jun 25, 2023, 12:33 AM

#

versed flax Sec 2.1 uses r, Sec 2.2 uses gamma (and so do all the figures). Has someone a we...

which is the most famous (cited) paper around this CFG or classifier guidance thing?

versed flax Jun 25, 2023, 12:33 AM

#

Well, Ho & Salimans is the reference paper for cfg

blissful garden Jun 25, 2023, 12:33 AM

#

I always go for the notation of the most-known paper unless there is a counter-argument for the choice

blissful garden Jun 25, 2023, 12:34 AM

#

versed flax Well, Ho & Salimans is the reference paper for cfg

Let's try to see if we can be consistent with theirs

versed flax Jun 25, 2023, 12:34 AM

#

IIRC sec 2.1 goes with CG's notation while 2.2 goes with CFG but I need to double check

#

ok no cfg uses w

#

it's the blog post that uses gamma

blissful garden Jun 25, 2023, 12:36 AM

#

Looking at 3.1, fixed some minor naming and citation problems.

close-book QA \cite{}, common sense reasoning tasks \cite{}, and sentence completion-tasks \cite{}
Do we have citations for each of these task categories? I don't recall any.

versed flax Jun 25, 2023, 12:36 AM

#

I don't know any citation that would fit here (but my NLP culture is small)

blissful garden Jun 25, 2023, 12:37 AM

#

Yeah unless we throw all the citations of benchmarks in their corresponding spots. Leave it here for now. If nobody has better idea, we can remove these empty citations.

blissful garden Jun 25, 2023, 12:41 AM

#

versed flax ok no cfg uses w

hmm should we make this change across the paper? Seems like a big move and let's see other people's opinions

versed flax Jun 25, 2023, 12:41 AM

#

blissful garden hmm should we make this change across the paper? Seems like a big move and let's...

honestly I hate w which is already way too overloaded in deep learning

blissful garden Jun 25, 2023, 12:42 AM

#

versed flax honestly I hate w which is already way too overloaded in deep learning

I hate it too. Also we have w_i for the prompting stuff

versed flax Jun 25, 2023, 12:43 AM

#

exactly. "words", "weights", omega

blissful garden Jun 25, 2023, 12:43 AM

#

versed flax honestly I hate w which is already way too overloaded in deep learning

is there another paper using r or gamma?

versed flax Jun 25, 2023, 12:43 AM

#

I'm checking

blissful garden Jun 25, 2023, 12:46 AM

#

by the way is it a standard practice to cite blog posts in ML?

versed flax Jun 25, 2023, 12:47 AM

#

blissful garden by the way is it a standard practice to cite blog posts in ML?

I looked for standard latex practice, and the rule of thumb is "just do yout best" lol

patent gull Jun 25, 2023, 12:47 AM

#

Ohh yah the notation definitely needs to be standardized

patent gull Jun 25, 2023, 12:47 AM

#

blissful garden Yeah unless we throw all the citations of benchmarks in their corresponding spot...

Would be good to include at least one citation that explains each task, otherwise we have to explain what they are and it’s kinda tangential

#

Sorry I just threw those in there. Feel free to ignore. I can also do the work of finding those citations myself, sometimes it’s just easier to divide the labor and not switch between lots of tabs

#

But my rule of thumb is “define or cite”

#

“If you can’t cite, define. If you don’t feel like defining, cite”

blissful garden Jun 25, 2023, 12:49 AM

#

patent gull Sorry I just threw those in there. Feel free to ignore. I can also do the work o...

oh yeah it's totally fine. I'm already filling in a few for you in section 3

patent gull Jun 25, 2023, 12:49 AM

#

Cool cool

blissful garden Jun 25, 2023, 12:49 AM

#

great job revising section 3 by the way!

patent gull Jun 25, 2023, 12:50 AM

#

I’m gonna be back online later tonight

#

Thanks!!!

blissful garden Jun 25, 2023, 12:51 AM

#

oh a super minor question, I saw "...– i.e.". Is this alright? I mean I remember always seeing things like "..., i.e., ...".

versed flax Jun 25, 2023, 12:52 AM

#

cgf fix and Imagen use w too but we still hate it

blissful garden Jun 25, 2023, 12:52 AM

#

versed flax cgf fix and Imagen use w too but we still hate it

we have a reason to override them though: w is already used in the explanations of prompting

versed flax Jun 25, 2023, 12:53 AM

#

blissful garden oh a super minor question, I saw "...– i.e.". Is this alright? I mean I remember...

In English you can add a note - like this - or (like this)

#

meaning this is not "... - ie." but "... - ie ... - ..." and is indeed correct

blissful garden Jun 25, 2023, 12:55 AM

#

versed flax meaning this is not "... - ie." but "... - ie ... - ..." and is indeed correct

yeah but I just never see it in academic papers though.
Like in papers we also have to expand things in full instead of "we're", "I'm", "don't", "Let's", ... (unlike French lol). I just worry there is a rule somewhere

blissful garden Jun 25, 2023, 12:57 AM

#

versed flax cgf fix and Imagen use w too but we still hate it

what about we keep using \gamma and add a footnote explaining our choice

versed flax Jun 25, 2023, 12:58 AM

#

the dudes in Imagen are just losing it and not even trying to hide it lmao

versed flax Jun 25, 2023, 12:59 AM

#

blissful garden what about we keep using \gamma and add a footnote explaining our choice

Then I advocate using gamma in 2.1 as well and not explain anything. We don't have to justify ourselves for changing a letter, do we?

blissful garden Jun 25, 2023, 12:59 AM

#

versed flax Then I advocate using gamma in 2.1 as well and not explain anything. We don't ha...

notation is quite important though

#

Sorry I'm just trained as a mathematician. I guess it's alright in ML

versed flax Jun 25, 2023, 1:01 AM

#

You're the maths guy. however I don't recall reading a paper saying "sorry tho we changed the letter because it was unadapted xoxo". They just do it

#

(just like cfg do not justify their choice for not keeping s)

blissful garden Jun 25, 2023, 1:03 AM

#

versed flax (just like cfg do not justify their choice for not keeping s)

a lot of math papers don't explain either and some of them cause massive problems for younger people. We were taught to be kinder for the audiences

versed flax Jun 25, 2023, 1:04 AM

#

I would consider the notation more rigoroustly if this were a mathy paper were connecting properly to the previous work was important because of some complex derivation etc

#

but here the mathiness is mostly for us too look like cools kids and the equation is absolutely trivial

#

That being said, I do have some kind of French "good enough" attitude, and it sometimes needs not to be tolerated

blissful garden Jun 25, 2023, 1:06 AM

#

versed flax but here the mathiness is mostly for us too look like cools kids and the equatio...

French mathematicians are crazy and ruthless about details. Like literally driving us crazy when being a student

blissful garden Jun 25, 2023, 1:07 AM

#

versed flax I would consider the notation more rigoroustly if this were a mathy paper were c...

yeah I think it's alright to just follow what others do

versed flax Jun 25, 2023, 1:07 AM

#

Oh I'm not talking about french mathematicians, whatever their nationality, they're a species on their own lmao

blissful garden Jun 25, 2023, 1:08 AM

#

versed flax Oh I'm not talking about french _mathematicians_, whatever their nationality, th...

oh that was a random remark about French guys

versed flax Jun 25, 2023, 1:08 AM

#

haha

versed flax Jun 25, 2023, 1:09 AM

#

blissful garden yeah I think it's alright to just follow what others do

we've talked about a lot of "others", which one do we follow then?

blissful garden Jun 25, 2023, 1:09 AM

#

versed flax we've talked about a lot of "others", which one do we follow then?

definitely not mathematicians lol

versed flax Jun 25, 2023, 1:10 AM

#

theeeeeeeeeeeen... gamma?

blissful garden Jun 25, 2023, 1:10 AM

#

versed flax we've talked about a lot of "others", which one do we follow then?

I was referring to Ho & Salimans. They choose their notation then we choose ours

#

learn from the best

#

I'm still learning the ML culture and I'm definitely not stubborn about my own habits

versed flax Jun 25, 2023, 1:28 AM

#

Gamma it is then. @fallow egret did you have a strong reason to use r in 2.1, and if so, should we reconsider the notation in the rest of the paper? If not, are we good changing r to gamma for consistency?

versed flax Jun 25, 2023, 2:28 AM

#

Damn I was quite happy with our former way of presenting CFG in the intro. It was more generic and we naturally derived the negative prompting and "promptless" setting easily

#

it's much harder now to go the other way "indeed... promptless is a particular case, you're not forced to negatively condition on the empty sequence, it can be anything, here's the actual generalized CFG formula haha what a nice trick we pulled on you!" lol

fallow egret Jun 25, 2023, 3:46 AM

#

versed flax Gamma it is then. <@1057033987811459203> did you have a strong reason to use r i...

This was the notation in the original cfg paper, so I used their notation.
But I don't think that it's important to stick to previous work notation, so I'm good with changing to gamma

fallow egret Jun 25, 2023, 4:15 AM

#

versed flax Damn I was quite happy with our former way of presenting CFG in the intro. It wa...

Do you have a theoretical justification why you are naming it CFG in this general case?
Because I don't see what is the classifier part in this case. So if you change it to generic notion the name should be changed. IMO the paper in the current state has much more theoretical meat and I found it more interesting and different (comparing to previous works)

patent gull Jun 25, 2023, 5:48 AM

#

great i went through and finished editing 3.4. There's one more little detail, @versed flax , and then I'll feel done. I feel like we're in a good spot with Section 3. Section 4 and 5 I wrote/edited. We're just a conclusion away from being done with the main body

fallow egret Jun 25, 2023, 7:49 AM

#

I want to write the appendix for CoT, is there is any decision on the appendix structure?

stone umbra Jun 25, 2023, 2:07 PM

#

I've been reading this like a soap opera, and just wanted to say that this is really cool work. 👋
Also, this was pretty hilarious (from https://cfg.vermeille.fr/):

Prompt
How to choose a good learning rate?
Response
Sometimes you can't choose a learning rate. You can't control your learning rate. You have to let it run. It's like breathing. It's hard to control your breathing, but it's also what keeps you alive.

loud adder Jun 25, 2023, 2:35 PM

#

I talked about this paper with Yejin Choi this weekend and she thought it was quite interesting. A lot of her work recently has had a similar theme, in that it’s oriented towards how we can induce high quality behavior in cheaper models. Most of her work has been in terms of producing higher quality synthetic datasets, but she was pleasantly surprised how much of an impact one can have at inference time based on this paper.

versed flax Jun 25, 2023, 2:36 PM

#

So cool!

loud adder Jun 25, 2023, 2:37 PM

#

I’m getting on a plane to come home, but will have notes by Monday

versed flax Jun 25, 2023, 2:37 PM

#

Thank you so much

versed flax Jun 25, 2023, 2:52 PM

#

@fallow egret, reading 2.1 I think you mixed \propto and \sim, I fixed that. I'll change r for \gamma later. It's totally omitting that CG uses the gradient of the external classifier otherwise people will actually wonder how that works (for now I reintroduced the commented sentence about it. Also the sentence following it started with "This modification" and there was no modification introduced)

#

I'll fix that tonight if you're okay

patent gull Jun 25, 2023, 3:37 PM

#

fallow egret I want to write the appendix for CoT, is there is any decision on the appendix s...

Elad thanks for starting the appendix!! I think you can write up your appendix section now, just summarize your tables and such. I’m not of the opinion that appendices need to have great stories and cohesion

#

We’ll include a one-page appendix map and table of contents

#

But otherwise I don’t really think it all needs to tie together. Maybe others have different opinions

fallow egret Jun 25, 2023, 3:38 PM

#

versed flax <@1057033987811459203>, reading 2.1 I think you mixed \propto and \sim, I fixed ...

Yes, as I said while ago there is an issue with the constant, I was planning to do the opposite and carry a constant to make it more accurate, but not sure which thing is better.
Sounds reasonable to add this sentence
Yes, the modification word is indeed not related and should be omit

patent gull Jun 25, 2023, 3:38 PM

#

But a super integrated appendix sometimes gets reviewers saying “this should’ve been another paper. REJECT.” At least I’ve gotten that feedback before

patent gull Jun 25, 2023, 3:41 PM

#

loud adder I talked about this paper with Yejin Choi this weekend and she thought it was qu...

That’s so cool! Had she not heard about it already from Luke?

fallow egret Jun 25, 2023, 3:42 PM

#

patent gull Elad thanks for starting the appendix!! I think you can write up your appendix s...

Agree, just what is the general structure? each subsection in section 3 will have additional results? Or we are mixing it? Because now there are charts, additional experiments and generated sample

patent gull Jun 25, 2023, 3:43 PM

#

That’s a good q

#

I think we can roughly keep the same structure in the appendix as we do in the paper

#

But not every section is gonna have a ton of results in the appendix and I think that’s ok

#

Let’s just reference “see appendix” in section 3 whenever suitable

#

Btw I’m gonna be away from my computer most of today, headed to the beach. Will check later tonight

#

Let’s everyone take a crack at the appendix. I think in general, if you put results in the appendix, you’re responsible for summarizing them

blissful garden Jun 25, 2023, 3:48 PM

#

patent gull Let’s everyone take a crack at the appendix. I think in general, if you put resu...

will work on A and B.
@versed flax do you have the data and code for the Figure 9? I want to try and see how it looks if we add regression lines for red dots and blue dots separately.

versed flax Jun 25, 2023, 3:48 PM

#

I do

fallow egret Jun 25, 2023, 4:25 PM

#

@versed flax what is the timeline? I understand that you want to publish it on Wednesday, so everything should be wrapped tomorrow?

versed flax Jun 25, 2023, 4:27 PM

#

fallow egret <@212467543745626112> what is the timeline? I understand that you want to publ...

By Wednesday, the paper is in a releaseable state. We use Thursday and Friday for stupid fixes like punctuation fixes, typos, emergency stuff. Friday night, paper is on ArXiv :)

fallow egret Jun 25, 2023, 4:27 PM

#

versed flax By Wednesday, the paper is in a releaseable state. We use Thursday and Friday fo...

Ok, this sounds good

versed flax Jun 25, 2023, 4:28 PM

#

That should give us enough time

#

This sounds reasonable to me

fallow egret Jun 25, 2023, 4:28 PM

#

Yes, I agree

blissful garden Jun 25, 2023, 4:59 PM

#

Brief summaries of my experiments in appendix are done (benchmarks and codegen). Some remarks are left for parts involving other people's works. Feel free to remove my remarks if they are dealt with.

patent gull Jun 26, 2023, 4:06 AM

#

Thanks @blissful garden !! I’ll take a look shortly and today/tomorrow will summarize my parts of the appendix. Once they’re all done I’ll wrap an appendix map and conclusion, unless @versed flax wants to write the conclusion.

blissful garden Jun 26, 2023, 4:29 AM

#

Added regression curves (using logistic regression because acc bounds between 0-1).
This does support our claim that CFG inference efficiency is good for Lambada where small LLaMA beats SOTA. But it sucks on most of others.
How should we present this?

fallow egret Jun 26, 2023, 4:31 AM

#

I will also try to close the CoT appendix today

versed flax Jun 26, 2023, 12:05 PM

#

blissful garden Added regression curves (using logistic regression because acc bounds between 0-...

I think it's cool because the lines pretty often fairly close

#

it still demonstrates that a smaller language model+cfg is a decent substitute for a bigger one

#

I need to write that part

#

@fallow egret I reworked your 2.1 to be a lot more rigorous and adapted the notations

#

I tried to satisfy as much as possible your desire for a strong maths background and nice derivations

fallow egret Jun 26, 2023, 12:11 PM

#

👍 I will go over the section after I'll finish with the CoT subsection

versed flax Jun 26, 2023, 12:11 PM

#

ty

loud adder Jun 26, 2023, 12:44 PM

#

blissful garden Added regression curves (using logistic regression because acc bounds between 0-...

Is this training FLOP or inference FLOP

versed flax Jun 26, 2023, 12:44 PM

#

inference

loud adder Jun 26, 2023, 12:44 PM

#

(Also, there isn’t an “s” at the end)

versed flax Jun 26, 2023, 12:45 PM

#

noted

loud adder Jun 26, 2023, 12:45 PM

#

It’s a weird acronym, but it stands for FLoating OPerations

versed flax Jun 26, 2023, 12:45 PM

#

I guess I wanted to write FLOPs, bc I know what the acronym stands for

loud adder Jun 26, 2023, 12:46 PM

#

People say “flops” orally as a natural pluralization

#

But that’s not really right (it would be like writing Ls for “litres”)

versed flax Jun 26, 2023, 12:46 PM

#

gotcha

loud adder Jun 26, 2023, 12:47 PM

#

And does cause confusion because there are things we want to measure in FLOP-seconds

#

(Also some people incorrectly use FLOPS thinking it’s FLoating Operations Per Second, like mph or rpm lol)

#

Just for extra confusion

versed flax Jun 26, 2023, 12:48 PM

#

oh, was not aware of this one

blissful garden Jun 26, 2023, 1:02 PM

#

oh yeah I have been always confused by that S in the end too

loud adder Jun 26, 2023, 1:06 PM

#

blissful garden Added regression curves (using logistic regression because acc bounds between 0-...

Is this all in the same precision?

#

If you’re concerned about it not always improving the result per inference FLOP, I would stress that a) that’s a really hard ask and b) 99% of users are VRAM bottlenecked, not FLOP bottlenecked

blissful garden Jun 26, 2023, 1:08 PM

#

loud adder Is this all in the same precision?

Yes they are. @versed flax sent me the plotting script and he already did all the math work with the FLOP and copy-pasted the numbers there. Which precision did you use @versed flax ?

versed flax Jun 26, 2023, 1:08 PM

#

blissful garden Yes they are. <@212467543745626112> sent me the plotting script and he already d...

It's all from the harness, so the default fp32 I assume

loud adder Jun 26, 2023, 1:15 PM

#

I was mostly asking because I was curious if that was a source of variation in the plots

#

So the couple tasks that are discontinuous… is that because of multiple model families being shown?

versed flax Jun 26, 2023, 1:15 PM

#

yes

#

I need someone to check the inaccuracies of 2.2. I'm not a mathiness prodigy.

fallow egret Jun 26, 2023, 2:17 PM

#

Yes, it's still incorrect- missing either normalization argument in the middle of eq 6 or using proportional

#

it's not 'equivalent to 2', it should be 'results in 2'

#

inconsistency in signs, ok I will go over this section soon, the big P is the finale notion?

versed flax Jun 26, 2023, 2:20 PM

#

fallow egret Yes, it's still incorrect- missing either normalization argument in the middle o...

good catch

versed flax Jun 26, 2023, 2:22 PM

#

fallow egret + it's not 'equivalent to 2', it should be 'results in 2'

fixed

versed flax Jun 26, 2023, 2:22 PM

#

fallow egret + inconsistency in signs, ok I will go over this section soon, the big P is the ...

where?

fallow egret Jun 26, 2023, 2:23 PM

#

versed flax where?

I recompile, now everying is consistent!

versed flax Jun 26, 2023, 2:23 PM

#

fallow egret I recompile, now everying is consistent!

I did not catch the sign changes, where was that?

fallow egret Jun 26, 2023, 2:24 PM

#

small 'p' vs big P for probability

versed flax Jun 26, 2023, 2:24 PM

#

Oooooooooh, I thought you mean sign as in +/-

#

you meant symbol

#

Do you like the section?

fallow egret Jun 26, 2023, 2:26 PM

#

In 6 the last part of the equation is missing (going back to P(w | c) both in the nominator and denominator)

fallow egret Jun 26, 2023, 2:27 PM

#

versed flax Do you like the section?

I just had a time for a quick look on the equations, it looks good. I will take a deeper look soon

versed flax Jun 26, 2023, 2:28 PM

#

fallow egret In 6 the last part of the equation is missing (going back to P(w | c) both in th...

I removed it but I was really not sure of the move. What were you showing with this? I thought you wanted to show that sampling a text with CFG can be done by autoregressively sampling each of its token, and that why I stopped there.

fallow egret Jun 26, 2023, 2:28 PM

#

No, I want to go back to equation 2

#

The last step is missing, it exactly equation 2 🙂

versed flax Jun 26, 2023, 2:29 PM

#

ah, I thought we wanted to go to eq 7 haha

fallow egret Jun 26, 2023, 2:30 PM

#

No, the point in this equation is to connect the autoregressive formula in 7, directly to eq 2 in the original work

#

This is the theoretical justification...

versed flax Jun 26, 2023, 2:30 PM

#

Ok. It needs to be made more explicit in the text then imho

fallow egret Jun 26, 2023, 2:31 PM

#

It was explicit in text (the line after eq 6, we had 'this results in 2')

blissful garden Jun 26, 2023, 3:42 PM

#

Was going through Sec 2 with @versed flax carefuly and personally I'm good with the whole Section 2 now.
(just one minor remark left at the last sentence)

blissful garden Jun 26, 2023, 4:19 PM

#

Appendix A.2 about the acc-FLOP chart is also done. I'm putting up this disclaimer. But feel free to add/change stuff.

patent gull Jun 26, 2023, 4:56 PM

#

i read thru section 2.1.... i can see you guys put a lot of work into it and it definitely shows!! the language is really tight, the math is useful. I left some small comments.

Personally I think there is some stuff that i think might be in-the-weeds...

Two points:

The introduction of p(z) early on.... we don't use p(z) anywhere else. Do we need to spell out this term? How does it help? Isn't the reader already going to be thinking about latent spaces?
The exploration into sample noise and diffusion... how necessary is this? Does thinking about sampling noise help us think about LMs? bc we don't really think about noise so much in the same way. I guess this could serve as a useful history/teaching, and i think it comes down to a personal preference whether to include or not, but i think there's a fair argument to be made that it's not directly useful to the overall NLP focus

#

However, that being said, it does look good.

I honestly had liked the previous structure of introducing negative prompting more in Section 3.4, because it did make the point that it was more of a side exploration rather than a main exploration.

However, if we do commit to having it in 2.1, then 3.4 needs to be significantly tightened... like there's still some of the introductory text there. and if negative prompting is introduced in 2.1 instead, maybe some of that text could be moved to 2.1 and then 3.4 is really just "negative prompting, as described in 2.1"

blissful garden Jun 26, 2023, 4:59 PM

#

Personally I'm okay with either a hand-wavy section 2 saying that we are inspired by SD, or a rigorous section 2 with careful derivations from SD despite notations being useless in other place.

patent gull Jun 26, 2023, 5:00 PM

#

btw yah i don't mean this as a criticism, just a point for discussion, and ultimately i do really like this version better than the last

blissful garden Jun 26, 2023, 5:01 PM

#

patent gull i read thru section 2.1.... i can see you guys put a lot of work into it and it ...

Maybe we can stash some notations into appendix 🤔

patent gull Jun 26, 2023, 5:02 PM

#

i mean if you have an explanation for how those 2 bullet points help the reader, i'm convinced. also i think there really is an argument just for teaching the reader

#

i think a counter-argument to my points is that it does really solidify that we have a strong CV background here

versed flax Jun 26, 2023, 5:09 PM

#

patent gull i mean if you have an explanation for how those 2 bullet points help the reader,...

I'm cooking. I'll explain later.

patent gull Jun 26, 2023, 5:10 PM

#

ok sure

fallow egret Jun 26, 2023, 5:16 PM

#

Ok, I think CoT subsection is ready. I really like the edit that was done (probably @patent gull ?)
I added more experiments + some nice qualitatively examples

patent gull Jun 26, 2023, 5:27 PM

#

great!! thanks Elad!!

#

this is nitpicking and no rush, but i would love it, if it's easy, if Figure 2 could be redone with font size=14

#

or 16. just to match Figures 3/4

fallow egret Jun 26, 2023, 5:32 PM

#

Sure, np

fallow egret Jun 26, 2023, 5:49 PM

#

@patent gull I changed it (font-14), I hope this is what you meant...

patent gull Jun 26, 2023, 7:11 PM

#

I’ll check thanks elad

fallow egret Jun 26, 2023, 7:17 PM

#

@versed flax I added many comments in section 2, with all the latest canges it seems that there are currently many inaccuracy (with respect mainly to the mathmtical part)

fallow egret Jun 27, 2023, 5:06 AM

#

@versed flax Can we iterate on one of the issues of section 2 here? It will be easier and faster.
In equation 1 there is an introduction of the classifier guidance according to the original paper. I don't understand why the given formula is unconditioned (first term after the approx). I attachd the original formula from the paper. Observer that the two terms are different probabilities (one with theta it's the generator and one with phi it's the external classifier)

Observe that you can't apply the Bayes rule in this stage to get what you wrote since we are still in the CG case here (not CFG), which means that the classifier probability function and the generative are not the same
function
You can apply it only when moving to the CFG section which indeed this is the same function

P.S, also Bayes theorm doesn't give you eq (1) in the CFG context (it should be divided by p(x)), but let's do it step by step...

patent gull Jun 27, 2023, 5:16 AM

#

@blissful garden I think Table 6 would look better as a percentage-normed horizontal stacked bar chart:

https://www.geeksforgeeks.org/stacked-percentage-bar-plot-in-matplotlib/

#

bc that's really what it's trying to show, right? We're supposed to see how CFG gets better?

#

it's kinda hard to parse that in the table with the numbers, since it's a different total in each row

#

each stack/row is in this order: [underperforms, ties, outperforms]
and then, different bar for each temp

blissful garden Jun 27, 2023, 5:18 AM

#

patent gull <@823129585230544906> I think Table 6 would look better as a percentage-normed h...

Oh great idea!

patent gull Jun 27, 2023, 5:18 AM

#

i can do that if you'd like

#

i guess those #s are easy enough to copy/paste

blissful garden Jun 27, 2023, 5:19 AM

#

patent gull i guess those #s are easy enough to copy/paste

oh that should be easy. Let me do it

patent gull Jun 27, 2023, 5:19 AM

#

up to you!

#

I'm gonna write the conclusion, then

loud adder Jun 27, 2023, 5:21 AM

#

If I haven’t started working on this in the next 12 hours please ping me and remind me to do so.

patent gull Jun 27, 2023, 5:30 AM

#

ok I'm done with the conclusion and done with my end of the appendix. I added a table of contents to the appendix, feel free to disagree with that design-choice

#

looks to me like we have v1.0 of a rough draft

#

I see one bit of orange text, let me address that. I haven't nearly begun to address all the comments, but I will do so

#

also i jotted down some limitations i could think of off the top of my head, at 2am, in the Conclusion

#

feel free to take a look and add your own... the more limitations we address, the better and more solid our paper

versed flax Jun 27, 2023, 9:09 AM

#

fallow egret <@212467543745626112> Can we iterate on one of the issues of section 2 here? It ...

which means that the classifier probability function and the generative are not the same function
That's totally irrelevant, they're probability functions

#

Whether you decide to model them with different models or not, it's correct

#

CG just says "we use an external classifier to guide generation"

#

you say "I want to model P(x|c), I apply Bayes' rule, I get p(x) p(c|x), oh that's an unconditional generator and a classifier, let's train two networks", I don't see where your confusion comes from

#

There's indeed one small mistake here, it's the theta subscript. Fixed:

#

Other than that it's correct

fallow egret Jun 27, 2023, 10:03 AM

#

versed flax There's indeed one small mistake here, it's the theta subscript. Fixed:

Can you please write it step by step how did you get to your formula, assuming we agree that source is what I sent?

versed flax Jun 27, 2023, 10:04 AM

#

fallow egret Can you please write it step by step how did you get to your formula, assuming w...

you say "I want to model P(x|c), I apply Bayes' rule, I get p(x) p(c|x), oh that's an unconditional generator and a classifier, let's train two networks", I don't see where your confusion comes from

fallow egret Jun 27, 2023, 10:08 AM

#

versed flax > you say "I want to model P(x|c), I apply Bayes' rule, I get p(x) p(c|x), oh th...

Again, these probabilities are not the same

versed flax Jun 27, 2023, 10:08 AM

#

What does that even mean?

fallow egret Jun 27, 2023, 10:08 AM

#

Each model define a different probability function...

#

The parameters are there for a reason, it's simply a different probability function

versed flax Jun 27, 2023, 10:09 AM

#

and?

fallow egret Jun 27, 2023, 10:10 AM

#

You used the bayes formula with respect to P_theta

versed flax Jun 27, 2023, 10:10 AM

#

Are you saying that two models can't interact if they don't share the parameters?

fallow egret Jun 27, 2023, 10:12 AM

#

I'm saying that P_theta(x|c) multiplay by P_phi(x|c) is not equal to P_phi(x|c)^2

#

which is what you used to get your equation..

versed flax Jun 27, 2023, 10:13 AM

#

of course it is

fallow egret Jun 27, 2023, 10:14 AM

#

of course it's not, it's not the same probability

versed flax Jun 27, 2023, 10:14 AM

#

What does that even mean?

#

if you train P_phi(x|c) and P_theta(x|c), and they both are trained on a similar dataset, and are both expressive enough, they'll learn the same thing, P_phi=P_theta

fallow egret Jun 27, 2023, 10:16 AM

#

It's simply incorrect, otherwise ensemble methods will not work
The external classifier doesn't have to be trained on the same data, it's a non-valid implicit assumption

#

They didn't train on the same objective (one is generative and the other is descreminative)

versed flax Jun 27, 2023, 10:26 AM

#

It's simply incorrect, otherwise ensemble methods will not work
Ensembles works because the "they are expressive enough" assumption breaks. Ensembles are a bug, not a feature. They work because your model P_theta doesn't perfectly model P (whatever your model is or supposed to be), so you average their mistakes to smooth them out. When writing theoretical derivatives like this, you can assume the model is perfect, and that's a common assumption

fallow egret Jun 27, 2023, 10:27 AM

#

versed flax > It's simply incorrect, otherwise ensemble methods will not work Ensembles work...

So they are not the same function, and you can't assume they are equal. I simply don't understand why do you need this assumption? Why not simply write the original formula?

#

It make the whole theoretical part invalid for no reason

versed flax Jun 27, 2023, 10:31 AM

#

Because it 1) makes more sense to a reader to tell that we need to guide an unconditional generator than a conditional one (why would it be needed then if it's already conditional?), and 2) it made me save time in writing with simpler explanations which utimately used a lot more of this time arguing this with you, and 3) no it's correct, and if you don't believe me, I quoted the CFG paper where that equality is laid out explicitely.

#

You train a model P_theta to be an approximation of P, it's fair to equate them in theoreticla equations.

fallow egret Jun 27, 2023, 10:35 AM

#

It is incorrect for sure, you have two different probability function on the same space, each one come from a different model.
Having this assumption that they will converge to the same probability by some 'magic' is not valid in any applicable setting. Therefore, your theoretical framework doesn't model the reality.
In my opinion in this part things should be correct, this is the most important thing

#

@blissful garden @patent gull Can someone help with that?

versed flax Jun 27, 2023, 10:36 AM

#

fallow egret It is incorrect for sure, you have two different probability function on the sam...

It's not "magic", it's training

fallow egret Jun 27, 2023, 10:37 AM

#

versed flax It's not "magic", it's training

But it's not true. When you are training the same network with the same objective and data on different seed you get different probability function

#

And in this case it's not the same objective and data...

versed flax Jun 27, 2023, 10:37 AM

#

You absolutely get two very extremely similar ones. Or you're just not properly training your model.

fallow egret Jun 27, 2023, 10:37 AM

#

versed flax You absolutely get two very extremely similar ones. Or you're just not properly ...

No, it's not true. There are so many works on this topic...

#

https://www.microsoft.com/en-us/research/blog/three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation/

Microsoft Research

Alexis Hagen

3 deep learning mysteries: Ensemble, knowledge- and self-distillation

Microsoft and CMU researchers begin to unravel 3 mysteries in deep learning related to ensemble, knowledge distillation & self-distillation. Discover how their work leads to the first theoretical proof with empirical evidence for ensemble in deep learning.

versed flax Jun 27, 2023, 10:39 AM

#

If that were true, it would just make it impossible to compare models as the accuracy of two instances of a model trained twice would be vastly different

#

and that's also why ensembles work better with models with different architectures, slightly different training data, and model types. The quirks in approximating the theoretical P won't be the same ones.

fallow egret Jun 27, 2023, 10:43 AM

#

I gave you a clear reference (and I can give more) that this assumption is very controversial.
IMO we should not have this assumption, since as I said there is no reason to have this assumption.

versed flax Jun 27, 2023, 10:45 AM

#

Okay, whatever. You can propose a fix, but I'm not wasting more time on this

fallow egret Jun 27, 2023, 10:45 AM

#

I've got rejected on much less controversial assumption in the theoretical part...

versed flax Jun 27, 2023, 10:45 AM

#

Good this isn't a theoretical work but more of an experimental then

#

The whole theorical part was developed to please you

fallow egret Jun 27, 2023, 10:46 AM

#

My work that was rejected was also not theoretical.
Reviewer are searching for these implicit problematic assumptions (I'm also doing it as a reviewer)

versed flax Jun 27, 2023, 10:54 AM

#

Then go reject "Diffusion Models Beats GANs on Image Synthesis" (NeurIPS 2021) which introduced classifier guidance🤷‍♂️

#

Or "Score-based Generative Modeling Through Stochastic Differential Equations" (ICLR 2021)

fallow egret Jun 27, 2023, 11:00 AM

#

versed flax

There is no issue with what they wrote, they define here a reverse diffusion process, they are not claiming that the two probabilities are the same

versed flax Jun 27, 2023, 11:00 AM

#

They totally use a classifier in the same way

#

their eq 2 is litterally the same you're complaining about

fallow egret Jun 27, 2023, 11:03 AM

#

No, it's not because their generative model is unconditional (they start with unconditional diffusion model). In our case we apply a conditional generative model (as in CFG paper)

#

In any case they are not claiming that p_theta(x|c) = p_phi(x|c)

versed flax Jun 27, 2023, 11:08 AM

#

Look. I'm not wasting more time on this. We end this. Feel free to propose a nice, correct, high quality and fully redacted fix.
I've spent two full days fixing your 2.1 which exists only to please your desire for theoretical grounding.

The only next action I'm taking on this non issue is clicking an Accept or Reject button.

fallow egret Jun 27, 2023, 11:12 AM

#

I don't understand what was the issue with the original version that was correct from a theoretical perspective.
What are your thoughts? Maybe I'm biased as a mathematician, but in my opinion the theoretical part should be accurate
@patent gull @loud adder @blissful garden

versed flax Jun 27, 2023, 11:14 AM

#

Alex said "impressive improvements" and Honglu proof read that section so much that we basically co wrote it

patent gull Jun 27, 2023, 12:49 PM

#

I’m not sure I am fully following the back-and-forth of this argument. And what I have to add certainly won’t settle it in a satisfying way. However I remember having a similar argument with my lab mate.

I have my own classifier-guided control paper: https://arxiv.org/pdf/2301.02299.pdf that has the same setup that @versed flax wrote… even less principled lol bc I don’t even notate two different sets of parameters.

Indeed, because of pretraining, p_theta(x) and p_phi(x|c) cannot even be assumed to be of the same linguistic domains. In our case, p(x) was vanilla GPT2 (i.e. general web) and p(x|c) was trained on news. One of the improvements we noticed was actually just due to fine tuning p(x) on the news domain (which shouldn’t have to happen in a theoretically perfect world).

No reviewer noticed. There are other classifier-based works with similar setups in NLP: FUDGE (https://arxiv.org/abs/2104.05218) and PPLM (https://arxiv.org/abs/1912.02164).

Indeed my lab mate published his work explicitly trying to address this: https://arxiv.org/abs/2205.14219.

At the end of the day, yes it is a problem, my labmate got a paper out of addressing the problem, BUT there is also a rich history of methods in this space and it’s uncontroversial at this point IMO. Most importantly, PPLM, FUDGE and my work all ALSO showed effectiveness, so it’s not an invalid setup

arXiv.org

FUDGE: Controlled Text Generation With Future Discriminators

We propose Future Discriminators for Generation (FUDGE), a flexible and
modular method for controlled text generation. Given a pre-existing model G for
generating text from a distribution of interest, FUDGE enables conditioning on
a desired attribute a (for example, formality) while requiring access only to
G's output logits. FUDGE learns an att...

#

Yes, these works are all *CL, which maybe has a different set of reviewers and reviewer concerns than ICML/Neurips/etc. I’m less familiar with those reviewers. But I do think we should move on

#

I do think we can have a more comfortable debate about section 2 once we feel really good about the whole rest of the paper

versed flax Jun 27, 2023, 12:52 PM

#

I've spent more time on Sec2 than the rest of the paper combined. Definitely agree that we should move on. As I said, if someone is displeased with the current state, they're free to submit a good fix, but going back and forth in chats and criticizing isn't productive

patent gull Jun 27, 2023, 12:55 PM

#

Yeah….. I mean. Yeah. I’m trying to think of a concise way to frame this debate. Honestly maybe one sentence about it and then cite NADO (my lab mates paper) as a proof that it’s an issue with classifier based methods

#

But it doesn’t affect our work since we’re not using classifier based guidance

fallow egret Jun 27, 2023, 12:55 PM

#

@patent gull I agree that we can move on and go back to that in the end. I simply don't understand why we need to trust that the reviewer will not notice it, when it's completely unnecessary assumption and simply using the conditional formula resolve the issue

patent gull Jun 27, 2023, 12:55 PM

#

I mean yeah it can literally be a sentence at the end of 2.1 saying “these works face issues….”

#

Yeah but that section is more summarizing the lit

#

It’s a problem w the lit

#

We don’t use classifieds

#

Classifiers

#

It’s not our theoretical problem

#

It’s the lits problem. Certainly in NLP, where it is NOT addressed typically

fallow egret Jun 27, 2023, 12:58 PM

#

I agree that it's not our problem, this is why I don't understand why we need to insert this issue in the first place which is completely unnecessary in our case and just raise unrelated questions

versed flax Jun 27, 2023, 12:58 PM

#

Just submit a good fix, Elad.

patent gull Jun 27, 2023, 12:58 PM

#

I mean I’m not trying to say it’s not important. It just doesn’t affect us so I think a reviewer would be wrong to point it out as a flaw with OUR work

#

I think I can put a short line in section 2.1 addressing this

versed flax Jun 27, 2023, 1:00 PM

#

Be productive. When you complained "The intro should start with the problem............." Alex proposed "We should swap first and second paragraph". One comment is clearly more productive and usable than the other and they still addressed the same point. Propose your fix.

fallow egret Jun 27, 2023, 1:02 PM

#

versed flax Be productive. When you complained "The intro should start with the problem........

I didn't want to modify the text without a consent after last time...

versed flax Jun 27, 2023, 1:02 PM

#

He didn't either. He just proposed a solution.

fallow egret Jun 27, 2023, 1:07 PM

#

versed flax He didn't either. He just proposed a solution.

My solution is simple:
eq 1: should be p_theta(x|c) instead of p_theta(x)
and then in eq(2) it should p_theta(x|c) ^ gamma+1 divided by p(x)^gamma
This is all the changes with respect to this part...

versed flax Jun 27, 2023, 1:08 PM

#

No, that's not "it", the text around it needs to be reworked as well, and address why we use classifier guidance since the model is already conditional.

fallow egret Jun 27, 2023, 1:10 PM

#

What do you mean? This is how you perform classifier guidance, you enhance the conditional effect on the model by external classifier

#

In CFG you also use a classifier guidance (where classifier is defined using your own model), on conditional generative model...

patent gull Jun 27, 2023, 1:13 PM

#

I don’t have the equations in my head right now (sorry, away from my computer)

#

But I’ll look and have an opinion on this when im in the office

#

I do feel like this isn’t top priority though since it’s entirely concerning background work (if I’m understanding correctly)

versed flax Jun 27, 2023, 1:14 PM

#

It's concerning notations on background work

#

It's the minorest thing in the minor things we have to address

fallow egret Jun 27, 2023, 1:15 PM

#

I agree it's not top priority, but IMO there are few issues in sec 2, that should be resolve before submission

blissful garden Jun 27, 2023, 1:15 PM

#

Just woke up... Give me some time to read through.....

versed flax Jun 27, 2023, 1:15 PM

#

fallow egret I agree it's not top priority, but IMO there are few issues in sec 2, that shoul...

This is not one of them.

patent gull Jun 27, 2023, 1:17 PM

#

My sense is that 2.1 has been steadily getting longer, denser and ultimately harder for the reader to get thru before getting to our real contribution but I honestly don’t have it in my head really because I’ve been focusing on other things

blissful garden Jun 27, 2023, 1:33 PM

#

fallow egret My solution is simple: eq 1: should be p_theta(x|c) instead of p_theta(x) and th...

why \gamma + 1 and \gamma?

versed flax Jun 27, 2023, 1:34 PM

#

blissful garden why \gamma + 1 and \gamma?

blissful garden Jun 27, 2023, 1:35 PM

#

oh I see

tepid gazelle Jun 27, 2023, 2:44 PM

#

Hey btw, I'm going to be reverting the V3 of triviaqa https://github.com/EleutherAI/lm-evaluation-harness/pull/610 in the eval harness upstream, the results on this do not match llama's performance whatsoever (way higher than they report), while V2 ~ does when accounting for prompt. Exact match is meant to be exact, although that has its own problems we don't want to be able to rate not Mark Twain as correct if Mark Twain is the expected ground truth

GitHub

[triviaqa] The ground truth must be a *substring* of the generated ...

fix triviaqa

versed flax Jun 27, 2023, 2:47 PM

#

tepid gazelle Hey btw, I'm going to be reverting the V3 of triviaqa https://github.com/Eleuthe...

That's fair. I'm indeed checking the LLaMA paper and I definitely hallucinated this "substring" thing or read it elsewhere and got confused

#

I'm sorry I wasted some time and negatively impacted the productivity

tepid gazelle Jun 27, 2023, 2:49 PM

#

no need to apologize at all!!

#

sorry for intruding on your project channel

versed flax Jun 27, 2023, 2:50 PM

#

oh lol don't worry about it

tepid gazelle Jun 27, 2023, 2:50 PM

#

just wanted to alert since that changes what scores yall should report/maybe rerun unfortunately

versed flax Jun 27, 2023, 2:51 PM

#

yes, indeed. I think we will just report how we got those results

blissful garden Jun 27, 2023, 2:51 PM

#

so we use our original numbers for triviaqa?

versed flax Jun 27, 2023, 2:51 PM

#

bc there were many instances where the model generated something like"Mark Twain" (with the quotes) or This is Mark Twain (can you confirm @blissful garden ?)

blissful garden Jun 27, 2023, 2:53 PM

#

yeah I guess there are pros and cons for each. I dumped the write-out files and inspect manually. Using substring was a lot better

blissful garden Jun 27, 2023, 2:53 PM

#

versed flax bc there were many instances where the model generated something like`"Mark Twai...

A LOT with extra words

#

here if you want to see

versed flax Jun 27, 2023, 5:10 PM

#

loud adder If I haven’t started working on this in the next 12 hours please ping me and rem...

It's been 12h :)

loud adder Jun 27, 2023, 5:40 PM

#

versed flax It's been 12h :)

Doing it now

patent gull Jun 27, 2023, 5:42 PM

#

Reg. Figure 9, the FLOPs tests
can we
(1) do statistical significance tests on the plots (I think f-tests is the right one?)
(2) draw confidence regions? We can establish these using bootstrapping, I think
the hypothesis ~~that we really want~~ that the figure seems to support right now is "CFG is statistically equivalent across most tasks to a similar-budget model". But #1 and #2 will help us really show it
I think this is an important finding if we can prove it, and warrants its own short section in the main paper

versed flax Jun 27, 2023, 5:42 PM

#

"CFG is statistically equivalent across most tasks to a similar-budget model"
it might not be the case though. But the difference doesn't look huge

blissful garden Jun 27, 2023, 6:22 PM

#

patent gull Reg. Figure 9, the FLOPs tests can we (1) do statistical significance tests on ...

Oh let me look it up. I actually suck at stats and need to brush up those hypothesis testing stuff 😂
Would be wonderful if we can refine it and make it part of the results

versed flax Jun 27, 2023, 6:23 PM

#

blissful garden Oh let me look it up. I actually suck at stats and need to brush up those hypoth...

scipy should have you covered

loud adder Jun 27, 2023, 6:24 PM

#

I am suprised by the citation in

A ``prompt'' is typically used to condition on the generation, containing task instructions, context, and a small set of examples \cite{flan}.
Why was this chosen? The FLAN paper is about finetuning models on instruction-formated data

#

(Also, FLAN and T0 came out at the same time with the same core idea: it's almost always correct to cite both of them when it's correct to cite one of them if you're not citing your use of their specific model or something)

versed flax Jun 27, 2023, 6:25 PM

#

Because it looked like a great paper to show what an instruction actually is

patent gull Jun 27, 2023, 6:27 PM

#

blissful garden Oh let me look it up. I actually suck at stats and need to brush up those hypoth...

for point #2, I think if you just bootstrap resample those FLOPs points 10,000 times, then you get a distribution over the results and you can calculate the median and percentiles to do a confidence region

loud adder Jun 27, 2023, 6:27 PM

#

I feel like the GPT-3 paper and this are more appropriate papers to cite https://arxiv.org/abs/2102.07350

arXiv.org

Prompt Programming for Large Language Models: Beyond the Few-Shot P...

Prevailing methods for mapping large generative language models to supervised
tasks may fail to sufficiently probe models' novel capabilities. Using GPT-3 as
a case study, we show that 0-shot prompts can significantly outperform few-shot
prompts. We suggest that the function of few-shot examples in these cases is
better described as locating an ...

patent gull Jun 27, 2023, 6:27 PM

#

boostratpping is great for confidence intervals over metrics/etc. all kinds of things with non-normal distributions

loud adder Jun 27, 2023, 6:32 PM

#

What is this " Fundamental limitations of alignment in large language models" paper that we're citing a lot?

#

Okay, only twice since the other three are commented out

versed flax Jun 27, 2023, 6:34 PM

#

IIRC it's a paper that we use to talk about system prompts. Probably not the best one

loud adder Jun 27, 2023, 6:38 PM

#

Gotcha. No worries, I don't expect y'all to have the literature and chronology memorized 🙂

versed flax Jun 27, 2023, 6:39 PM

#

Pretty simple: I have very very little knowledge of the NLP lit

#

I have a pretty darn good grasp of the vision lit, but NLP... close to none

loud adder Jun 27, 2023, 6:39 PM

#

The goal of these questions is to get an idea of what you're looking to cite so I can identify papers that may be a better fit

versed flax Jun 27, 2023, 6:40 PM

#

gotcha

loud adder Jun 27, 2023, 6:41 PM

#

It would be a good idea to submit a PR to the HuggingFace transformers library that includes CFG as a LogitsWraper or whatever it's called

#

This will substantially increase the chances of people using the methodology because that's how most people get their LLMs

versed flax Jun 27, 2023, 6:41 PM

#

loud adder It would be a good idea to submit a PR to the HuggingFace `transformers` library...

It's ready and will fly the very second the paper is on ArXiv :)

loud adder Jun 27, 2023, 6:42 PM

#

versed flax It's ready and will fly the very second the paper is on ArXiv :)

If you mean that you're ready to submit the PR, you should submit the PR now so we can answer any questions or handle any issues they have. If you mean you've already done that and are waiting for the paper to go live to have it merged then well done!

versed flax Jun 27, 2023, 6:43 PM

#

loud adder If you mean that you're ready to submit the PR, you should submit the PR now so ...

I've been waiting to submit the PR. I was thinking that they wouldn't accept a PR from something weird without a paper attached to it

#

I'll submit the PR in few hours then

blissful garden Jun 27, 2023, 6:44 PM

#

versed flax I'll submit the PR in few hours then

attach some part of our draft in the PR? Like Sec 2 and the eval charts.
actually just the eval results might justify it.

loud adder Jun 27, 2023, 6:45 PM

#

If you say "I've been collaborating with EleutherAI on this and we have a paper coming out on Friday", and tag me then they won't have any issue with it 😛

versed flax Jun 27, 2023, 6:45 PM

#

hahaha

#

noted

loud adder Jun 27, 2023, 6:45 PM

#

blissful garden attach some part of our draft in the PR? Like Sec 2 and the eval charts. actuall...

Also this

#

@versed flax Do the "prompt alignment" techniques require finetuning or are they inference-time like ours?

Various approaches have been proposed to address this, including prompt alignment \cite{alignment} and fine-tuning \cite{instructgpt,flan,sanhmultitask}.

versed flax Jun 27, 2023, 6:52 PM

#

loud adder <@212467543745626112> Do the "prompt alignment" techniques require finetuning or...

It depends. For what we know GPT-3.5 is aligned with RLHF and Bing Search with prompt

loud adder Jun 27, 2023, 6:52 PM

#

Oh that's the Anthropic paper

versed flax Jun 27, 2023, 6:53 PM

#

In more humble situations, like Character.ai and the likes, it's prompt alignment

loud adder Jun 27, 2023, 6:54 PM

#

What does "prompt alignment" mean

#

Is there a paper describing what Character.AI does

versed flax Jun 27, 2023, 6:55 PM

#

Sorry, approximatie language here. It means there's a system prompt describing the chatbot's intended behavior.

#

("This is a conversation between Person A and Eric Cartman:
Cartman: Hey you, leave me alone!
Person A:")

versed flax Jun 27, 2023, 6:57 PM

#

loud adder Is there a paper describing what Character.AI does

I can try and look for one.

loud adder Jun 27, 2023, 7:01 PM

#

I don't think doing so is very important.

versed flax Jun 27, 2023, 7:02 PM

#

loud adder I don't think doing so is very important.

https://arxiv.org/pdf/2206.07550.pdf I found this one. They prompt the chatbot with OCEAN traits

patent gull Jun 27, 2023, 7:05 PM

#

loud adder It would be a good idea to submit a PR to the HuggingFace `transformers` library...

personally, i don't think the "right" huggingface implementation of CFG is the logit-wrapper implementation.

I think putting it in the forward method of a CFG-head model, maybe as a mixin, is the more hugging-face appropriate way, looking at how they build their models.

I have something like that implemented, although my class is a bit more of a monstrosity bc it's doing different things, but:

https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/log_logits_on_p3.py#L52-L79

#

Definitely this is a side-discussion, but since we're talking about code...

loud adder Jun 27, 2023, 7:06 PM

#

I don't have a strong feeling, and getting this feedback from the mantainers is another reason to open the issue early 🙂

patent gull Jun 27, 2023, 7:07 PM

#

SGTM

#

yah i always just found the logitwarpers approach to be a little awkward, since we needed to pass in input_ids and model but there was already a model inside, and logits getting generated. idk. felt weird

versed flax Jun 27, 2023, 7:44 PM

#

I added this in the enumeration in the introduction:

\item We show that for the same inference cost, one can train a model that is half the size and obtain similar performance on those benchmarks;

#

I should maybe add it into the abstract as well

#

it's a fairly strong result

versed flax Jun 27, 2023, 9:20 PM

#

An important question remains and I have no idea what the answer is: Should we acknowledge CAD?

loud adder Jun 27, 2023, 9:29 PM

#

versed flax An important question remains and I have no idea what the answer is: _Should we ...

Yea, I’m adding this shortly

versed flax Jun 27, 2023, 9:30 PM

#

oh you're still on it

loud adder Jun 27, 2023, 10:08 PM

#

Yeah, got dragged into a meeting and then had to cook dinner but I’m back at work 🙂

versed flax Jun 27, 2023, 10:08 PM

#

loud adder Yeah, got dragged into a meeting and then had to cook dinner but I’m back at wor...

I'll be available for the next 4h. Don't hesitate if you have any question or need any feedback.

loud adder Jun 27, 2023, 10:09 PM

#

It’s really good. Stuff has come together really well in the past two weeks

#

Basically all my comments and edits are about copy editing and optimizing the presentation

#

BTW your website appears to be down: http://vermeille.fr/

versed flax Jun 27, 2023, 10:11 PM

#

loud adder BTW your website appears to be down: http://vermeille.fr/

Ah! thanks!

loud adder Jun 27, 2023, 10:12 PM

#

(I was going to add affiliations)

#

@versed flax Is there something unique or special about SD's about negative guidance? We discuss negative guidance in the VQGAN-CLIP paper, and I'm under the impression it's something that can be done with any T2I model

versed flax Jun 27, 2023, 10:20 PM

#

loud adder <@212467543745626112> Is there something unique or special about SD's about neg...

Nothing specific about it. It's just this one famous tool that implements it

#

Midjourney and DALL-E don't.

loud adder Jun 27, 2023, 10:22 PM

#

Oh interesting

#

I didn’t realize that

versed flax Jun 27, 2023, 10:22 PM

#

loud adder (I was going to add affiliations)

I can't fix it but I basically have to recreate it (apparently my payment failed last time I had to renew). Is it important?

loud adder Jun 27, 2023, 10:23 PM

#

versed flax Midjourney and DALL-E don't.

@oak ore what’s up with that? 😛

loud adder Jun 27, 2023, 10:23 PM

#

versed flax I can't fix it but I basically have to recreate it (apparently my payment failed...

Oh no. I just googled you because I couldn’t remember what institution you said you were at, and then figured you might want to know.

versed flax Jun 27, 2023, 10:24 PM

#

loud adder Oh no. I just googled you because I couldn’t remember what institution you said ...

I had my PhD from the Université de Toulon, but I'm not working there anymore. I'm working for a small company called Hexaglobe now

oak ore Jun 27, 2023, 10:27 PM

#

loud adder <@353691560090664982> what’s up with that? 😛

this is a pretty long thread - what's the question?

#

negative guidance could be done with anything that uses cfg. sd doesn't do anything special there. midjourney & dalle are closed models with restricted APIs, and negative guidance isn't part of the exposed API

versed flax Jun 27, 2023, 10:29 PM

#

(to be fair they certainly use a preset hidden neg prompt. It's just so good)

#

https://www.reddit.com/r/StableDiffusion/comments/144mw6f/visualising_the_effect_of_the_negative_prompt/

r/StableDiffusion - Visualising the effect of the negative prompt

161 votes and 24 comments so far on Reddit

▶ Play video

oak ore Jun 27, 2023, 10:30 PM

#

we do something fancier than that actually, tho I can't go into detail

versed flax Jun 27, 2023, 10:30 PM

#

oh, you're working for one of those orgs?

oak ore Jun 27, 2023, 10:31 PM

#

yeah I'm at midjourney

versed flax Jun 27, 2023, 10:31 PM

#

Would you happen to be hiring French dudes? 😎

oak ore Jun 27, 2023, 10:31 PM

#

I don't think we've ever shipped negative prompts in the way sd does them

oak ore Jun 27, 2023, 10:32 PM

#

versed flax Would you happen to be hiring French dudes? 😎

we've got people from all over, tho i don't have bandwidth to evaluate new hires rn

versed flax Jun 27, 2023, 10:33 PM

#

oak ore we've got people from all over, tho i don't have bandwidth to evaluate new hires...

if that ever happens and you're interested, hit me up, along with the expected curriculum!

loud adder Jun 27, 2023, 10:44 PM

#

Sorry to derail the convo. I’m going to go on a walk before the sun sets then finish up

patent gull Jun 27, 2023, 10:59 PM

#

@loud adder please let me know if you'd like to see anything else in Section 4

#

I know you mentioned causal approaches way back before we put 4 together

versed flax Jun 28, 2023, 1:27 AM

#

I'm starting to be sleepy. Try your luck if you need something but I may not answer, sorry

loud adder Jun 28, 2023, 1:36 AM

#

versed flax I'm starting to be sleepy. Try your luck if you need something but I may not ans...

Don’t worry about it. Go sleep and it’ll be there in the morning

versed flax Jun 28, 2023, 1:38 AM

#

I'm working on the PR for now, but not for long :)

#

Thank you for your contribution

versed flax Jun 28, 2023, 11:46 AM

#

I have addressed most of the edits.

@patent gull, Stella did edit the intro to Sec 3 and removed the parts that articulate the section into the various prompt types. I'll let you proof read and accept / reject the changes. I did not want to do it for you, it's your part.
@patent gull please accept / reject the edits to the abstract so that I know they're correct.
@patent gull Stella advocates moving your Related Works to the appendix. I quite like it but I understand her point, the paper is already quite long. However I feel like focusing on the CV background only is weird.
@loud adder the main unchecked thing I have now is this new figure you're suggesting and that I don't fully see.
@patent gull / @loud adder I answered most of your comments in the sidebar. Once you've read my answer and find it satisfactory, can you mark them as resolved? If you leave them open I don't know if we can move on.

patent gull Jun 28, 2023, 11:57 AM

#

Ok I’ll check. I think i largely agree with these changes. The section 3 header was a little too structured-feeling

#

And I think there’s a way to discuss classifier guidance and contrastive decoding in NLP in 1-2 sentences in section 2

#

Thus reducing the need for the related works section

#

I was thinking if we want to reduce the length, there might be some of plots and tables throughout that could be moved to the appendix as well. But we don’t have a page limit for arxiv so less worried about length

versed flax Jun 28, 2023, 12:02 PM

#

The paper is quite long. Would it be unreasonable to add a toc for the arxiv release?

loud adder Jun 28, 2023, 12:02 PM

#

patent gull I was thinking if we want to reduce the length, there might be some of plots and...

My primary reason for mentioning the length is about people’s attention span. 12 pages isn’t egregiously long, but it is 50% longer than the main text of most ML papers. I do think we should strive to not make it substantially longer.

#

(I say this as someone who has multiple arXiv papers that are > 80 pages long)

versed flax Jun 28, 2023, 12:04 PM

#

Yes, totally. It's long. My proposed "easy fix" is to add a toc for the arxiv release. The reader can then glance what the paper is about and choose what to read. Dunno if it's unreasonable

patent gull Jun 28, 2023, 12:07 PM

#

I don’t think Toc is necessary for a 12 pager

#

The appendix already has a toc

versed flax Jun 28, 2023, 12:07 PM

#

That's fair

patent gull Jun 28, 2023, 12:08 PM

#

Like I honestly don’t like toc, they don’t end up being descriptive enough or useful to me

#

hmmm so @loud adder what is the desired page limit? 10?

versed flax Jun 28, 2023, 12:08 PM

#

That's fair²

patent gull Jun 28, 2023, 12:08 PM

#

8?

#

we have a ton of plots and tables that can be summed up with 1 line and moved to app. Also have language that can be tightened throughout

loud adder Jun 28, 2023, 12:16 PM

#

6-10 pages is the typical length of a ML paper

patent gull Jun 28, 2023, 12:21 PM

#

Alright sounds good

#

I know I can move my line plots to the appendix. I’ll take a look at all the results tables we have and think about some that can get condensed

#

Or moved

loud adder Jun 28, 2023, 12:33 PM

#

She said that her koala started to speak English when she was about six months old and has since been able "to understand the words of people"., who lives in a kangaroo colony near Kogarah on Queensland’s Sunshine Coast, told news.com.au he had never seen anything like it before:
This sample from the appendix seems to show an abrupt change in topic half way through.

versed flax Jun 28, 2023, 12:37 PM

#

loud adder > She said that her koala started to speak English when she was about six months...

True. Let's regen new ones. Also those were sampled with Rep penalty and argmax decoding.

#

I'll do that tonight. I have to get going with my job for now.

#

(I'm still reading and answering here, I just won't do anything meaningful now)

versed flax Jun 28, 2023, 1:27 PM

#

@loud adder fyi this is the response from sgugger to the PR

let's see if the community requests this added feature before implementing it in the library proper :-)

loud adder Jun 28, 2023, 1:45 PM

#

I saw and don't understand, but w/e

patent gull Jun 28, 2023, 2:15 PM

#

i went through the edits/comments and agree with pretty much all of them

#

would you like to discuss \subsection{Relation to instruction tuning}?

#

let me know

versed flax Jun 28, 2023, 2:16 PM

#

patent gull i went through the edits/comments and agree with pretty much all of them

Have you accepted / rejected / resolved them :)?

patent gull Jun 28, 2023, 2:17 PM

#

i accepted most that i saw from you/me

#

I'm leaving comments up there

#

since most of them feel like they're still open

versed flax Jun 28, 2023, 3:44 PM

#

I've updated the tables and figures with triviaqa
Added a note about triviaqa methodology in the appendix.
talking with HF https://github.com/huggingface/transformers/issues/24536

blissful garden Jun 28, 2023, 3:52 PM

#

do we still move the FLOP stuff up to main text? There seem to be a lot of stuff in the main text already

versed flax Jun 28, 2023, 3:54 PM

#

I'm pretty sure we can find something that is less important than this result, and move it to the appendix / remove it instead

blissful garden Jun 28, 2023, 4:08 PM

#

@versed flax I also saw the MusicGen PR and trace it back to the paper
https://arxiv.org/pdf/2209.15352.pdf
It seems equation 4 is exactly what we are doing here...... They are so much earlier than us. We should probably say like although we weren't aware of them in the beginning, our work can be seen as generalizing their technique to text-to-text models with a comprehensive analysis.

versed flax Jun 28, 2023, 4:09 PM

#

blissful garden <@212467543745626112> I also saw the MusicGen PR and trace it back to the paper ...

Do you think it's more similar to our work rather than to the use in diffusion models?

blissful garden Jun 28, 2023, 4:11 PM

#

versed flax Do you think it's more similar to our work rather than to the use in diffusion m...

no I just mean the exact technique. They literally only had one paragraph without anything else. But the way they say it is very similar to us (mentioning all the SD stuff and having the same formula as ours)

#

They have a Figure 3 for ablation of CFG and that was it. But they did study that

#

I actually start to think this paper is closer to us than CAD. And they simply just didn't realize the generality of this technique.

versed flax Jun 28, 2023, 4:14 PM

#

blissful garden I actually start to think this paper is closer to us than CAD. And they simply j...

Their architecture looks very very close to text2image models and that's probably why they did not generalize

blissful garden Jun 28, 2023, 4:15 PM

#

versed flax Their architecture looks very very close to text2image models and that's probabl...

Yep

#

just throwing it out there and see if we want to add one short sentence to acknowledge their work as well.

#

In our field if our upcoming work has any resemblance to any other group's previous stuff, we'd send our draft to them in case they have remarks (but don't wait for it and all the submission schedules would be unchanged)
(and usually they say "good work!" and connect with you)

versed flax Jun 28, 2023, 4:21 PM

#

I've never done that. I don't know what's customary in ML

fallow egret Jun 28, 2023, 4:35 PM

#

I agree that although they apply CFG on autoregressive model it's a different field, so in that sense it similar to text2image model and is less relevant than CAD. We might want to add one line referring their work but I don't think we should do more than that...

unique sedge Jun 28, 2023, 4:48 PM

#

Do best effort attempts at trying to cover related work for something you are doing, but if its not immediately in your vertical (subfield/adjacent field) its okay to miss it. Very rarely would a reviewer reject your paper because you havent mentioned one paper (they might ask you to add it though), unless you are committing an error egregiously.

patent gull Jun 28, 2023, 8:25 PM

#

i think we should include it in related works, or as another citation in the intro to CFG!

#

it's very cool

#

i think it's quite clear that they applied this idea from the text-to-image lit

#

weird that that guy is keeping such close tabs on HF PRs that he noticed it lol

versed flax Jun 28, 2023, 8:50 PM

#

So we're like 99% there. I see 2 remaining TODOs in the paper:

1 figure that @loud adder asked for but that I don't really understand. Do you understand what she meant @patent gull ?
1 flop analysis. @blissful garden I see you're on this, do you need a second brain?

I did another pass on the paper, fixed the figure flow in the appendix (it was totally chaotic), and the various things I mentioned earlier

#

We're 99% done => what's this 1% I'm not seeing, besides those two points, and how I can work on it? Has someone identified some incomplete work? I'm so deep inside it that I have a hard time keeping track of the progress lol

blissful garden Jun 28, 2023, 9:04 PM

#

versed flax So we're like 99% there. I see 2 remaining TODOs in the paper: - 1 figure that <...

the problem yesterday was that the f-test resulted in all p=0 and Alex told me to change to ANCOVA. I got the code and putting together the results right now. Also reading up what ANCOVA is about (yes my stats suck)

versed flax Jun 28, 2023, 9:04 PM

#

(I have no idea what that is either)

blissful garden Jun 28, 2023, 9:05 PM

#

it does seem to directly tackle the comparisons of regressions though

versed flax Jun 28, 2023, 9:05 PM

#

blissful garden the problem yesterday was that the f-test resulted in all p=0 and Alex told me t...

So that thing is about to be solved. Great!

blissful garden Jun 28, 2023, 9:06 PM

#

versed flax So that thing is about to be solved. Great!

hopefully something useful can come out

versed flax Jun 28, 2023, 9:06 PM

#

Well then, maybe this paper will be able to fly on ArXiv before Friday then!

blissful garden Jun 28, 2023, 9:13 PM

#

all p values are super small again... @patent gull what conclusions are we looking for other than their adjusted means are not from the same distribution?
(I'm using original samples not bootstrapped samples btw)

versed flax Jun 28, 2023, 9:15 PM

#

Well then if p-values are small, it means that 2x vanilla and CFG aren't indistinguishable (which I expected)

patent gull Jun 28, 2023, 9:15 PM

#

no p-vals being small means they are distinguishable (ah which is what you said)

#

something must be up... the 95% bootstrapped confidence intervals are totally overlapping

#

hmm

#

are you running ancova on the normal values, or the log-normalized values?

blissful garden Jun 28, 2023, 9:16 PM

#

log normalized

patent gull Jun 28, 2023, 9:17 PM

#

😠 ugh thought that was it for a second..

blissful garden Jun 28, 2023, 9:17 PM

#

log(x) vs log(1-y) - log(y), and then linear regression

patent gull Jun 28, 2023, 9:18 PM

#

why log(1-y) - log(y)?

blissful garden Jun 28, 2023, 9:19 PM

#

logistic regression

#

y bounded between 0 and 1

patent gull Jun 28, 2023, 9:20 PM

#

@fallow egret you have a whole section in the appendix to write:

#

\subsection{Deliberative Prompting: Chain-of-Thought}

patent gull Jun 28, 2023, 9:22 PM

#

blissful garden y bounded between 0 and 1

i thought y was just avg accuracy, in those plots?

#

why are there so many different opinions out there for statistical-testing of regression lines?? ugh
https://stats.stackexchange.com/questions/151916/are-two-linear-regression-models-significantly-different
https://stackoverflow.com/questions/66433019/how-to-statistically-compare-the-intercept-and-slope-of-two-different-linear-reg

Cross Validated

Are two linear regression models significantly different?

This question extends What test should be used to tell if two linear regression lines are significantly different? to the more general case of having two estimated models.

I have got the following...

Stack Overflow

How to statistically compare the intercept and slope of two differe...

I have two series of data as below. I want to create an OLS linear regression model for df1 and another OLS linear regression model for df2. And then statistically test if the y-intercepts of these...

blissful garden Jun 28, 2023, 9:25 PM

#

actually ANCOVA only tests the slope, right?

blissful garden Jun 28, 2023, 9:25 PM

#

patent gull i thought `y` was just avg accuracy, in those plots?

yes y is avg which is bounded between 0 - 1

patent gull Jun 28, 2023, 9:25 PM

#

ancova stands for analysis of covariance, which i assumed meant the covariance between x ~ y

#

does anyone in this channel have a go-to significance test that they use for testing 2 regressions?

patent gull Jun 28, 2023, 9:26 PM

#

blissful garden yes y is avg which is bounded between 0 - 1

i think you can just directly test the relationship between log x and y

#

i don't think you need to transform it using logistic regression

blissful garden Jun 28, 2023, 9:26 PM

#

patent gull i think you can just directly test the relationship between `log x` and `y`

no, because y is bounded at 1, when y is very close to 1 (like 0.9), it's having an obvious flattening look where linear regression is unfair

#

I was doing this in the beginning until I realize every line just intersects when x is large (y getting close to 1)

patent gull Jun 28, 2023, 9:28 PM

#

ahh so those plots aren't just showing x and y

#

they're showing some transformation?

#

Fig 10

blissful garden Jun 28, 2023, 9:28 PM

#

the plots are just log(x) and y

#

but the curves are logistic regressions between log(x) and y

#

which is the same as linear regression between log(x) and log(1-y) - log(y)

patent gull Jun 28, 2023, 9:28 PM

#

ohh ok ok i thought you were doing some fancy multinomial fit thing

#

i think scipy optimize has a multinomial fit

#

anyway

#

ok... if you wanna send me the data i can also try significance testing

#

otherwise we can also just say "the two lines are indistinguishable on a 95% confidence interval"

blissful garden Jun 28, 2023, 9:30 PM

#

so if two groups have similar slopes, ANCOVA will give high p?

patent gull Jun 28, 2023, 9:31 PM

#

that's what i thought. but if you see those SO links I sent, there are other proposals for significance tests

#

😭 that first link says chi-squared test of coefficients, a partial f-test and a t-test...

#

can't say i've heard of a "partial f-test" before, nor do i know which one is appropriate in this case

blissful garden Jun 28, 2023, 9:35 PM

#

@patent gull data and codes sent (a bunch of stuff in DM but I don't want to spam the channel). Feel free to play with it

#

meanwhile let me also check out those links

#

Maybe the conclusion is indeed we have different slopes (or covariance) for each task

patent gull Jun 28, 2023, 9:41 PM

#

cool cool

blissful garden Jun 29, 2023, 12:20 AM

#

Seems like most of the stuff related to me is done. There is this one question left:

do we want to compress the codegen results in the main text by for example reporting temp=0.2 only? This way we combine the three tables together. (will still need all temps at least in appendix to fully showcase the trade-off of adherence-creativity)

loud adder Jun 29, 2023, 12:21 AM

#

blissful garden Seems like most of the stuff related to me is done. There is this one question l...

I think so

#

@fallow egret you wrote the interpretability section, right? I wanted to chat about that as I feel like I’m missing something as I read it

versed flax Jun 29, 2023, 12:22 AM

#

loud adder <@1057033987811459203> you wrote the interpretability section, right? I wanted t...

It's @patent gull

patent gull Jun 29, 2023, 12:23 AM

#

I wrote it, yeah, I’ll be available in a bit to talk about it

#

(in a bit meaning like 1-2 hours)

blissful garden Jun 29, 2023, 12:42 AM

#

blissful garden Seems like most of the stuff related to me is done. There is this one question l...

This is done. Put the full table to the appendix and just say "Here we show the results for temperature= 0.2 in Table 3". Feel free to change the wording if there is a better way to put it.

versed flax Jun 29, 2023, 12:49 AM

#

@loud adder there's some debate on the MT section. Does it belong to the main text or the appendix? We're trying to make the paper shorter. The section shows:

it's a generative task, but so is CoT
CFG brings 10% improvement on MT on base models
it didn't work on tuned models (so it makes the positive results a bit pointless, since people will obvioulsy use the tuned models)
there are no further insight.

unique sedge Jun 29, 2023, 12:51 AM

#

versed flax <@193204646687408129> there's some debate on the MT section. Does it belong to t...

It doesn’t work for 1 shot either

versed flax Jun 29, 2023, 12:51 AM

#

Right.

loud adder Jun 29, 2023, 12:56 AM

#

unique sedge It doesn’t work for 1 shot either

I remember we did 1-shot for the BLOOM paper too. Is 1-shot a standard thing to do for MT

unique sedge Jun 29, 2023, 1:02 AM

#

loud adder I remember we did 1-shot for the BLOOM paper too. Is 1-shot a standard thing to ...

For generations tasks from what ive seen, seems to be the norm. Because the model tends to just over-generate unless it has an example to anchor to.

blissful garden Jun 29, 2023, 3:42 AM

#

I see the Section 2 is 2 pages long. If we are really compressing the pages, maybe we should also consider moving most of the math to the appendix as well. Although I'm a math guy, my guess is that most audience just wants to see one equation and a short story before going straight to charts and conclusions.

loud adder Jun 29, 2023, 3:45 AM

#

I think all of 2.2 is necessary, and it’s hard to see what equations can be cut from 2.1

#

I recall thinking the prose was a little overdone though, maybe we can cut some of it

fallow egret Jun 29, 2023, 3:49 AM

#

patent gull <@1057033987811459203> you have a whole section in the appendix to write:

I was referring to all these figures (in the appendix) in the main text. I will write a few sentence in the appendix that echo the main text

fallow egret Jun 29, 2023, 4:14 AM

#

patent gull `\subsection{Deliberative Prompting: Chain-of-Thought}`

Done

blissful garden Jun 29, 2023, 4:35 AM

#

@versed flax I'm pretty sure you flipped the order of the model labels

patent gull Jun 29, 2023, 5:43 AM

#

loud adder I recall thinking the prose was a little overdone though, maybe we can cut some ...

I will work on this tomorrow morning. Apologies for being absent, I was pulled away longer than I expected

versed flax Jun 29, 2023, 11:54 AM

#

fallow egret Done

I think you broke the LaTex.

fallow egret Jun 29, 2023, 11:54 AM

#

versed flax I think you broke the LaTex.

It was before

versed flax Jun 29, 2023, 11:56 AM

#

You're right

#

@patent gull fyi \usepackage{subfigure} broke the LaTeX. I'm commenting it. It doesn't seem to break anything else. No idea what you needed it for.

patent gull Jun 29, 2023, 1:00 PM

#

Ugh so sorry man

#

I was trying to put two figures side by side

#

The gpt4all figure and the humeval win rate fig

versed flax Jun 29, 2023, 1:01 PM

#

You did that with \subfloat already didn't you?

patent gull Jun 29, 2023, 1:01 PM

#

Yeahhh I didn’t want them to have letters, and I didn’t want them to break the counter

versed flax Jun 29, 2023, 1:01 PM

#

ah

patent gull Jun 29, 2023, 1:01 PM

#

But it wasn’t breaking latex when I first imported it

#

Maybe after several compiles

versed flax Jun 29, 2023, 1:01 PM

#

maybe yeah

patent gull Jun 29, 2023, 1:01 PM

#

Or maybe my latex was caching something

#

But anyway I would’ve never just left it broken if I had known

#

My bad

versed flax Jun 29, 2023, 1:02 PM

#

I was definitely awake and working on the paper when you did that and I did not notice the breaking

#

lol I know

patent gull Jun 29, 2023, 1:02 PM

#

Overleaf caches a lot of intermediate files…. Thinking about it more, that may have been it :/

versed flax Jun 29, 2023, 1:02 PM

#

it's no drama

patent gull Jun 29, 2023, 1:03 PM

#

Cool

#

@loud adder let me know when you’d like to chat about the interpretation section. I’ll be at my computer in 5 min

patent gull Jun 29, 2023, 1:28 PM

#

@fallow egret can you redo your figures with plt.rc('font', size=16)?
Also what is the y-axis? accuracy? can you label Accuracy (%)?

#

(I'm questioning whether we need the results to look like this, and whether we can format them like Figure 1, i.e. a table)

#

saves space. but it would be a very, very small table lol

#

oh sorry i see the legend. so, my visual graphics opinion is that legend-based hues should be different trials not metrics. Metrics imo belong on a dual y-axis

fallow egret Jun 29, 2023, 1:30 PM

#

Yes, it will be narrow and long table

patent gull Jun 29, 2023, 1:31 PM

#

https://python-graph-gallery.com/line-chart-dual-y-axis-with-matplotlib/

#

dual y-axis, sorry

fallow egret Jun 29, 2023, 1:31 PM

#

So what to change?
I used @blissful garden code to be aligned with his figures...

patent gull Jun 29, 2023, 1:31 PM

#

if you send me the data i can change it

loud adder Jun 29, 2023, 1:31 PM

#

Which figure is this

patent gull Jun 29, 2023, 1:31 PM

#

Figure 2

#

there's also wasted vertical space in the right-hand image, due to the need to project both into the same scale, which a dual y-axis would solve

loud adder Jun 29, 2023, 1:33 PM

#

What are the conclusions I’m supposed to reach reading these plots @fallow egret

fallow egret Jun 29, 2023, 1:34 PM

#

This:
using CFG increases the percentage of CoT which results in a valid answer that could be parsed. For low guidance strengths, this results in boosting the model performances. However, for large values, although the model returns more valid results, the quality of the chains is also impacted, and overall the model performances degrade.

patent gull Jun 29, 2023, 1:35 PM

#

how did we evaluate the quality of the chains?

#

qualitatively?

#

(or did i write that?)

versed flax Jun 29, 2023, 1:35 PM

#

I would put a shorter version of that in the caption as well

fallow egret Jun 29, 2023, 1:35 PM

#

patent gull how did we evaluate `the quality of the chains`?

Yes, there are qualitatively examples in the appendix

loud adder Jun 29, 2023, 1:36 PM

#

versed flax I would put a shorter version of that in the caption as well

Agreed. Always tell people what they should think in the caption of a plot

patent gull Jun 29, 2023, 1:37 PM

#

ideally i would like error bars/ confidence region as well (which you can get via bootstrapping)

#

here are the things I'd like changed:

make height smaller (half as large)
make a dual y-axis and remove the legend
make the font a lot bigger
confidence regions on the plot
x-axis label should match Figure 4 (Guidance Strength (CFG \gamma))

If you would like me to do that, send me the data and I'll take a look

fallow egret Jun 29, 2023, 1:38 PM

#

patent gull ideally i would like error bars/ confidence region as well (which you can get vi...

Yes, but first of all we don't have it in other figures. Second it's going to take forever to do bootstrapping

loud adder Jun 29, 2023, 1:39 PM

#

fallow egret Yes, but first of all we don't have it in other figures. Second it's going to ta...

How long does a run take

patent gull Jun 29, 2023, 1:39 PM

#

(A) if you have the data saved, bootstrap sampling just means resampling
(B) our other line graphs in the main body have confidence

loud adder Jun 29, 2023, 1:39 PM

#

Also if this is being done with the eval harness, it does bootstrapping for you. That's what the "acc_stderr" etc values are from.

fallow egret Jun 29, 2023, 1:39 PM

#

loud adder How long does a run take

It depends on the data but ~6h for one cfg value

fallow egret Jun 29, 2023, 1:40 PM

#

loud adder Also if this is being done with the eval harness, it does bootstrapping for you....

It's with eval harness

patent gull Jun 29, 2023, 1:41 PM

#

cool

fallow egret Jun 29, 2023, 1:41 PM

#

loud adder Also if this is being done with the eval harness, it does bootstrapping for you....

This is not boostrapping as far as I understand the value

patent gull Jun 29, 2023, 1:41 PM

#

how many stds are common for error bars?

loud adder Jun 29, 2023, 1:41 PM

#

fallow egret This is not boostrapping as far as I understand the value

Why do you say that

patent gull Jun 29, 2023, 1:42 PM

#

i think central limit theorem says in large limits, 2 stds = bootstrap @ 95 confidence?

loud adder Jun 29, 2023, 1:43 PM

#

The code is here: https://github.com/EleutherAI/lm-evaluation-harness/blob/72b7f0c00a6ff94632c5b873fc24e093ae74fa47/lm_eval/metrics.py#L192

I'm not a statistician, but my understanding is that this is the right way to do a bootstrap CI

GitHub

lm-evaluation-harness/lm_eval/metrics.py at 72b7f0c00a6ff94632c5b87...

A framework for few-shot evaluation of autoregressive language models. - lm-evaluation-harness/lm_eval/metrics.py at 72b7f0c00a6ff94632c5b873fc24e093ae74fa47 · EleutherAI/lm-evaluation-harness

patent gull Jun 29, 2023, 1:43 PM

#

anyway, this shouldn't be a debate. good science means error bars and confidence regions

#

especially in the main body

loud adder Jun 29, 2023, 1:44 PM

#

patent gull anyway, this shouldn't be a debate. good science means error bars and confidence...

Good science means correct error bars. If there's a contention that the error bars are meaningless that's wroth addressing

fallow egret Jun 29, 2023, 1:44 PM

#

Ok, I can add it

patent gull Jun 29, 2023, 1:44 PM

#

yeah i mean i'm just addressing the debate over whether to include them or not

loud adder Jun 29, 2023, 1:45 PM

#

I think Elad was against it when he thought that 6 hours was per CFG per iteration

patent gull Jun 29, 2023, 1:45 PM

#

fair enough

loud adder Jun 29, 2023, 1:45 PM

#

Under the reasonable assessment that we don't have 1,000 days to run stuff for

#

But yeah, the default setting is to run 1,000 iterations for the bootstrap CI

fallow egret Jun 29, 2023, 1:46 PM

#

Yes, there is no problem to do that with this code. My only very small concern is that we don't provide it in all the other figures (in the appendix)

loud adder Jun 29, 2023, 1:46 PM

#

(this is also why the evals are slightly non-deterministic, the "score" we report is the median of the runs)

fallow egret Jun 29, 2023, 1:46 PM

#

I thought on boosting by running multiple experiments which going to take forever... nvm

#

Ok, so Alex do you want to send me the code I should use for the figures or you want me to produce the numbers for the figures?

loud adder Jun 29, 2023, 1:48 PM

#

No worries!

patent gull Jun 29, 2023, 1:49 PM

#

if you produce the numbers that would probably be easiest collectively for both of us

#

we're .5 page away from 10 pages

#

there is a heavily commented region of text with multiple "why is the needed" comments in 2.2:


Next, we need to define what is considered conditioning, $c$, in decoder-only language models. In the common situations, a user provides a \textit{prompt} $c$ which can be a context, an instruction, or the beginning of some text, and uses a language model to sample a sequence of continuation tokens $w_i$ for the prompt $c$. Since a good continuation is expected to highly correlate to the prompt, we consider the prompt as our conditioning.```

I suggest cutting it

loud adder Jun 29, 2023, 1:51 PM

#

@patent gull Can you turn on link sharing so I can pass a copy to some people for their feedback?

patent gull Jun 29, 2023, 1:51 PM

#

yeah

#

it's on

#

view: https://www.overleaf.com/read/ytzhdbbvpjkp

#

edit: https://www.overleaf.com/2856215227xwjhddvsmjwj

#

I think i'll take a stab at mocking the table described here:
\textbf{New Figure: shows several examples of (prompt, initial segment, completion) triples. Examples should show diverse relationships between the prompt and the initial text, including ones where the prompt is at the beginning and ones where the prompt includes text after the question (``let's think about it step by step'' perhaps)}

and then bother people to fill in stuff for their sections

#

but i don't understand initial segment. Is that, like, CoT-related?

versed flax Jun 29, 2023, 1:53 PM

#

patent gull there is a heavily commented region of text with multiple "why is the needed" co...

I addressed that in the comments in the sidebar. It's not important per se, but it's addressing arguments that people coming from CV would make, and imho it makes the text flow a bit better. I'm not too opinionated on this piece of text and will ultimately let the majority decide after reading my replies.

patent gull Jun 29, 2023, 1:55 PM

#

i see, i see... i see the difference between token-logits in NLP and vision semantic spaces

#

i think that's useful

#

i have a standing comment there about the word embedding and sentence embeddings sentence... i don't think that's useful, nor necessarily helpful in thinking about token logits

#

how is conditioning $ c$ in NLP different from conditioning in vision?

versed flax Jun 29, 2023, 1:57 PM

#

patent gull i see, i see... i see the difference between token-logits in NLP and vision sema...

Yes. We're manipulating the logits and we pretend it's a semantic representation. People in CV would absolutely scream bc they could think it's the same as manipulating pixel space directly, which is horrendous

patent gull Jun 29, 2023, 1:57 PM

#

ah.. i see. ok so there's a field difference here that i'm not understanding/aware of

#

fair!

versed flax Jun 29, 2023, 1:58 PM

#

the later layers of image generators are never the ones you'd want to manipulate

loud adder Jun 29, 2023, 1:58 PM

#

patent gull I think i'll take a stab at mocking the table described here: ```\textbf{New Fig...

Oh, this confused people so I meant to write up an explainer...

versed flax Jun 29, 2023, 2:00 PM

#

versed flax the later layers of image generators are never the ones you'd want to manipulate

Since the model go from abstract -> pixel, later layers lost all their semantics and they're just local pixel descriptors. (It would be the opposite with a classifier, ofc, which are mirror inverse architectures which goes from pixel -> abstract) . That's why, coming from CV, I thought it was important to mention that the logits space in NLP is indeed still semantic

versed flax Jun 29, 2023, 2:01 PM

#

patent gull how is conditioning $ c$ in NLP different from conditioning in vision?

I don't understand this question. It's too vague.

patent gull Jun 29, 2023, 2:01 PM

#

Next, we need to define what is considered conditioning, $c$, in decoder-only language models. In the common situations, a user provides a \textit{prompt} $c$ which can be a context, an instruction, or the beginning of some text, and uses a language model to sample a sequence of continuation tokens $w_i$ for the prompt $c$.

I'm talking about this paragraph.... my impression was that conditioning with prompts was kinda the same thing between vision and nlp

patent gull Jun 29, 2023, 2:02 PM

#

versed flax Since the model go from abstract -> pixel, later layers lost all their semantics...

also, yeah, i think this is really fair and maybe worth explicitly saying somehow. I need to think about it...

loud adder Jun 29, 2023, 2:02 PM

#

versed flax Since the model go from abstract -> pixel, later layers lost all their semantics...

It's amusingly the opposite in NLP... it's less obvious that the latents are semantically meanignful!

versed flax Jun 29, 2023, 2:04 PM

#

loud adder It's amusingly the opposite in NLP... it's less obvious that the *latents* are s...

Yeah... pixels don't bear any semantic on their own, but words do. I would expect a char level convnet generator / discriminator in NLP to display the exact same behavior

patent gull Jun 29, 2023, 2:04 PM

#

loud adder Oh, this confused people so I meant to write up an explainer...

So in section 3, we break up (prompt, continuation) variations into the following overall framework.

Prompt is what is supplied by the user/dev. Continuation is generated by the model. All variations branch from there.

cot: <prompt, [cot, continuation]>
text-to-text: <long prompt, long continuation>
chatbot: <[system prompt, user prompt], continuation>```

#

so the idea of initial segment as a third, distinct category doesn't really fit

loud adder Jun 29, 2023, 2:05 PM

#

So I'm picturing something like this

Screen_Shot_2023-06-29_at_10.05.41_AM.png

patent gull Jun 29, 2023, 2:06 PM

#

yeah... i would just prefer a breakdown that mirrors the structure of our paper... more comprehensible to readers

loud adder Jun 29, 2023, 2:07 PM

#

Or this

Screen_Shot_2023-06-29_at_10.07.01_AM.png

loud adder Jun 29, 2023, 2:07 PM

#

patent gull yeah... i would just prefer a breakdown that mirrors the structure of our paper....

I think I agree, I'm trying to explain my mental model so I can understand the framing y'all're using

versed flax Jun 29, 2023, 2:09 PM

#

patent gull ```Next, we need to define what is considered conditioning, $c$, in decoder-only...

It's making a difference between the model's conditioning and the tasks' conditioning. The model's conditioning on prefix that naturally arises from the sequential autoregressive sampling might not align with the task: a user might want to generate a text that ends or contains some predefined conditioning text, or text in a specific style or something.

This is just to say that we align the task's conditioning to the model's conditioning by expressing it as a prefix. This might be too trivial.

patent gull Jun 29, 2023, 2:30 PM

#

alright here's the table... ulimately we may be able to lose the middle column if the example column is descriptive enough:

#

#

@versed flax can you fill in a good example for Assistant Prompting?
@blissful garden can you fill in a good example for code-gen?
@fallow egret can you fill in a good example from cot?

I can do basic prompting

versed flax Jun 29, 2023, 2:47 PM

#

patent gull

Clearly the example are not fitting in that last column lol.
Also, how about we write "prompt => completion" rather than "(prompt, completion)" which makes more obvious what the input and outputs are?

patent gull Jun 29, 2023, 2:51 PM

#

yeah i think we can fiddle with it a bit after it's all full

#

i definitely think we can lose that middle column

#

alright. did some polishing. gotta turn to my day job now

fallow egret Jun 29, 2023, 2:59 PM

#

patent gull <@212467543745626112> can you fill in a good example for Assistant Prompting? <@...

I provide two full examples in Table 14-15 do we need more?

patent gull Jun 29, 2023, 3:00 PM

#

yeah

fallow egret Jun 29, 2023, 3:00 PM

#

patent gull yeah

How many more?
Just add more tables in the same format?

patent gull Jun 29, 2023, 3:01 PM

#

no take a look at that table

#

and Vermifuge already put in one

#

should ideally be short

fallow egret Jun 29, 2023, 3:03 PM

#

patent gull should ideally be short

If we want the real prompt it's going to be problematic since it's few-shot. It's very very long

versed flax Jun 29, 2023, 3:03 PM

#

just make one up

#

it's meant to illustrate how we categorize the test cases

patent gull Jun 29, 2023, 3:03 PM

#

i see. yeah feel free to use '...', as well

fallow egret Jun 29, 2023, 3:03 PM

#

Ok, np I can make it zero-shot, I hope it will not confuse the reader...

patent gull Jun 29, 2023, 3:04 PM

#

hmm i hope not either. let's see. there might be space to write "here's an example...." in the prompt, and then clarify in the caption. but we'll get a feel for it when we see it

fallow egret Jun 29, 2023, 3:08 PM

#

Oh, I see we also need to write the reasoning, lol I don't how to squeeze all this stuff

patent gull Jun 29, 2023, 3:09 PM

#

i think just put in what you feel is good and complete, don't worry about space right now

#

we'll massage and standardize once all the examples are in

versed flax Jun 29, 2023, 3:11 PM

#

("how many egg boxes to buy to have 24?", ("For a box of 12 eggs, that's 24/12=2", "The answer is 2"))
something like that?

fallow egret Jun 29, 2023, 3:14 PM

#

Ok, sounds good. I thought we want a real example

versed flax Jun 29, 2023, 3:15 PM

#

they would be too long I guess

fallow egret Jun 29, 2023, 3:16 PM

#

Yes, indeed sounds good. I just hope it's not more confusing for the reader

versed flax Jun 29, 2023, 3:16 PM

#

what's possibly confusing about it?

fallow egret Jun 29, 2023, 3:17 PM

#

versed flax what's possibly confusing about it?

This example doesn't represent the task and reasoning complexity + the setting is few-shot

versed flax Jun 29, 2023, 3:17 PM

#

gotcha, that's fair

fallow egret Jun 29, 2023, 3:17 PM

#

But it's a good solution to put such example...

loud adder Jun 29, 2023, 3:27 PM

#

patent gull

If this is in response to my request for a figure, the key thing is that this doesn't make it obvious what the CFG is attaching to. You should state that explicitly in the figure

versed flax Jun 29, 2023, 3:29 PM

#

Damn, yes. Maybe we just need a figure like

gamma * LLM("The dragon flew over Paris, France, on Saturday evening when") + (1 - gamma) * LLM("on Saturday evening when")

loud adder Jun 29, 2023, 3:30 PM

#

I think color coding the text makes it pretty clear

loud adder Jun 29, 2023, 3:30 PM

#

loud adder Or this

This is a shitty screenshot of sublime, but captures the core idea

versed flax Jun 29, 2023, 3:31 PM

#

I understand your screenshots now

blissful garden Jun 29, 2023, 3:32 PM

#

patent gull <@212467543745626112> can you fill in a good example for Assistant Prompting? <@...

Which row should I fill in?

versed flax Jun 29, 2023, 3:32 PM

#

blissful garden Which row should I fill in?

maybe none, actually

loud adder Jun 29, 2023, 3:33 PM

#

I was reallt susprised I couldn't find a good example of what I had in mind

versed flax Jun 29, 2023, 3:34 PM

#

for real. It's so obvious we all overlooked it

blissful garden Jun 29, 2023, 3:38 PM

#

blissful garden <@212467543745626112> I'm pretty sure you flipped the order of the model labels

@versed flax the order of the labels

versed flax Jun 29, 2023, 3:39 PM

#

something like that, but pretty?

versed flax Jun 29, 2023, 3:40 PM

#

blissful garden <@212467543745626112> the order of the labels

good catch

loud adder Jun 29, 2023, 3:44 PM

#

Oooo is this intending to be a latent space representation? I like it

versed flax Jun 29, 2023, 3:44 PM

#

Yes

blissful garden Jun 29, 2023, 3:49 PM

#

versed flax something like that, but pretty?

oh I like this picture

versed flax Jun 29, 2023, 3:49 PM

#

slightly better

blissful garden Jun 29, 2023, 3:49 PM

#

maybe worth putting in section 2?

#

(I know we are long but one good picture is better than a lot of words)

versed flax Jun 29, 2023, 3:50 PM

#

Gotta be in the front page imho

#

that's the whole paper in a nutshell

#

good enough?

blissful garden Jun 29, 2023, 4:05 PM

#

versed flax good enough?

that's a big 0 for the subscript of x_0 👀

versed flax Jun 29, 2023, 4:05 PM

#

hahaha true

loud adder Jun 29, 2023, 4:13 PM

#

versed flax good enough?

xI like this a lot

#

Needs a little prettying up

#

But it's really good

versed flax Jun 29, 2023, 4:14 PM

#

\caption{We show a 2D projection of a textual latent space $(x_0, x_1)$. We embed our text both with and without the prompt ``Today in France,'', and we walk from the promptless embedding in the direction to the prompted embedding with step size $\gamma$. Defining $\gamma>1$ overemphasizes the prompt, leading to better behavior and performance gains.}

loud adder Jun 29, 2023, 4:17 PM

#

Maybe put it in bold by the 1.5, as its being emphasized there

#

And in normal text by the 1?

versed flax Jun 29, 2023, 4:18 PM

#

like this?

loud adder Jun 29, 2023, 4:18 PM

#

"it" = "today in france"

versed flax Jun 29, 2023, 4:18 PM

#

ooooooooooh gotcha

loud adder Jun 29, 2023, 4:18 PM

#

So
y = 1.5 Today in France, citizens were celebrating
y = 1.5 Today in France, citizens were celebrating
y = 0 citizens were celebrating

#

I would definitely use the word "notional" or "hypothetical" or somethign like that in the caption, lest someone think we think this is what it actually looks like

versed flax Jun 29, 2023, 4:21 PM

#

versed flax Jun 29, 2023, 4:29 PM

#

patent gull

I was doubting the interest of this table but I do think we're better without it. What it does is explain how we split Sec 3 which is imho not important enough to remain in the paper

#

I think it just camed from a misinterpration of Stella's point. Now that we have the actual meaning and did the right thing, I don't see the use of this

loud adder Jun 29, 2023, 4:33 PM

#

I futzed with the formating a bunch and think this looks way better. Thoughts?

Screen_Shot_2023-06-29_at_12.33.06_PM.png

unique sedge Jun 29, 2023, 4:37 PM

#

Latex god PrayGe

versed flax Jun 29, 2023, 4:43 PM

#

loud adder I futzed with the formating a bunch and think this looks way better. Thoughts?

Thank you so much! It's so much better

fallow egret Jun 29, 2023, 4:52 PM

#

@loud adder , @patent gull After looking at the eval harness code, it doesn't apply bootstrap for the acc metric, it simply report the std with respect to the 0/1 (if understand correctly).
Are we fine with that?

loud adder Jun 29, 2023, 4:52 PM

#

Title page: I might remove the figure title? My main point of annoyance here is that it's not centered tbh.

Model surgery: "model editing" is a more common term, at least in NLP. As for whether there are technqiues for doing this at inference time, we recently wrote a paper about it where we edit the entire Pythia suite. AFAIK this is the first example that's effective at scale: https://arxiv.org/abs/2306.03819

Table 2: I agree with @versed flax that this doesn't really accomplish what I was hoping to accomplish. I also think it's formally correct at the expense of being clear, and that if we do something like this it should be formatted like a NL document and not a tuple of strings

versed flax Jun 29, 2023, 4:54 PM

#

loud adder **Title page:** I might remove the figure title? My main point of annoyance here...

Let me remove the title, cite that paper and change the terminology, and comment that table (Alex will remove it if he agrees)

loud adder Jun 29, 2023, 5:03 PM

#

My main outstanding concerns relate to readability to NLP people, and I sent a copy to two who said they’d be able to provide feedback by the end of the day.

versed flax Jun 29, 2023, 5:04 PM

#

loud adder My main outstanding concerns relate to readability to NLP people, and I sent a c...

what could be obscure for nlp people? I don't know that culture much

loud adder Jun 29, 2023, 5:05 PM

#

It’s less about obscurity and more about readability & communicating ideas effectively

blissful garden Jun 29, 2023, 5:07 PM

#

I'm going to rewrite some of my FLOP appendix because we now have the main text.

versed flax Jun 29, 2023, 5:09 PM

#

blissful garden I'm going to rewrite some of my FLOP appendix because we now have the main text.

remember: no chicken! lol

loud adder Jun 29, 2023, 5:11 PM

#

There are some areas I find a little weird, or at least not how I would have written it, but it’s hard for me to tell if that’s personal style, language differences, field cultural differences, or something else

#

I added space at the end of the paper for acknowledgments and (at the beginning of the appendix) author contributions. Other than CoreWeave for providing compute, is there anyone in particular we want to thank? People you showed the draft to and got useful feedback from, people who provided compute for experiments other than the pod I provided, etc.

blissful garden Jun 29, 2023, 5:18 PM

#

loud adder I added space at the end of the paper for acknowledgments and (at the beginning ...

I used the cw cluster and a little bit of the Stability aws cluster. Maybe we want to acknowledge Stability for the compute as well

loud adder Jun 29, 2023, 5:20 PM

#

blissful garden I used the cw cluster and a little bit of the Stability aws cluster. Maybe we wa...

Sure

#

And by “sure” I actually mean “we are contractually obligated to do so”

versed flax Jun 29, 2023, 5:21 PM

#

loud adder I added space at the end of the paper for acknowledgments and (at the beginning ...

not on my side

#

some of my friends for taking part in the human evaluation berk

loud adder Jun 29, 2023, 5:25 PM

#

We can absolutely thank the volunteers for our human experiments

blissful garden Jun 29, 2023, 5:38 PM

#

@patent gull when you are free could you quickly glance Appendix C.2 (mostly the last paragraph) to see if I'm still missing anything?

I also capitalized all the "ANCOVA" because everybody seems to capitalize it. Also put "p" inside $ for the minor difference of fonts for math variables.

Another note: in Section 4, you wrote in the last paragraph "a P-sized model...". It seems the P is not used. Should it be removed?

patent gull Jun 29, 2023, 5:38 PM

#

lol can we thank all the volunteers by the random names we applied to them in the web interface?

#

They’ll know

#

I was “rogue”

blissful garden Jun 29, 2023, 5:39 PM

#

I forgot mine😭

patent gull Jun 29, 2023, 5:39 PM

#

Oh man hahaha

patent gull Jun 29, 2023, 5:39 PM

#

blissful garden <@1102703708669751306> when you are free could you quickly glance Appendix C.2 (...

I will!!

fallow egret Jun 29, 2023, 5:40 PM

#

@patent gull All the stat is here:
https://drive.google.com/drive/folders/1it9kW9BQhWg8YfzFHcc1or69iOBX61gP?usp=sharing
I used the eval harness std, although I don't like it but whatever if this is the convention then lets use it...

#

I also modify the captions of the figures.
If there is anything else on my side let me know

patent gull Jun 29, 2023, 5:42 PM

#

Ok cool thanks elad!!

#

Will get this done after work hours

fallow egret Jun 29, 2023, 5:44 PM

#

Table 2 was removed in the end?

patent gull Jun 29, 2023, 5:44 PM

#

versed flax I was doubting the interest of this table but I do think we're better without it...

Fine by me!! I thought the idea of outlining the different prompts was a good one but maybe they’re commonplace enough to be generally known

patent gull Jun 29, 2023, 5:44 PM

#

fallow egret Table 2 was removed in the end?

I’m sorry haha I assume so. I’m not in front of the doc right now. I think It was a misinterpretation on my part

fallow egret Jun 29, 2023, 5:45 PM

#

I'm not seeing it in the text currently. I think it was a good decision to remove it

patent gull Jun 29, 2023, 5:48 PM

#

loud adder I added space at the end of the paper for acknowledgments and (at the beginning ...

When I’m sole first author I usually acknowledge my academic funding sources. I don’t know what the protocol is here since it’s a side project and I’m not sole first author

#

I also don’t know what my funding sources want

#

What is the protocol y’all typically follow for this?

blissful garden Jun 29, 2023, 5:50 PM

#

patent gull When I’m sole first author I usually acknowledge my academic funding sources. I ...

oh this reminds me that I have a grant too...... and I haven't spun the story yet and it will look unrelated... I need to figure out thinkies

versed flax Jun 29, 2023, 5:50 PM

#

idk if I should thank my company who were also lenient enough to allow me some time to work on that while not being my mission at all

patent gull Jun 29, 2023, 5:50 PM

#

Hahaha

loud adder Jun 29, 2023, 5:50 PM

#

versed flax idk if I should thank my company who were also lenient enough to allow me _some_...

Usually that falls under having them as your affiliation

fallow egret Jun 29, 2023, 5:50 PM

#

patent gull When I’m sole first author I usually acknowledge my academic funding sources. I ...

If you have a grant it depends on the grant terms. When I was a phd and got a grant from the ERC I was obligate to mention them even in such case

patent gull Jun 29, 2023, 5:51 PM

#

Ok

loud adder Jun 29, 2023, 5:51 PM

#

It’s assumed that your employer is sponsoring your research. Acknowledgements are typically for non-obvious sources of support

#

And like @fallow egret says, it’s often contractually obligated

blissful garden Jun 29, 2023, 5:51 PM

#

fallow egret If you have a grant it depends on the grant terms. When I was a phd and got a gr...

yeah ERC is very strict. Swiss guys are chill. I used to ask SNSF guys about whether I'm obligated to report, and they said "up to you" berk

versed flax Jun 29, 2023, 8:24 PM

#

I gave a shot at redacting the Author Contributions appendix. I did it from a non reliable memory. I invite everyone to read it and fix it. @blissful garden , @patent gull , you guys worked a lot together and I might have mixed some of your contributions. @loud adder , I genuinely have no idea how to properly phrase the supervising role you had and may have forgotten things. @fallow egret and @unique sedge, make sure I did not forget anything.

#

This is my first time redacting such a thing and I genuinely have no idea how to word it, which level of details to go into, etc.

fallow egret Jun 29, 2023, 8:35 PM

#

I'm completely fine with what is written. I think from a style perspective it should be more general without the specific details. But I also didn't write such section in any of my papers so I'm not sure

versed flax Jun 29, 2023, 8:38 PM

#

I did my best being fair and indeed that's maybe a bit too detailed. I'm waiting for the feedback of people more seasoned than I am

blissful garden Jun 29, 2023, 8:57 PM

#

fallow egret I'm completely fine with what is written. I think from a style perspective it sh...

Some paper does, like Pythia 😉

fallow egret Jun 29, 2023, 9:00 PM

#

Yes, I'm seeing also the RWKV paper has such detailed style. It looks like this is the Eleuther style 🙂
https://arxiv.org/pdf/2305.13048.pdf

versed flax Jun 29, 2023, 9:08 PM

#

okay maybe I shouldn't be that specific with section numbers

blissful garden Jun 29, 2023, 9:09 PM

#

lgtm

versed flax Jun 29, 2023, 9:09 PM

#

blissful garden lgtm

currently? or if I get a bit more vague by removing the section numbers?

blissful garden Jun 29, 2023, 9:09 PM

#

versed flax currently? or if I get a bit more vague by removing the section numbers?

I'm saying the contribution part

versed flax Jun 29, 2023, 9:10 PM

#

blissful garden I'm saying the contribution part

oh. I'm happy I got it right :)

blissful garden Jun 29, 2023, 9:10 PM

#

Oh I added C.1 to my part. I did C1-3 altogether

versed flax Jun 29, 2023, 9:10 PM

#

perfect!

#

3.1 is still unattributed. It's the standard benchmark section

#

the paper flies tomorrow 🥳

fallow egret Jun 29, 2023, 9:15 PM

#

We will have time for changes until Monday:
https://info.arxiv.org/help/availability.html

versed flax Jun 29, 2023, 9:16 PM

#

wait, the paper doesn't go live as soon as posted??

blissful garden Jun 29, 2023, 9:16 PM

#

versed flax 3.1 is still unattributed. It's the standard benchmark section

I think that has a mixture of my old texts and Alex's stuff

blissful garden Jun 29, 2023, 9:16 PM

#

versed flax wait, the paper doesn't go live as soon as posted??

nope

loud adder Jun 29, 2023, 9:17 PM

#

versed flax the paper flies tomorrow 🥳

Oh we must have failed at communicating about this

#

Yeah it's weird. And it's made worse by the fact it skips a day: you'd think papers received by 2 pm EST friday would go out at 8 pm EST friday, but they don't for reasons I don't understand

#

Schedule tl;dr

If the paper is submitted by 1800 UTC on Friday it goes out at the end of the day on Sunday
If the paper is submitted by 1800 UTC on Monday it goes out at the end of the day on Monday

versed flax Jun 29, 2023, 9:22 PM

#

so we can do tomorrow 6pm, right?

loud adder Jun 29, 2023, 9:23 PM

#

Yes

versed flax Jun 29, 2023, 9:24 PM

#

Awesome! So this is our last night working on it :)

blissful garden Jun 29, 2023, 9:24 PM

#

@versed flax Oh the legends of Pythia and GPT2 charts are missing

#

By the way I will be busy travelling internationally tomorrow. Hope we don't find last-minute thing related to my parts but just to say I might not be available for quite a while.
Doing some final proofreading right now

versed flax Jun 29, 2023, 9:45 PM

#

blissful garden <@212467543745626112> Oh the legends of Pythia and GPT2 charts are missing

good catch (again)

versed flax Jun 29, 2023, 10:06 PM

#

fallow egret We will have time for changes until Monday: https://info.arxiv.org/help/availabi...

just to make sure you caught up: we post tomorrow and it will be released Sunday. I wanted to make sure you read what happened after your message :) We don't have until Monday

patent gull Jun 29, 2023, 10:53 PM

#

phew just ending work. a lot to catch up on. what is needed from me? besides some proof-reading?

versed flax Jun 29, 2023, 10:53 PM

#

patent gull phew just ending work. a lot to catch up on. what is needed from me? besides som...

Appendix A, Author contributions :)

#

Just make sure I didn't forget something about you

patent gull Jun 29, 2023, 10:53 PM

#

haha ok i'll take a look, thanks man

#

uh dumb question — in Figure 1, whats the difference between \gamma = 1 and 1.5?

#

just the bolding? looks like the same output to me

versed flax Jun 29, 2023, 10:56 PM

#

yes

patent gull Jun 29, 2023, 10:57 PM

#

uhh i'm confused haha. I kinda glanced at the discussion around this plot, but i thought the point was to show that when we traversed towards higher \gamma, the generation changed

#

what's the point of it?

#

ohh wow someone did a lot of cutting... it's only 9.5 pages

versed flax Jun 29, 2023, 10:58 PM

#

Showing how we fiddle with the latent space. But you have a point

versed flax Jun 29, 2023, 10:59 PM

#

patent gull uhh i'm confused haha. I kinda glanced at the discussion around this plot, but i...

This is smart indeed

patent gull Jun 29, 2023, 10:59 PM

#

IMO, gamma=1.5 should be perfect, 1.1 should be not as perfect, and 1.0 should be blah

#

oh wait it starts at 0

#

ohhh this is showing literal prompt emphasis wow i'm dumb

versed flax Jun 29, 2023, 11:00 PM

#

yes

#

That's the caption lmao

#

but you have a point, it wouldn't hurt to show the continuations as well

patent gull Jun 29, 2023, 11:01 PM

#

that's the caption....

#

yes

versed flax Jun 29, 2023, 11:02 PM

#

gamma=0 => "citizen were celebrating summer"
gamma=1 => "Today in France, citizen were celebrating Christmas"
gamma=1.5 => "Today in France, citizen were celebrating Bastille Day"
something like that?

patent gull Jun 29, 2023, 11:03 PM

#

so the prompt is "today in france" and the continuation is "citizenS were celebrating..."?

#

i don't really get the point of gamma=0, we don't test on that, and why would we be expect it to be even close to being on topic? I would expect total garbage from it

#

but anyway yeah, I would do something like:

gamma=1 "Today in France, the weather was decent in London" (i.e. meandering, definitely topic-switches)
gamma=1.1 "Today in France, the weather was good for citizens" (i.e. not great, kinda passable)
gamma=1.5 "Today in France, the citizens celebrated in good weather" (i.e. good, on-topic)

#

and then underline "Today in France" and we can update the caption to be way more explicit about each time-step

versed flax Jun 29, 2023, 11:07 PM

#

what

patent gull Jun 29, 2023, 11:07 PM

#

idk just a thought

versed flax Jun 29, 2023, 11:07 PM

#

how is "good weather" related to France?

patent gull Jun 29, 2023, 11:07 PM

#

idk i was thinking of showing something that would change topic by the end

#

I guess your gamma=1 isn't bad

versed flax Jun 29, 2023, 11:08 PM

#

how about
gamma=0 => "citizens were celebrating Independence Day"
gamma=1 => "Today in France, citizens were celebrating Christmas"
gamma=1.5 => "Today in France, citizens were celebrating Bastille Day"

patent gull Jun 29, 2023, 11:08 PM

#

haha ok, yeah that sounds good

#

but why would gamma=0 be like that at all, though

#

wouldn't we expect random generation?

versed flax Jun 29, 2023, 11:09 PM

#

No?

patent gull Jun 29, 2023, 11:09 PM

#

gamma =0 means completely unprompted

#

so it could easily be "chickens fly to trees"

#

or whatever

versed flax Jun 29, 2023, 11:09 PM

#

The prompt only is "Today in France"

#

"citizen were celebrating" is the beginning of a continuation

patent gull Jun 29, 2023, 11:10 PM

#

oh we're doing like a multipart prompt?

versed flax Jun 29, 2023, 11:10 PM

#

Ah, that was in the caption but got deleted

patent gull Jun 29, 2023, 11:10 PM

#

ok... so "citizens were celebrating" should be underlined... "Today in France" should be bigger/bolder each time

versed flax Jun 29, 2023, 11:11 PM

#

You're overthinking is wayyyy too much

patent gull Jun 29, 2023, 11:11 PM

#

and then in caption we should say "start of continuation is underlined", "prompt is bolded" according to strength

#

or something

versed flax Jun 29, 2023, 11:11 PM

#

Wait you're actually totally right

#

gamma=0 is totally unrelated

#

whoopsie

patent gull Jun 29, 2023, 11:12 PM

#

honestly i think we shouldn't show gamma=0

#

we should just start the line at gamma=1 and underneath say (baseline)

#

that way it's clear that we're just improving above baseline

versed flax Jun 29, 2023, 11:12 PM

#

It's important.

patent gull Jun 29, 2023, 11:13 PM

#

we can do:

gamma=1 => "Today in France, citizens were celebrating July 4th"
gamma=1.1 => "Today in France, citizens were celebrating Christmas"
gamma=1.5 => "Today in France, citizens were celebrating Bastille Day"

versed flax Jun 29, 2023, 11:13 PM

#

But now I want to add
gamma=0.5 => "Today in France, citizens were celebrating Independence Day"

#

ah! great convergence lol

patent gull Jun 29, 2023, 11:13 PM

#

yah bc July 4th is illogical

#

whoops boss is calling me brb. those are my 2 cents for the figure

versed flax Jun 29, 2023, 11:13 PM

#

tbh I'm not exactly comfortable showing a wrong continuation for gamma=1

#

patent gull Jun 29, 2023, 11:32 PM

#

Haha I think that’s great

#

But that is accurate, isn’t it?

#

Like that’s what would happen

#

That’s a beautiful graphic imo

versed flax Jun 29, 2023, 11:33 PM

#

It's much better

#

Thank you for catching that

patent gull Jun 29, 2023, 11:33 PM

#

Haha no problem

#

Something is breaking at work so I need to brb but I’ll be back later. Did A quick scan of the paper — it seems really good

blissful garden Jun 30, 2023, 12:29 AM

#

versed flax tbh I'm not exactly comfortable showing a _wrong_ continuation for gamma=1

how does the model know what day it is today

versed flax Jun 30, 2023, 12:30 AM

#

blissful garden how does the model know what day it is today

it doesn't?

loud adder Jun 30, 2023, 12:30 AM

#

versed flax it doesn't?

He was making a joke

versed flax Jun 30, 2023, 12:30 AM

#

whoops

loud adder Jun 30, 2023, 12:31 AM

#

Pretending that the cause of the change of the holiday was changing what day the model thought it was

versed flax Jun 30, 2023, 12:32 AM

#

I guess I should go to bed and sleep, then haha

patent gull Jun 30, 2023, 4:13 AM

#

@fallow egret i'm redoing the plots now

fallow egret Jun 30, 2023, 4:13 AM

#

Let me know if something missing

patent gull Jun 30, 2023, 4:13 AM

#

I'm looking over and it seems like the aqua plots are even more impressive

#

i have two questions:

Was there any specific reason you put aqua in the appendix and gsm8k in main body?
I'm thinking of putting them all in the main body, but only reporting one metric (probably accuracy). Is this OK? The metrics seem highly correlated. Is there any specific insights we get from % invalid?

fallow egret Jun 30, 2023, 4:15 AM

#

Not really GSM8K is the more standard benchmark (it's bigger and appear in more previous works), but I don't think it's that important for the order
Yes, they are no cooreclt for high cfg values, which I think it's very important

patent gull Jun 30, 2023, 4:16 AM

#

hmmmmmm i see

fallow egret Jun 30, 2023, 4:17 AM

#

For low values it is indeed cofrelate and you get more results and increase accuracy. However for larger value, you still get the same high valid percentage but the accuracy breaks, which means the quality of the reasoning chains deteriot

patent gull Jun 30, 2023, 4:18 AM

#

that seems to me like invalid % is a coarser metric

#

oh wait

#

well considering the confidence regions, it seems to me that invalid % stays pretty constant

#

#

C8tfjaBIrUKXPRSS9Cywzs9IojKRPgGNm9uDs1Sw4m4k14CCYR0haKD9RbKWk2yStAz4DTuJnd88GbPe6KOz9LyZ4Cbg27JMScuB28nOtw6CKmINOAjmF5PAYuBVXPRUE3rda2bDVTfOFGY2KekhfB17ziPpYlwvsLfCbDFMEnp2dRBkBMh6CAIgiDoABGCDoIgCIIOEBNwEARBEHSAmICDIAiCoAPEBBwEQRAEHSAm4CAIgiDoADEBB0EQBEEHiAk4CIIgCDrAP2eofrExWRxNAAAAAElFTkSuQmCC.png

#

how is invalid % calculated?

fallow egret Jun 30, 2023, 4:23 AM

#

It's 1 if you get a parsed results (otherwise 0), and simply the % of non-parsed results sum(res==0)/len(res)

patent gull Jun 30, 2023, 4:25 AM

#

i'm a bit confused. isn't accuracy strictly bounded by % invalid?

#

in Aqua Guanaco, how can there be more invalid, but also more correct?

#

i guess it's different portions of the dataset, but still, seems counterintuitive to me

#

what is more important for practitioners to be able to measure? An invalid answer or an incorrect answer? Can't we have heuristics to reject invalid answers? And then, what is the accuracy only on the valid answers? Do people look at that?

fallow egret Jun 30, 2023, 4:30 AM

#

If you have 20% invalid but from the rest of the 80% all the answer are correct you have 80%. On the other hand if you have 10% invalid and from the 90% only 50% correct then you have 45% accuracy

#

We have heuristic, as was written in the paper we follow self-consistency parsing protocol

#

we follow their exact protocol both with respect to prompt and parsing protocol

patent gull Jun 30, 2023, 4:32 AM

#

i see so accuracy is also a function of % invalid

#

i guess i'm just wondering if there's a way to include both acqua and gsm8k in the main body, but only with accuracy. I guess it's an interesting point about the different CFG values, though

fallow egret Jun 30, 2023, 4:32 AM

#

It's not precision.
it's num of correct answer (no matter valid/invalid) / length )

patent gull Jun 30, 2023, 4:33 AM

#

i see.. sorry it's late for me

#

brain's not working

fallow egret Jun 30, 2023, 4:34 AM

#

sure, completely understood 🙂
I was also wondering what is the correct way to do that. But I think the invalid metric is super important to explain what happening, and the exact effect of the cfg

patent gull Jun 30, 2023, 4:36 AM

#

yeahh i see that...

#

hmmmm let me try one thing

#

ugh yeah it's really hard to see it working as a table...

#

#

acc and invalid % would probably do best stacked in parentheses, but we've established a different visual vocabulary for parentheses elsewhere. Also hard for the eye to really follow

fallow egret Jun 30, 2023, 4:49 AM

#

Yes, I think graph is much more readable than table in this case

patent gull Jun 30, 2023, 4:51 AM

#

so when i plot them like this:

#

#

fallow egret Jun 30, 2023, 4:52 AM

#

Yes, this might working

#Evaluating Classifier-Free Guidance impact