#Evaluating Classifier-Free Guidance impact

1 messages Ā· Page 3 of 1

patent gull
#

^ I could say the same thing

#

so let's just move forward

versed flax
#

then I'd say it's better stated in the intro. Though it introduces the notation in Sec2

fallow egret
#

Yes, I agree that something currently look wrong in the first paragraphs of section 2 (until 2.1)

blissful garden
#

Really? I thought I was rephrasing the flow and fixing the grammar and adding the last paragraph.
It was done before all our conversations so of if was important to look at your original text I'm sorry

fallow egret
versed flax
#

let me transfer the ownership to Stella then

fallow egret
#

@versed flax you can also subscribe and cancel. It will not charge you anything, and we will have 14 days to work with premium

versed flax
#

Ha, good.

fallow egret
patent gull
#

i have a premium sub

#

eductional account

#

idk how to transfer, though

fallow egret
#

IMO the line of works and presenting the prompt alignment issue should be only in the intro. In section 2 there should be only the formal notations (defining the first tokens in the sequence as the prompt)

patent gull
#

woohoo we have full history!

blissful garden
#

just commented out my remarks for Section 2 to make it cleaner. For the record I didn't touch the content of 2.1 and 2.2 except the equation numbering change.

patent gull
#

ok great

#

section 1 and 2 look good to me

fallow egret
patent gull
#

Section 3 needs a roadmap before diving into 3.1... it's our most important section and we need to prime the reader to get all of our great results

#

wait.. it's all just inputted from another file

#

hm oh. right. catching up

fallow egret
patent gull
#

šŸ˜…

#

lucky me

#

ok.. what's the plan with Section 3? Elad thinks it needs a rewrite? I actually think that just putting in more structure:

  • road map in the start
  • first/last sentences in each subsection tying it to the overall picture

is fine

blissful garden
#

so do we restructure section 3?

patent gull
#

ā˜ļø

blissful garden
#

Elad has a point in terms of classifying benchmarks based on tasks. But we need to think carefully about that. Might mean we split up some tables and stuff

patent gull
#

"classifying benchmarks based on tasks" <- can someone summarize this for me?

fallow egret
#

cc @patent gull

blissful garden
#

here, seems like we have 3 categories for the general benchmarks

patent gull
#

cool thanks

#

ok yeah i do agree with this

#

I was thinking that even just doing points 1/3/4 would really clarify things

blissful garden
#

so overall do we do

  • common sense reasoning
  • close-book QA
  • completion
  • machine translation
  • code generation
patent gull
#

but 5 makes a lot of sense to me, too. #2 looks more like a statement to me? idk what's the proposal, there

fallow egret
patent gull
#

gotcha

#

sure. I did like the idea of lm_harness as it's own standalone, since I think people will start to see that more and more, and just know what it is

#

but i think breaking it up is also something I support

#

does this imply that we need to redo Figure 1 to be more broken up?

patent gull
blissful garden
fallow egret
blissful garden
fallow egret
#

šŸ™Œ I'm excited, I think it will really improve the quality of the paper!

blissful garden
patent gull
#

i see

blissful garden
#

completion is just lambada, QA are just triviaqa and sciq, machine translation is just wmt14 fr-en, and then a whole army of reasoning tasks

fallow egret
#

As I said, in my opinion it will actualy reduce their volume since all of this list only get one subsection (assuming all subsection will have ~ same length as it should be)

blissful garden
#

it's also funny that some non-reasoning tasks (sciq and lambada) tend to get bigger improvements than reasoning tasks which I don't understand šŸ˜‚

patent gull
#

alright, i'll be back in 10-20. I think once it's mapped out, I might be able to take on this task of rewriting Secion 3, but not sure if i'm mentally able to after EMNLP

blissful garden
#

There are some sporadic experiments like the GPT-J codegen completion and my initial image generation tasks. Maybe let me move them to appendix?

Also moving the figure 1 up to 3.1 (somehow it ended up in 3.3)

loud adder
#

When would be a productive time for me to read through the paper today

patent gull
#

@loud adder here are where your comments might be most appreciated/productive:

  • section 1,2 are done. Style edits/thoughts appreciated
  • section. 3 we are debating structure. If you could see how the structure feels to you currently, and whether you think it needs an overhaul, that would be great
#
  • section 4, we’d appreciate your thoughts on content. last time I checked (pre EMNLP) was in a good state writing-wise. Not sure if it changed. But your thoughts on the content and whether you think it’s good/conclusive or needs more would be very appreciated!
#

Cant speak to the rest of the sections personally

versed flax
blissful garden
versed flax
patent gull
#

Yeah… ā€œearly experiments showedā€¦ā€ or ā€œinitial runs suggest thatā€¦ā€

versed flax
#

exactly

patent gull
#

i'm reading through Section 3 again with respect to @fallow egret 's new structure and now I actually have a counterargument...

#

the current organization scheme can roughly (with some reorg) be broken down into different "Methods", not "Tasks"... so another way to reorg would be:

  • 3.1: zero-shot prompting
  • 3.5: chain of thought
  • 3.2-3.3 text-to-text generation
  • 3.4: negative prompting, (although maybe this deserves it's own section?)
#

I actually think this is a little bit more logical than breaking it down by "Tasks" because this paper is really more about the CFG mechanism than it is about the semantics of the tasks.

With this breakdown, we can make a big-picture story that we're really trying to probe CFG with different parts of the prompt/prompt formulations. Those categories above are a nice breakdown

#

besides, i think if we break down into tasks, we'd have to have insights/hypotheses about why it does well at the tasks, specifically, and we don't have anything besides hand-waviness or trying to find citations.

This "methods" breakdown is really more about viewing CFG with different prompt setups, which then Section 4 addresses much more directly

blissful garden
patent gull
#

right, it's a little unbalanced if we go with the task breakdown

fallow egret
#

I'm completely fine with this 'method' split as long as we emphasis it. I agree that this distinction sounds indeed much better. The only thing is that I think it's a little bit strange to classify the code generation as text-to-text + it sounds that there should be merge between 3.2-3.3

patent gull
#

i'm open to either way, it was just something that came to mind, btw

#

i think yeah my only qualm with "Tasks" was I was starting to write it in my mind

#

and realized that I didn't have a great hypothesis/justification for why CFG would do well in common sense, QA, etc.

#

besides just repeating over and over again "more adherence to the prompt"

fallow egret
#

Yes, I agree that it's much better split and better stress the different effect on each method

patent gull
#

which maybe that'll also work šŸ¤·ā€ā™‚ļø idk. but it doesn't feel like it builds, actually, in the same way the methods split does

#

plus, your work gets to stand on its own, now šŸ˜‰

#

ok i'll take a stab at putting that structure in the beginning, and if it doesn't work, we can always evaluate and take a diff direction

fallow egret
#

Lol, this doesn't effect my vote šŸ™‚

blissful garden
fallow egret
#

Ok, so for me the method split sounds indeed more reasonable, any objection?

versed flax
#

I'm trapped because of timezones but will do another pass later

patent gull
#

i'm gonna update the old_sections/experiment file

#

we can always go a different direction with another file lol

fallow egret
patent gull
#

alright, i did a little bit showing what kind of structure I have in mind. Haven't finished up the last parts ... will be back in a bit to do that

#

language might be a little sloppy, especially around the "we hypothesize..." bits.... feel free to change!!

unique sedge
#

Negative prompting probably deserves its own section if its like a good bridge between going from tasks to the section where we talk about why it works.

Every section should add something to the global thread and make the case stronger

versed flax
#

Negative prompting needs to be addressed separately I guess, especially for future work

patent gull
#

We have those really good human evals though, right?

versed flax
#

We need to find a way to make it work. It's just too powerful. It won't be for this paper but it's good to address it

versed flax
#

But it's still one.

patent gull
#

Hmmm let’s see. Idk it still might fit

versed flax
#

I mean, it's not as granular or interesting as I wanted it to be, but it's still neg prompting abd the results are quite awesome

patent gull
#

In my opinion I think it fits in with a method-driven reorg of section 3

#

Although maybe we’ll be able to evaluate better once it’s all written

versed flax
#

I'll work on that as I get home

patent gull
#

i had a thought for another explanatory experiment, what do you guys think

#

so we argue that CFG increases the adherence to the prompt

#

this implies that true continuation w_c is more likely under true prompt w_p vs. another random prompt w_{p'} in the CFG setting vs. the vanilla setting

#

so we measure \delta p = p(w_c | w_p) - p(w_c | w_{p'})

versed flax
#

Yes. That's what were trying to measure with KL, Kendall tau etc

patent gull
#

but we're not holding the continuation the same

#

and testing different prompts

patent gull
#

what we're testing with KL, Kendall, etc. is whether the logit distributions of CFG look similar to Instruction-tuned models

#

not explicitly whether they're following the prompts better

#

and what I'm saying that if a model ISN'T following the prompt well, we would expect this delta:

$\delta c = p(w_c | w_p) - p(w_c | w_{p'})$

to be lower

#

than a model that is

vital pondBOT
#

Alex Spangher

versed flax
patent gull
#

we're testing something more like:

$m = < p_1(w_c | w_p) || p_2(w_c | w_p)>$

vital pondBOT
#

Alex Spangher

patent gull
#

yeah I try to make the argument with entropy, but I was just thinking about another way to test the argument maybe more directly

#

anyway, I'm gonna keep editing section 3

#

that was just a passing thought

versed flax
patent gull
#

lower entropy is evidence of prompt adherence, but not bullet proof

#

there are other reasons why entropy might decrease besides greater prompt adherence

versed flax
#

It shows better language modeling

patent gull
#

uhh yeah i think you're right

#

but in theory, the model could be both doing well on benchmarks and generating crappy english

#

totally possible to overfit on benchmarks

patent gull
#

ok done editing Section 3

#

left some comments, didn't touch "Continuations"

#

but i did a lot of work trying to make it more structural and flow together better

#

please let's not make major changes without a discussion here!! I'll try to look at Section 4 later tonight or tomorrow

#

who wants to take a stab at the conclusion? if no one does by the time i'm done with Section 4, then I will

#

i think we're close, everyone. it's shaping up

#

appendices need work, but the language in the main body is really coming together, I think

versed flax
#

@patent gull I'm back home

#

How can I maximize my usefulness?

patent gull
#

Cool!

#

I left some comments in the negative prompting section

#

If you take a pass at those then I can look later

#

And then we all have to start addressing that appendix lol…..

blissful garden
#

are [] in section 3 placeholders for citations? oh yeah they are. Fixed some stuff for Section 3.3

versed flax
#

Sec 2.1 uses r, Sec 2.2 uses gamma (and so do all the figures). Has someone a well thought opinion on the notation we should prefer?

blissful garden
versed flax
#

Well, Ho & Salimans is the reference paper for cfg

blissful garden
#

I always go for the notation of the most-known paper unless there is a counter-argument for the choice

blissful garden
versed flax
#

IIRC sec 2.1 goes with CG's notation while 2.2 goes with CFG but I need to double check

#

ok no cfg uses w

#

it's the blog post that uses gamma

blissful garden
#

Looking at 3.1, fixed some minor naming and citation problems.

close-book QA \cite{}, common sense reasoning tasks \cite{}, and sentence completion-tasks \cite{}
Do we have citations for each of these task categories? I don't recall any.

versed flax
#

I don't know any citation that would fit here (but my NLP culture is small)

blissful garden
#

Yeah unless we throw all the citations of benchmarks in their corresponding spots. Leave it here for now. If nobody has better idea, we can remove these empty citations.

blissful garden
versed flax
blissful garden
versed flax
#

exactly. "words", "weights", omega

blissful garden
versed flax
#

I'm checking

blissful garden
#

by the way is it a standard practice to cite blog posts in ML?

versed flax
patent gull
#

Ohh yah the notation definitely needs to be standardized

patent gull
#

Sorry I just threw those in there. Feel free to ignore. I can also do the work of finding those citations myself, sometimes it’s just easier to divide the labor and not switch between lots of tabs

#

But my rule of thumb is ā€œdefine or citeā€

#

ā€œIf you can’t cite, define. If you don’t feel like defining, citeā€

blissful garden
patent gull
#

Cool cool

blissful garden
#

great job revising section 3 by the way!

patent gull
#

I’m gonna be back online later tonight

#

Thanks!!!

blissful garden
#

oh a super minor question, I saw "...– i.e.". Is this alright? I mean I remember always seeing things like "..., i.e., ...".

versed flax
#

cgf fix and Imagen use w too but we still hate it

blissful garden
versed flax
#

meaning this is not "... - ie." but "... - ie ... - ..." and is indeed correct

blissful garden
blissful garden
versed flax
#

the dudes in Imagen are just losing it and not even trying to hide it lmao

versed flax
blissful garden
#

Sorry I'm just trained as a mathematician. I guess it's alright in ML

versed flax
#

You're the maths guy. however I don't recall reading a paper saying "sorry tho we changed the letter because it was unadapted xoxo". They just do it

#

(just like cfg do not justify their choice for not keeping s)

blissful garden
versed flax
#

I would consider the notation more rigoroustly if this were a mathy paper were connecting properly to the previous work was important because of some complex derivation etc

#

but here the mathiness is mostly for us too look like cools kids and the equation is absolutely trivial

#

That being said, I do have some kind of French "good enough" attitude, and it sometimes needs not to be tolerated

blissful garden
blissful garden
versed flax
#

Oh I'm not talking about french mathematicians, whatever their nationality, they're a species on their own lmao

blissful garden
versed flax
#

haha

versed flax
blissful garden
versed flax
#

theeeeeeeeeeeen... gamma?

blissful garden
#

learn from the best

#

I'm still learning the ML culture and I'm definitely not stubborn about my own habits

versed flax
#

Gamma it is then. @fallow egret did you have a strong reason to use r in 2.1, and if so, should we reconsider the notation in the rest of the paper? If not, are we good changing r to gamma for consistency?

versed flax
#

Damn I was quite happy with our former way of presenting CFG in the intro. It was more generic and we naturally derived the negative prompting and "promptless" setting easily

#

it's much harder now to go the other way "indeed... promptless is a particular case, you're not forced to negatively condition on the empty sequence, it can be anything, here's the actual generalized CFG formula haha what a nice trick we pulled on you!" lol

fallow egret
fallow egret
patent gull
#

great i went through and finished editing 3.4. There's one more little detail, @versed flax , and then I'll feel done. I feel like we're in a good spot with Section 3. Section 4 and 5 I wrote/edited. We're just a conclusion away from being done with the main body

fallow egret
#

I want to write the appendix for CoT, is there is any decision on the appendix structure?

stone umbra
#

I've been reading this like a soap opera, and just wanted to say that this is really cool work. šŸ‘‹
Also, this was pretty hilarious (from https://cfg.vermeille.fr/):

Prompt
How to choose a good learning rate?
Response
Sometimes you can't choose a learning rate. You can't control your learning rate. You have to let it run. It's like breathing. It's hard to control your breathing, but it's also what keeps you alive.

loud adder
#

I talked about this paper with Yejin Choi this weekend and she thought it was quite interesting. A lot of her work recently has had a similar theme, in that it’s oriented towards how we can induce high quality behavior in cheaper models. Most of her work has been in terms of producing higher quality synthetic datasets, but she was pleasantly surprised how much of an impact one can have at inference time based on this paper.

versed flax
#

So cool!

loud adder
#

I’m getting on a plane to come home, but will have notes by Monday

versed flax
#

Thank you so much

versed flax
#

@fallow egret, reading 2.1 I think you mixed \propto and \sim, I fixed that. I'll change r for \gamma later. It's totally omitting that CG uses the gradient of the external classifier otherwise people will actually wonder how that works (for now I reintroduced the commented sentence about it. Also the sentence following it started with "This modification" and there was no modification introduced)

#

I'll fix that tonight if you're okay

patent gull
#

We’ll include a one-page appendix map and table of contents

#

But otherwise I don’t really think it all needs to tie together. Maybe others have different opinions

fallow egret
patent gull
#

But a super integrated appendix sometimes gets reviewers saying ā€œthis should’ve been another paper. REJECT.ā€ At least I’ve gotten that feedback before

patent gull
fallow egret
patent gull
#

That’s a good q

#

I think we can roughly keep the same structure in the appendix as we do in the paper

#

But not every section is gonna have a ton of results in the appendix and I think that’s ok

#

Let’s just reference ā€œsee appendixā€ in section 3 whenever suitable

#

Btw I’m gonna be away from my computer most of today, headed to the beach. Will check later tonight

#

Let’s everyone take a crack at the appendix. I think in general, if you put results in the appendix, you’re responsible for summarizing them

blissful garden
versed flax
#

I do

fallow egret
#

@versed flax what is the timeline? I understand that you want to publish it on Wednesday, so everything should be wrapped tomorrow?

versed flax
versed flax
#

That should give us enough time

#

This sounds reasonable to me

fallow egret
#

Yes, I agree

blissful garden
#

Brief summaries of my experiments in appendix are done (benchmarks and codegen). Some remarks are left for parts involving other people's works. Feel free to remove my remarks if they are dealt with.

patent gull
#

Thanks @blissful garden !! I’ll take a look shortly and today/tomorrow will summarize my parts of the appendix. Once they’re all done I’ll wrap an appendix map and conclusion, unless @versed flax wants to write the conclusion.

blissful garden
#

Added regression curves (using logistic regression because acc bounds between 0-1).
This does support our claim that CFG inference efficiency is good for Lambada where small LLaMA beats SOTA. But it sucks on most of others.
How should we present this?

fallow egret
#

I will also try to close the CoT appendix today

versed flax
#

it still demonstrates that a smaller language model+cfg is a decent substitute for a bigger one

#

I need to write that part

#

@fallow egret I reworked your 2.1 to be a lot more rigorous and adapted the notations

#

I tried to satisfy as much as possible your desire for a strong maths background and nice derivations

fallow egret
#

šŸ‘ I will go over the section after I'll finish with the CoT subsection

versed flax
#

ty

loud adder
versed flax
#

inference

loud adder
#

(Also, there isn’t an ā€œsā€ at the end)

versed flax
#

noted

loud adder
#

It’s a weird acronym, but it stands for FLoating OPerations

versed flax
#

I guess I wanted to write FLOPs, bc I know what the acronym stands for

loud adder
#

People say ā€œflopsā€ orally as a natural pluralization

#

But that’s not really right (it would be like writing Ls for ā€œlitresā€)

versed flax
#

gotcha

loud adder
#

And does cause confusion because there are things we want to measure in FLOP-seconds

#

(Also some people incorrectly use FLOPS thinking it’s FLoating Operations Per Second, like mph or rpm lol)

#

Just for extra confusion

versed flax
#

oh, was not aware of this one

blissful garden
#

oh yeah I have been always confused by that S in the end too

loud adder
#

If you’re concerned about it not always improving the result per inference FLOP, I would stress that a) that’s a really hard ask and b) 99% of users are VRAM bottlenecked, not FLOP bottlenecked

blissful garden
versed flax
loud adder
#

I was mostly asking because I was curious if that was a source of variation in the plots

#

So the couple tasks that are discontinuous… is that because of multiple model families being shown?

versed flax
#

yes

#

I need someone to check the inaccuracies of 2.2. I'm not a mathiness prodigy.

fallow egret
#

Yes, it's still incorrect- missing either normalization argument in the middle of eq 6 or using proportional

#
  • it's not 'equivalent to 2', it should be 'results in 2'
#
  • inconsistency in signs, ok I will go over this section soon, the big P is the finale notion?
fallow egret
versed flax
fallow egret
#

small 'p' vs big P for probability

versed flax
#

Oooooooooh, I thought you mean sign as in +/-

#

you meant symbol

#

Do you like the section?

fallow egret
#

In 6 the last part of the equation is missing (going back to P(w | c) both in the nominator and denominator)

fallow egret
versed flax
fallow egret
#

No, I want to go back to equation 2

#

The last step is missing, it exactly equation 2 šŸ™‚

versed flax
#

ah, I thought we wanted to go to eq 7 haha

fallow egret
#

No, the point in this equation is to connect the autoregressive formula in 7, directly to eq 2 in the original work

#

This is the theoretical justification...

versed flax
#

Ok. It needs to be made more explicit in the text then imho

fallow egret
#

It was explicit in text (the line after eq 6, we had 'this results in 2')

blissful garden
#

Was going through Sec 2 with @versed flax carefuly and personally I'm good with the whole Section 2 now.
(just one minor remark left at the last sentence)

blissful garden
#

Appendix A.2 about the acc-FLOP chart is also done. I'm putting up this disclaimer. But feel free to add/change stuff.

patent gull
#

i read thru section 2.1.... i can see you guys put a lot of work into it and it definitely shows!! the language is really tight, the math is useful. I left some small comments.

Personally I think there is some stuff that i think might be in-the-weeds...

Two points:

  • The introduction of p(z) early on.... we don't use p(z) anywhere else. Do we need to spell out this term? How does it help? Isn't the reader already going to be thinking about latent spaces?
  • The exploration into sample noise and diffusion... how necessary is this? Does thinking about sampling noise help us think about LMs? bc we don't really think about noise so much in the same way. I guess this could serve as a useful history/teaching, and i think it comes down to a personal preference whether to include or not, but i think there's a fair argument to be made that it's not directly useful to the overall NLP focus
#

However, that being said, it does look good.

I honestly had liked the previous structure of introducing negative prompting more in Section 3.4, because it did make the point that it was more of a side exploration rather than a main exploration.

However, if we do commit to having it in 2.1, then 3.4 needs to be significantly tightened... like there's still some of the introductory text there. and if negative prompting is introduced in 2.1 instead, maybe some of that text could be moved to 2.1 and then 3.4 is really just "negative prompting, as described in 2.1"

blissful garden
#

Personally I'm okay with either a hand-wavy section 2 saying that we are inspired by SD, or a rigorous section 2 with careful derivations from SD despite notations being useless in other place.

patent gull
#

btw yah i don't mean this as a criticism, just a point for discussion, and ultimately i do really like this version better than the last

blissful garden
patent gull
#

i mean if you have an explanation for how those 2 bullet points help the reader, i'm convinced. also i think there really is an argument just for teaching the reader

#

i think a counter-argument to my points is that it does really solidify that we have a strong CV background here

versed flax
patent gull
#

ok sure

fallow egret
#

Ok, I think CoT subsection is ready. I really like the edit that was done (probably @patent gull ?)
I added more experiments + some nice qualitatively examples

patent gull
#

great!! thanks Elad!!

#

this is nitpicking and no rush, but i would love it, if it's easy, if Figure 2 could be redone with font size=14

#

or 16. just to match Figures 3/4

fallow egret
#

Sure, np

fallow egret
#

@patent gull I changed it (font-14), I hope this is what you meant...

patent gull
#

I’ll check thanks elad

fallow egret
#

@versed flax I added many comments in section 2, with all the latest canges it seems that there are currently many inaccuracy (with respect mainly to the mathmtical part)

fallow egret
#

@versed flax Can we iterate on one of the issues of section 2 here? It will be easier and faster.
In equation 1 there is an introduction of the classifier guidance according to the original paper. I don't understand why the given formula is unconditioned (first term after the approx). I attachd the original formula from the paper. Observer that the two terms are different probabilities (one with theta it's the generator and one with phi it's the external classifier)

Observe that you can't apply the Bayes rule in this stage to get what you wrote since we are still in the CG case here (not CFG), which means that the classifier probability function and the generative are not the same
function
You can apply it only when moving to the CFG section which indeed this is the same function

P.S, also Bayes theorm doesn't give you eq (1) in the CFG context (it should be divided by p(x)), but let's do it step by step...

patent gull
#

bc that's really what it's trying to show, right? We're supposed to see how CFG gets better?

#

it's kinda hard to parse that in the table with the numbers, since it's a different total in each row

#

each stack/row is in this order: [underperforms, ties, outperforms]
and then, different bar for each temp

patent gull
#

i can do that if you'd like

#

i guess those #s are easy enough to copy/paste

blissful garden
patent gull
#

up to you!

#

I'm gonna write the conclusion, then

loud adder
#

If I haven’t started working on this in the next 12 hours please ping me and remind me to do so.

patent gull
#

ok I'm done with the conclusion and done with my end of the appendix. I added a table of contents to the appendix, feel free to disagree with that design-choice

#

looks to me like we have v1.0 of a rough draft

#

I see one bit of orange text, let me address that. I haven't nearly begun to address all the comments, but I will do so

#

also i jotted down some limitations i could think of off the top of my head, at 2am, in the Conclusion

#

feel free to take a look and add your own... the more limitations we address, the better and more solid our paper

versed flax
#

Whether you decide to model them with different models or not, it's correct

#

CG just says "we use an external classifier to guide generation"

#

you say "I want to model P(x|c), I apply Bayes' rule, I get p(x) p(c|x), oh that's an unconditional generator and a classifier, let's train two networks", I don't see where your confusion comes from

#

There's indeed one small mistake here, it's the theta subscript. Fixed:

#

Other than that it's correct

fallow egret
versed flax
fallow egret
versed flax
#

What does that even mean?

fallow egret
#

Each model define a different probability function...

#

The parameters are there for a reason, it's simply a different probability function

versed flax
#

and?

fallow egret
#

You used the bayes formula with respect to P_theta

versed flax
#

Are you saying that two models can't interact if they don't share the parameters?

fallow egret
#

I'm saying that P_theta(x|c) multiplay by P_phi(x|c) is not equal to P_phi(x|c)^2

#

which is what you used to get your equation..

versed flax
#

of course it is

fallow egret
#

of course it's not, it's not the same probability

versed flax
#

What does that even mean?

#

if you train P_phi(x|c) and P_theta(x|c), and they both are trained on a similar dataset, and are both expressive enough, they'll learn the same thing, P_phi=P_theta

fallow egret
#
  1. It's simply incorrect, otherwise ensemble methods will not work
  2. The external classifier doesn't have to be trained on the same data, it's a non-valid implicit assumption
#
  1. They didn't train on the same objective (one is generative and the other is descreminative)
versed flax
#

It's simply incorrect, otherwise ensemble methods will not work
Ensembles works because the "they are expressive enough" assumption breaks. Ensembles are a bug, not a feature. They work because your model P_theta doesn't perfectly model P (whatever your model is or supposed to be), so you average their mistakes to smooth them out. When writing theoretical derivatives like this, you can assume the model is perfect, and that's a common assumption

fallow egret
#

It make the whole theoretical part invalid for no reason

versed flax
#

Because it 1) makes more sense to a reader to tell that we need to guide an unconditional generator than a conditional one (why would it be needed then if it's already conditional?), and 2) it made me save time in writing with simpler explanations which utimately used a lot more of this time arguing this with you, and 3) no it's correct, and if you don't believe me, I quoted the CFG paper where that equality is laid out explicitely.

#

You train a model P_theta to be an approximation of P, it's fair to equate them in theoreticla equations.

fallow egret
#

It is incorrect for sure, you have two different probability function on the same space, each one come from a different model.
Having this assumption that they will converge to the same probability by some 'magic' is not valid in any applicable setting. Therefore, your theoretical framework doesn't model the reality.
In my opinion in this part things should be correct, this is the most important thing

#

@blissful garden @patent gull Can someone help with that?

versed flax
fallow egret
#

And in this case it's not the same objective and data...

versed flax
#

You absolutely get two very extremely similar ones. Or you're just not properly training your model.

fallow egret
versed flax
#

If that were true, it would just make it impossible to compare models as the accuracy of two instances of a model trained twice would be vastly different

#

and that's also why ensembles work better with models with different architectures, slightly different training data, and model types. The quirks in approximating the theoretical P won't be the same ones.

fallow egret
#

I gave you a clear reference (and I can give more) that this assumption is very controversial.
IMO we should not have this assumption, since as I said there is no reason to have this assumption.

versed flax
#

Okay, whatever. You can propose a fix, but I'm not wasting more time on this

fallow egret
#

I've got rejected on much less controversial assumption in the theoretical part...

versed flax
#

Good this isn't a theoretical work but more of an experimental then

#

The whole theorical part was developed to please you

fallow egret
#

My work that was rejected was also not theoretical.
Reviewer are searching for these implicit problematic assumptions (I'm also doing it as a reviewer)

versed flax
#

Then go reject "Diffusion Models Beats GANs on Image Synthesis" (NeurIPS 2021) which introduced classifier guidancešŸ¤·ā€ā™‚ļø

#

Or "Score-based Generative Modeling Through Stochastic Differential Equations" (ICLR 2021)

fallow egret
# versed flax

There is no issue with what they wrote, they define here a reverse diffusion process, they are not claiming that the two probabilities are the same

versed flax
#

They totally use a classifier in the same way

#

their eq 2 is litterally the same you're complaining about

fallow egret
#

No, it's not because their generative model is unconditional (they start with unconditional diffusion model). In our case we apply a conditional generative model (as in CFG paper)

#

In any case they are not claiming that p_theta(x|c) = p_phi(x|c)

versed flax
#

Look. I'm not wasting more time on this. We end this. Feel free to propose a nice, correct, high quality and fully redacted fix.
I've spent two full days fixing your 2.1 which exists only to please your desire for theoretical grounding.

The only next action I'm taking on this non issue is clicking an Accept or Reject button.

fallow egret
#

I don't understand what was the issue with the original version that was correct from a theoretical perspective.
What are your thoughts? Maybe I'm biased as a mathematician, but in my opinion the theoretical part should be accurate
@patent gull @loud adder @blissful garden

versed flax
#

Alex said "impressive improvements" and Honglu proof read that section so much that we basically co wrote it

patent gull
#

I’m not sure I am fully following the back-and-forth of this argument. And what I have to add certainly won’t settle it in a satisfying way. However I remember having a similar argument with my lab mate.

I have my own classifier-guided control paper: https://arxiv.org/pdf/2301.02299.pdf that has the same setup that @versed flax wrote… even less principled lol bc I don’t even notate two different sets of parameters.

Indeed, because of pretraining, p_theta(x) and p_phi(x|c) cannot even be assumed to be of the same linguistic domains. In our case, p(x) was vanilla GPT2 (i.e. general web) and p(x|c) was trained on news. One of the improvements we noticed was actually just due to fine tuning p(x) on the news domain (which shouldn’t have to happen in a theoretically perfect world).

No reviewer noticed. There are other classifier-based works with similar setups in NLP: FUDGE (https://arxiv.org/abs/2104.05218) and PPLM (https://arxiv.org/abs/1912.02164).

Indeed my lab mate published his work explicitly trying to address this: https://arxiv.org/abs/2205.14219.

At the end of the day, yes it is a problem, my labmate got a paper out of addressing the problem, BUT there is also a rich history of methods in this space and it’s uncontroversial at this point IMO. Most importantly, PPLM, FUDGE and my work all ALSO showed effectiveness, so it’s not an invalid setup

#

Yes, these works are all *CL, which maybe has a different set of reviewers and reviewer concerns than ICML/Neurips/etc. I’m less familiar with those reviewers. But I do think we should move on

#

I do think we can have a more comfortable debate about section 2 once we feel really good about the whole rest of the paper

versed flax
#

I've spent more time on Sec2 than the rest of the paper combined. Definitely agree that we should move on. As I said, if someone is displeased with the current state, they're free to submit a good fix, but going back and forth in chats and criticizing isn't productive

patent gull
#

Yeah….. I mean. Yeah. I’m trying to think of a concise way to frame this debate. Honestly maybe one sentence about it and then cite NADO (my lab mates paper) as a proof that it’s an issue with classifier based methods

#

But it doesn’t affect our work since we’re not using classifier based guidance

fallow egret
#

@patent gull I agree that we can move on and go back to that in the end. I simply don't understand why we need to trust that the reviewer will not notice it, when it's completely unnecessary assumption and simply using the conditional formula resolve the issue

patent gull
#

I mean yeah it can literally be a sentence at the end of 2.1 saying ā€œthese works face issues….ā€

#

Yeah but that section is more summarizing the lit

#

It’s a problem w the lit

#

We don’t use classifieds

#

Classifiers

#

It’s not our theoretical problem

#

It’s the lits problem. Certainly in NLP, where it is NOT addressed typically

fallow egret
#

I agree that it's not our problem, this is why I don't understand why we need to insert this issue in the first place which is completely unnecessary in our case and just raise unrelated questions

versed flax
#

Just submit a good fix, Elad.

patent gull
#

I mean I’m not trying to say it’s not important. It just doesn’t affect us so I think a reviewer would be wrong to point it out as a flaw with OUR work

#

I think I can put a short line in section 2.1 addressing this

versed flax
#

Be productive. When you complained "The intro should start with the problem............." Alex proposed "We should swap first and second paragraph". One comment is clearly more productive and usable than the other and they still addressed the same point. Propose your fix.

fallow egret
versed flax
#

He didn't either. He just proposed a solution.

fallow egret
versed flax
#

No, that's not "it", the text around it needs to be reworked as well, and address why we use classifier guidance since the model is already conditional.

fallow egret
#

What do you mean? This is how you perform classifier guidance, you enhance the conditional effect on the model by external classifier

#

In CFG you also use a classifier guidance (where classifier is defined using your own model), on conditional generative model...

patent gull
#

I don’t have the equations in my head right now (sorry, away from my computer)

#

But I’ll look and have an opinion on this when im in the office

#

I do feel like this isn’t top priority though since it’s entirely concerning background work (if I’m understanding correctly)

versed flax
#

It's concerning notations on background work

#

It's the minorest thing in the minor things we have to address

fallow egret
#

I agree it's not top priority, but IMO there are few issues in sec 2, that should be resolve before submission

blissful garden
#

Just woke up... Give me some time to read through.....

patent gull
#

My sense is that 2.1 has been steadily getting longer, denser and ultimately harder for the reader to get thru before getting to our real contribution but I honestly don’t have it in my head really because I’ve been focusing on other things

blissful garden
versed flax
blissful garden
#

oh I see

tepid gazelle
#

Hey btw, I'm going to be reverting the V3 of triviaqa https://github.com/EleutherAI/lm-evaluation-harness/pull/610 in the eval harness upstream, the results on this do not match llama's performance whatsoever (way higher than they report), while V2 ~ does when accounting for prompt. Exact match is meant to be exact, although that has its own problems we don't want to be able to rate not Mark Twain as correct if Mark Twain is the expected ground truth

versed flax
#

I'm sorry I wasted some time and negatively impacted the productivity

tepid gazelle
#

no need to apologize at all!!

#

sorry for intruding on your project channel

versed flax
#

oh lol don't worry about it

tepid gazelle
#

just wanted to alert since that changes what scores yall should report/maybe rerun unfortunately

versed flax
#

yes, indeed. I think we will just report how we got those results

blissful garden
#

so we use our original numbers for triviaqa?

versed flax
#

bc there were many instances where the model generated something like"Mark Twain" (with the quotes) or This is Mark Twain (can you confirm @blissful garden ?)

blissful garden
#

yeah I guess there are pros and cons for each. I dumped the write-out files and inspect manually. Using substring was a lot better

blissful garden
#

here if you want to see

loud adder
patent gull
#

Reg. Figure 9, the FLOPs tests
can we
(1) do statistical significance tests on the plots (I think f-tests is the right one?)
(2) draw confidence regions? We can establish these using bootstrapping, I think
the hypothesis that we really want that the figure seems to support right now is "CFG is statistically equivalent across most tasks to a similar-budget model". But #1 and #2 will help us really show it
I think this is an important finding if we can prove it, and warrants its own short section in the main paper

versed flax
#

"CFG is statistically equivalent across most tasks to a similar-budget model"
it might not be the case though. But the difference doesn't look huge

blissful garden
loud adder
#

I am suprised by the citation in

A ``prompt'' is typically used to condition on the generation, containing task instructions, context, and a small set of examples \cite{flan}.
Why was this chosen? The FLAN paper is about finetuning models on instruction-formated data

#

(Also, FLAN and T0 came out at the same time with the same core idea: it's almost always correct to cite both of them when it's correct to cite one of them if you're not citing your use of their specific model or something)

versed flax
#

Because it looked like a great paper to show what an instruction actually is

patent gull
loud adder
#

I feel like the GPT-3 paper and this are more appropriate papers to cite https://arxiv.org/abs/2102.07350

patent gull
#

boostratpping is great for confidence intervals over metrics/etc. all kinds of things with non-normal distributions

loud adder
#

What is this " Fundamental limitations of alignment in large language models" paper that we're citing a lot?

#

Okay, only twice since the other three are commented out

versed flax
#

IIRC it's a paper that we use to talk about system prompts. Probably not the best one

loud adder
#

Gotcha. No worries, I don't expect y'all to have the literature and chronology memorized šŸ™‚

versed flax
#

Pretty simple: I have very very little knowledge of the NLP lit

#

I have a pretty darn good grasp of the vision lit, but NLP... close to none

loud adder
#

The goal of these questions is to get an idea of what you're looking to cite so I can identify papers that may be a better fit

versed flax
#

gotcha

loud adder
#

It would be a good idea to submit a PR to the HuggingFace transformers library that includes CFG as a LogitsWraper or whatever it's called

#

This will substantially increase the chances of people using the methodology because that's how most people get their LLMs

versed flax
loud adder
versed flax
#

I'll submit the PR in few hours then

blissful garden
loud adder
#

If you say "I've been collaborating with EleutherAI on this and we have a paper coming out on Friday", and tag me then they won't have any issue with it šŸ˜›

versed flax
#

hahaha

#

noted

loud adder
#

@versed flax Do the "prompt alignment" techniques require finetuning or are they inference-time like ours?

Various approaches have been proposed to address this, including prompt alignment \cite{alignment} and fine-tuning \cite{instructgpt,flan,sanhmultitask}.

versed flax
loud adder
#

Oh that's the Anthropic paper

versed flax
#

In more humble situations, like Character.ai and the likes, it's prompt alignment

loud adder
#

What does "prompt alignment" mean

versed flax
#

Sorry, approximatie language here. It means there's a system prompt describing the chatbot's intended behavior.

#

("This is a conversation between Person A and Eric Cartman:
Cartman: Hey you, leave me alone!
Person A:")

versed flax
loud adder
#

I don't think doing so is very important.

versed flax
patent gull
# loud adder It would be a good idea to submit a PR to the HuggingFace `transformers` library...

personally, i don't think the "right" huggingface implementation of CFG is the logit-wrapper implementation.

I think putting it in the forward method of a CFG-head model, maybe as a mixin, is the more hugging-face appropriate way, looking at how they build their models.

I have something like that implemented, although my class is a bit more of a monstrosity bc it's doing different things, but:

https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/log_logits_on_p3.py#L52-L79

#

Definitely this is a side-discussion, but since we're talking about code...

loud adder
#

I don't have a strong feeling, and getting this feedback from the mantainers is another reason to open the issue early šŸ™‚

patent gull
#

SGTM

#

yah i always just found the logitwarpers approach to be a little awkward, since we needed to pass in input_ids and model but there was already a model inside, and logits getting generated. idk. felt weird

versed flax
#

I added this in the enumeration in the introduction:

\item We show that for the same inference cost, one can train a model that is half the size and obtain similar performance on those benchmarks;

#

I should maybe add it into the abstract as well

#

it's a fairly strong result

versed flax
#

An important question remains and I have no idea what the answer is: Should we acknowledge CAD?

loud adder
versed flax
#

oh you're still on it

loud adder
#

Yeah, got dragged into a meeting and then had to cook dinner but I’m back at work šŸ™‚

versed flax
loud adder
#

It’s really good. Stuff has come together really well in the past two weeks

#

Basically all my comments and edits are about copy editing and optimizing the presentation

versed flax
loud adder
#

(I was going to add affiliations)

#

@versed flax Is there something unique or special about SD's about negative guidance? We discuss negative guidance in the VQGAN-CLIP paper, and I'm under the impression it's something that can be done with any T2I model

versed flax
#

Midjourney and DALL-E don't.

loud adder
#

Oh interesting

#

I didn’t realize that

versed flax
loud adder
loud adder
versed flax
oak ore
#

negative guidance could be done with anything that uses cfg. sd doesn't do anything special there. midjourney & dalle are closed models with restricted APIs, and negative guidance isn't part of the exposed API

versed flax
#

(to be fair they certainly use a preset hidden neg prompt. It's just so good)

oak ore
#

we do something fancier than that actually, tho I can't go into detail

versed flax
#

oh, you're working for one of those orgs?

oak ore
#

yeah I'm at midjourney

versed flax
#

Would you happen to be hiring French dudes? šŸ˜Ž

oak ore
#

I don't think we've ever shipped negative prompts in the way sd does them

oak ore
versed flax
loud adder
#

Sorry to derail the convo. I’m going to go on a walk before the sun sets then finish up

patent gull
#

@loud adder please let me know if you'd like to see anything else in Section 4

#

I know you mentioned causal approaches way back before we put 4 together

versed flax
#

I'm starting to be sleepy. Try your luck if you need something but I may not answer, sorry

loud adder
versed flax
#

I'm working on the PR for now, but not for long :)

#

Thank you for your contribution

versed flax
#

I have addressed most of the edits.

  • @patent gull, Stella did edit the intro to Sec 3 and removed the parts that articulate the section into the various prompt types. I'll let you proof read and accept / reject the changes. I did not want to do it for you, it's your part.
  • @patent gull please accept / reject the edits to the abstract so that I know they're correct.
  • @patent gull Stella advocates moving your Related Works to the appendix. I quite like it but I understand her point, the paper is already quite long. However I feel like focusing on the CV background only is weird.
  • @loud adder the main unchecked thing I have now is this new figure you're suggesting and that I don't fully see.
  • @patent gull / @loud adder I answered most of your comments in the sidebar. Once you've read my answer and find it satisfactory, can you mark them as resolved? If you leave them open I don't know if we can move on.
patent gull
#

Ok I’ll check. I think i largely agree with these changes. The section 3 header was a little too structured-feeling

#

And I think there’s a way to discuss classifier guidance and contrastive decoding in NLP in 1-2 sentences in section 2

#

Thus reducing the need for the related works section

#

I was thinking if we want to reduce the length, there might be some of plots and tables throughout that could be moved to the appendix as well. But we don’t have a page limit for arxiv so less worried about length

versed flax
#

The paper is quite long. Would it be unreasonable to add a toc for the arxiv release?

loud adder
#

(I say this as someone who has multiple arXiv papers that are > 80 pages long)

versed flax
#

Yes, totally. It's long. My proposed "easy fix" is to add a toc for the arxiv release. The reader can then glance what the paper is about and choose what to read. Dunno if it's unreasonable

patent gull
#

I don’t think Toc is necessary for a 12 pager

#

The appendix already has a toc

versed flax
#

That's fair

patent gull
#

Like I honestly don’t like toc, they don’t end up being descriptive enough or useful to me

#

hmmm so @loud adder what is the desired page limit? 10?

versed flax
#

That's fair²

patent gull
#

8?

#

we have a ton of plots and tables that can be summed up with 1 line and moved to app. Also have language that can be tightened throughout

loud adder
#

6-10 pages is the typical length of a ML paper

patent gull
#

Alright sounds good

#

I know I can move my line plots to the appendix. I’ll take a look at all the results tables we have and think about some that can get condensed

#

Or moved

loud adder
#

She said that her koala started to speak English when she was about six months old and has since been able "to understand the words of people"., who lives in a kangaroo colony near Kogarah on Queensland’s Sunshine Coast, told news.com.au he had never seen anything like it before:
This sample from the appendix seems to show an abrupt change in topic half way through.

versed flax
#

I'll do that tonight. I have to get going with my job for now.

#

(I'm still reading and answering here, I just won't do anything meaningful now)

versed flax
#

@loud adder fyi this is the response from sgugger to the PR

let's see if the community requests this added feature before implementing it in the library proper :-)

loud adder
#

I saw and don't understand, but w/e

patent gull
#

i went through the edits/comments and agree with pretty much all of them

#

would you like to discuss \subsection{Relation to instruction tuning}?

#

let me know

versed flax
patent gull
#

i accepted most that i saw from you/me

#

I'm leaving comments up there

#

since most of them feel like they're still open

versed flax
blissful garden
#

do we still move the FLOP stuff up to main text? There seem to be a lot of stuff in the main text already

versed flax
#

I'm pretty sure we can find something that is less important than this result, and move it to the appendix / remove it instead

blissful garden
#

@versed flax I also saw the MusicGen PR and trace it back to the paper
https://arxiv.org/pdf/2209.15352.pdf
It seems equation 4 is exactly what we are doing here...... They are so much earlier than us. We should probably say like although we weren't aware of them in the beginning, our work can be seen as generalizing their technique to text-to-text models with a comprehensive analysis.

versed flax
blissful garden
#

They have a Figure 3 for ablation of CFG and that was it. But they did study that

#

I actually start to think this paper is closer to us than CAD. And they simply just didn't realize the generality of this technique.

versed flax
blissful garden
#

just throwing it out there and see if we want to add one short sentence to acknowledge their work as well.

#

In our field if our upcoming work has any resemblance to any other group's previous stuff, we'd send our draft to them in case they have remarks (but don't wait for it and all the submission schedules would be unchanged)
(and usually they say "good work!" and connect with you)

versed flax
#

I've never done that. I don't know what's customary in ML

fallow egret
#

I agree that although they apply CFG on autoregressive model it's a different field, so in that sense it similar to text2image model and is less relevant than CAD. We might want to add one line referring their work but I don't think we should do more than that...

unique sedge
#

Do best effort attempts at trying to cover related work for something you are doing, but if its not immediately in your vertical (subfield/adjacent field) its okay to miss it. Very rarely would a reviewer reject your paper because you havent mentioned one paper (they might ask you to add it though), unless you are committing an error egregiously.

patent gull
#

i think we should include it in related works, or as another citation in the intro to CFG!

#

it's very cool

#

i think it's quite clear that they applied this idea from the text-to-image lit

#

weird that that guy is keeping such close tabs on HF PRs that he noticed it lol

versed flax
#

So we're like 99% there. I see 2 remaining TODOs in the paper:

  • 1 figure that @loud adder asked for but that I don't really understand. Do you understand what she meant @patent gull ?
  • 1 flop analysis. @blissful garden I see you're on this, do you need a second brain?

I did another pass on the paper, fixed the figure flow in the appendix (it was totally chaotic), and the various things I mentioned earlier

#

We're 99% done => what's this 1% I'm not seeing, besides those two points, and how I can work on it? Has someone identified some incomplete work? I'm so deep inside it that I have a hard time keeping track of the progress lol

blissful garden
versed flax
#

(I have no idea what that is either)

blissful garden
#

it does seem to directly tackle the comparisons of regressions though

versed flax
blissful garden
versed flax
#

Well then, maybe this paper will be able to fly on ArXiv before Friday then!

blissful garden
#

all p values are super small again... @patent gull what conclusions are we looking for other than their adjusted means are not from the same distribution?
(I'm using original samples not bootstrapped samples btw)

versed flax
#

Well then if p-values are small, it means that 2x vanilla and CFG aren't indistinguishable (which I expected)

patent gull
#

no p-vals being small means they are distinguishable (ah which is what you said)

#

something must be up... the 95% bootstrapped confidence intervals are totally overlapping

#

hmm

#

are you running ancova on the normal values, or the log-normalized values?

blissful garden
#

log normalized

patent gull
#

😠 ugh thought that was it for a second..

blissful garden
#

log(x) vs log(1-y) - log(y), and then linear regression

patent gull
#

why log(1-y) - log(y)?

blissful garden
#

logistic regression

#

y bounded between 0 and 1

patent gull
#

@fallow egret you have a whole section in the appendix to write:

#

\subsection{Deliberative Prompting: Chain-of-Thought}

patent gull
#
blissful garden
#

actually ANCOVA only tests the slope, right?

blissful garden
patent gull
#

ancova stands for analysis of covariance, which i assumed meant the covariance between x ~ y

#

does anyone in this channel have a go-to significance test that they use for testing 2 regressions?

patent gull
#

i don't think you need to transform it using logistic regression

blissful garden
#

I was doing this in the beginning until I realize every line just intersects when x is large (y getting close to 1)

patent gull
#

ahh so those plots aren't just showing x and y

#

they're showing some transformation?

#

Fig 10

blissful garden
#

the plots are just log(x) and y

#

but the curves are logistic regressions between log(x) and y

#

which is the same as linear regression between log(x) and log(1-y) - log(y)

patent gull
#

ohh ok ok i thought you were doing some fancy multinomial fit thing

#

i think scipy optimize has a multinomial fit

#

anyway

#

ok... if you wanna send me the data i can also try significance testing

#

otherwise we can also just say "the two lines are indistinguishable on a 95% confidence interval"

blissful garden
#

so if two groups have similar slopes, ANCOVA will give high p?

patent gull
#

that's what i thought. but if you see those SO links I sent, there are other proposals for significance tests

#

😭 that first link says chi-squared test of coefficients, a partial f-test and a t-test...

#

can't say i've heard of a "partial f-test" before, nor do i know which one is appropriate in this case

blissful garden
#

@patent gull data and codes sent (a bunch of stuff in DM but I don't want to spam the channel). Feel free to play with it

#

meanwhile let me also check out those links

#

Maybe the conclusion is indeed we have different slopes (or covariance) for each task

patent gull
#

cool cool

blissful garden
#

Seems like most of the stuff related to me is done. There is this one question left:

  • do we want to compress the codegen results in the main text by for example reporting temp=0.2 only? This way we combine the three tables together. (will still need all temps at least in appendix to fully showcase the trade-off of adherence-creativity)
loud adder
#

@fallow egret you wrote the interpretability section, right? I wanted to chat about that as I feel like I’m missing something as I read it

patent gull
#

I wrote it, yeah, I’ll be available in a bit to talk about it

#

(in a bit meaning like 1-2 hours)

blissful garden
versed flax
#

@loud adder there's some debate on the MT section. Does it belong to the main text or the appendix? We're trying to make the paper shorter. The section shows:

  • it's a generative task, but so is CoT
  • CFG brings 10% improvement on MT on base models
  • it didn't work on tuned models (so it makes the positive results a bit pointless, since people will obvioulsy use the tuned models)
  • there are no further insight.
unique sedge
versed flax
#

Right.

loud adder
unique sedge
blissful garden
#

I see the Section 2 is 2 pages long. If we are really compressing the pages, maybe we should also consider moving most of the math to the appendix as well. Although I'm a math guy, my guess is that most audience just wants to see one equation and a short story before going straight to charts and conclusions.

loud adder
#

I think all of 2.2 is necessary, and it’s hard to see what equations can be cut from 2.1

#

I recall thinking the prose was a little overdone though, maybe we can cut some of it

fallow egret
blissful garden
#

@versed flax I'm pretty sure you flipped the order of the model labels

patent gull
versed flax
fallow egret
versed flax
#

You're right

#

@patent gull fyi \usepackage{subfigure} broke the LaTeX. I'm commenting it. It doesn't seem to break anything else. No idea what you needed it for.

patent gull
#

Ugh so sorry man

#

I was trying to put two figures side by side

#

The gpt4all figure and the humeval win rate fig

versed flax
#

You did that with \subfloat already didn't you?

patent gull
#

Yeahhh I didn’t want them to have letters, and I didn’t want them to break the counter

versed flax
#

ah

patent gull
#

But it wasn’t breaking latex when I first imported it

#

Maybe after several compiles

versed flax
#

maybe yeah

patent gull
#

Or maybe my latex was caching something

#

But anyway I would’ve never just left it broken if I had known

#

My bad

versed flax
#

I was definitely awake and working on the paper when you did that and I did not notice the breaking

#

lol I know

patent gull
#

Overleaf caches a lot of intermediate files…. Thinking about it more, that may have been it :/

versed flax
#

it's no drama

patent gull
#

Cool

#

@loud adder let me know when you’d like to chat about the interpretation section. I’ll be at my computer in 5 min

patent gull
#

@fallow egret can you redo your figures with plt.rc('font', size=16)?
Also what is the y-axis? accuracy? can you label Accuracy (%)?

#

(I'm questioning whether we need the results to look like this, and whether we can format them like Figure 1, i.e. a table)

#

saves space. but it would be a very, very small table lol

#

oh sorry i see the legend. so, my visual graphics opinion is that legend-based hues should be different trials not metrics. Metrics imo belong on a dual y-axis

fallow egret
#

Yes, it will be narrow and long table

patent gull
#

dual y-axis, sorry

fallow egret
#

So what to change?
I used @blissful garden code to be aligned with his figures...

patent gull
#

if you send me the data i can change it

loud adder
#

Which figure is this

patent gull
#

Figure 2

#

there's also wasted vertical space in the right-hand image, due to the need to project both into the same scale, which a dual y-axis would solve

loud adder
#

What are the conclusions I’m supposed to reach reading these plots @fallow egret

fallow egret
#

This:
using CFG increases the percentage of CoT which results in a valid answer that could be parsed. For low guidance strengths, this results in boosting the model performances. However, for large values, although the model returns more valid results, the quality of the chains is also impacted, and overall the model performances degrade.

patent gull
#

how did we evaluate the quality of the chains?

#

qualitatively?

#

(or did i write that?)

versed flax
#

I would put a shorter version of that in the caption as well

fallow egret
loud adder
patent gull
#

ideally i would like error bars/ confidence region as well (which you can get via bootstrapping)

#

here are the things I'd like changed:

  • make height smaller (half as large)
  • make a dual y-axis and remove the legend
  • make the font a lot bigger
  • confidence regions on the plot
  • x-axis label should match Figure 4 (Guidance Strength (CFG \gamma))

If you would like me to do that, send me the data and I'll take a look

fallow egret
patent gull
#

(A) if you have the data saved, bootstrap sampling just means resampling
(B) our other line graphs in the main body have confidence

loud adder
#

Also if this is being done with the eval harness, it does bootstrapping for you. That's what the "acc_stderr" etc values are from.

fallow egret
patent gull
#

cool

fallow egret
patent gull
#

how many stds are common for error bars?

loud adder
patent gull
#

i think central limit theorem says in large limits, 2 stds = bootstrap @ 95 confidence?

loud adder
#

The code is here: https://github.com/EleutherAI/lm-evaluation-harness/blob/72b7f0c00a6ff94632c5b873fc24e093ae74fa47/lm_eval/metrics.py#L192

I'm not a statistician, but my understanding is that this is the right way to do a bootstrap CI

GitHub

A framework for few-shot evaluation of autoregressive language models. - lm-evaluation-harness/lm_eval/metrics.py at 72b7f0c00a6ff94632c5b873fc24e093ae74fa47 Ā· EleutherAI/lm-evaluation-harness

patent gull
#

anyway, this shouldn't be a debate. good science means error bars and confidence regions

#

especially in the main body

loud adder
fallow egret
#

Ok, I can add it

patent gull
#

yeah i mean i'm just addressing the debate over whether to include them or not

loud adder
#

I think Elad was against it when he thought that 6 hours was per CFG per iteration

patent gull
#

fair enough

loud adder
#

Under the reasonable assessment that we don't have 1,000 days to run stuff for

#

But yeah, the default setting is to run 1,000 iterations for the bootstrap CI

fallow egret
#

Yes, there is no problem to do that with this code. My only very small concern is that we don't provide it in all the other figures (in the appendix)

loud adder
#

(this is also why the evals are slightly non-deterministic, the "score" we report is the median of the runs)

fallow egret
#

I thought on boosting by running multiple experiments which going to take forever... nvm

#

Ok, so Alex do you want to send me the code I should use for the figures or you want me to produce the numbers for the figures?

loud adder
#

No worries!

patent gull
#

if you produce the numbers that would probably be easiest collectively for both of us

#

we're .5 page away from 10 pages

#

there is a heavily commented region of text with multiple "why is the needed" comments in 2.2:


Next, we need to define what is considered conditioning, $c$, in decoder-only language models. In the common situations, a user provides a \textit{prompt} $c$ which can be a context, an instruction, or the beginning of some text, and uses a language model to sample a sequence of continuation tokens $w_i$ for the prompt $c$. Since a good continuation is expected to highly correlate to the prompt, we consider the prompt as our conditioning.```

I suggest cutting it
loud adder
#

@patent gull Can you turn on link sharing so I can pass a copy to some people for their feedback?

patent gull
#

yeah

#

it's on

#

I think i'll take a stab at mocking the table described here:
\textbf{New Figure: shows several examples of (prompt, initial segment, completion) triples. Examples should show diverse relationships between the prompt and the initial text, including ones where the prompt is at the beginning and ones where the prompt includes text after the question (``let's think about it step by step'' perhaps)}

and then bother people to fill in stuff for their sections

#

but i don't understand initial segment. Is that, like, CoT-related?

versed flax
patent gull
#

i see, i see... i see the difference between token-logits in NLP and vision semantic spaces

#

i think that's useful

#

i have a standing comment there about the word embedding and sentence embeddings sentence... i don't think that's useful, nor necessarily helpful in thinking about token logits

#

how is conditioning $ c$ in NLP different from conditioning in vision?

versed flax
patent gull
#

ah.. i see. ok so there's a field difference here that i'm not understanding/aware of

#

fair!

versed flax
#

the later layers of image generators are never the ones you'd want to manipulate

loud adder
versed flax
versed flax
patent gull
#

Next, we need to define what is considered conditioning, $c$, in decoder-only language models. In the common situations, a user provides a \textit{prompt} $c$ which can be a context, an instruction, or the beginning of some text, and uses a language model to sample a sequence of continuation tokens $w_i$ for the prompt $c$.

I'm talking about this paragraph.... my impression was that conditioning with prompts was kinda the same thing between vision and nlp

patent gull
loud adder
versed flax
patent gull
# loud adder Oh, this confused people so I meant to write up an explainer...

So in section 3, we break up (prompt, continuation) variations into the following overall framework.

Prompt is what is supplied by the user/dev. Continuation is generated by the model. All variations branch from there.

cot: <prompt, [cot, continuation]>
text-to-text: <long prompt, long continuation>
chatbot: <[system prompt, user prompt], continuation>```
#

so the idea of initial segment as a third, distinct category doesn't really fit

loud adder
#

So I'm picturing something like this

patent gull
#

yeah... i would just prefer a breakdown that mirrors the structure of our paper... more comprehensible to readers

loud adder
#

Or this

loud adder
versed flax
# patent gull ```Next, we need to define what is considered conditioning, $c$, in decoder-only...

It's making a difference between the model's conditioning and the tasks' conditioning. The model's conditioning on prefix that naturally arises from the sequential autoregressive sampling might not align with the task: a user might want to generate a text that ends or contains some predefined conditioning text, or text in a specific style or something.

This is just to say that we align the task's conditioning to the model's conditioning by expressing it as a prefix. This might be too trivial.

patent gull
#

alright here's the table... ulimately we may be able to lose the middle column if the example column is descriptive enough:

#

@versed flax can you fill in a good example for Assistant Prompting?
@blissful garden can you fill in a good example for code-gen?
@fallow egret can you fill in a good example from cot?

I can do basic prompting

versed flax
# patent gull

Clearly the example are not fitting in that last column lol.
Also, how about we write "prompt => completion" rather than "(prompt, completion)" which makes more obvious what the input and outputs are?

patent gull
#

yeah i think we can fiddle with it a bit after it's all full

#

i definitely think we can lose that middle column

#

alright. did some polishing. gotta turn to my day job now

fallow egret
patent gull
#

yeah

fallow egret
patent gull
#

no take a look at that table

#

and Vermifuge already put in one

#

should ideally be short

fallow egret
versed flax
#

just make one up

#

it's meant to illustrate how we categorize the test cases

patent gull
#

i see. yeah feel free to use '...', as well

fallow egret
#

Ok, np I can make it zero-shot, I hope it will not confuse the reader...

patent gull
#

hmm i hope not either. let's see. there might be space to write "here's an example...." in the prompt, and then clarify in the caption. but we'll get a feel for it when we see it

fallow egret
#

Oh, I see we also need to write the reasoning, lol I don't how to squeeze all this stuff

patent gull
#

i think just put in what you feel is good and complete, don't worry about space right now

#

we'll massage and standardize once all the examples are in

versed flax
#

("how many egg boxes to buy to have 24?", ("For a box of 12 eggs, that's 24/12=2", "The answer is 2"))
something like that?

fallow egret
#

Ok, sounds good. I thought we want a real example

versed flax
#

they would be too long I guess

fallow egret
#

Yes, indeed sounds good. I just hope it's not more confusing for the reader

versed flax
#

what's possibly confusing about it?

fallow egret
versed flax
#

gotcha, that's fair

fallow egret
#

But it's a good solution to put such example...

loud adder
# patent gull

If this is in response to my request for a figure, the key thing is that this doesn't make it obvious what the CFG is attaching to. You should state that explicitly in the figure

versed flax
#

Damn, yes. Maybe we just need a figure like

gamma * LLM("The dragon flew over Paris, France, on Saturday evening when") + (1 - gamma) * LLM("on Saturday evening when")

loud adder
#

I think color coding the text makes it pretty clear

loud adder
versed flax
#

I understand your screenshots now

blissful garden
versed flax
loud adder
#

I was reallt susprised I couldn't find a good example of what I had in mind

versed flax
#

for real. It's so obvious we all overlooked it

blissful garden
versed flax
#

something like that, but pretty?

versed flax
loud adder
#

Oooo is this intending to be a latent space representation? I like it

versed flax
#

Yes

blissful garden
versed flax
#

slightly better

blissful garden
#

maybe worth putting in section 2?

#

(I know we are long but one good picture is better than a lot of words)

versed flax
#

Gotta be in the front page imho

#

that's the whole paper in a nutshell

#

good enough?

blissful garden
versed flax
#

hahaha true

loud adder
#

Needs a little prettying up

#

But it's really good

versed flax
#

\caption{We show a 2D projection of a textual latent space $(x_0, x_1)$. We embed our text both with and without the prompt ``Today in France,'', and we walk from the promptless embedding in the direction to the prompted embedding with step size $\gamma$. Defining $\gamma>1$ overemphasizes the prompt, leading to better behavior and performance gains.}

loud adder
#

Maybe put it in bold by the 1.5, as its being emphasized there

#

And in normal text by the 1?

versed flax
#

like this?

loud adder
#

"it" = "today in france"

versed flax
#

ooooooooooh gotcha

loud adder
#

So
y = 1.5 Today in France, citizens were celebrating
y = 1.5 Today in France, citizens were celebrating
y = 0 citizens were celebrating

#

I would definitely use the word "notional" or "hypothetical" or somethign like that in the caption, lest someone think we think this is what it actually looks like

versed flax
versed flax
# patent gull

I was doubting the interest of this table but I do think we're better without it. What it does is explain how we split Sec 3 which is imho not important enough to remain in the paper

#

I think it just camed from a misinterpration of Stella's point. Now that we have the actual meaning and did the right thing, I don't see the use of this

loud adder
#

I futzed with the formating a bunch and think this looks way better. Thoughts?

unique sedge
#

Latex god PrayGe

versed flax
fallow egret
#

@loud adder , @patent gull After looking at the eval harness code, it doesn't apply bootstrap for the acc metric, it simply report the std with respect to the 0/1 (if understand correctly).
Are we fine with that?

loud adder
#

Title page: I might remove the figure title? My main point of annoyance here is that it's not centered tbh.

Model surgery: "model editing" is a more common term, at least in NLP. As for whether there are technqiues for doing this at inference time, we recently wrote a paper about it where we edit the entire Pythia suite. AFAIK this is the first example that's effective at scale: https://arxiv.org/abs/2306.03819

Table 2: I agree with @versed flax that this doesn't really accomplish what I was hoping to accomplish. I also think it's formally correct at the expense of being clear, and that if we do something like this it should be formatted like a NL document and not a tuple of strings

versed flax
loud adder
#

My main outstanding concerns relate to readability to NLP people, and I sent a copy to two who said they’d be able to provide feedback by the end of the day.

versed flax
loud adder
#

It’s less about obscurity and more about readability & communicating ideas effectively

blissful garden
#

I'm going to rewrite some of my FLOP appendix because we now have the main text.

loud adder
#

There are some areas I find a little weird, or at least not how I would have written it, but it’s hard for me to tell if that’s personal style, language differences, field cultural differences, or something else

#

I added space at the end of the paper for acknowledgments and (at the beginning of the appendix) author contributions. Other than CoreWeave for providing compute, is there anyone in particular we want to thank? People you showed the draft to and got useful feedback from, people who provided compute for experiments other than the pod I provided, etc.

blissful garden
loud adder
#

And by ā€œsureā€ I actually mean ā€œwe are contractually obligated to do soā€

versed flax
#

some of my friends for taking part in the human evaluation berk

loud adder
#

We can absolutely thank the volunteers for our human experiments

blissful garden
#

@patent gull when you are free could you quickly glance Appendix C.2 (mostly the last paragraph) to see if I'm still missing anything?

I also capitalized all the "ANCOVA" because everybody seems to capitalize it. Also put "p" inside $ for the minor difference of fonts for math variables.

Another note: in Section 4, you wrote in the last paragraph "a P-sized model...". It seems the P is not used. Should it be removed?

patent gull
#

lol can we thank all the volunteers by the random names we applied to them in the web interface?

#

They’ll know

#

I was ā€œrogueā€

blissful garden
#

I forgot mine😭

patent gull
#

Oh man hahaha

fallow egret
#

I also modify the captions of the figures.
If there is anything else on my side let me know

patent gull
#

Ok cool thanks elad!!

#

Will get this done after work hours

fallow egret
#

Table 2 was removed in the end?

patent gull
patent gull
fallow egret
#

I'm not seeing it in the text currently. I think it was a good decision to remove it

patent gull
#

I also don’t know what my funding sources want

#

What is the protocol y’all typically follow for this?

blissful garden
versed flax
#

idk if I should thank my company who were also lenient enough to allow me some time to work on that while not being my mission at all

patent gull
#

Hahaha

loud adder
fallow egret
patent gull
#

Ok

loud adder
#

It’s assumed that your employer is sponsoring your research. Acknowledgements are typically for non-obvious sources of support

#

And like @fallow egret says, it’s often contractually obligated

blissful garden
versed flax
#

I gave a shot at redacting the Author Contributions appendix. I did it from a non reliable memory. I invite everyone to read it and fix it. @blissful garden , @patent gull , you guys worked a lot together and I might have mixed some of your contributions. @loud adder , I genuinely have no idea how to properly phrase the supervising role you had and may have forgotten things. @fallow egret and @unique sedge, make sure I did not forget anything.

#

This is my first time redacting such a thing and I genuinely have no idea how to word it, which level of details to go into, etc.

fallow egret
#

I'm completely fine with what is written. I think from a style perspective it should be more general without the specific details. But I also didn't write such section in any of my papers so I'm not sure

versed flax
#

I did my best being fair and indeed that's maybe a bit too detailed. I'm waiting for the feedback of people more seasoned than I am

blissful garden
fallow egret
versed flax
#

okay maybe I shouldn't be that specific with section numbers

blissful garden
#

lgtm

versed flax
blissful garden
versed flax
blissful garden
#

Oh I added C.1 to my part. I did C1-3 altogether

versed flax
#

perfect!

#

3.1 is still unattributed. It's the standard benchmark section

#

the paper flies tomorrow 🄳

fallow egret
versed flax
#

wait, the paper doesn't go live as soon as posted??

blissful garden
blissful garden
loud adder
#

Yeah it's weird. And it's made worse by the fact it skips a day: you'd think papers received by 2 pm EST friday would go out at 8 pm EST friday, but they don't for reasons I don't understand

#

Schedule tl;dr

  • If the paper is submitted by 1800 UTC on Friday it goes out at the end of the day on Sunday
  • If the paper is submitted by 1800 UTC on Monday it goes out at the end of the day on Monday
versed flax
#

so we can do tomorrow 6pm, right?

loud adder
#

Yes

versed flax
#

Awesome! So this is our last night working on it :)

blissful garden
#

@versed flax Oh the legends of Pythia and GPT2 charts are missing

#

By the way I will be busy travelling internationally tomorrow. Hope we don't find last-minute thing related to my parts but just to say I might not be available for quite a while.
Doing some final proofreading right now

versed flax
patent gull
#

phew just ending work. a lot to catch up on. what is needed from me? besides some proof-reading?

versed flax
#

Just make sure I didn't forget something about you

patent gull
#

haha ok i'll take a look, thanks man

#

uh dumb question — in Figure 1, whats the difference between \gamma = 1 and 1.5?

#

just the bolding? looks like the same output to me

versed flax
#

yes

patent gull
#

uhh i'm confused haha. I kinda glanced at the discussion around this plot, but i thought the point was to show that when we traversed towards higher \gamma, the generation changed

#

what's the point of it?

#

ohh wow someone did a lot of cutting... it's only 9.5 pages

versed flax
#

Showing how we fiddle with the latent space. But you have a point

patent gull
#

IMO, gamma=1.5 should be perfect, 1.1 should be not as perfect, and 1.0 should be blah

#

oh wait it starts at 0

#

ohhh this is showing literal prompt emphasis wow i'm dumb

versed flax
#

yes

#

That's the caption lmao

#

but you have a point, it wouldn't hurt to show the continuations as well

patent gull
#

that's the caption....

#

yes

versed flax
#

gamma=0 => "citizen were celebrating summer"
gamma=1 => "Today in France, citizen were celebrating Christmas"
gamma=1.5 => "Today in France, citizen were celebrating Bastille Day"
something like that?

patent gull
#

so the prompt is "today in france" and the continuation is "citizenS were celebrating..."?

#

i don't really get the point of gamma=0, we don't test on that, and why would we be expect it to be even close to being on topic? I would expect total garbage from it

#

but anyway yeah, I would do something like:

gamma=1 "Today in France, the weather was decent in London" (i.e. meandering, definitely topic-switches)
gamma=1.1 "Today in France, the weather was good for citizens" (i.e. not great, kinda passable)
gamma=1.5 "Today in France, the citizens celebrated in good weather" (i.e. good, on-topic)

#

and then underline "Today in France" and we can update the caption to be way more explicit about each time-step

versed flax
#

what

patent gull
#

idk just a thought

versed flax
#

how is "good weather" related to France?

patent gull
#

idk i was thinking of showing something that would change topic by the end

#

I guess your gamma=1 isn't bad

versed flax
#

how about
gamma=0 => "citizens were celebrating Independence Day"
gamma=1 => "Today in France, citizens were celebrating Christmas"
gamma=1.5 => "Today in France, citizens were celebrating Bastille Day"

patent gull
#

haha ok, yeah that sounds good

#

but why would gamma=0 be like that at all, though

#

wouldn't we expect random generation?

versed flax
#

No?

patent gull
#

gamma =0 means completely unprompted

#

so it could easily be "chickens fly to trees"

#

or whatever

versed flax
#

The prompt only is "Today in France"

#

"citizen were celebrating" is the beginning of a continuation

patent gull
#

oh we're doing like a multipart prompt?

versed flax
#

Ah, that was in the caption but got deleted

patent gull
#

ok... so "citizens were celebrating" should be underlined... "Today in France" should be bigger/bolder each time

versed flax
#

You're overthinking is wayyyy too much

patent gull
#

and then in caption we should say "start of continuation is underlined", "prompt is bolded" according to strength

#

or something

versed flax
#

Wait you're actually totally right

#

gamma=0 is totally unrelated

#

whoopsie

patent gull
#

honestly i think we shouldn't show gamma=0

#

we should just start the line at gamma=1 and underneath say (baseline)

#

that way it's clear that we're just improving above baseline

versed flax
#

It's important.

patent gull
#

we can do:

gamma=1 => "Today in France, citizens were celebrating July 4th"
gamma=1.1 => "Today in France, citizens were celebrating Christmas"
gamma=1.5 => "Today in France, citizens were celebrating Bastille Day"

versed flax
#

But now I want to add
gamma=0.5 => "Today in France, citizens were celebrating Independence Day"

#

ah! great convergence lol

patent gull
#

yah bc July 4th is illogical

#

whoops boss is calling me brb. those are my 2 cents for the figure

versed flax
#

tbh I'm not exactly comfortable showing a wrong continuation for gamma=1

patent gull
#

Haha I think that’s great

#

But that is accurate, isn’t it?

#

Like that’s what would happen

#

That’s a beautiful graphic imo

versed flax
#

It's much better

#

Thank you for catching that

patent gull
#

Haha no problem

#

Something is breaking at work so I need to brb but I’ll be back later. Did A quick scan of the paper — it seems really good

blissful garden
versed flax
loud adder
versed flax
#

whoops

loud adder
#

Pretending that the cause of the change of the holiday was changing what day the model thought it was

versed flax
#

I guess I should go to bed and sleep, then haha

patent gull
#

@fallow egret i'm redoing the plots now

fallow egret
#

Let me know if something missing

patent gull
#

I'm looking over and it seems like the aqua plots are even more impressive

#

i have two questions:

  1. Was there any specific reason you put aqua in the appendix and gsm8k in main body?
  2. I'm thinking of putting them all in the main body, but only reporting one metric (probably accuracy). Is this OK? The metrics seem highly correlated. Is there any specific insights we get from % invalid?
fallow egret
#
  1. Not really GSM8K is the more standard benchmark (it's bigger and appear in more previous works), but I don't think it's that important for the order
  2. Yes, they are no cooreclt for high cfg values, which I think it's very important
patent gull
#

hmmmmmm i see

fallow egret
#

For low values it is indeed cofrelate and you get more results and increase accuracy. However for larger value, you still get the same high valid percentage but the accuracy breaks, which means the quality of the reasoning chains deteriot

patent gull
#

that seems to me like invalid % is a coarser metric

#

oh wait

#

well considering the confidence regions, it seems to me that invalid % stays pretty constant

#

how is invalid % calculated?

fallow egret
#

It's 1 if you get a parsed results (otherwise 0), and simply the % of non-parsed results sum(res==0)/len(res)

patent gull
#

i'm a bit confused. isn't accuracy strictly bounded by % invalid?

#

in Aqua Guanaco, how can there be more invalid, but also more correct?

#

i guess it's different portions of the dataset, but still, seems counterintuitive to me

#

what is more important for practitioners to be able to measure? An invalid answer or an incorrect answer? Can't we have heuristics to reject invalid answers? And then, what is the accuracy only on the valid answers? Do people look at that?

fallow egret
#
  1. If you have 20% invalid but from the rest of the 80% all the answer are correct you have 80%. On the other hand if you have 10% invalid and from the 90% only 50% correct then you have 45% accuracy
#
  1. We have heuristic, as was written in the paper we follow self-consistency parsing protocol
#

we follow their exact protocol both with respect to prompt and parsing protocol

patent gull
#

i see so accuracy is also a function of % invalid

#

i guess i'm just wondering if there's a way to include both acqua and gsm8k in the main body, but only with accuracy. I guess it's an interesting point about the different CFG values, though

fallow egret
#

It's not precision.
it's num of correct answer (no matter valid/invalid) / length )

patent gull
#

i see.. sorry it's late for me

#

brain's not working

fallow egret
#

sure, completely understood šŸ™‚
I was also wondering what is the correct way to do that. But I think the invalid metric is super important to explain what happening, and the exact effect of the cfg

patent gull
#

yeahh i see that...

#

hmmmm let me try one thing

#

ugh yeah it's really hard to see it working as a table...

#

acc and invalid % would probably do best stacked in parentheses, but we've established a different visual vocabulary for parentheses elsewhere. Also hard for the eye to really follow

fallow egret
#

Yes, I think graph is much more readable than table in this case

patent gull
#

so when i plot them like this:

fallow egret
#

Yes, this might working