#Evaluating Classifier-Free Guidance impact
1 messages Ā· Page 3 of 1
then I'd say it's better stated in the intro. Though it introduces the notation in Sec2
Yes, I agree that something currently look wrong in the first paragraphs of section 2 (until 2.1)
Really? I thought I was rephrasing the flow and fixing the grammar and adding the last paragraph.
It was done before all our conversations so of if was important to look at your original text I'm sorry
This is fine š as I said feel free to change, I just want to have access to the original version, to see the diffs (it's just the account limitations, we should work with premium)
let me transfer the ownership to Stella then
@versed flax you can also subscribe and cancel. It will not charge you anything, and we will have 14 days to work with premium
Ha, good.
Just remember to cancel š
IMO the line of works and presenting the prompt alignment issue should be only in the intro. In section 2 there should be only the formal notations (defining the first tokens in the sequence as the prompt)
woohoo we have full history!
yup I made that change
just commented out my remarks for Section 2 to make it cleaner. For the record I didn't touch the content of 2.1 and 2.2 except the equation numbering change.
Just in time, I think I address all the issues (please check).
Also I filled all the citations
Section 3 needs a roadmap before diving into 3.1... it's our most important section and we need to prime the reader to get all of our great results
wait.. it's all just inputted from another file
hm oh. right. catching up
Yes, you missed all the fun yesterday š
š
lucky me
ok.. what's the plan with Section 3? Elad thinks it needs a rewrite? I actually think that just putting in more structure:
- road map in the start
- first/last sentences in each subsection tying it to the overall picture
is fine
so do we restructure section 3?
āļø
Elad has a point in terms of classifying benchmarks based on tasks. But we need to think carefully about that. Might mean we split up some tables and stuff
"classifying benchmarks based on tasks" <- can someone summarize this for me?
cc @patent gull
here, seems like we have 3 categories for the general benchmarks
cool thanks
ok yeah i do agree with this
I was thinking that even just doing points 1/3/4 would really clarify things
so overall do we do
- common sense reasoning
- close-book QA
- completion
- machine translation
- code generation
but 5 makes a lot of sense to me, too. #2 looks more like a statement to me? idk what's the proposal, there
Yes, 2 is actually explaining why 5 is needed
gotcha
sure. I did like the idea of lm_harness as it's own standalone, since I think people will start to see that more and more, and just know what it is
but i think breaking it up is also something I support
does this imply that we need to redo Figure 1 to be more broken up?
ok... here's what i think. I think the order should be:
close-book QA
completion
common sense reasoning
machine translation
code generation
And I think we need to designate a single person to take this on, or at least direct other people, since it's a bigger-picture change that touches a lot of people's work
yes, will need smaller table for each task like the llama paper
I think it's fine to put it in one figure and refer in different subsections to the same figure
Sounds great
oh if it works like this I'm fine
š I'm excited, I think it will really improve the quality of the paper!
there was also a concern that it might turn out that we tried too many common sense reasoning tasks
i see
completion is just lambada, QA are just triviaqa and sciq, machine translation is just wmt14 fr-en, and then a whole army of reasoning tasks
As I said, in my opinion it will actualy reduce their volume since all of this list only get one subsection (assuming all subsection will have ~ same length as it should be)
it's also funny that some non-reasoning tasks (sciq and lambada) tend to get bigger improvements than reasoning tasks which I don't understand š
alright i need to get breakfast/etc. can someone fill in this structure here with all the tasks that are gonna go in each part? just visually so we can agree on it and not go back-and-forth a million times:
https://docs.google.com/document/d/1LEDBU1xtue-x4793hRGNOLMJmgDqKou6LpNO8md8IUU/edit?usp=sharing
Section 3 close-book QA completion common sense reasoning machine translation code generation
alright, i'll be back in 10-20. I think once it's mapped out, I might be able to take on this task of rewriting Secion 3, but not sure if i'm mentally able to after EMNLP
There are some sporadic experiments like the GPT-J codegen completion and my initial image generation tasks. Maybe let me move them to appendix?
Also moving the figure 1 up to 3.1 (somehow it ended up in 3.3)
I think it's a good idea
When would be a productive time for me to read through the paper today
@loud adder here are where your comments might be most appreciated/productive:
- section 1,2 are done. Style edits/thoughts appreciated
- section. 3 we are debating structure. If you could see how the structure feels to you currently, and whether you think it needs an overhaul, that would be great
- section 4, weād appreciate your thoughts on content. last time I checked (pre EMNLP) was in a good state writing-wise. Not sure if it changed. But your thoughts on the content and whether you think itās good/conclusive or needs more would be very appreciated!
Cant speak to the rest of the sections personally
Meh. I see their value as introductory experiments, but are they interesting enough to be moved after the more detailed and thorough experiments, in the appendix?
Yeah i was seeing them the same way and used it to introduce the humaneval. But if we base the whole section on well-established benchmarks, they don't seem to have places to go
I imagine if we structure around benchmark tasks it will be tightly written like "here is the QA, and next is reasoning, and next..."
They I'd try to compress them a lot (1 sentence) and move the detail to the appendix as you propose, yes
Yeah⦠āearly experiments showedā¦ā or āinitial runs suggest thatā¦ā
exactly
Yeah this definitely works.
i'm reading through Section 3 again with respect to @fallow egret 's new structure and now I actually have a counterargument...
the current organization scheme can roughly (with some reorg) be broken down into different "Methods", not "Tasks"... so another way to reorg would be:
- 3.1: zero-shot prompting
- 3.5: chain of thought
- 3.2-3.3 text-to-text generation
- 3.4: negative prompting, (although maybe this deserves it's own section?)
I actually think this is a little bit more logical than breaking it down by "Tasks" because this paper is really more about the CFG mechanism than it is about the semantics of the tasks.
With this breakdown, we can make a big-picture story that we're really trying to probe CFG with different parts of the prompt/prompt formulations. Those categories above are a nice breakdown
besides, i think if we break down into tasks, we'd have to have insights/hypotheses about why it does well at the tasks, specifically, and we don't have anything besides hand-waviness or trying to find citations.
This "methods" breakdown is really more about viewing CFG with different prompt setups, which then Section 4 addresses much more directly
Yeah and the army of reasoning tasks is gonna stand out. Would be hard to justify our bias other than saying "other people do it too"
right, it's a little unbalanced if we go with the task breakdown
I'm completely fine with this 'method' split as long as we emphasis it. I agree that this distinction sounds indeed much better. The only thing is that I think it's a little bit strange to classify the code generation as text-to-text + it sounds that there should be merge between 3.2-3.3
i'm open to either way, it was just something that came to mind, btw
i think yeah my only qualm with "Tasks" was I was starting to write it in my mind
and realized that I didn't have a great hypothesis/justification for why CFG would do well in common sense, QA, etc.
besides just repeating over and over again "more adherence to the prompt"
Yes, I agree that it's much better split and better stress the different effect on each method
which maybe that'll also work š¤·āāļø idk. but it doesn't feel like it builds, actually, in the same way the methods split does
plus, your work gets to stand on its own, now š
ok i'll take a stab at putting that structure in the beginning, and if it doesn't work, we can always evaluate and take a diff direction
Lol, this doesn't effect my vote š
Yeah if there is a narrative to combine translation with codegen I'm fine
Ok, so for me the method split sounds indeed more reasonable, any objection?
I'm trapped because of timezones but will do another pass later
i'm gonna update the old_sections/experiment file
we can always go a different direction with another file lol
By the way IMO working with separate tex file for each section is much more convenient
alright, i did a little bit showing what kind of structure I have in mind. Haven't finished up the last parts ... will be back in a bit to do that
language might be a little sloppy, especially around the "we hypothesize..." bits.... feel free to change!!
I like this idea
Negative prompting probably deserves its own section if its like a good bridge between going from tasks to the section where we talk about why it works.
Every section should add something to the global thread and make the case stronger
This makes a lot of sense. A lot.
Negative prompting needs to be addressed separately I guess, especially for future work
We have those really good human evals though, right?
We need to find a way to make it work. It's just too powerful. It won't be for this paper but it's good to address it
Yes. That's a "mild" neg prompting situation
But it's still one.
Hmmm letās see. Idk it still might fit
I mean, it's not as granular or interesting as I wanted it to be, but it's still neg prompting abd the results are quite awesome
In my opinion I think it fits in with a method-driven reorg of section 3
Although maybe weāll be able to evaluate better once itās all written
I'll work on that as I get home
i had a thought for another explanatory experiment, what do you guys think
so we argue that CFG increases the adherence to the prompt
this implies that true continuation w_c is more likely under true prompt w_p vs. another random prompt w_{p'} in the CFG setting vs. the vanilla setting
so we measure \delta p = p(w_c | w_p) - p(w_c | w_{p'})
Yes. That's what were trying to measure with KL, Kendall tau etc
Uh?
what we're testing with KL, Kendall, etc. is whether the logit distributions of CFG look similar to Instruction-tuned models
not explicitly whether they're following the prompts better
and what I'm saying that if a model ISN'T following the prompt well, we would expect this delta:
$\delta c = p(w_c | w_p) - p(w_c | w_{p'})$
to be lower
than a model that is
Alex Spangher
You do, with entropy, don't you?
we're testing something more like:
$m = < p_1(w_c | w_p) || p_2(w_c | w_p)>$
Alex Spangher
yeah I try to make the argument with entropy, but I was just thinking about another way to test the argument maybe more directly
anyway, I'm gonna keep editing section 3
that was just a passing thought
What would be more direct than cross entropy of gold target s?
lower entropy is evidence of prompt adherence, but not bullet proof
there are other reasons why entropy might decrease besides greater prompt adherence
It shows better language modeling
uhh yeah i think you're right
but in theory, the model could be both doing well on benchmarks and generating crappy english
totally possible to overfit on benchmarks
ok done editing Section 3
left some comments, didn't touch "Continuations"
but i did a lot of work trying to make it more structural and flow together better
please let's not make major changes without a discussion here!! I'll try to look at Section 4 later tonight or tomorrow
who wants to take a stab at the conclusion? if no one does by the time i'm done with Section 4, then I will
i think we're close, everyone. it's shaping up
appendices need work, but the language in the main body is really coming together, I think
Cool!
I left some comments in the negative prompting section
If you take a pass at those then I can look later
And then we all have to start addressing that appendix lolā¦..
are [] in section 3 placeholders for citations? oh yeah they are. Fixed some stuff for Section 3.3
Sec 2.1 uses r, Sec 2.2 uses gamma (and so do all the figures). Has someone a well thought opinion on the notation we should prefer?
which is the most famous (cited) paper around this CFG or classifier guidance thing?
Well, Ho & Salimans is the reference paper for cfg
I always go for the notation of the most-known paper unless there is a counter-argument for the choice
Let's try to see if we can be consistent with theirs
IIRC sec 2.1 goes with CG's notation while 2.2 goes with CFG but I need to double check
ok no cfg uses w
it's the blog post that uses gamma
Looking at 3.1, fixed some minor naming and citation problems.
close-book QA \cite{}, common sense reasoning tasks \cite{}, and sentence completion-tasks \cite{}
Do we have citations for each of these task categories? I don't recall any.
I don't know any citation that would fit here (but my NLP culture is small)
Yeah unless we throw all the citations of benchmarks in their corresponding spots. Leave it here for now. If nobody has better idea, we can remove these empty citations.
hmm should we make this change across the paper? Seems like a big move and let's see other people's opinions
honestly I hate w which is already way too overloaded in deep learning
I hate it too. Also we have w_i for the prompting stuff
exactly. "words", "weights", omega
is there another paper using r or gamma?
I'm checking
by the way is it a standard practice to cite blog posts in ML?
I looked for standard latex practice, and the rule of thumb is "just do yout best" lol
Ohh yah the notation definitely needs to be standardized
Would be good to include at least one citation that explains each task, otherwise we have to explain what they are and itās kinda tangential
Sorry I just threw those in there. Feel free to ignore. I can also do the work of finding those citations myself, sometimes itās just easier to divide the labor and not switch between lots of tabs
But my rule of thumb is ādefine or citeā
āIf you canāt cite, define. If you donāt feel like defining, citeā
oh yeah it's totally fine. I'm already filling in a few for you in section 3
Cool cool
great job revising section 3 by the way!
oh a super minor question, I saw "...ā i.e.". Is this alright? I mean I remember always seeing things like "..., i.e., ...".
cgf fix and Imagen use w too but we still hate it
we have a reason to override them though: w is already used in the explanations of prompting
In English you can add a note - like this - or (like this)
meaning this is not "... - ie." but "... - ie ... - ..." and is indeed correct
yeah but I just never see it in academic papers though.
Like in papers we also have to expand things in full instead of "we're", "I'm", "don't", "Let's", ... (unlike French lol). I just worry there is a rule somewhere
what about we keep using \gamma and add a footnote explaining our choice
the dudes in Imagen are just losing it and not even trying to hide it lmao
Then I advocate using gamma in 2.1 as well and not explain anything. We don't have to justify ourselves for changing a letter, do we?
notation is quite important though
Sorry I'm just trained as a mathematician. I guess it's alright in ML
You're the maths guy. however I don't recall reading a paper saying "sorry tho we changed the letter because it was unadapted xoxo". They just do it
(just like cfg do not justify their choice for not keeping s)
a lot of math papers don't explain either and some of them cause massive problems for younger people. We were taught to be kinder for the audiences
I would consider the notation more rigoroustly if this were a mathy paper were connecting properly to the previous work was important because of some complex derivation etc
but here the mathiness is mostly for us too look like cools kids and the equation is absolutely trivial
That being said, I do have some kind of French "good enough" attitude, and it sometimes needs not to be tolerated
French mathematicians are crazy and ruthless about details. Like literally driving us crazy when being a student
yeah I think it's alright to just follow what others do
Oh I'm not talking about french mathematicians, whatever their nationality, they're a species on their own lmao
oh that was a random remark about French guys
haha
we've talked about a lot of "others", which one do we follow then?
definitely not mathematicians lol
theeeeeeeeeeeen... gamma?
I was referring to Ho & Salimans. They choose their notation then we choose ours
learn from the best
I'm still learning the ML culture and I'm definitely not stubborn about my own habits
Gamma it is then. @fallow egret did you have a strong reason to use r in 2.1, and if so, should we reconsider the notation in the rest of the paper? If not, are we good changing r to gamma for consistency?
Damn I was quite happy with our former way of presenting CFG in the intro. It was more generic and we naturally derived the negative prompting and "promptless" setting easily
it's much harder now to go the other way "indeed... promptless is a particular case, you're not forced to negatively condition on the empty sequence, it can be anything, here's the actual generalized CFG formula haha what a nice trick we pulled on you!" lol
This was the notation in the original cfg paper, so I used their notation.
But I don't think that it's important to stick to previous work notation, so I'm good with changing to gamma
Do you have a theoretical justification why you are naming it CFG in this general case?
Because I don't see what is the classifier part in this case. So if you change it to generic notion the name should be changed. IMO the paper in the current state has much more theoretical meat and I found it more interesting and different (comparing to previous works)
great i went through and finished editing 3.4. There's one more little detail, @versed flax , and then I'll feel done. I feel like we're in a good spot with Section 3. Section 4 and 5 I wrote/edited. We're just a conclusion away from being done with the main body
I want to write the appendix for CoT, is there is any decision on the appendix structure?
I've been reading this like a soap opera, and just wanted to say that this is really cool work. š
Also, this was pretty hilarious (from https://cfg.vermeille.fr/):
Prompt
How to choose a good learning rate?
Response
Sometimes you can't choose a learning rate. You can't control your learning rate. You have to let it run. It's like breathing. It's hard to control your breathing, but it's also what keeps you alive.
I talked about this paper with Yejin Choi this weekend and she thought it was quite interesting. A lot of her work recently has had a similar theme, in that itās oriented towards how we can induce high quality behavior in cheaper models. Most of her work has been in terms of producing higher quality synthetic datasets, but she was pleasantly surprised how much of an impact one can have at inference time based on this paper.
So cool!
Iām getting on a plane to come home, but will have notes by Monday
Thank you so much
@fallow egret, reading 2.1 I think you mixed \propto and \sim, I fixed that. I'll change r for \gamma later. It's totally omitting that CG uses the gradient of the external classifier otherwise people will actually wonder how that works (for now I reintroduced the commented sentence about it. Also the sentence following it started with "This modification" and there was no modification introduced)
I'll fix that tonight if you're okay
Elad thanks for starting the appendix!! I think you can write up your appendix section now, just summarize your tables and such. Iām not of the opinion that appendices need to have great stories and cohesion
Weāll include a one-page appendix map and table of contents
But otherwise I donāt really think it all needs to tie together. Maybe others have different opinions
- Yes, as I said while ago there is an issue with the constant, I was planning to do the opposite and carry a constant to make it more accurate, but not sure which thing is better.
- Sounds reasonable to add this sentence
- Yes, the modification word is indeed not related and should be omit
But a super integrated appendix sometimes gets reviewers saying āthis shouldāve been another paper. REJECT.ā At least Iāve gotten that feedback before
Thatās so cool! Had she not heard about it already from Luke?
Agree, just what is the general structure? each subsection in section 3 will have additional results? Or we are mixing it? Because now there are charts, additional experiments and generated sample
Thatās a good q
I think we can roughly keep the same structure in the appendix as we do in the paper
But not every section is gonna have a ton of results in the appendix and I think thatās ok
Letās just reference āsee appendixā in section 3 whenever suitable
Btw Iām gonna be away from my computer most of today, headed to the beach. Will check later tonight
Letās everyone take a crack at the appendix. I think in general, if you put results in the appendix, youāre responsible for summarizing them
will work on A and B.
@versed flax do you have the data and code for the Figure 9? I want to try and see how it looks if we add regression lines for red dots and blue dots separately.
I do
@versed flax what is the timeline? I understand that you want to publish it on Wednesday, so everything should be wrapped tomorrow?
By Wednesday, the paper is in a releaseable state. We use Thursday and Friday for stupid fixes like punctuation fixes, typos, emergency stuff. Friday night, paper is on ArXiv :)
Ok, this sounds good
Yes, I agree
Brief summaries of my experiments in appendix are done (benchmarks and codegen). Some remarks are left for parts involving other people's works. Feel free to remove my remarks if they are dealt with.
Thanks @blissful garden !! Iāll take a look shortly and today/tomorrow will summarize my parts of the appendix. Once theyāre all done Iāll wrap an appendix map and conclusion, unless @versed flax wants to write the conclusion.
Added regression curves (using logistic regression because acc bounds between 0-1).
This does support our claim that CFG inference efficiency is good for Lambada where small LLaMA beats SOTA. But it sucks on most of others.
How should we present this?
I will also try to close the CoT appendix today
I think it's cool because the lines pretty often fairly close
it still demonstrates that a smaller language model+cfg is a decent substitute for a bigger one
I need to write that part
@fallow egret I reworked your 2.1 to be a lot more rigorous and adapted the notations
I tried to satisfy as much as possible your desire for a strong maths background and nice derivations
š I will go over the section after I'll finish with the CoT subsection
ty
Is this training FLOP or inference FLOP
inference
(Also, there isnāt an āsā at the end)
noted
Itās a weird acronym, but it stands for FLoating OPerations
I guess I wanted to write FLOPs, bc I know what the acronym stands for
People say āflopsā orally as a natural pluralization
But thatās not really right (it would be like writing Ls for ālitresā)
gotcha
And does cause confusion because there are things we want to measure in FLOP-seconds
(Also some people incorrectly use FLOPS thinking itās FLoating Operations Per Second, like mph or rpm lol)
Just for extra confusion
oh, was not aware of this one
oh yeah I have been always confused by that S in the end too
Is this all in the same precision?
If youāre concerned about it not always improving the result per inference FLOP, I would stress that a) thatās a really hard ask and b) 99% of users are VRAM bottlenecked, not FLOP bottlenecked
Yes they are. @versed flax sent me the plotting script and he already did all the math work with the FLOP and copy-pasted the numbers there. Which precision did you use @versed flax ?
It's all from the harness, so the default fp32 I assume
I was mostly asking because I was curious if that was a source of variation in the plots
So the couple tasks that are discontinuous⦠is that because of multiple model families being shown?
Yes, it's still incorrect- missing either normalization argument in the middle of eq 6 or using proportional
- it's not 'equivalent to 2', it should be 'results in 2'
- inconsistency in signs, ok I will go over this section soon, the big P is the finale notion?
good catch
fixed
where?
I recompile, now everying is consistent!
I did not catch the sign changes, where was that?
small 'p' vs big P for probability
Oooooooooh, I thought you mean sign as in +/-
you meant symbol
Do you like the section?
In 6 the last part of the equation is missing (going back to P(w | c) both in the nominator and denominator)
I just had a time for a quick look on the equations, it looks good. I will take a deeper look soon
I removed it but I was really not sure of the move. What were you showing with this? I thought you wanted to show that sampling a text with CFG can be done by autoregressively sampling each of its token, and that why I stopped there.
No, I want to go back to equation 2
The last step is missing, it exactly equation 2 š
ah, I thought we wanted to go to eq 7 haha
No, the point in this equation is to connect the autoregressive formula in 7, directly to eq 2 in the original work
This is the theoretical justification...
Ok. It needs to be made more explicit in the text then imho
It was explicit in text (the line after eq 6, we had 'this results in 2')
Was going through Sec 2 with @versed flax carefuly and personally I'm good with the whole Section 2 now.
(just one minor remark left at the last sentence)
Appendix A.2 about the acc-FLOP chart is also done. I'm putting up this disclaimer. But feel free to add/change stuff.
i read thru section 2.1.... i can see you guys put a lot of work into it and it definitely shows!! the language is really tight, the math is useful. I left some small comments.
Personally I think there is some stuff that i think might be in-the-weeds...
Two points:
- The introduction of p(z) early on.... we don't use p(z) anywhere else. Do we need to spell out this term? How does it help? Isn't the reader already going to be thinking about latent spaces?
- The exploration into sample noise and diffusion... how necessary is this? Does thinking about sampling noise help us think about LMs? bc we don't really think about noise so much in the same way. I guess this could serve as a useful history/teaching, and i think it comes down to a personal preference whether to include or not, but i think there's a fair argument to be made that it's not directly useful to the overall NLP focus
However, that being said, it does look good.
I honestly had liked the previous structure of introducing negative prompting more in Section 3.4, because it did make the point that it was more of a side exploration rather than a main exploration.
However, if we do commit to having it in 2.1, then 3.4 needs to be significantly tightened... like there's still some of the introductory text there. and if negative prompting is introduced in 2.1 instead, maybe some of that text could be moved to 2.1 and then 3.4 is really just "negative prompting, as described in 2.1"
Personally I'm okay with either a hand-wavy section 2 saying that we are inspired by SD, or a rigorous section 2 with careful derivations from SD despite notations being useless in other place.
btw yah i don't mean this as a criticism, just a point for discussion, and ultimately i do really like this version better than the last
Maybe we can stash some notations into appendix š¤
i mean if you have an explanation for how those 2 bullet points help the reader, i'm convinced. also i think there really is an argument just for teaching the reader
i think a counter-argument to my points is that it does really solidify that we have a strong CV background here
I'm cooking. I'll explain later.
ok sure
Ok, I think CoT subsection is ready. I really like the edit that was done (probably @patent gull ?)
I added more experiments + some nice qualitatively examples
great!! thanks Elad!!
this is nitpicking and no rush, but i would love it, if it's easy, if Figure 2 could be redone with font size=14
or 16. just to match Figures 3/4
Sure, np
@patent gull I changed it (font-14), I hope this is what you meant...
Iāll check thanks elad
@versed flax I added many comments in section 2, with all the latest canges it seems that there are currently many inaccuracy (with respect mainly to the mathmtical part)
@versed flax Can we iterate on one of the issues of section 2 here? It will be easier and faster.
In equation 1 there is an introduction of the classifier guidance according to the original paper. I don't understand why the given formula is unconditioned (first term after the approx). I attachd the original formula from the paper. Observer that the two terms are different probabilities (one with theta it's the generator and one with phi it's the external classifier)
Observe that you can't apply the Bayes rule in this stage to get what you wrote since we are still in the CG case here (not CFG), which means that the classifier probability function and the generative are not the same
function
You can apply it only when moving to the CFG section which indeed this is the same function
P.S, also Bayes theorm doesn't give you eq (1) in the CFG context (it should be divided by p(x)), but let's do it step by step...
@blissful garden I think Table 6 would look better as a percentage-normed horizontal stacked bar chart:
https://www.geeksforgeeks.org/stacked-percentage-bar-plot-in-matplotlib/
bc that's really what it's trying to show, right? We're supposed to see how CFG gets better?
it's kinda hard to parse that in the table with the numbers, since it's a different total in each row
each stack/row is in this order: [underperforms, ties, outperforms]
and then, different bar for each temp
Oh great idea!
oh that should be easy. Let me do it
If I havenāt started working on this in the next 12 hours please ping me and remind me to do so.
ok I'm done with the conclusion and done with my end of the appendix. I added a table of contents to the appendix, feel free to disagree with that design-choice
looks to me like we have v1.0 of a rough draft
I see one bit of orange text, let me address that. I haven't nearly begun to address all the comments, but I will do so
also i jotted down some limitations i could think of off the top of my head, at 2am, in the Conclusion
feel free to take a look and add your own... the more limitations we address, the better and more solid our paper
which means that the classifier probability function and the generative are not the same function
That's totally irrelevant, they're probability functions
Whether you decide to model them with different models or not, it's correct
CG just says "we use an external classifier to guide generation"
you say "I want to model P(x|c), I apply Bayes' rule, I get p(x) p(c|x), oh that's an unconditional generator and a classifier, let's train two networks", I don't see where your confusion comes from
There's indeed one small mistake here, it's the theta subscript. Fixed:
Other than that it's correct
Can you please write it step by step how did you get to your formula, assuming we agree that source is what I sent?
you say "I want to model P(x|c), I apply Bayes' rule, I get p(x) p(c|x), oh that's an unconditional generator and a classifier, let's train two networks", I don't see where your confusion comes from
Again, these probabilities are not the same
What does that even mean?
Each model define a different probability function...
The parameters are there for a reason, it's simply a different probability function
and?
You used the bayes formula with respect to P_theta
Are you saying that two models can't interact if they don't share the parameters?
I'm saying that P_theta(x|c) multiplay by P_phi(x|c) is not equal to P_phi(x|c)^2
which is what you used to get your equation..
of course it is
of course it's not, it's not the same probability
What does that even mean?
if you train P_phi(x|c) and P_theta(x|c), and they both are trained on a similar dataset, and are both expressive enough, they'll learn the same thing, P_phi=P_theta
- It's simply incorrect, otherwise ensemble methods will not work
- The external classifier doesn't have to be trained on the same data, it's a non-valid implicit assumption
- They didn't train on the same objective (one is generative and the other is descreminative)
It's simply incorrect, otherwise ensemble methods will not work
Ensembles works because the "they are expressive enough" assumption breaks. Ensembles are a bug, not a feature. They work because your model P_theta doesn't perfectly model P (whatever your model is or supposed to be), so you average their mistakes to smooth them out. When writing theoretical derivatives like this, you can assume the model is perfect, and that's a common assumption
So they are not the same function, and you can't assume they are equal. I simply don't understand why do you need this assumption? Why not simply write the original formula?
It make the whole theoretical part invalid for no reason
Because it 1) makes more sense to a reader to tell that we need to guide an unconditional generator than a conditional one (why would it be needed then if it's already conditional?), and 2) it made me save time in writing with simpler explanations which utimately used a lot more of this time arguing this with you, and 3) no it's correct, and if you don't believe me, I quoted the CFG paper where that equality is laid out explicitely.
You train a model P_theta to be an approximation of P, it's fair to equate them in theoreticla equations.
It is incorrect for sure, you have two different probability function on the same space, each one come from a different model.
Having this assumption that they will converge to the same probability by some 'magic' is not valid in any applicable setting. Therefore, your theoretical framework doesn't model the reality.
In my opinion in this part things should be correct, this is the most important thing
@blissful garden @patent gull Can someone help with that?
It's not "magic", it's training
But it's not true. When you are training the same network with the same objective and data on different seed you get different probability function
And in this case it's not the same objective and data...
You absolutely get two very extremely similar ones. Or you're just not properly training your model.
No, it's not true. There are so many works on this topic...
If that were true, it would just make it impossible to compare models as the accuracy of two instances of a model trained twice would be vastly different
and that's also why ensembles work better with models with different architectures, slightly different training data, and model types. The quirks in approximating the theoretical P won't be the same ones.
I gave you a clear reference (and I can give more) that this assumption is very controversial.
IMO we should not have this assumption, since as I said there is no reason to have this assumption.
Okay, whatever. You can propose a fix, but I'm not wasting more time on this
I've got rejected on much less controversial assumption in the theoretical part...
Good this isn't a theoretical work but more of an experimental then
The whole theorical part was developed to please you
My work that was rejected was also not theoretical.
Reviewer are searching for these implicit problematic assumptions (I'm also doing it as a reviewer)
Then go reject "Diffusion Models Beats GANs on Image Synthesis" (NeurIPS 2021) which introduced classifier guidanceš¤·āāļø
Or "Score-based Generative Modeling Through Stochastic Differential Equations" (ICLR 2021)
There is no issue with what they wrote, they define here a reverse diffusion process, they are not claiming that the two probabilities are the same
They totally use a classifier in the same way
their eq 2 is litterally the same you're complaining about
No, it's not because their generative model is unconditional (they start with unconditional diffusion model). In our case we apply a conditional generative model (as in CFG paper)
In any case they are not claiming that p_theta(x|c) = p_phi(x|c)
Look. I'm not wasting more time on this. We end this. Feel free to propose a nice, correct, high quality and fully redacted fix.
I've spent two full days fixing your 2.1 which exists only to please your desire for theoretical grounding.
The only next action I'm taking on this non issue is clicking an Accept or Reject button.
I don't understand what was the issue with the original version that was correct from a theoretical perspective.
What are your thoughts? Maybe I'm biased as a mathematician, but in my opinion the theoretical part should be accurate
@patent gull @loud adder @blissful garden
Alex said "impressive improvements" and Honglu proof read that section so much that we basically co wrote it
Iām not sure I am fully following the back-and-forth of this argument. And what I have to add certainly wonāt settle it in a satisfying way. However I remember having a similar argument with my lab mate.
I have my own classifier-guided control paper: https://arxiv.org/pdf/2301.02299.pdf that has the same setup that @versed flax wrote⦠even less principled lol bc I donāt even notate two different sets of parameters.
Indeed, because of pretraining, p_theta(x) and p_phi(x|c) cannot even be assumed to be of the same linguistic domains. In our case, p(x) was vanilla GPT2 (i.e. general web) and p(x|c) was trained on news. One of the improvements we noticed was actually just due to fine tuning p(x) on the news domain (which shouldnāt have to happen in a theoretically perfect world).
No reviewer noticed. There are other classifier-based works with similar setups in NLP: FUDGE (https://arxiv.org/abs/2104.05218) and PPLM (https://arxiv.org/abs/1912.02164).
Indeed my lab mate published his work explicitly trying to address this: https://arxiv.org/abs/2205.14219.
At the end of the day, yes it is a problem, my labmate got a paper out of addressing the problem, BUT there is also a rich history of methods in this space and itās uncontroversial at this point IMO. Most importantly, PPLM, FUDGE and my work all ALSO showed effectiveness, so itās not an invalid setup
We propose Future Discriminators for Generation (FUDGE), a flexible and
modular method for controlled text generation. Given a pre-existing model G for
generating text from a distribution of interest, FUDGE enables conditioning on
a desired attribute a (for example, formality) while requiring access only to
G's output logits. FUDGE learns an att...
Yes, these works are all *CL, which maybe has a different set of reviewers and reviewer concerns than ICML/Neurips/etc. Iām less familiar with those reviewers. But I do think we should move on
I do think we can have a more comfortable debate about section 2 once we feel really good about the whole rest of the paper
I've spent more time on Sec2 than the rest of the paper combined. Definitely agree that we should move on. As I said, if someone is displeased with the current state, they're free to submit a good fix, but going back and forth in chats and criticizing isn't productive
Yeahā¦.. I mean. Yeah. Iām trying to think of a concise way to frame this debate. Honestly maybe one sentence about it and then cite NADO (my lab mates paper) as a proof that itās an issue with classifier based methods
But it doesnāt affect our work since weāre not using classifier based guidance
@patent gull I agree that we can move on and go back to that in the end. I simply don't understand why we need to trust that the reviewer will not notice it, when it's completely unnecessary assumption and simply using the conditional formula resolve the issue
I mean yeah it can literally be a sentence at the end of 2.1 saying āthese works face issuesā¦.ā
Yeah but that section is more summarizing the lit
Itās a problem w the lit
We donāt use classifieds
Classifiers
Itās not our theoretical problem
Itās the lits problem. Certainly in NLP, where it is NOT addressed typically
I agree that it's not our problem, this is why I don't understand why we need to insert this issue in the first place which is completely unnecessary in our case and just raise unrelated questions
Just submit a good fix, Elad.
I mean Iām not trying to say itās not important. It just doesnāt affect us so I think a reviewer would be wrong to point it out as a flaw with OUR work
I think I can put a short line in section 2.1 addressing this
Be productive. When you complained "The intro should start with the problem............." Alex proposed "We should swap first and second paragraph". One comment is clearly more productive and usable than the other and they still addressed the same point. Propose your fix.
I didn't want to modify the text without a consent after last time...
He didn't either. He just proposed a solution.
My solution is simple:
eq 1: should be p_theta(x|c) instead of p_theta(x)
and then in eq(2) it should p_theta(x|c) ^ gamma+1 divided by p(x)^gamma
This is all the changes with respect to this part...
No, that's not "it", the text around it needs to be reworked as well, and address why we use classifier guidance since the model is already conditional.
What do you mean? This is how you perform classifier guidance, you enhance the conditional effect on the model by external classifier
In CFG you also use a classifier guidance (where classifier is defined using your own model), on conditional generative model...
I donāt have the equations in my head right now (sorry, away from my computer)
But Iāll look and have an opinion on this when im in the office
I do feel like this isnāt top priority though since itās entirely concerning background work (if Iām understanding correctly)
It's concerning notations on background work
It's the minorest thing in the minor things we have to address
I agree it's not top priority, but IMO there are few issues in sec 2, that should be resolve before submission
Just woke up... Give me some time to read through.....
This is not one of them.
My sense is that 2.1 has been steadily getting longer, denser and ultimately harder for the reader to get thru before getting to our real contribution but I honestly donāt have it in my head really because Iāve been focusing on other things
why \gamma + 1 and \gamma?
oh I see
Hey btw, I'm going to be reverting the V3 of triviaqa https://github.com/EleutherAI/lm-evaluation-harness/pull/610 in the eval harness upstream, the results on this do not match llama's performance whatsoever (way higher than they report), while V2 ~ does when accounting for prompt. Exact match is meant to be exact, although that has its own problems we don't want to be able to rate not Mark Twain as correct if Mark Twain is the expected ground truth
That's fair. I'm indeed checking the LLaMA paper and I definitely hallucinated this "substring" thing or read it elsewhere and got confused
I'm sorry I wasted some time and negatively impacted the productivity
oh lol don't worry about it
just wanted to alert since that changes what scores yall should report/maybe rerun unfortunately
yes, indeed. I think we will just report how we got those results
so we use our original numbers for triviaqa?
bc there were many instances where the model generated something like"Mark Twain" (with the quotes) or This is Mark Twain (can you confirm @blissful garden ?)
yeah I guess there are pros and cons for each. I dumped the write-out files and inspect manually. Using substring was a lot better
A LOT with extra words
here if you want to see
It's been 12h :)
Doing it now
Reg. Figure 9, the FLOPs tests
can we
(1) do statistical significance tests on the plots (I think f-tests is the right one?)
(2) draw confidence regions? We can establish these using bootstrapping, I think
the hypothesis that we really want that the figure seems to support right now is "CFG is statistically equivalent across most tasks to a similar-budget model". But #1 and #2 will help us really show it
I think this is an important finding if we can prove it, and warrants its own short section in the main paper
"CFG is statistically equivalent across most tasks to a similar-budget model"
it might not be the case though. But the difference doesn't look huge
Oh let me look it up. I actually suck at stats and need to brush up those hypothesis testing stuff š
Would be wonderful if we can refine it and make it part of the results
scipy should have you covered
I am suprised by the citation in
A ``prompt'' is typically used to condition on the generation, containing task instructions, context, and a small set of examples \cite{flan}.
Why was this chosen? The FLAN paper is about finetuning models on instruction-formated data
(Also, FLAN and T0 came out at the same time with the same core idea: it's almost always correct to cite both of them when it's correct to cite one of them if you're not citing your use of their specific model or something)
Because it looked like a great paper to show what an instruction actually is
for point #2, I think if you just bootstrap resample those FLOPs points 10,000 times, then you get a distribution over the results and you can calculate the median and percentiles to do a confidence region
I feel like the GPT-3 paper and this are more appropriate papers to cite https://arxiv.org/abs/2102.07350
Prevailing methods for mapping large generative language models to supervised
tasks may fail to sufficiently probe models' novel capabilities. Using GPT-3 as
a case study, we show that 0-shot prompts can significantly outperform few-shot
prompts. We suggest that the function of few-shot examples in these cases is
better described as locating an ...
boostratpping is great for confidence intervals over metrics/etc. all kinds of things with non-normal distributions
What is this " Fundamental limitations of alignment in large language models" paper that we're citing a lot?
Okay, only twice since the other three are commented out
IIRC it's a paper that we use to talk about system prompts. Probably not the best one
Gotcha. No worries, I don't expect y'all to have the literature and chronology memorized š
Pretty simple: I have very very little knowledge of the NLP lit
I have a pretty darn good grasp of the vision lit, but NLP... close to none
The goal of these questions is to get an idea of what you're looking to cite so I can identify papers that may be a better fit
gotcha
It would be a good idea to submit a PR to the HuggingFace transformers library that includes CFG as a LogitsWraper or whatever it's called
This will substantially increase the chances of people using the methodology because that's how most people get their LLMs
It's ready and will fly the very second the paper is on ArXiv :)
If you mean that you're ready to submit the PR, you should submit the PR now so we can answer any questions or handle any issues they have. If you mean you've already done that and are waiting for the paper to go live to have it merged then well done!
I've been waiting to submit the PR. I was thinking that they wouldn't accept a PR from something weird without a paper attached to it
I'll submit the PR in few hours then
attach some part of our draft in the PR? Like Sec 2 and the eval charts.
actually just the eval results might justify it.
If you say "I've been collaborating with EleutherAI on this and we have a paper coming out on Friday", and tag me then they won't have any issue with it š
Also this
@versed flax Do the "prompt alignment" techniques require finetuning or are they inference-time like ours?
Various approaches have been proposed to address this, including prompt alignment \cite{alignment} and fine-tuning \cite{instructgpt,flan,sanhmultitask}.
It depends. For what we know GPT-3.5 is aligned with RLHF and Bing Search with prompt
Oh that's the Anthropic paper
In more humble situations, like Character.ai and the likes, it's prompt alignment
Sorry, approximatie language here. It means there's a system prompt describing the chatbot's intended behavior.
("This is a conversation between Person A and Eric Cartman:
Cartman: Hey you, leave me alone!
Person A:")
I can try and look for one.
I don't think doing so is very important.
https://arxiv.org/pdf/2206.07550.pdf I found this one. They prompt the chatbot with OCEAN traits
personally, i don't think the "right" huggingface implementation of CFG is the logit-wrapper implementation.
I think putting it in the forward method of a CFG-head model, maybe as a mixin, is the more hugging-face appropriate way, looking at how they build their models.
I have something like that implemented, although my class is a bit more of a monstrosity bc it's doing different things, but:
https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/log_logits_on_p3.py#L52-L79
Definitely this is a side-discussion, but since we're talking about code...
I don't have a strong feeling, and getting this feedback from the mantainers is another reason to open the issue early š
SGTM
yah i always just found the logitwarpers approach to be a little awkward, since we needed to pass in input_ids and model but there was already a model inside, and logits getting generated. idk. felt weird
I added this in the enumeration in the introduction:
\item We show that for the same inference cost, one can train a model that is half the size and obtain similar performance on those benchmarks;
I should maybe add it into the abstract as well
it's a fairly strong result
An important question remains and I have no idea what the answer is: Should we acknowledge CAD?
Yea, Iām adding this shortly
oh you're still on it
Yeah, got dragged into a meeting and then had to cook dinner but Iām back at work š
I'll be available for the next 4h. Don't hesitate if you have any question or need any feedback.
Itās really good. Stuff has come together really well in the past two weeks
Basically all my comments and edits are about copy editing and optimizing the presentation
BTW your website appears to be down: http://vermeille.fr/
Ah! thanks!
(I was going to add affiliations)
@versed flax Is there something unique or special about SD's about negative guidance? We discuss negative guidance in the VQGAN-CLIP paper, and I'm under the impression it's something that can be done with any T2I model
Nothing specific about it. It's just this one famous tool that implements it
Midjourney and DALL-E don't.
I can't fix it but I basically have to recreate it (apparently my payment failed last time I had to renew). Is it important?
@oak ore whatās up with that? š
Oh no. I just googled you because I couldnāt remember what institution you said you were at, and then figured you might want to know.
I had my PhD from the UniversitƩ de Toulon, but I'm not working there anymore. I'm working for a small company called Hexaglobe now
this is a pretty long thread - what's the question?
negative guidance could be done with anything that uses cfg. sd doesn't do anything special there. midjourney & dalle are closed models with restricted APIs, and negative guidance isn't part of the exposed API
(to be fair they certainly use a preset hidden neg prompt. It's just so good)
we do something fancier than that actually, tho I can't go into detail
oh, you're working for one of those orgs?
yeah I'm at midjourney
Would you happen to be hiring French dudes? š
I don't think we've ever shipped negative prompts in the way sd does them
we've got people from all over, tho i don't have bandwidth to evaluate new hires rn
if that ever happens and you're interested, hit me up, along with the expected curriculum!
Sorry to derail the convo. Iām going to go on a walk before the sun sets then finish up
@loud adder please let me know if you'd like to see anything else in Section 4
I know you mentioned causal approaches way back before we put 4 together
I'm starting to be sleepy. Try your luck if you need something but I may not answer, sorry
Donāt worry about it. Go sleep and itāll be there in the morning
I have addressed most of the edits.
- @patent gull, Stella did edit the intro to Sec 3 and removed the parts that articulate the section into the various prompt types. I'll let you proof read and accept / reject the changes. I did not want to do it for you, it's your part.
- @patent gull please accept / reject the edits to the abstract so that I know they're correct.
- @patent gull Stella advocates moving your Related Works to the appendix. I quite like it but I understand her point, the paper is already quite long. However I feel like focusing on the CV background only is weird.
- @loud adder the main unchecked thing I have now is this new figure you're suggesting and that I don't fully see.
- @patent gull / @loud adder I answered most of your comments in the sidebar. Once you've read my answer and find it satisfactory, can you mark them as resolved? If you leave them open I don't know if we can move on.
Ok Iāll check. I think i largely agree with these changes. The section 3 header was a little too structured-feeling
And I think thereās a way to discuss classifier guidance and contrastive decoding in NLP in 1-2 sentences in section 2
Thus reducing the need for the related works section
I was thinking if we want to reduce the length, there might be some of plots and tables throughout that could be moved to the appendix as well. But we donāt have a page limit for arxiv so less worried about length
The paper is quite long. Would it be unreasonable to add a toc for the arxiv release?
My primary reason for mentioning the length is about peopleās attention span. 12 pages isnāt egregiously long, but it is 50% longer than the main text of most ML papers. I do think we should strive to not make it substantially longer.
(I say this as someone who has multiple arXiv papers that are > 80 pages long)
Yes, totally. It's long. My proposed "easy fix" is to add a toc for the arxiv release. The reader can then glance what the paper is about and choose what to read. Dunno if it's unreasonable
That's fair
Like I honestly donāt like toc, they donāt end up being descriptive enough or useful to me
hmmm so @loud adder what is the desired page limit? 10?
That's fair²
8?
we have a ton of plots and tables that can be summed up with 1 line and moved to app. Also have language that can be tightened throughout
6-10 pages is the typical length of a ML paper
Alright sounds good
I know I can move my line plots to the appendix. Iāll take a look at all the results tables we have and think about some that can get condensed
Or moved
She said that her koala started to speak English when she was about six months old and has since been able "to understand the words of people"., who lives in a kangaroo colony near Kogarah on Queenslandās Sunshine Coast, told news.com.au he had never seen anything like it before:
This sample from the appendix seems to show an abrupt change in topic half way through.
True. Let's regen new ones. Also those were sampled with Rep penalty and argmax decoding.
I'll do that tonight. I have to get going with my job for now.
(I'm still reading and answering here, I just won't do anything meaningful now)
@loud adder fyi this is the response from sgugger to the PR
let's see if the community requests this added feature before implementing it in the library proper :-)
I saw and don't understand, but w/e
i went through the edits/comments and agree with pretty much all of them
would you like to discuss \subsection{Relation to instruction tuning}?
let me know
Have you accepted / rejected / resolved them :)?
i accepted most that i saw from you/me
I'm leaving comments up there
since most of them feel like they're still open
- I've updated the tables and figures with triviaqa
- Added a note about triviaqa methodology in the appendix.
- talking with HF https://github.com/huggingface/transformers/issues/24536
do we still move the FLOP stuff up to main text? There seem to be a lot of stuff in the main text already
I'm pretty sure we can find something that is less important than this result, and move it to the appendix / remove it instead
@versed flax I also saw the MusicGen PR and trace it back to the paper
https://arxiv.org/pdf/2209.15352.pdf
It seems equation 4 is exactly what we are doing here...... They are so much earlier than us. We should probably say like although we weren't aware of them in the beginning, our work can be seen as generalizing their technique to text-to-text models with a comprehensive analysis.
Do you think it's more similar to our work rather than to the use in diffusion models?
no I just mean the exact technique. They literally only had one paragraph without anything else. But the way they say it is very similar to us (mentioning all the SD stuff and having the same formula as ours)
They have a Figure 3 for ablation of CFG and that was it. But they did study that
I actually start to think this paper is closer to us than CAD. And they simply just didn't realize the generality of this technique.
Their architecture looks very very close to text2image models and that's probably why they did not generalize
Yep
just throwing it out there and see if we want to add one short sentence to acknowledge their work as well.
In our field if our upcoming work has any resemblance to any other group's previous stuff, we'd send our draft to them in case they have remarks (but don't wait for it and all the submission schedules would be unchanged)
(and usually they say "good work!" and connect with you)
I've never done that. I don't know what's customary in ML
I agree that although they apply CFG on autoregressive model it's a different field, so in that sense it similar to text2image model and is less relevant than CAD. We might want to add one line referring their work but I don't think we should do more than that...
Do best effort attempts at trying to cover related work for something you are doing, but if its not immediately in your vertical (subfield/adjacent field) its okay to miss it. Very rarely would a reviewer reject your paper because you havent mentioned one paper (they might ask you to add it though), unless you are committing an error egregiously.
i think we should include it in related works, or as another citation in the intro to CFG!
it's very cool
i think it's quite clear that they applied this idea from the text-to-image lit
weird that that guy is keeping such close tabs on HF PRs that he noticed it lol
So we're like 99% there. I see 2 remaining TODOs in the paper:
- 1 figure that @loud adder asked for but that I don't really understand. Do you understand what she meant @patent gull ?
- 1 flop analysis. @blissful garden I see you're on this, do you need a second brain?
I did another pass on the paper, fixed the figure flow in the appendix (it was totally chaotic), and the various things I mentioned earlier
We're 99% done => what's this 1% I'm not seeing, besides those two points, and how I can work on it? Has someone identified some incomplete work? I'm so deep inside it that I have a hard time keeping track of the progress lol
the problem yesterday was that the f-test resulted in all p=0 and Alex told me to change to ANCOVA. I got the code and putting together the results right now. Also reading up what ANCOVA is about (yes my stats suck)
(I have no idea what that is either)
it does seem to directly tackle the comparisons of regressions though
So that thing is about to be solved. Great!
hopefully something useful can come out
Well then, maybe this paper will be able to fly on ArXiv before Friday then!
all p values are super small again... @patent gull what conclusions are we looking for other than their adjusted means are not from the same distribution?
(I'm using original samples not bootstrapped samples btw)
Well then if p-values are small, it means that 2x vanilla and CFG aren't indistinguishable (which I expected)
no p-vals being small means they are distinguishable (ah which is what you said)
something must be up... the 95% bootstrapped confidence intervals are totally overlapping
hmm
are you running ancova on the normal values, or the log-normalized values?
log normalized
š ugh thought that was it for a second..
log(x) vs log(1-y) - log(y), and then linear regression
why log(1-y) - log(y)?
@fallow egret you have a whole section in the appendix to write:
\subsection{Deliberative Prompting: Chain-of-Thought}
i thought y was just avg accuracy, in those plots?
why are there so many different opinions out there for statistical-testing of regression lines?? ugh
https://stats.stackexchange.com/questions/151916/are-two-linear-regression-models-significantly-different
https://stackoverflow.com/questions/66433019/how-to-statistically-compare-the-intercept-and-slope-of-two-different-linear-reg
This question extends What test should be used to tell if two linear regression lines are significantly different? to the more general case of having two estimated models.
I have got the following...
actually ANCOVA only tests the slope, right?
yes y is avg which is bounded between 0 - 1
ancova stands for analysis of covariance, which i assumed meant the covariance between x ~ y
does anyone in this channel have a go-to significance test that they use for testing 2 regressions?
i think you can just directly test the relationship between log x and y
i don't think you need to transform it using logistic regression
no, because y is bounded at 1, when y is very close to 1 (like 0.9), it's having an obvious flattening look where linear regression is unfair
I was doing this in the beginning until I realize every line just intersects when x is large (y getting close to 1)
ahh so those plots aren't just showing x and y
they're showing some transformation?
Fig 10
the plots are just log(x) and y
but the curves are logistic regressions between log(x) and y
which is the same as linear regression between log(x) and log(1-y) - log(y)
ohh ok ok i thought you were doing some fancy multinomial fit thing
i think scipy optimize has a multinomial fit
anyway
ok... if you wanna send me the data i can also try significance testing
otherwise we can also just say "the two lines are indistinguishable on a 95% confidence interval"
so if two groups have similar slopes, ANCOVA will give high p?
that's what i thought. but if you see those SO links I sent, there are other proposals for significance tests
š that first link says chi-squared test of coefficients, a partial f-test and a t-test...
can't say i've heard of a "partial f-test" before, nor do i know which one is appropriate in this case
@patent gull data and codes sent (a bunch of stuff in DM but I don't want to spam the channel). Feel free to play with it
meanwhile let me also check out those links
Maybe the conclusion is indeed we have different slopes (or covariance) for each task
cool cool
Seems like most of the stuff related to me is done. There is this one question left:
- do we want to compress the codegen results in the main text by for example reporting temp=0.2 only? This way we combine the three tables together. (will still need all temps at least in appendix to fully showcase the trade-off of adherence-creativity)
I think so
@fallow egret you wrote the interpretability section, right? I wanted to chat about that as I feel like Iām missing something as I read it
It's @patent gull
I wrote it, yeah, Iāll be available in a bit to talk about it
(in a bit meaning like 1-2 hours)
This is done. Put the full table to the appendix and just say "Here we show the results for temperature= 0.2 in Table 3". Feel free to change the wording if there is a better way to put it.
@loud adder there's some debate on the MT section. Does it belong to the main text or the appendix? We're trying to make the paper shorter. The section shows:
- it's a generative task, but so is CoT
- CFG brings 10% improvement on MT on base models
- it didn't work on tuned models (so it makes the positive results a bit pointless, since people will obvioulsy use the tuned models)
- there are no further insight.
It doesnāt work for 1 shot either
Right.
I remember we did 1-shot for the BLOOM paper too. Is 1-shot a standard thing to do for MT
For generations tasks from what ive seen, seems to be the norm. Because the model tends to just over-generate unless it has an example to anchor to.
I see the Section 2 is 2 pages long. If we are really compressing the pages, maybe we should also consider moving most of the math to the appendix as well. Although I'm a math guy, my guess is that most audience just wants to see one equation and a short story before going straight to charts and conclusions.
I think all of 2.2 is necessary, and itās hard to see what equations can be cut from 2.1
I recall thinking the prose was a little overdone though, maybe we can cut some of it
I was referring to all these figures (in the appendix) in the main text. I will write a few sentence in the appendix that echo the main text
Done
@versed flax I'm pretty sure you flipped the order of the model labels
I will work on this tomorrow morning. Apologies for being absent, I was pulled away longer than I expected
I think you broke the LaTex.
It was before
You're right
@patent gull fyi \usepackage{subfigure} broke the LaTeX. I'm commenting it. It doesn't seem to break anything else. No idea what you needed it for.
Ugh so sorry man
I was trying to put two figures side by side
The gpt4all figure and the humeval win rate fig
You did that with \subfloat already didn't you?
Yeahhh I didnāt want them to have letters, and I didnāt want them to break the counter
ah
But it wasnāt breaking latex when I first imported it
Maybe after several compiles
maybe yeah
Or maybe my latex was caching something
But anyway I wouldāve never just left it broken if I had known
My bad
I was definitely awake and working on the paper when you did that and I did not notice the breaking
lol I know
Overleaf caches a lot of intermediate filesā¦. Thinking about it more, that may have been it :/
it's no drama
Cool
@loud adder let me know when youād like to chat about the interpretation section. Iāll be at my computer in 5 min
@fallow egret can you redo your figures with plt.rc('font', size=16)?
Also what is the y-axis? accuracy? can you label Accuracy (%)?
(I'm questioning whether we need the results to look like this, and whether we can format them like Figure 1, i.e. a table)
saves space. but it would be a very, very small table lol
oh sorry i see the legend. so, my visual graphics opinion is that legend-based hues should be different trials not metrics. Metrics imo belong on a dual y-axis
Yes, it will be narrow and long table
So what to change?
I used @blissful garden code to be aligned with his figures...
if you send me the data i can change it
Which figure is this
Figure 2
there's also wasted vertical space in the right-hand image, due to the need to project both into the same scale, which a dual y-axis would solve
What are the conclusions Iām supposed to reach reading these plots @fallow egret
This:
using CFG increases the percentage of CoT which results in a valid answer that could be parsed. For low guidance strengths, this results in boosting the model performances. However, for large values, although the model returns more valid results, the quality of the chains is also impacted, and overall the model performances degrade.
how did we evaluate the quality of the chains?
qualitatively?
(or did i write that?)
I would put a shorter version of that in the caption as well
Yes, there are qualitatively examples in the appendix
Agreed. Always tell people what they should think in the caption of a plot
ideally i would like error bars/ confidence region as well (which you can get via bootstrapping)
here are the things I'd like changed:
- make height smaller (half as large)
- make a dual y-axis and remove the legend
- make the font a lot bigger
- confidence regions on the plot
- x-axis label should match Figure 4 (
Guidance Strength (CFG \gamma))
If you would like me to do that, send me the data and I'll take a look
Yes, but first of all we don't have it in other figures. Second it's going to take forever to do bootstrapping
How long does a run take
(A) if you have the data saved, bootstrap sampling just means resampling
(B) our other line graphs in the main body have confidence
Also if this is being done with the eval harness, it does bootstrapping for you. That's what the "acc_stderr" etc values are from.
It depends on the data but ~6h for one cfg value
It's with eval harness
cool
This is not boostrapping as far as I understand the value
how many stds are common for error bars?
Why do you say that
i think central limit theorem says in large limits, 2 stds = bootstrap @ 95 confidence?
The code is here: https://github.com/EleutherAI/lm-evaluation-harness/blob/72b7f0c00a6ff94632c5b873fc24e093ae74fa47/lm_eval/metrics.py#L192
I'm not a statistician, but my understanding is that this is the right way to do a bootstrap CI
anyway, this shouldn't be a debate. good science means error bars and confidence regions
especially in the main body
Good science means correct error bars. If there's a contention that the error bars are meaningless that's wroth addressing
Ok, I can add it
yeah i mean i'm just addressing the debate over whether to include them or not
I think Elad was against it when he thought that 6 hours was per CFG per iteration
fair enough
Under the reasonable assessment that we don't have 1,000 days to run stuff for
But yeah, the default setting is to run 1,000 iterations for the bootstrap CI
Yes, there is no problem to do that with this code. My only very small concern is that we don't provide it in all the other figures (in the appendix)
(this is also why the evals are slightly non-deterministic, the "score" we report is the median of the runs)
I thought on boosting by running multiple experiments which going to take forever... nvm
Ok, so Alex do you want to send me the code I should use for the figures or you want me to produce the numbers for the figures?
No worries!
if you produce the numbers that would probably be easiest collectively for both of us
we're .5 page away from 10 pages
there is a heavily commented region of text with multiple "why is the needed" comments in 2.2:
Next, we need to define what is considered conditioning, $c$, in decoder-only language models. In the common situations, a user provides a \textit{prompt} $c$ which can be a context, an instruction, or the beginning of some text, and uses a language model to sample a sequence of continuation tokens $w_i$ for the prompt $c$. Since a good continuation is expected to highly correlate to the prompt, we consider the prompt as our conditioning.```
I suggest cutting it
@patent gull Can you turn on link sharing so I can pass a copy to some people for their feedback?
yeah
it's on
I think i'll take a stab at mocking the table described here:
\textbf{New Figure: shows several examples of (prompt, initial segment, completion) triples. Examples should show diverse relationships between the prompt and the initial text, including ones where the prompt is at the beginning and ones where the prompt includes text after the question (``let's think about it step by step'' perhaps)}
and then bother people to fill in stuff for their sections
but i don't understand initial segment. Is that, like, CoT-related?
I addressed that in the comments in the sidebar. It's not important per se, but it's addressing arguments that people coming from CV would make, and imho it makes the text flow a bit better. I'm not too opinionated on this piece of text and will ultimately let the majority decide after reading my replies.
i see, i see... i see the difference between token-logits in NLP and vision semantic spaces
i think that's useful
i have a standing comment there about the word embedding and sentence embeddings sentence... i don't think that's useful, nor necessarily helpful in thinking about token logits
how is conditioning $ c$ in NLP different from conditioning in vision?
Yes. We're manipulating the logits and we pretend it's a semantic representation. People in CV would absolutely scream bc they could think it's the same as manipulating pixel space directly, which is horrendous
ah.. i see. ok so there's a field difference here that i'm not understanding/aware of
fair!
the later layers of image generators are never the ones you'd want to manipulate
Oh, this confused people so I meant to write up an explainer...
Since the model go from abstract -> pixel, later layers lost all their semantics and they're just local pixel descriptors. (It would be the opposite with a classifier, ofc, which are mirror inverse architectures which goes from pixel -> abstract) . That's why, coming from CV, I thought it was important to mention that the logits space in NLP is indeed still semantic
I don't understand this question. It's too vague.
Next, we need to define what is considered conditioning, $c$, in decoder-only language models. In the common situations, a user provides a \textit{prompt} $c$ which can be a context, an instruction, or the beginning of some text, and uses a language model to sample a sequence of continuation tokens $w_i$ for the prompt $c$.
I'm talking about this paragraph.... my impression was that conditioning with prompts was kinda the same thing between vision and nlp
also, yeah, i think this is really fair and maybe worth explicitly saying somehow. I need to think about it...
It's amusingly the opposite in NLP... it's less obvious that the latents are semantically meanignful!
Yeah... pixels don't bear any semantic on their own, but words do. I would expect a char level convnet generator / discriminator in NLP to display the exact same behavior
So in section 3, we break up (prompt, continuation) variations into the following overall framework.
Prompt is what is supplied by the user/dev. Continuation is generated by the model. All variations branch from there.
cot: <prompt, [cot, continuation]>
text-to-text: <long prompt, long continuation>
chatbot: <[system prompt, user prompt], continuation>```
so the idea of initial segment as a third, distinct category doesn't really fit
So I'm picturing something like this
yeah... i would just prefer a breakdown that mirrors the structure of our paper... more comprehensible to readers
Or this
I think I agree, I'm trying to explain my mental model so I can understand the framing y'all're using
It's making a difference between the model's conditioning and the tasks' conditioning. The model's conditioning on prefix that naturally arises from the sequential autoregressive sampling might not align with the task: a user might want to generate a text that ends or contains some predefined conditioning text, or text in a specific style or something.
This is just to say that we align the task's conditioning to the model's conditioning by expressing it as a prefix. This might be too trivial.
alright here's the table... ulimately we may be able to lose the middle column if the example column is descriptive enough:
@versed flax can you fill in a good example for Assistant Prompting?
@blissful garden can you fill in a good example for code-gen?
@fallow egret can you fill in a good example from cot?
I can do basic prompting
Clearly the example are not fitting in that last column lol.
Also, how about we write "prompt => completion" rather than "(prompt, completion)" which makes more obvious what the input and outputs are?
yeah i think we can fiddle with it a bit after it's all full
i definitely think we can lose that middle column
alright. did some polishing. gotta turn to my day job now
I provide two full examples in Table 14-15 do we need more?
yeah
How many more?
Just add more tables in the same format?
no take a look at that table
and Vermifuge already put in one
should ideally be short
If we want the real prompt it's going to be problematic since it's few-shot. It's very very long
i see. yeah feel free to use '...', as well
Ok, np I can make it zero-shot, I hope it will not confuse the reader...
hmm i hope not either. let's see. there might be space to write "here's an example...." in the prompt, and then clarify in the caption. but we'll get a feel for it when we see it
Oh, I see we also need to write the reasoning, lol I don't how to squeeze all this stuff
i think just put in what you feel is good and complete, don't worry about space right now
we'll massage and standardize once all the examples are in
("how many egg boxes to buy to have 24?", ("For a box of 12 eggs, that's 24/12=2", "The answer is 2"))
something like that?
Ok, sounds good. I thought we want a real example
they would be too long I guess
Yes, indeed sounds good. I just hope it's not more confusing for the reader
what's possibly confusing about it?
This example doesn't represent the task and reasoning complexity + the setting is few-shot
gotcha, that's fair
But it's a good solution to put such example...
If this is in response to my request for a figure, the key thing is that this doesn't make it obvious what the CFG is attaching to. You should state that explicitly in the figure
Damn, yes. Maybe we just need a figure like
gamma * LLM("The dragon flew over Paris, France, on Saturday evening when") + (1 - gamma) * LLM("on Saturday evening when")
I think color coding the text makes it pretty clear
This is a shitty screenshot of sublime, but captures the core idea
I understand your screenshots now
Which row should I fill in?
maybe none, actually
I was reallt susprised I couldn't find a good example of what I had in mind
for real. It's so obvious we all overlooked it
@versed flax the order of the labels
something like that, but pretty?
good catch
Oooo is this intending to be a latent space representation? I like it
Yes
oh I like this picture
slightly better
maybe worth putting in section 2?
(I know we are long but one good picture is better than a lot of words)
that's a big 0 for the subscript of x_0 š
hahaha true
xI like this a lot
Needs a little prettying up
But it's really good
\caption{We show a 2D projection of a textual latent space $(x_0, x_1)$. We embed our text both with and without the prompt ``Today in France,'', and we walk from the promptless embedding in the direction to the prompted embedding with step size $\gamma$. Defining $\gamma>1$ overemphasizes the prompt, leading to better behavior and performance gains.}
Maybe put it in bold by the 1.5, as its being emphasized there
And in normal text by the 1?
like this?
"it" = "today in france"
ooooooooooh gotcha
So
y = 1.5 Today in France, citizens were celebrating
y = 1.5 Today in France, citizens were celebrating
y = 0 citizens were celebrating
I would definitely use the word "notional" or "hypothetical" or somethign like that in the caption, lest someone think we think this is what it actually looks like
I was doubting the interest of this table but I do think we're better without it. What it does is explain how we split Sec 3 which is imho not important enough to remain in the paper
I think it just camed from a misinterpration of Stella's point. Now that we have the actual meaning and did the right thing, I don't see the use of this
I futzed with the formating a bunch and think this looks way better. Thoughts?
Latex god 
Thank you so much! It's so much better
@loud adder , @patent gull After looking at the eval harness code, it doesn't apply bootstrap for the acc metric, it simply report the std with respect to the 0/1 (if understand correctly).
Are we fine with that?
Title page: I might remove the figure title? My main point of annoyance here is that it's not centered tbh.
Model surgery: "model editing" is a more common term, at least in NLP. As for whether there are technqiues for doing this at inference time, we recently wrote a paper about it where we edit the entire Pythia suite. AFAIK this is the first example that's effective at scale: https://arxiv.org/abs/2306.03819
Table 2: I agree with @versed flax that this doesn't really accomplish what I was hoping to accomplish. I also think it's formally correct at the expense of being clear, and that if we do something like this it should be formatted like a NL document and not a tuple of strings
Let me remove the title, cite that paper and change the terminology, and comment that table (Alex will remove it if he agrees)
My main outstanding concerns relate to readability to NLP people, and I sent a copy to two who said theyād be able to provide feedback by the end of the day.
what could be obscure for nlp people? I don't know that culture much
Itās less about obscurity and more about readability & communicating ideas effectively
I'm going to rewrite some of my FLOP appendix because we now have the main text.
remember: no chicken! lol
There are some areas I find a little weird, or at least not how I would have written it, but itās hard for me to tell if thatās personal style, language differences, field cultural differences, or something else
I added space at the end of the paper for acknowledgments and (at the beginning of the appendix) author contributions. Other than CoreWeave for providing compute, is there anyone in particular we want to thank? People you showed the draft to and got useful feedback from, people who provided compute for experiments other than the pod I provided, etc.
I used the cw cluster and a little bit of the Stability aws cluster. Maybe we want to acknowledge Stability for the compute as well
Sure
And by āsureā I actually mean āwe are contractually obligated to do soā
not on my side
some of my friends for taking part in the human evaluation 
We can absolutely thank the volunteers for our human experiments
@patent gull when you are free could you quickly glance Appendix C.2 (mostly the last paragraph) to see if I'm still missing anything?
I also capitalized all the "ANCOVA" because everybody seems to capitalize it. Also put "p" inside $ for the minor difference of fonts for math variables.
Another note: in Section 4, you wrote in the last paragraph "a P-sized model...". It seems the P is not used. Should it be removed?
lol can we thank all the volunteers by the random names we applied to them in the web interface?
Theyāll know
I was ārogueā
I forgot mineš
Oh man hahaha
I will!!
@patent gull All the stat is here:
https://drive.google.com/drive/folders/1it9kW9BQhWg8YfzFHcc1or69iOBX61gP?usp=sharing
I used the eval harness std, although I don't like it but whatever if this is the convention then lets use it...
I also modify the captions of the figures.
If there is anything else on my side let me know
Table 2 was removed in the end?
Fine by me!! I thought the idea of outlining the different prompts was a good one but maybe theyāre commonplace enough to be generally known
Iām sorry haha I assume so. Iām not in front of the doc right now. I think It was a misinterpretation on my part
I'm not seeing it in the text currently. I think it was a good decision to remove it
When Iām sole first author I usually acknowledge my academic funding sources. I donāt know what the protocol is here since itās a side project and Iām not sole first author
I also donāt know what my funding sources want
What is the protocol yāall typically follow for this?
oh this reminds me that I have a grant too...... and I haven't spun the story yet and it will look unrelated... I need to figure out 
idk if I should thank my company who were also lenient enough to allow me some time to work on that while not being my mission at all
Hahaha
Usually that falls under having them as your affiliation
If you have a grant it depends on the grant terms. When I was a phd and got a grant from the ERC I was obligate to mention them even in such case
Ok
Itās assumed that your employer is sponsoring your research. Acknowledgements are typically for non-obvious sources of support
And like @fallow egret says, itās often contractually obligated
yeah ERC is very strict. Swiss guys are chill. I used to ask SNSF guys about whether I'm obligated to report, and they said "up to you" 
I gave a shot at redacting the Author Contributions appendix. I did it from a non reliable memory. I invite everyone to read it and fix it. @blissful garden , @patent gull , you guys worked a lot together and I might have mixed some of your contributions. @loud adder , I genuinely have no idea how to properly phrase the supervising role you had and may have forgotten things. @fallow egret and @unique sedge, make sure I did not forget anything.
This is my first time redacting such a thing and I genuinely have no idea how to word it, which level of details to go into, etc.
I'm completely fine with what is written. I think from a style perspective it should be more general without the specific details. But I also didn't write such section in any of my papers so I'm not sure
I did my best being fair and indeed that's maybe a bit too detailed. I'm waiting for the feedback of people more seasoned than I am
Some paper does, like Pythia š
Yes, I'm seeing also the RWKV paper has such detailed style. It looks like this is the Eleuther style š
https://arxiv.org/pdf/2305.13048.pdf
okay maybe I shouldn't be that specific with section numbers
lgtm
currently? or if I get a bit more vague by removing the section numbers?
I'm saying the contribution part
oh. I'm happy I got it right :)
Oh I added C.1 to my part. I did C1-3 altogether
perfect!
3.1 is still unattributed. It's the standard benchmark section
the paper flies tomorrow š„³
We will have time for changes until Monday:
https://info.arxiv.org/help/availability.html
wait, the paper doesn't go live as soon as posted??
I think that has a mixture of my old texts and Alex's stuff
nope
Oh we must have failed at communicating about this
Yeah it's weird. And it's made worse by the fact it skips a day: you'd think papers received by 2 pm EST friday would go out at 8 pm EST friday, but they don't for reasons I don't understand
Schedule tl;dr
- If the paper is submitted by 1800 UTC on Friday it goes out at the end of the day on Sunday
- If the paper is submitted by 1800 UTC on Monday it goes out at the end of the day on Monday
so we can do tomorrow 6pm, right?
Yes
Awesome! So this is our last night working on it :)
@versed flax Oh the legends of Pythia and GPT2 charts are missing
By the way I will be busy travelling internationally tomorrow. Hope we don't find last-minute thing related to my parts but just to say I might not be available for quite a while.
Doing some final proofreading right now
good catch (again)
just to make sure you caught up: we post tomorrow and it will be released Sunday. I wanted to make sure you read what happened after your message :) We don't have until Monday
phew just ending work. a lot to catch up on. what is needed from me? besides some proof-reading?
Appendix A, Author contributions :)
Just make sure I didn't forget something about you
haha ok i'll take a look, thanks man
uh dumb question ā in Figure 1, whats the difference between \gamma = 1 and 1.5?
just the bolding? looks like the same output to me
yes
uhh i'm confused haha. I kinda glanced at the discussion around this plot, but i thought the point was to show that when we traversed towards higher \gamma, the generation changed
what's the point of it?
ohh wow someone did a lot of cutting... it's only 9.5 pages
Showing how we fiddle with the latent space. But you have a point
This is smart indeed
IMO, gamma=1.5 should be perfect, 1.1 should be not as perfect, and 1.0 should be blah
oh wait it starts at 0
ohhh this is showing literal prompt emphasis wow i'm dumb
yes
That's the caption lmao
but you have a point, it wouldn't hurt to show the continuations as well
gamma=0 => "citizen were celebrating summer"
gamma=1 => "Today in France, citizen were celebrating Christmas"
gamma=1.5 => "Today in France, citizen were celebrating Bastille Day"
something like that?
so the prompt is "today in france" and the continuation is "citizenS were celebrating..."?
i don't really get the point of gamma=0, we don't test on that, and why would we be expect it to be even close to being on topic? I would expect total garbage from it
but anyway yeah, I would do something like:
gamma=1 "Today in France, the weather was decent in London" (i.e. meandering, definitely topic-switches)
gamma=1.1 "Today in France, the weather was good for citizens" (i.e. not great, kinda passable)
gamma=1.5 "Today in France, the citizens celebrated in good weather" (i.e. good, on-topic)
and then underline "Today in France" and we can update the caption to be way more explicit about each time-step
what
idk just a thought
how is "good weather" related to France?
idk i was thinking of showing something that would change topic by the end
I guess your gamma=1 isn't bad
how about
gamma=0 => "citizens were celebrating Independence Day"
gamma=1 => "Today in France, citizens were celebrating Christmas"
gamma=1.5 => "Today in France, citizens were celebrating Bastille Day"
haha ok, yeah that sounds good
but why would gamma=0 be like that at all, though
wouldn't we expect random generation?
No?
gamma =0 means completely unprompted
so it could easily be "chickens fly to trees"
or whatever
The prompt only is "Today in France"
"citizen were celebrating" is the beginning of a continuation
oh we're doing like a multipart prompt?
Ah, that was in the caption but got deleted
ok... so "citizens were celebrating" should be underlined... "Today in France" should be bigger/bolder each time
You're overthinking is wayyyy too much
and then in caption we should say "start of continuation is underlined", "prompt is bolded" according to strength
or something
honestly i think we shouldn't show gamma=0
we should just start the line at gamma=1 and underneath say (baseline)
that way it's clear that we're just improving above baseline
It's important.
we can do:
gamma=1 => "Today in France, citizens were celebrating July 4th"
gamma=1.1 => "Today in France, citizens were celebrating Christmas"
gamma=1.5 => "Today in France, citizens were celebrating Bastille Day"
But now I want to add
gamma=0.5 => "Today in France, citizens were celebrating Independence Day"
ah! great convergence lol
yah bc July 4th is illogical
whoops boss is calling me brb. those are my 2 cents for the figure
Haha I think thatās great
But that is accurate, isnāt it?
Like thatās what would happen
Thatās a beautiful graphic imo
Haha no problem
Something is breaking at work so I need to brb but Iāll be back later. Did A quick scan of the paper ā it seems really good
how does the model know what day it is today
it doesn't?
He was making a joke
whoops
Pretending that the cause of the change of the holiday was changing what day the model thought it was
I guess I should go to bed and sleep, then haha
@fallow egret i'm redoing the plots now
Let me know if something missing
I'm looking over and it seems like the aqua plots are even more impressive
i have two questions:
- Was there any specific reason you put aqua in the appendix and gsm8k in main body?
- I'm thinking of putting them all in the main body, but only reporting one metric (probably accuracy). Is this OK? The metrics seem highly correlated. Is there any specific insights we get from % invalid?
- Not really GSM8K is the more standard benchmark (it's bigger and appear in more previous works), but I don't think it's that important for the order
- Yes, they are no cooreclt for high cfg values, which I think it's very important
hmmmmmm i see
For low values it is indeed cofrelate and you get more results and increase accuracy. However for larger value, you still get the same high valid percentage but the accuracy breaks, which means the quality of the reasoning chains deteriot
that seems to me like invalid % is a coarser metric
oh wait
well considering the confidence regions, it seems to me that invalid % stays pretty constant
how is invalid % calculated?
It's 1 if you get a parsed results (otherwise 0), and simply the % of non-parsed results sum(res==0)/len(res)
i'm a bit confused. isn't accuracy strictly bounded by % invalid?
in Aqua Guanaco, how can there be more invalid, but also more correct?
i guess it's different portions of the dataset, but still, seems counterintuitive to me
what is more important for practitioners to be able to measure? An invalid answer or an incorrect answer? Can't we have heuristics to reject invalid answers? And then, what is the accuracy only on the valid answers? Do people look at that?
- If you have 20% invalid but from the rest of the 80% all the answer are correct you have 80%. On the other hand if you have 10% invalid and from the 90% only 50% correct then you have 45% accuracy
- We have heuristic, as was written in the paper we follow self-consistency parsing protocol
we follow their exact protocol both with respect to prompt and parsing protocol
i see so accuracy is also a function of % invalid
i guess i'm just wondering if there's a way to include both acqua and gsm8k in the main body, but only with accuracy. I guess it's an interesting point about the different CFG values, though
It's not precision.
it's num of correct answer (no matter valid/invalid) / length )
sure, completely understood š
I was also wondering what is the correct way to do that. But I think the invalid metric is super important to explain what happening, and the exact effect of the cfg
yeahh i see that...
hmmmm let me try one thing
ugh yeah it's really hard to see it working as a table...
acc and invalid % would probably do best stacked in parentheses, but we've established a different visual vocabulary for parentheses elsewhere. Also hard for the eye to really follow
Yes, I think graph is much more readable than table in this case
Yes, this might working