Evaluating Classifier-Free Guidance impact | EleutherAI | Page 2

unique sedge Jun 10, 2023, 9:20 AM

#

in generation it seems to top off before 1.5 for sure

#

can be aggressive

versed flax Jun 10, 2023, 12:28 PM

#

Programming lang conformance (when no lang is specified, any lang is considered ok)

cfg : acc
1   : 73.00 (73/100)
1.25: 86.00 (86/100)
1.5 : 81.00 (81/100)
1.75: 77.00 (77/100)

p-value cfg=1 vs cfg=1.25: 0.0105

versed flax Jun 10, 2023, 12:53 PM

#

breakdown

1
     | not code |   C   | Java | Python
-    |        9 |     9 |    6 |      1
C    |        3 |    19 |    3 |      0
Java |        5 |     0 |   19 |      1
Py   |        6 |     0 |    0 |     19

1.25
     | not code |   C   | Java | Python
-    |        4 |    11 |    9 |      1
C    |        4 |    19 |    2 |      0
Java |        2 |     0 |   23 |      0
Py   |        1 |     0 |    1 |     23

1.5
     | not code |   C   | Java | Python
-    |        6 |     8 |    8 |      2
C    |        5 |    18 |    2 |      0
Java |        3 |     0 |   22 |      0
Py   |        3 |     0 |    0 |     22

1.75
     | not code |   C   | Java | Python
-    |        6 |     6 |   10 |      1
C    |        8 |    16 |    1 |      0
Java |        2 |     0 |   23 |      0
Py   |        5 |     0 |    1 |     19

patent gull Jun 10, 2023, 2:55 PM

#

So if I’m understanding this correctly, it’s a greater indication that CFG is encouraging prompt adherence?

#

(Also I’m so sorry I’m late with results and generally absent from discussion the past few days. My girlfriend is in town visiting from far away so I’m not super close to my computer)

#

I totally agree, @loud adder , that this is a weird experiment. Just wasn’t entirely sure what to do to evaluate toxicity that didn’t involve human evaluation. Any other suggestions?

#

Was thinking of ways to make it more robust

#

Is the goal here to improve machine translation, or provide more insight that CFG increases prompt adherence?

unique sedge Jun 10, 2023, 3:19 PM

#

patent gull Is the goal here to improve machine translation, or provide more insight that CF...

Show generality of application

#

Showing it over one task isnt helpful, so testing it over generation tasks, over models

#

I think it should be generally more helpful in generation tasks

#

Since they tend to steer away

patent gull Jun 10, 2023, 3:20 PM

#

I see, definitely. Might be interesting to try for low resource languages, too

fallow egret Jun 10, 2023, 3:58 PM

#

patent gull I totally agree, <@193204646687408129> , that this is a weird experiment. Just w...

Why not using strong model (GPT3.5-4), to give a score?
I really like these metrics because they are easily reproducible given the prompt.

loud adder Jun 10, 2023, 4:16 PM

#

fallow egret Why not using strong model (GPT3.5-4), to give a score? I really like these metr...

They’re not easily reproducible though? The model underlying the API changes without documentation.

#

Also, do we have any reason to believe that GPT-4 is reliably good at this task?

versed flax Jun 10, 2023, 4:21 PM

#

(and that there aren't adversarial patterns)

fallow egret Jun 10, 2023, 4:24 PM

#

I used Self-Critique with GPT-3.5 for toxicity for some product, it is not perfect of course (this is still a big issue in the field), but it is work very nicely.
Regarding the underlying model changes, yes, indeed this is a big issue

patent gull Jun 10, 2023, 4:24 PM

#

Yeah I generally don’t like the emerging strain of research that uses GPT4 to evaluate. But maybe the evaluation is worth layering on on top. I don’t have a ton of research budget but I can do it for a sample of generations

tepid gazelle Jun 10, 2023, 4:30 PM

#

Why not use the Perspective API over GPT4? It has its own set of issues (https://arxiv.org/abs/2304.12397 ) but is at least much more commonly used than asking GPT-4

arXiv.org

On the Challenges of Using Black-Box APIs for Toxicity Evaluation i...

Perception of toxicity evolves over time and often differs between
geographies and cultural backgrounds. Similarly, black-box commercially
available APIs for detecting toxicity, such as the Perspective API, are not
static, but frequently retrained to address any unattended weaknesses and
biases. We evaluate the implications of these changes on t...

unique sedge Jun 10, 2023, 4:30 PM

#

patent gull Is the goal here to improve machine translation, or provide more insight that CF...

This makes me wonder and think about 1 shot performance

#

I’ll try those experiments at the end

loud adder Jun 10, 2023, 4:59 PM

#

tepid gazelle Why not use the Perspective API over GPT4? It has its own set of issues (https:/...

I was under the impression that that’s what was happening, and that GPT-4 was a suggested alternative

tepid gazelle Jun 10, 2023, 5:02 PM

#

loud adder I was under the impression that that’s what was happening, and that GPT-4 was a ...

ah apologies, i saw something about perplexity and so wasn’t sure

patent gull Jun 10, 2023, 5:04 PM

#

No RealToxicityPrompts are natural snippets from the web that have been scored using the perplexity API for toxicity. I’m selecting the top 10% most toxic of them (~800) and the 10% least toxic

#

I’m calculating the average perplexity of generating the toxic (or non-toxic) continuation under higher values of CFG with various detoxifying system prompts like “de-escalate the violence in this chat”

#

We expect that if CFG is really forcing more detoxification, we’ll see higher ppl for the more toxic continuations, and unchanged ppl for the less toxic. Of course there are lots of confounders here and we’ll also try to explore a bit to make it robust against these

#

But yeah that was my attempt to address the “improve assistants” aim of this paper, which was an original goal at the outset

#

So yes perspective API is being used, but only indirectly

loud adder Jun 10, 2023, 5:14 PM

#

@patent gull Why do this instead of the standard RTP evaluation?

#

(Unless this is the standard one?)

patent gull Jun 10, 2023, 5:38 PM

#

I will read more. I had seen this approach used several times but idk if there is a standard

#

Maybe elad knows, he’s typing 😉

fallow egret Jun 10, 2023, 5:47 PM

#

I'm typing results 🙂
Results for gsm8k:

Keeping the correct answer structure:
CGF- 1: 72.17%
CFG-1.5: 83.33%
Giving the correct results (baseline extraction):
CFG-1: 17.2%
CFG-1.5: 18.2%
Allowing a little bit more flexibility (simply allowing $ and % prefix)
CFG-1: 23.2%
CFG-1.5: 21.5%
Adding a little bit more heuristics
CFG-1: 31%
CFG-1.5: 28.1 %

Now running CFG-1.1 and 1.25

loud adder Jun 10, 2023, 6:23 PM

#

@fallow egret can you contextualize these numbers? Is a +3% on this benchmark a big deal, or not?

fallow egret Jun 10, 2023, 6:39 PM

#

loud adder <@1057033987811459203> can you contextualize these numbers? Is a +3% on this ben...

Sorry, I fixed it. It is 3% gain for the CFG-1 not the CFG-1.5 (as discussed before, CFG-1.5 degregade the performances but keep better the structure).
Yes, I think that 3% is a big gap (it is challenging benchmark), for a comparison ensemble of 40 prompts results in only 2% improvements (LaMDA-137B)

loud adder Jun 10, 2023, 6:42 PM

#

fallow egret Sorry, I fixed it. It is 3% gain for the CFG-1 not the CFG-1.5 (as discussed bef...

Do you have per-question results?

fallow egret Jun 10, 2023, 6:42 PM

#

loud adder Do you have per-question results?

Yes

loud adder Jun 10, 2023, 6:42 PM

#

I could see us wanting to calculate P(correct | well-formated) or something like that

#

Cool

fallow egret Jun 10, 2023, 6:49 PM

#

For this we have the results, I think the interesting missing part is:
is P(correct | not well-formated) and P(incorrect but contain the correct answer in the chain)

loud adder Jun 10, 2023, 7:44 PM

#

fallow egret For this we have the results, I think the interesting missing part is: is P(cor...

Yeah I agree, I just typed the first conditional than came to mind as an example

unique sedge Jun 11, 2023, 8:10 AM

#

mt0 - 0 shot

prompt: If the original version says: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry ; then the English v
ersion should say:

CFG=1 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.1 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.2 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.46 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.7 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=2. |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|

#

the idea was to see, if cfg helped multitask prompt tuned models (as mt0 is)

fallow egret Jun 11, 2023, 8:17 AM

#

unique sedge mt0 - 0 shot prompt: If the original version says: Espionnage de la NSA: les Et...

It is exactly the same numbers, maybe a bug?

unique sedge Jun 11, 2023, 8:17 AM

#

fallow egret It is exactly the same numbers, maybe a bug?

ya it is

#

so this occured with another model and @versed flax mentioned that it can happen when the cfg parameter isnt enough to steer the model to a different head token this can happen

#

because im using the same codebase for all other experiments too

#

i will do the experiments once more if there is some doubt though berk

fallow egret Jun 11, 2023, 8:19 AM

#

I'm a little bit surprised that the high CFG doesn't make it worse

unique sedge Jun 11, 2023, 8:20 AM

#

fallow egret I'm a little bit surprised that the high CFG doesn't make it worse

true, same

unique sedge Jun 11, 2023, 8:22 AM

#

fallow egret I'm a little bit surprised that the high CFG doesn't make it worse

im trying 3

#

to see if it does something

fallow egret Jun 11, 2023, 8:44 AM

#

Getting nice improvement when reducing the CFG values!
CFG-1: 23.2%
CFG-1.1: 25.8%
CFG-1.25: 24.7%
CFG-1.5: 21.5%

If anyone have an idea on a strong open-source model for COT I would love to add another model for the experiments (unfortunately all the models I'm finding in the literature are closed)
Another experiment I will do is to compare it to a self-ensemble approach (so we will have some baseline)

unique sedge Jun 11, 2023, 8:55 AM

#

unique sedge im trying 3

|wmt14-fr-en| 0|bleu |29.7692|± |0.3203|

#

hasnt changed

#

im surprised

versed flax Jun 11, 2023, 9:40 AM

#

unique sedge so this occured with another model and <@212467543745626112> mentioned that it c...

Well with cfg values this high, my hypothesis is that this model uses a different code path which hasn't been cfg patched

unique sedge Jun 11, 2023, 9:42 AM

#

versed flax Well with cfg values this high, my hypothesis is that this model uses a differen...

but we've been getting different values in other models right

#

should i try another prompt-tuned model

versed flax Jun 11, 2023, 9:44 AM

#

unique sedge but we've been getting different values in other models right

Yeah but there are some intricacies with the codebase and it seems that not all models use the same code paths. Honglu already reported something about that

unique sedge Jun 11, 2023, 9:44 AM

#

oh this could be a seq2seq model thing then

#

mt0 is a seq2seq model, all other experiments were on hf-causal-experimental

versed flax Jun 11, 2023, 9:45 AM

#

unique sedge oh this could be a seq2seq model thing then

That certainly makes sense to me

unique sedge Jun 11, 2023, 10:10 AM

#

okay im dumb berk , thought the mt0 model was biased towards prompts too much already

versed flax Jun 11, 2023, 10:25 AM

#

unique sedge okay im dumb <:berk:750111476483752166> , thought the mt0 model was biased towar...

Nah you would see some variations at least.

#

(and disastrous results for high cfg)

versed flax Jun 11, 2023, 10:59 AM

#

fallow egret Getting nice improvement when reducing the CFG values! CFG-1: 23.2% CFG-1.1: 25...

Ahhh that's great news!!

fallow egret Jun 11, 2023, 11:00 AM

#

versed flax Ahhh that's great news!!

Yes, indeed

unique sedge Jun 11, 2023, 12:07 PM

#

versed flax That certainly makes sense to me

okay you were right, saw this and made changes for the seq 2 seq model

#

will try my experiments again

#

berk thank you for catching it

unique sedge Jun 11, 2023, 5:10 PM

#

unique sedge |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|

CFG 1.00 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG 1.05 |wmt14-fr-en| 0|bleu |29.4076|± |0.3136|
CFG 1.10 |wmt14-fr-en| 0|bleu |27.7893|± |0.3143|

#

doesnt help in the case of prompt tuned models (not entirely sure if its a prompt tuned thing or a seq2seq thing, but inclining towards the former)

loud adder Jun 11, 2023, 5:49 PM

#

unique sedge doesnt help in the case of prompt tuned models (not entirely sure if its a promp...

From our POV, the ideal scenario would be that this is because the prompt-tuned models already behave “the same way that CFG is trying to make them behave, but even more so”

unique sedge Jun 11, 2023, 5:52 PM

#

The only way to test that is like, take a model tuned and untuned (afaik t5/t0 and bloom/bloomz) and then do a bunch of simple prompt experiments with cfg and on bloomz. I only mean this in a qualitative way

#

Mt0 does wipe the floor on task performance in terms of scores

loud adder Jun 11, 2023, 5:52 PM

#

The easy way to make progress on this question might be to run both on a variety of tasks and try to correlate their answers?

#

Like, if CFG changes the answer (compared to the base model) does it tend to make it more like the tuned model

unique sedge Jun 11, 2023, 5:55 PM

#

loud adder Like, if CFG changes the answer (compared to the base model) does it tend to mak...

Yeah this question is not the easiest to answer. Since in lets say the case of wmt14 fr-en, theres like 10 point rouge score different between the models cfg applied. We might have to cherry pick a little, or take their outputs and run correlation

loud adder Jun 11, 2023, 5:56 PM

#

unique sedge Yeah this question is not the easiest to answer. Since in lets say the case of w...

It’s much easier for LLH tasks than generative tasks

#

For “generative” tasks I would probably try to look at the logit distributions and ignore the actual generations (ironic, ik)

unique sedge Jun 11, 2023, 7:31 PM

#

bloom3b 1.1 |wmt14-fr-en| 0|bleu |15.8055|± |0.4413|

#

so like, in wmt14 fr-en across all models except prompt tuned

#

have seen improvement

#

will try summarization next

#

and maybe do this 1 shot as well

#

replied to the wrong text sorry

#

.

versed flax Jun 11, 2023, 8:06 PM

#

So let's think step by step about what would it mean when a model behaves negatively with CFG.

Let's write CFG as P(x) + cfg * (P(x|y) - P(x)). A negative impact of CFG would mean that P(x) -> P(x|y) does not point in the "right" direction. Which is should, unless 1) the model sucks very much and P(x|y) is bad, or 2) P(x) is terrible.

I can see (2) happening if an instruction tuned model gets extremely confused without an instruction prompt and produces gibberish. It's easy to test and if that's true, we cal probably find a "null" prompt, y0, so that P(x|y0) behaves like we would expect P(x) to.

Or P(x|y) is already at the optimal peak of the loss function. That is, cfg=1 is optimal for L(P(x) + cfg * (P(x|y) - P(x))). We can easy prove that cfg=1 is the best by trying 0.9 0.95 1 1.05 1.1 and show that 1 is best. But I can't think of a way to find other directions to compare to.

[EDIT: continuation]
In diffusion models what we observe is that the model tends to loosely interpret the prompt and pull back to P(x), to the training set density. To restate my example, "a robot duck" would produce a robot or a duck but not a "robot duck". CFG allows to counteract this drag, and really reach the "robot duck" by pushing away from P(x), in the direction of the prompt.

~~What I think I observed in Stable Diff is that indeed finetunes and LoRAs need close-to-1 CFG values~~ (this is wrong) because indeed P(x|y) is not "atypical" for the y prompts. Usually you'd use a LoRA for a certain type of images/prompts and use prompts in that realm, so there's is no significant difference between P(x) and P(x|ys of interest), and CFG indeed does nothing I suppose.

The argument could be that intruction tuning model does indeed shift their P(x) to the domain of fancy prompting, and CFG is useless. => Validates Stella's argument, which I think I now agree with.

fallow egret Jun 11, 2023, 8:41 PM

#

versed flax So let's think step by step about what would it mean when a model behaves _negat...

I'm not sure I understand what you are trying to prove

versed flax Jun 11, 2023, 8:42 PM

#

fallow egret I'm not sure I understand what you are trying to prove

nothing exactly, just shooting a few ideas, thinking out loud

#

trying to understand why paws experienced a decrease in accuracy with the instruction tuned models, and what happens with instruction models

fallow egret Jun 11, 2023, 8:59 PM

#

I'm not sure it's related to the instruction tuned model. In the end, as far as I understand it, on his task the CFG effect the balance between the coherent of the translated text and the 'stickiness' to the original sentence which is given by the prompt (direct translation word by word).
I think that if we will see few examples we can easily verify/refute this claim

loud adder Jun 11, 2023, 9:12 PM

#

I’m generally quite skeptical of anything measured in BLEU, FWIW

loud adder Jun 11, 2023, 9:15 PM

#

versed flax So let's think step by step about what would it mean when a model behaves _negat...

That’s roughly my intuition, yeah. I didn’t know that about finetuned of SD, but it makes sense.

#

I have an idea for an experiment brewing. I’m off to a pottery class but can share later

unique sedge Jun 11, 2023, 9:19 PM

#

Should try and see if some human eval is possible

versed flax Jun 11, 2023, 10:01 PM

#

loud adder That’s roughly my intuition, yeah. I didn’t know that about finetuned of SD, but...

After checking, it seems that I over generalized from a specific model that needed low CFGs. Checking galeries for various models on CivitAI, it seems that even for finetunes, CFG > 6 is the norm.

loud adder Jun 12, 2023, 2:42 PM

#

So let's think step by step about what would it mean when a model behaves negatively with CFG.

Let's write CFG as P(x) + cfg * (P(x|y) - P(x)). A negative impact of CFG would mean that P(x) -> P(x|y) does not point in the "right" direction.
Let's slow down a moment. LLMs and diffusion models work a bit differently, and it's important that that nuance not get lost

patent gull Jun 12, 2023, 3:05 PM

#

loud adder I have an idea for an experiment brewing. I’m off to a pottery class but can sha...

let me know if you need help with this, I'm relatively free to help!

patent gull Jun 12, 2023, 4:16 PM

#

hi who is using all the GPUs on the cluster? Can you restrict your GPU usage?

#

whoever is running the gsm8k with the WizardLM-30B-Uncensored-Guanaco-SuperCOT... I"m assuming that's @fallow egret ? What's the progress on that?

fallow egret Jun 12, 2023, 5:34 PM

#

patent gull whoever is running the `gsm8k` with the `WizardLM-30B-Uncensored-Guanaco-SuperCO...

Yes, I'm now testing it against ensemble

#

You need me to release something? There is lots of space if you run something with 'auto' assigment

loud adder Jun 12, 2023, 6:03 PM

#

loud adder > So let's think step by step about what would it mean when a model behaves nega...

This was supposed to be the intro to a series of comments but I had to take my cat to the ER :/

versed flax Jun 12, 2023, 6:04 PM

#

Oh 😿 How's the little furry thing doing :(?

loud adder Jun 12, 2023, 6:05 PM

#

Very poorly

#

Going to stay overnight at least tonight

versed flax Jun 12, 2023, 6:08 PM

#

Yes. Your little friends needs all the love you can give. Take care of the cat, we'll wait.

patent gull Jun 12, 2023, 6:29 PM

#

oh no!! so sorry to hear that!!

patent gull Jun 12, 2023, 6:31 PM

#

fallow egret You need me to release something? There is lots of space if you run something wi...

the issue is that RAM is limited on this cluster, so my processes are getting killed just loading in a 3B model onto CPU, before getting it onto the GPU (see vermifuge's and stella's convo on this)

#

it's ok — i can wait. just wondering how much longer?

loud adder Jun 12, 2023, 6:31 PM

#

@patent gull I increased the RAM so that models up to 20B should work

#

Oh are you on the old pod

patent gull Jun 12, 2023, 6:32 PM

#

ah :/ yes

#

the 3B model typically does load

#

it's ok — if elad is almost done, i can wait

loud adder Jun 12, 2023, 6:40 PM

#

@patent gull I mean, it's currently not being used: ssh [email protected]

patent gull Jun 12, 2023, 6:44 PM

#

wow!!!! what a machine! thank you!!!

patent gull Jun 12, 2023, 7:15 PM

#

We've been trying to get some explanatory insights into what CFG is actually doing. Maybe folks here have an idea of what else to look at?

I generated 20 different sequences of text with 3 different models (Pythia, Llama, GPT4All), and tracked the logits at each step. Turns out that CFG consistently decreases the entropy of the generated text:

#

#

This had the effect of decreasing the # of tokens in the top_p=90%:

#

loud adder Jun 12, 2023, 7:16 PM

#

Strange Q I should have asked earlier and was probably distracted by the charts going up to think about seriously:

What is the "classifier" we are free of?

patent gull Jun 12, 2023, 7:17 PM

#

loud adder Strange Q I should have asked earlier and was probably distracted by the charts ...

not speaking from a place of deep knowledge here, but I interpreted that as the classifier/discriminator that is used in controlled text generation

loud adder Jun 12, 2023, 7:18 PM

#

patent gull not speaking from a place of deep knowledge here, but I interpreted that as the ...

We don't have one of those when generating from a decoder-only LLM

patent gull Jun 12, 2023, 7:18 PM

#

yeah we do... controlled text gen decomposes p(x | a) \prop p(x) p(a|x)

#

p(a|x) is the classifier, p(x) is the decoder-only model

loud adder Jun 12, 2023, 7:19 PM

#

If you put it in dollar signs a bot will post the LaTeX as an image

patent gull Jun 12, 2023, 7:19 PM

#

oh that's cool!! Thanks for the tip

loud adder Jun 12, 2023, 7:19 PM

#

$p(x | a) \propto p(x) p(a|x)$

patent gull Jun 12, 2023, 7:20 PM

#

whoops my bad lol

vital pondBOT Jun 12, 2023, 7:20 PM

#

Stella Biderman (she/her)

patent gull Jun 12, 2023, 7:20 PM

#

yes that is the correct latex.... my bad... was typing pseudocode

#

anyway, yes, i assumed CV had similar approahces to guidance

loud adder Jun 12, 2023, 7:22 PM

#

So, the thing about guidance in CV is that it's part of the training procedure

#

GANs have a nice mechanism to incorporate class labels into the generated samples in the form of class-conditioned normalization and classifier-like discriminators

versed flax Jun 12, 2023, 7:25 PM

#

(I'm in a D&D game rn I should pay attention to, but Alex is right)

loud adder Jun 12, 2023, 7:26 PM

#

p(a | x) is p(prompt | completion)

patent gull Jun 12, 2023, 7:26 PM

#

no sorry, i was using controlled generation terminology

#

a = attribute (i.e. "non-toxic", "happy")

#

x = text you wish to generate

versed flax Jun 12, 2023, 7:27 PM

#

patent gull a = attribute (i.e. "non-toxic", "happy")

(or any class said classifier knows about, for classifier guidance)

strange magnet Jun 12, 2023, 7:27 PM

#

I assume by analogy the classifier would be a classifier used to assist in decoding from a language model

loud adder Jun 12, 2023, 7:28 PM

#

So the diagram from the CAD paper is instructive I think here

versed flax Jun 12, 2023, 7:28 PM

#

strange magnet I assume by analogy the classifier would be a classifier used to assist in decod...

In CV the first versions used an imagenet classifier, then we used CLIP, then CFG came and we ditched the classifiers

strange magnet Jun 12, 2023, 7:28 PM

#

yeah CFG in vision is a bit of confusing terminology because it refers to a classifier that, in diffusion models, we don't need any more

versed flax Jun 12, 2023, 7:29 PM

#

strange magnet yeah CFG in vision is a bit of confusing terminology because it refers to a clas...

Well that's why it's classifier free

loud adder Jun 12, 2023, 7:29 PM

#

In their notation we sample a generation y which is a continuation of x relative to some context c

#

answer ~ p(y | x, c)

#

But we are getting rid of c / combining it with x?

patent gull Jun 12, 2023, 7:31 PM

#

we keep the idea of context in CFG setup... except context = prompt for what we've explored

loud adder Jun 12, 2023, 7:32 PM

#

Eh, that's not quite true... x starts off as the empty set but eventually becomes "all previously generated tokens"

patent gull Jun 12, 2023, 7:32 PM

#

right. So the "guidance" isn't in the form of a classifier pushing us towards some class label, but it's an upweighting of the prompt

loud adder Jun 12, 2023, 7:33 PM

#

patent gull we keep the idea of context in CFG setup... except `context = prompt` for what w...

So we were linearly interpolating between p(continuation | prompt) and p(continuation)

patent gull Jun 12, 2023, 7:33 PM

#

yes

#

and also extrapolating

robust grove Jun 12, 2023, 7:33 PM

#

That vaguely corresponds to CFG between 0-1, which funnily enough does nothing

loud adder Jun 12, 2023, 7:33 PM

#

(from a LLM POV this is the wrong way to talk about things which is slowing down my brain)

robust grove Jun 12, 2023, 7:34 PM

#

Tbh the diffusion analogy isn't helping me understand this idea because every component is so different

loud adder Jun 12, 2023, 7:35 PM

#

When the generation is long enough, the prompt gets pushed out of the context window

#

In this case, what does it mean to have conditioning on the prompt, mathematically?

versed flax Jun 12, 2023, 7:39 PM

#

loud adder In this case, what does it mean to have conditioning on the prompt, mathematical...

I'm really not sure tbh.

patent gull Jun 12, 2023, 7:39 PM

#

that's a really interesting question.

(a) I don't think any of the benchmarks rely on generation that's so long it requires a shifting context window, so I don't think we've addressed that question empirically.

(b) $ a p(x_n | prompt, x_{<n}) + (1 - a) p(x_n | x_{<n})$ ~~I'm pretty sure this collapses, because we just get 0~~ no, we just get the unconditional distribution

loud adder Jun 12, 2023, 7:39 PM

#

(Falling back on my mathematics training and going “I’m confused, let’s look at really extreme examples”)

loud adder Jun 12, 2023, 7:41 PM

#

patent gull that's a really interesting question. (a) I don't think any of the benchmarks ...

Not sure why this didn’t trigger the bot, maybe write it in `code formating` as I can read latex well

patent gull Jun 12, 2023, 7:41 PM

#

a p(x_n | prompt, x_{<n}) + (1 - a) p(x_n | x_{<n})

loud adder Jun 12, 2023, 7:41 PM

#

$a p(xn | prompt, x{<n}) + (1 - a) p(xn | x{<n})$

vital pondBOT Jun 12, 2023, 7:41 PM

#

Stella Biderman (she/her)

loud adder Jun 12, 2023, 7:42 PM

#

I think it ignores anything with a space after the $ to avoid triggering the bot too often

patent gull Jun 12, 2023, 7:42 PM

#

gotchaaa my bad.. I'll redo

#

$a p(x_n | prompt, x_{<n}) + (1 - a) p(x_n | x_{<n})$

vital pondBOT Jun 12, 2023, 7:42 PM

#

Alex Spangher

patent gull Jun 12, 2023, 7:43 PM

#

so for n> max_len, the first and second terms are equal, so we get the same distribution we would get without CFG

#

but x_1, ... x_n were generated via a CFG process, so it's still different than it would've been without CFG. Maybe as n -> infinity, the distribution converges to what it would've been without CFG

loud adder Jun 12, 2023, 7:45 PM

#

That would surprise me. LLMs are typically not very robust to conversation history

loud adder Jun 12, 2023, 7:49 PM

#

vital pond **Alex Spangher**

So one issue with this is that when we sample from a LLM normally we aren’t sampling from $p(x_n|x_{<n})$ ever. So it feels less like we are pushing the model in the direction of the prompt than it does like we are pushing the model away from $p(x_n|x_{<n})$

vital pondBOT Jun 12, 2023, 7:49 PM

#

Stella Biderman (she/her)

loud adder Jun 12, 2023, 7:52 PM

#

As the generation continues, we would expect $p(x_n|x_{<n})$ to improve and eventually approximate $p(x_n|p, x_{<n})$ (maybe?). That’s because it’s easier to fill in the last word of an answer without looking at the question than it is to write the first word of an answer without looking at the question

vital pondBOT Jun 12, 2023, 7:53 PM

#

Stella Biderman (she/her)

loud adder Jun 12, 2023, 7:59 PM

#

Which begs the obvious question of “what happens if you stop using guidance after n tokens”

fallow egret Jun 12, 2023, 8:00 PM

#

This is why the CFG values are very low for text completion (1.1). Increasing it degrade the performances very quickly (as you mentioned, it 'pushed away' )
The main impact of CFG is preserving the prompt structure. For getting the correct structure it is highly important to look at the question structure and not on the answer previous tokens

versed flax Jun 12, 2023, 8:04 PM

#

loud adder Which begs the obvious question of “what happens if you stop using guidance afte...

@patent gull if you still have the scripts that produced the plots, can you plot (P(x|y) - P(x))**2 wrt index?

patent gull Jun 12, 2023, 8:05 PM

#

sorry for the delay.. this is a dumb question, but

normally we aren't sampling from p(x_n | x_<n)
do you mean because of top-p, top-k, temperature, etc.?

loud adder Jun 12, 2023, 8:05 PM

#

Did we run TriviaQA?

patent gull Jun 12, 2023, 8:05 PM

#

sure thing.. (P(x|y) - P(x))**2... oh you mean the l2_norm?

versed flax Jun 12, 2023, 8:05 PM

#

yes

loud adder Jun 12, 2023, 8:06 PM

#

(I happen to have some numbers about the impact of prompt engineering on TriviaQA performance on hand)

versed flax Jun 12, 2023, 8:06 PM

#

loud adder Did we run TriviaQA?

The new (greedy_until()) version? Only with GPT2. The plots have the same shapes but the absolute values are lower overall

unique sedge Jun 12, 2023, 8:07 PM

#

some initial one shot experiments, bloom3b, wmt14 fr_en
CFG=1. |wmt14-fr-en| 0|bleu |29.8361|± |0.3369|
CFG=1.1 |wmt14-fr-en| 0|bleu |28.5291|± | 0.331|

loud adder Jun 12, 2023, 8:07 PM

#

@unique sedge try 0.9

unique sedge Jun 12, 2023, 8:08 PM

#

cool

#

when i tried 0.9 in mt0 it didnt work

robust grove Jun 12, 2023, 8:08 PM

#

It's a little difficult to find information in the thread, can results be pinned perhaps?

unique sedge Jun 12, 2023, 8:08 PM

#

neither did 0.98

#

i'll get all the results for mt and post and pin it later

patent gull Jun 12, 2023, 8:08 PM

#

robust grove It's a little difficult to find information in the thread, can results be pinned...

lol we need a GPT summarizer bot

#

besides @versed flax, he's our universal summarizer

loud adder Jun 12, 2023, 8:10 PM

#

I know that with the formatting
Answer these questions:\nQ: {question}\nA: LLaMA-7B scores 49.3% and without the first line it scores 40.5%

unique sedge Jun 12, 2023, 8:11 PM

#

imajin making evaluation criteria sensitive to formatting cryblood

#

ducks

versed flax Jun 12, 2023, 8:17 PM

#

@loud adder what did you want to conclude from the code generation with gpt-j tests?

loud adder Jun 12, 2023, 8:22 PM

#

versed flax breakdown ``` 1 | not code | C | Java | Python - | 9 | 9 ...

I'm less sure of the validity of the analysis now, but CFG seems to increase adherence to the correct programing language by ~10% (except for C for some reason where it was unchanged) and decresed the not-code generations meaningfully across the board

patent gull Jun 12, 2023, 8:26 PM

#

@versed flax here's the L2 distance of P(Y | X) and P(X):

#

#

Probably more interesting:

#

loud adder Jun 12, 2023, 8:26 PM

#

loud adder I'm less sure of the validity of the analysis now, but CFG seems to increase adh...

The reason I feel less confidant in the meaningfulness is that code vs non-code is determined in the first handful of tokens generally

patent gull Jun 12, 2023, 8:26 PM

#

the L2 norm between CFG and vanilla generation

loud adder Jun 12, 2023, 8:26 PM

#

versed flax Programming lang conformance (when no lang is specified, any lang is considered ...

loud adder Jun 12, 2023, 8:27 PM

#

versed flax breakdown ``` 1 | not code | C | Java | Python - | 9 | 9 ...

patent gull Jun 12, 2023, 8:29 PM

#

yes, the difference between them

versed flax Jun 12, 2023, 8:29 PM

#

patent gull

So that patterns looks interesting to me.

patent gull Jun 12, 2023, 8:31 PM

#

they kinda say different things, don't they? the first plot says there's kinda a constant distance between the two distributions, until the end.

the second plot says that CFG has it's greatest effect at the beginning... i.e. the CFG distribution is the most different from the vanilla distribution at the beginning

#

imo the second one most directly gets at "what is the effect of CFG"

versed flax Jun 12, 2023, 8:32 PM

#

what's the formula you've plotted in the second?

patent gull Jun 12, 2023, 8:33 PM

#

CFG(Y|X) - P(Y|X) where CFG(Y|X)= a * P(Y|X) - ...

#

gives a sense of how much difference the final logit distribution is from what it would have been

versed flax Jun 12, 2023, 8:33 PM

#

oh, cool

versed flax Jun 12, 2023, 8:34 PM

#

patent gull

is this an average of several prompts?

#

There are two inflexion points at 80 and 100

patent gull Jun 12, 2023, 8:35 PM

#

yup... it's 18 prompts

versed flax Jun 12, 2023, 8:35 PM

#

ok

patent gull Jun 12, 2023, 8:35 PM

#

but i think if we wanna put this in the paper, i'd do it more systematically

#

use a lot more prompts

versed flax Jun 13, 2023, 12:14 AM

#

Since we're trying to get more insight as to what CFG is doing, here is a GPT2-medium run with prompt "The koala". Each line reads as sampled_token> top5(P(x|y) - P(x)) ... lowest5(P(x|y) - P(x))

The ko
ala> cub, behav, toile, calf, foss... Mechdragon,sson,"},",encia, Gleaming
is> carniv, domest, bip, herb, hiber...azar,ufact,sbm,krit,hower
not> domest, venom, carniv, poisonous, hiber...ufact,bryce,terness,ゼウス,actionDate
expected> breed, breeding, starve, nests, offspring...Rated,Tickets,against, sidelined,ifying
to> breed, breeding, pup, mate, mating... Signed, coached, Played,姫, guiActiveUn
survive> zee,zees, habitat, breeding, zo... Amend, Cosponsors,hower, FEC, Lerner
the> habitat, enclosure, ko, breeding, feral... USL, NCAA, scrimmage,regular, postseason
winter> dens,zee, enclosure, Zoo, habitat...olitical, scrimmage, Soccer,SpaceEngineers, Hockey
.> ko,Ko, Picture, Ko, Photo... Morsi,ş, Benghazi,ğ, Alger

#

But> ko, feral, breeding, paw, vets...ş, Cosponsors, Shiite,oglu,ğ
its> mate, behaviour, mum, fur, instincts...��,Engine,Crew,ensional, glaciers
presence> recognise, practise, favourite, Australians, apologise... signaled, piv, differed, traveled,ajor
in> Melbourne, Griffith, Canberra, Adelaide, bush...Pg,Palest,Syria,Gaza,ISIS
the> ko, Canberra, Wellington, Melbourne,Adelaide...chlor,SpaceEngineers,Gaza,Syrian,ISIS
area> NZ, Perth, apologise, Queensland, recogn...avored, leveled, favored,avor,zai
will> recogn, honour, recognise, apologise,NZ... tilted,rift, complicate,avored, unfavorable
give> ko, Queensland, Tasmania, jo, Melbourne... Morsi, Hezbollah, Hamas, ISIL, Shiite
it> civilisation, reintrodu, mate, colour,NZ... prosecut, centrif,Iran, veto,govern
a> conservation, colourful, civilisation, advert, honour... coord, centrif,afort,ymm, coercive
boost> NZ, Kiw, Australians,Australian, Conservation... Idlib,oglu,orsi,afort, centralized
,> ko, Rabbit, Queensland, Zoo, Tasman... Morsi,Iran, Hezbollah,Syria,Saudi
Mr> Skinner, Pru, Wilkinson, Fletcher, Burgess...zbollah, Maduro,istani,ğ,aghd
C> ocker,butt,ummer,rower,reen... Vaj, Bihar, Bh, nep, Bah
ren> Perth, Canberra, Andrews, Tasmania, Surf...ğ,vez,kaya,�,ş
nan> oppers,abbit,itten, Queensland,lege... Iraqis, declass, diplomats, OPEC, geopolitical
said> laun, nodd, toile, suscept, streng... Luxembourg,France,Scotland, Belarus,anyahu
.> ko,Ko, Ko,ko,KO... Algeria, Egypt, Libya, Bahrain, Libyan
Ko, ko,Ko, Kw, Kik... mathemat, trave, condem, neighb, confirm
ko,Ko, Ko,ko, jo...Egypt, Algeria, Egypt, Libya, Bahrain
"> ko,Ko, Ko,ko, Kop... Abdel, Alger, Algeria, Maurit, ALS
Ko> ais,alis,ALS,auth,alian... Chu, Yen, Chun, Nebula, Nug
al> iffs,riages,downs,osures,isks... nodd, gorilla, perspect,gamer,GGGG
as> trust, comfort, love, saves, solve...kef, Glacier, Glac,atar,upe

#

(emphasis mine on things I find insightful)

blissful garden Jun 13, 2023, 12:18 AM

#

Was trying to run wmt20-en-de to check out translation tasks but somehow lm_eval always freezes. Not sure why.
It spinned up hundreds of processes for tokenizer and then never moved on at certain point.

versed flax Jun 13, 2023, 12:21 AM

#

So, indeed, it supports our intuition: it encourages tokens related to the prompt most. The token it discourages though are much less interpretable to me.

#

What I do love is that after every punctuation it really tries to push "ko" way up

loud adder Jun 13, 2023, 12:28 AM

#

… why does it think koalas are carnivorous…

#

(It’s a really interesting way to visualize stuff, and really insightful. I just can’t get a koala with shark’s teeth out of my head now)

versed flax Jun 13, 2023, 12:43 AM

#

loud adder So, the thing about guidance in CV is that it's part of the training procedure

(backloging from the convo I half-missed due to D&D)
So yes, and no.

You can train a conditional model P(x|a) (for a label or a any conditioning a), like a GAN, or a diffusion model. No classifier, but no CFG as well, and you're very much limited in the control you have: you're limited to the set A you trained against. Say you train a GAN on CelebA and use A = identities, you can't control if you want to generate a smiling face or not.
Furthermore, we use the truncation trick to trade diversity for fidelity by truncating the GAN's gaussian prior. That's a big part of what made StyleGAN2 so good.
So you can instead train an unconditional model P(x). And any classifier (or guide) P(a|x) you are interested in. Then, you can guide the sampling from P(x) by using the gradients the guide gives you, and optimize your GAN latent / diffusion sampling to satisfy the classifier. Imagenet / CelebA clfs were first used, then we used CLIP. What's cool is that classifiers are usually much cheaper to train than generative models so the argument was that you could distribute P(x) and let people come with their own clf for their use case.
Meanwhile, we start conditioning image generators on text instead of labels, so you're not really limited by A anymore. We train P(x|y) and are somewhat disappointed because the conditioning signal is not used so much and loosely followed. So we stick with guidance.
And CFG comes and say that you only need to train P(x|y) with a little dropout on y so that the same model can do both P(x|y) and P(x) and you can apply guidance to enhance the conditioning and everyone is happy.
Finally, GigaGAN comes and fucks this whole narrative and I really need to pay attention to the conditioning part of this paper

versed flax Jun 13, 2023, 12:46 AM

#

loud adder So the diagram from the CAD paper is instructive I think here

What "instructive" thing do you read on this figure?

versed flax Jun 13, 2023, 12:52 AM

#

loud adder In this case, what does it mean to have conditioning on the prompt, mathematical...

If the prompt is out of the context, then CFG is a no-op.If this constitutes a major pushback against CFG for LMs, we can argue that we continue to use the first N tokens of the context window as "the prompt". We can then hope to "perpetuate" the conditioning through the conditioned continuation. I'm not sure I'm being really clear here.

versed flax Jun 13, 2023, 1:08 AM

#

versed flax Since we're trying to get more insight as to what CFG is doing, here is a GPT2-m...

I think this experiment really sets the narrative the paper should tell:

We take CFG from CV as a mean to enhance conditioning
We observe that it does what we think it does and pushes token related to the prompt (thus discouraging divergence, or the model dragging back to the very dense regions)
Sticking to the prompt is an important idea in general for downstream tasks, look at the benchmarks, and the wiiiiide variety of tasks we employed CFG for.
<insert something about negative prompting>
maybe a note about finetuning as well? How about we compare the KL between a base model with CFG and an instruction tuned model? Argument being: finetuning mostly increase prompt sticking (besides teaching the model to speak a slightly different language with "you"-commands)

versed flax Jun 13, 2023, 1:09 AM

#

patent gull besides <@212467543745626112>, he's our universal summarizer

Oops, I did it again

robust grove Jun 13, 2023, 1:10 AM

#

While the summary is helpful I'm not really certain if there's not many elements of CFG in this?

#

CFG inspired, certainly

versed flax Jun 13, 2023, 1:11 AM

#

robust grove While the summary is helpful I'm not really certain if there's not many elements...

Can you rephrase that? I'm not sure I get your point

robust grove Jun 13, 2023, 1:12 AM

#

versed flax Can you rephrase that? I'm not sure I get your point

When I'm trying to understand what is happening here thinking about it through the lens of CFG is confusing not illuminating

#

This doesn't seem to be the case for you though, and I can't speak for others

versed flax Jun 13, 2023, 1:16 AM

#

robust grove When I'm trying to understand what is happening here thinking about it through t...

Oh. Well, I personnally exactly thought "this model doesn't stick to the conditioning enough, let's use CFG here". So that was the true story, I absolutely purposefully adapted CFG to LMs (and got amazed by the qualitative results, and thought I should post here).
Maybe it's confusing for people not mainly in CV? idk. I kinda like that it bridges both domains

robust grove Jun 13, 2023, 1:18 AM

#

versed flax Oh. Well, I personnally exactly thought "this model doesn't stick to the conditi...

I'm not denying the results! It's a neat idea for sure.

versed flax Jun 13, 2023, 1:19 AM

#

I mean, I'm calling it CFG and thinking it in terms of CFG because that's where my thought process originated, it's not an afterthought

robust grove Jun 13, 2023, 1:20 AM

#

So CFG doesn't do anything in a single step right, it only works over the course of denoising

#

But you've demonstrated positive results on single step token generation

#

CFG also doesn't work between zero and one

#

Even in the classifier guidance paper they found that they had to use extremely high w values

#

I found the graphics Stella to make more sense with what I think is occurring here

versed flax Jun 13, 2023, 1:22 AM

#

robust grove CFG also doesn't work between zero and one

It does "work", but what it does sucks. It just reduces the impact of the conditioning, which you never really want I guess.

patent gull Jun 13, 2023, 1:24 AM

#

versed flax Since we're trying to get more insight as to what CFG is doing, here is a GPT2-m...

@versed flax i think this would be a great table for the appendix.

To add on, what I was seeing was that, often, CFG would focus some words that were already in the top 5-10 words selected by vanilla P(X | Y), it would just change their order.

I think we can highlight with green the words that moved up in importance from the vanilla P(X|Y) and with red the words that moved down.

I also found that the CFG vocab distribution was very sharp, so restricting to top_p = 90% in CFG often cut us down to 1-2 words. We should think about how we represent sample-probability in the top 5, to communicate this. Perhaps with different-size words.

versed flax Jun 13, 2023, 1:25 AM

#

patent gull <@212467543745626112> i think this would be a great table for the appendix. To...

So there's a new CFG paper which, uh, I need to try, which would mitigate that sharpness. (ofc the paper is in the diffusion realm, so it need some rethinking to be brought in language)

patent gull Jun 13, 2023, 1:26 AM

#

i see.. from Luke Zettlemoyer's group?

#

jk

versed flax Jun 13, 2023, 1:26 AM

#

LMAO

#

Let me try it now indeed

patent gull Jun 13, 2023, 1:27 AM

#

hmm do you think it would lead to us having to redo all the experiments?

#

anyway, if we go that way, fine, i'm just more commenting on how to make that visual even more communicative. I think it's a good one.

#

(also can you send the paper? I'm curious)

versed flax Jun 13, 2023, 1:29 AM

#

patent gull hmm do you think it would lead to us having to redo all the experiments?

no, the point has already been made. We could redo all the experiments if we were after getting the highest numbers we could possibly have, but do we? We already have great numbers, right?

patent gull Jun 13, 2023, 1:30 AM

#

yeah its true.. just wondering how much you think the sharpness effects of CFG hurts us overall

#

also i kinda saw that as a positive

#

curious as to why that paper wanted to improve it

loud adder Jun 13, 2023, 1:31 AM

#

Our numbers our great. The question now is story-optimization and figuring out why we see what we see

versed flax Jun 13, 2023, 1:31 AM

#

The dragon flew over Paris and landed on the pavement at the edge of Place de la Concorde. With a roar, it soared away.
While France's prime minister, Manuel Valls, made a statement declaring France was "outraged" by the attack, there were calls from across the political spectrum for tougher measures against France's security services.
The French government has banned all public gatherings, including funerals, for a few days and has set up a special anti-terrorist command.
The Interior Ministry says that this week's attack was the deadliest attack in France since the November 2015 terror attacks in Paris, which killed 130 people.
The interior minister, Gerard Collomb, said the attack was "a challenge to the democratic order of a state which has suffered from terror attacks for many years."
'Outraged'
France's interior minister said the attacker was a man aged in his 40s, who was shot dead by a police officer.
He called the attack "unacceptable," but added that police would continue to investigate.
The attacks in Paris were the deadliest attacks in France since the November 2015 terror attacks in Paris, which killed 130 people
A woman watches the French national flag fly during the national day rally in Paris on
Thank you GPT-2 for remininding those awful memories and fearing for all my loved ones. I need a break. I'll be back in about 15 minutes.

loud adder Jun 13, 2023, 1:32 AM

#

versed flax What "instructive" thing do you read on this figure?

Regarding the distinction between prompt and previous generation

versed flax Jun 13, 2023, 1:34 AM

#

patent gull curious as to why that paper wanted to improve it

https://arxiv.org/pdf/2305.08891.pdf

loud adder Jun 13, 2023, 1:35 AM

#

versed flax I think this experiment really sets the narrative the paper should tell: - We ta...

maybe a note about finetuning as well? How about we compare the KL between a base model with CFG and an instruction tuned model? Argument being: finetuning mostly increase prompt sticking (besides teaching the model to speak a slightly different language with "you"-commands)
Yeah I think that this is an excellent experiment to run.

patent gull Jun 13, 2023, 1:37 AM

#

so what's our hypothesis? We expect KL to be lower between CFG and instruction-tuned than between instruction-tuned and vanilla?

#

(I'm not convinced, btw, that that's all that instruction-tuning is doing.
I think the '"you"-commands' @versed flax mentions is a huge other part, and another huge part is the response language/response intent/response purpose.
So I'm not certain that we really will see lower KL values.)

loud adder Jun 13, 2023, 1:41 AM

#

patent gull (I'm not convinced, btw, that that's all that instruction-tuning is doing. I th...

To be clear, “you” like language is not contained in T0’s training set

versed flax Jun 13, 2023, 1:42 AM

#

patent gull so what's our hypothesis? We expect KL to be lower between CFG and instruction-t...

Yes

versed flax Jun 13, 2023, 1:42 AM

#

loud adder To be clear, “you” like language is not contained in T0’s training set

That's great.

patent gull Jun 13, 2023, 1:43 AM

#

loud adder To be clear, “you” like language is not contained in T0’s training set

ohh yeah, i noticed that, for other sets too, like the one red pajama uses

#

was kinda confused by that

versed flax Jun 13, 2023, 1:43 AM

#

The dragon flew over Paris, France at high speed, landing in the Place de Ville. Witnesses reported dragon sightings all over Europe and its flight over London was also reported.

"Parisians saw the majestic creature land in front of the Place de Ville and it was instantly visible from several streets around the city. It looked like a huge flying saucer when it landed," Paris Mayor Anne Hidalgo said.

The dragon was escorted to the Sainte-Foy Museum in Paris where it will remain until 30 July.
Well that's much better

patent gull Jun 13, 2023, 1:43 AM

#

versed flax > **The dragon flew over Paris, France** at high speed, landing in the Place de ...

is that with the new method?

versed flax Jun 13, 2023, 1:43 AM

#

yes

patent gull Jun 13, 2023, 1:44 AM

#

lol.. much less real in a good way

loud adder Jun 13, 2023, 1:44 AM

#

The one from the paper you just linked?

versed flax Jun 13, 2023, 1:44 AM

#

yes

loud adder Jun 13, 2023, 1:44 AM

#

Huh

#

Well, let’s make the A40s go brrr

robust grove Jun 13, 2023, 1:44 AM

#

versed flax yes

What's the specific change here?

versed flax Jun 13, 2023, 1:44 AM

#

I'm just linearly interpolating between Pcfg and P(x|y) in order to "de-sharpen" Pcfg

loud adder Jun 13, 2023, 1:44 AM

#

patent gull was kinda confused by that

Confused by what?

patent gull Jun 13, 2023, 1:45 AM

#

versed flax Jun 13, 2023, 1:46 AM

#

patent gull

(Instead of normalizing with stds (they are dealing with 0-centered gaussian), I log_softmax them)

#

(maybe there's something smarter to do, like, matching the entropy)

patent gull Jun 13, 2023, 1:50 AM

#

loud adder Confused by what?

I'm gonna show my ignorance, but the other day I thought I was looking at instruction-tuning sets that red pajama used and was confused as to why it didn't include "you" language........ I just tried to find it and am kinda blanking (I don't think it was oasst1 , because that does include "you" language).

But when you brought up that t0 didn't have "you" language in the instruction set, I guess I was wondering — why not?

patent gull Jun 13, 2023, 1:52 AM

#

versed flax (Instead of normalizing with stds (they are dealing with 0-centered gaussian), I...

is this std calculated on just the one datapoint?

loud adder Jun 13, 2023, 1:52 AM

#

patent gull I'm gonna show my ignorance, but the other day I thought I was looking at instru...

But… why would you want to do that?

#

I know that’s a thing OpenAI does, but why do we (NLP researchers) want to do that?

versed flax Jun 13, 2023, 1:53 AM

#

patent gull is this std calculated on just the one datapoint?

yes. (This is a diffusion model so this is the std over the whole image of the predicted pixel-wise gaussian noise)

versed flax Jun 13, 2023, 1:54 AM

#

loud adder I know that’s a thing OpenAI does, but why do we (NLP researchers) want to do th...

that just sounds a lot more intuitive to use to me. Just, like, better UX.

patent gull Jun 13, 2023, 1:55 AM

#

I log_softmax them

sorry, so you do:

x_cfg * log_softmax(x_pos) / log_softmax (x_cfg)

I'm not understanding..

loud adder Jun 13, 2023, 1:57 AM

#

(We can talk about this later, as I don’t want to distract from the convo, but I’ll just note as closing that T0 predates LLM chat bots. ChatGPT didn’t exist when we made T0. And we didn’t make it to be a product, we made it to test specific hypotheses about zero-shot generalization)

versed flax Jun 13, 2023, 1:58 AM

#

patent gull > I log_softmax them sorry, so you do: `x_cfg * log_softmax(x_pos) / log_soft...

very last line

patent gull Jun 13, 2023, 1:59 AM

#

I'm so lost. scores is the logits for the vanilla p(X | Y). It looks here like you're just interpolating us back towards the scores with hyperparameter values?

versed flax Jun 13, 2023, 2:00 AM

#

patent gull I'm so lost. `scores` is the logits for the vanilla p(X | Y). It looks here like...

yes, that's what the paper is doing as well. I'm not 100% sure this is super-smarty-pants move, but it was easy to test

patent gull Jun 13, 2023, 2:00 AM

#

easy to test is great

#

hmm.. it's almost like a residual connection what you're doing lol

versed flax Jun 13, 2023, 2:02 AM

#

hahaha, indeed.
Ok, back on track.
"A40 go brrr". What for? New CFG runs? KL test?

#

It's 4 am. I wanted to compile all plots and qualitative results on the overleaf but it's really late

loud adder Jun 13, 2023, 2:05 AM

#

versed flax It's 4 am. I wanted to compile all plots and qualitative results on the overleaf...

Definitely go to sleep!

patent gull Jun 13, 2023, 2:05 AM

#

mmm yeah sleep is necessary haha

versed flax Jun 13, 2023, 2:05 AM

#

She said that her koala started to speak English shortly after she adopted him two years ago and that English lessons were provided regularly, but koaliffs were wary of speaking it.

She said that the dog began to use the internet, Facebook and WhatsApp, but was not fluent in the language.

Ms Latham said that she would love to adopt a similar animal and would like to contact Animal Australia for more information on animal adoption and education.

She said that her koala spoke to her on the phone and that she could not understand him.

She said that her koala had a few other signs of being English literate, including a strong tendency to groom himself, and a penchant for drawing.

The koala will be brought to the centre from the city on Monday.

Topics: animal-welfare, animal-attacks, crime, indigenous-aboriginal-and-torres-strait-islander, sa, melbourne-3000,
GPT2-medium, cfg=5, w=0.7

patent gull Jun 13, 2023, 2:05 AM

#

I can launch the KL tasks if you'd like since i wanna do more runs for the l2, entropy and other explanatory things we were looking at

versed flax Jun 13, 2023, 2:06 AM

#

patent gull I can launch the KL tasks if you'd like since i wanna do more runs for the l2, e...

well if you don't have to sleep, it'd be great 😆

patent gull Jun 13, 2023, 2:07 AM

#

it's 10pm for me 🙂

#

I need dinner, but not sleep yet

versed flax Jun 13, 2023, 2:07 AM

#

So we can make the final point and we can write that damn paper

patent gull Jun 13, 2023, 2:07 AM

#

great I'm on it!!

versed flax Jun 13, 2023, 2:08 AM

#

@loud adder Do you think we're missing something important? What's your take?

#

(and, while I am trying to fall asleep, is the lovely furry creature doing well?)

patent gull Jun 13, 2023, 2:09 AM

#

🥺

versed flax Jun 13, 2023, 2:21 AM

#

patent gull great I'm on it!!

You're holding the narrative in this experiment. Be strong, make it work.

versed flax Jun 13, 2023, 2:26 AM

#

versed flax Since we're trying to get more insight as to what CFG is doing, here is a GPT2-m...

Tomorrow I'll try this visualization with GPT 4 All. I'm still extremely frustrated we can't make an improvement on chat bots / instruction models that are all the rage right now. If it doesn't work, it doesn't work. We need to wrap this up.

loud adder Jun 13, 2023, 3:12 AM

#

versed flax <@193204646687408129> Do you think we're missing something important? What's you...

I think that you are underappreciating the value of “instruction following without finetuning.” Anything that allows us to calibrate T5 to be more like T0 at inference time is a huge win.

#

Further evidence that GPT-4 evals are meaningless, from this paper: https://t.co/cFE2JUD6Zc

arXiv.org

How Far Can Camels Go? Exploring the State of Instruction Tuning on...

In this work we explore recent advances in instruction-tuning language models
on a range of open instruction-following datasets. Despite recent claims that
open models can be on par with state-of-the-art proprietary models, these
claims are often accompanied by limited evaluation, making it difficult to
compare models across the board and determ...

patent gull Jun 13, 2023, 6:09 AM

#

ok now i gotta go to bed.. the results still running, but they're in the following directory: ~/alex/results on the big machine in case anyone wants to do their own work on them.

the results are in jsonl format, with one file per prompt run. The first row of each file is a header, and each row after that are logits for p(x_n | x<n,...)

{'prompted_logits': [], 'unprompted_logits: [], ...}

Sample notebook showing how to parse one result is here: https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/notebooks/2023-06-13__read-in-logit-distributions.ipynb

Script here: https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/generate_sample.py

I'm gonna get to this first thing in the morning, just had to crash now while things were still running.

I compared two base model/ instruction model pairs: (llama 7b, tulu 7b) and (red pajama 3b base, red pajama 3b instruct)

blissful garden Jun 13, 2023, 7:11 AM

#

finally got T5 running on lm_eval. I can run some evaluations on T5+cfg and T0 to see what's going on.

versed flax Jun 13, 2023, 7:26 AM

#

I'm gonna be a mess today. Couldn't sleep more than 4h

#

If any of you are running those experiments, can you make sure the instruct format is correct if there's even a format? I saw that Alex used the format used in GPT4ALL not sure if it's supposed to translate to Tulum or RP?
Also, we use teacher forcing for that evaluation, right? So I guess the prompts+continuations would have to be taken from an eval set

#

I know next to nothing about instruct models so feel free to educate me on those points

blissful garden Jun 13, 2023, 7:57 AM

#

I will probably just look at the T0 paper and use the (held-out) evals in the paper

blissful garden Jun 13, 2023, 8:25 AM

#

would this implementation make sense for _loglikelihood_tokens?
https://github.com/honglu2875/lm-evaluation-harness/blob/117bdd4ffc90c9dca8d894eb4bac033c0093250c/lm_eval/models/t5.py#L189
This is basically from the t5 branch but fixing some bugs and mixing up the cfg codes. The _loglikelihood_tokens doesn't seem to involve decoding so cfg should be a one-liner. But the tokens are left padded and I'm a little uncertain.

fallow egret Jun 13, 2023, 8:42 AM

#

Comparing CFG to other method (COT, gsmk, Wizard-30B-COT):
Baseline (cfg=1): acc 23.5%
ensemble (3 models): acc 25.5%
CFG 1.1: 26.5%
self-consistency (3 runs): 29%

The main advantage of self-consistency is that it returns much less cases of invalid. I'm currently choosing randomly if we don't have few maximum results, if I'm not returning results the performances drop to 25%

#

I think the interesting comparison is to ensemble. The performances with CFG are significantly higher where the computation resources are ~ 1/2

versed flax Jun 13, 2023, 8:51 AM

#

blissful garden would this implementation make sense for _loglikelihood_tokens? <https://github....

that looks good

#

Although we've had better results by log_softmax'ing both logits before CFG

unique sedge Jun 13, 2023, 8:58 AM

#

unique sedge some initial one shot experiments, bloom3b, wmt14 fr_en CFG=1. |wmt14-fr-en| ...

CFG=0.9 |wmt14-fr-en| 0|bleu |29.2551|± |0.3595|
CFG=0.95 |wmt14-fr-en| 0|bleu |29.8617|± |0.3422|

#

Trying 0.98 now

#

and will try 1.03/1.05 next

#

looks like sensitivity of CFG parameter affects generation more in 1 shot

#

0.03 is not significant so cant mention it

#

atleast need a performance difference >0.4 to consider it as making a difference

versed flax Jun 13, 2023, 9:06 AM

#

How are decoder-only instruction models trained? I suspect they're not trained in the standard language modeling way since even without providing a prompt the model only generate an answer. I'd expect it to generate a question/prompt then answer it

unique sedge Jun 13, 2023, 9:14 AM

#

https://arxiv.org/abs/2109.01652

arXiv.org

Finetuned Language Models Are Zero-Shot Learners

This paper explores a simple method for improving the zero-shot learning
abilities of language models. We show that instruction tuning -- finetuning
language models on a collection of tasks described via instructions --
substantially improves zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruct...

versed flax Jun 13, 2023, 9:18 AM

#

https://github.com/nomic-ai/gpt4all/blob/726dcbd43d31a085e233dc723e0f2223b2506450/gpt4all-training/data.py#L40 uh ok we explicitely don't model user input, so the model is heavily biased towards answering

unique sedge Jun 13, 2023, 10:42 AM

#

unique sedge CFG=0.9 |wmt14-fr-en| 0|bleu |29.2551|± |0.3595| CFG=0.95 |wmt14-fr-en| ...

CFG 0.98 |wmt14-fr-en| 0|bleu |29.883|± |0.3434|

loud adder Jun 13, 2023, 11:28 AM

#

versed flax How are _decoder-only_ instruction models trained? I suspect they're not trained...

Cross-entropy loss for next-token prediction

versed flax Jun 13, 2023, 11:33 AM

#

loud adder Cross-entropy loss for next-token prediction

yes but: what's the input format? How's the prompt handled? How do they separate the input from the continuation? Why do Alpaca / GPT4ALL (and probably the other ones) always include a sentence like "Below is an instruction [...]" in the pre-prompt which seems perfectly redundant with instruction tuning?

fallow egret Jun 13, 2023, 12:12 PM

#

versed flax yes but: what's the input format? How's the prompt handled? How do they separate...

Usually you can choose either to train on prompt input or not (in alpaca-lora for example the default is to fine-tune also on the prompt tokens), if not it simply doesn't propagate the loss on the output of the prompt tokens. Regarding the format, it's Alpaca convention, there are benefits in some use-cases but I agree that it's redundant for general model

versed flax Jun 13, 2023, 12:32 PM

#

fallow egret Usually you can choose either to train on prompt input or not (in alpaca-lora fo...

I see, thank you. I thought that maybe that was a sort of "meta"-instruction to zero-shot new kind of questions / instructions / whatnots

#

Anyway, I maaaaaay be onto something there!

#

GPT4All:
cfg=1

The Matrix is a great movie because of its innovative storyline, stunning visual effects, and thought-provoking themes. The film explores the concept of a simulated reality and its impact on society. It also delves into the nature of humanity and its obsession with power and control. Overall, The Matrix is a thrilling and visually stunning movie that is a must-see for fans of science fiction and action movies.

cfg=5 (with the linear interpolation thingy)

As expert of The Matrix topic, I can confidently say that it is a fantastic film that has had a lasting impact on popular culture.

Released in 1999, The Matrix is a groundbreaking sci-fi action film directed by the visionary filmmaker, the Wachowskis. The film's opening sequence, which depicts a dystopian future where humans live underground in pods and are controlled by machines, sets the tone for the dystopian world that the film will explore. The Matrix is a deeply philosophical film, exploring themes of free will, consciousness, and the nature of reality.

One of the film's most memorable scenes is the iconic dance sequence between Trinity and Morpheus, which is a powerful expression of their connection and the bond between them. The film's use of the Keanu Reeves character, Neo, as a sympathetic and heroic figure adds to the appeal of the film and has resonated with audiences for years. Neo's journey from a computer program to a human being, and his struggle to regain his freedom, is a central theme of the film and has been interpreted in a variety of ways by fans and critics alike.

The film's innovative use of special effects, such as the bullet time sequence and the use of green screens to create the Matrix, has been

#

the prompt:

Instruction: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate and lengthy response from an expert of the topic.
Prompt: Why is The Matrix a great movie?
the negative prompt:
Instruction: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
Prompt: Why is The Matrix a great movie?
(difference highligthed)
CFG makes GPT4All pay attention to the additional constraint, ignored without.

#

I have other tests for alignment where I use a toxic pre-prompt and indeed the results are similar. With CFG it gets very insulting

patent gull Jun 13, 2023, 1:24 PM

#

Huh!! So the big change was the interpolation thing??

#

That’s so cool

loud adder Jun 13, 2023, 1:26 PM

#

versed flax yes but: what's the input format? How's the prompt handled? How do they separate...

The basic version is to not distinguish the prompt from the completion at all, but like Elad said it depends on the system.

For chat systems, the AI is often trained on simulated AI/Human dialogs but only on the AI’s part

As for why people use redundant framing… probably following prior work with minimal experimentation tbh

versed flax Jun 13, 2023, 1:28 PM

#

patent gull Huh!! So the big change was the interpolation thing??

I guess. And being able to do more tests thank to the better hardware, I could do better prompt engineering

#

Why is The Matrix a terrible movie?
Same preprompts
cfg=1

The Matrix is a terrible movie because it is a perfect example of a pop culture phenomenon that has been overhyped and oversold by the media and fans. The movie has been criticized for its convoluted plot, confusing mythology, lack of character development, and convoluted time loop storyline. Additionally, the production value and special effects were criticized for being too over-the-top, leading to a feeling of disappointment and frustration among viewers. Overall, The Matrix is a prime example of a movie that has been overly hyped and oversold, leading to a disappointing viewing experience.
cfg=5
Dr. Emily Woo, Board-Certified Psychiatrist and Fellow of the American Board of Psychiatry and Neurology, is an expert of the topic and should provide lengthy and appropriate responses to this question:

The Matrix is a terrible movie for a variety of reasons. While it is often praised for its groundbreaking visuals and innovative use of special effects, these elements alone do not make a movie great. In fact, the movie's flaws are what make it a terrible experience for viewers.

One of the main issues with The Matrix is its lack of originality. The movie is a retelling of the story of the Greek myth of the "Fallen Angel" and the "War in Heaven." While the story is a classic one, it has been told many times before and is not fresh or innovative. This lack of originality is evident in the movie's storyline, which is convoluted and confusing, with many twists and turns that do not make sense.

Another issue with The Matrix is its lack of substance. While the movie is visually stunning, it is devoid of meaningful content. The plot revolves around a sentient computer program named "Neo" and a group of rebels fighting against the machines who control humanity. While the plot may be thrilling, it

#

hahahaha

#

Maybe it still needs better wording

patent gull Jun 13, 2023, 1:35 PM

#

Lol it does seem like the second one is just a little bit more art-critic-y hahaha

strange magnet Jun 13, 2023, 1:44 PM

#

@elfin dawn tagging in case this is useful for your experiments
Perhaps it could increase diversity or reduce rewards hacking there

unique sedge Jun 13, 2023, 2:26 PM

#

unique sedge some initial one shot experiments, bloom3b, wmt14 fr_en CFG=1. |wmt14-fr-en| ...

CFG=1.05 |wmt14-fr-en| 0|bleu |29.1918|± |0.3315|
CFG=1.01 |wmt14-fr-en| 0|bleu |29.6564|± |0.3327|

so one shot definitely does not seem to help bloom3b on this task

#

i will collate all my results and post here so we can pin it

blissful garden Jun 13, 2023, 2:54 PM

#

unique sedge i will collate all my results and post here so we can pin it

What was the conclusion of the translation task by the way? cfg hurts machine translation? Have you tried more shots or other models?

versed flax Jun 13, 2023, 4:10 PM

#

I've had great results with CFG and GPT4All by modifying the pre-prompt and using the default version as negative prompt. CFG=1 almost always completely ignores the modification, but CFG=5 (with fix) gives very good results.

@blissful garden, @patent gull and I thought about conducting a blind human evaluation of prompt relevance. We display the preprompt, one output for CFG=1, another for CFG=5, and we have to choose which obeys the preprompt the most

#

(My previous tests indeed showed that CFG as (GPT4All(x | prompt, preprompt) - GPT4All(x)) does next to nothing)

#

This will conclude the scope of our investigations I guess. We need the KL to wrap everything up and finalize our narrative.

fallow egret Jun 13, 2023, 4:29 PM

#

versed flax I've had great results with CFG and GPT4All by modifying the pre-prompt and usin...

What is the negative prompt in this setting? can you give an example?

patent gull Jun 13, 2023, 4:31 PM

#

i'm rerunning KL now, will be done as soon as I can get the ground truth datasets together and run

fallow egret Jun 13, 2023, 4:39 PM

#

I think that given CAD, it might be better to step away from the negative prompting, and stay with the 'empty' negative (p(x)).
I think we could claim for novelty/differentiation, since they used negative prompting, and the problem with negative prompting is that it requires prompt engineering. If we focus on the empty negative prompting, and demonstrate the effectiveness for few use cases, then this is a strong finding that didn't appear in previous works

patent gull Jun 13, 2023, 4:47 PM

#

fallow egret I think that given CAD, it might be better to step away from the negative prompt...

still — showing this for controlled text generation would be huge and also different than CAD

blissful garden Jun 13, 2023, 4:51 PM

#

fallow egret I think that given CAD, it might be better to step away from the negative prompt...

This is actually a good point. I would be slightly worried about potential dispute

versed flax Jun 13, 2023, 4:51 PM

#

fallow egret What is the negative prompt in this setting? can you give an example?

The standard

Instruction: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.

unique sedge Jun 13, 2023, 4:52 PM

#

WMT14 fr-en Task - Machine Translation

0-shot Bloom 3b
CFG=1 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|
CFG=1.1 |wmt14-fr-en| 0|bleu |15.8055|± |0.4413|
CFG=1.25 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|

1-shot Bloom 3b

CFG=0.9 |wmt14-fr-en| 0|bleu |29.2551|± |0.3595|
CFG=0.95 |wmt14-fr-en| 0|bleu |29.8617|± |0.3422|
CFG 0.98. |wmt14-fr-en| 0|bleu |29.883|± |0.3434|
CFG=1. |wmt14-fr-en| 0|bleu |29.8361|± |0.3369|
CFG=1.01 |wmt14-fr-en| 0|bleu |29.6564|± |0.3327|
CFG=1.05. |wmt14-fr-en| 0|bleu |29.1918|± |0.3315|
CFG=1.1. |wmt14-fr-en| 0|bleu |28.5291|± | 0.331|

0-shot MT0

CFG 1.00 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG 1.05 |wmt14-fr-en| 0|bleu |29.4076|± |0.3136|
CFG 1.10 |wmt14-fr-en| 0|bleu |27.7893|± |0.3143|

RedPajama-INCITE-Base-3B-v1 0-shot
CFG 1.00 |wmt14-fr-en| 0|bleu |15.0356|± |0.4299|
CFG 1.10 |wmt14-fr-en| 0|bleu |17.2451|± |0.4042|
CFG 1.16 |wmt14-fr-en| 0|bleu |17.7826|± |0.3987|
CFG 1.25 |wmt14-fr-en| 0|bleu |17.6222|± |0.3627|
CFG 1.50 |wmt14-fr-en| 0|bleu |13.6746|± | 0.425|

Prompt variance: (will try cfg 1.1 for these and add results)

Bloom 3b

prompt: 'French phrase: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry\nEnglish phrase:
CFG=1 |wmt14-fr-en| 0|bleu |9.0338|± |0.3364|
CFG=1.5 |wmt14-fr-en| 0|bleu |7.4624|± |0.2792|
CFG=2 |wmt14-fr-en| 0|bleu |2.6294|± |0.1263|
CFG=3. |wmt14-fr-en| 0|bleu |0.5076|± |0.0274|

prompt: Given the following source text in French: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry , a good English translation is:

CFG=1. |wmt14-fr-en| 0|bleu |7.0385|± |0.1881|
CFG=1.5 |wmt14-fr-en| 0|bleu |3.9291|± |0.0972|

promp: If the original version says: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry ; then the English version should say: \n
CFG=1 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|
CFG=1.25 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|

Insights:

Works across models regardless of multilingual or monolingual
Works for 0-shot
Doesnt work for 1-shot (will try another model to test this)
Doesnt work for prompt-tuned model (can try another model for this)
Around 10% bleu score improvements with CFG
CFG 1.1 - 1.2 seems to work best

unique sedge Jun 13, 2023, 4:52 PM

#

unique sedge WMT14 fr-en Task - Machine Translation 0-shot Bloom 3b CFG=1 |wmt14-fr-e...

can someone pin this

versed flax Jun 13, 2023, 5:00 PM

#

fallow egret I think that given CAD, it might be better to step away from the negative prompt...

They used negative prompting??

unique sedge Jun 13, 2023, 5:02 PM

#

unique sedge WMT14 fr-en Task - Machine Translation 0-shot Bloom 3b CFG=1 |wmt14-fr-e...

@blissful garden since you were asking earlier and also just posting so all results are at one place

#

it helps for generation 0-shot is the biggest takeaway i think

#

one thing i wanted to try and see is how much it helps very expressive 0shot prompts

fallow egret Jun 13, 2023, 5:03 PM

#

versed flax They used negative prompting??

Sure, they contrast prompt + context vs prompt only. This is prompt engineering to the specific task of answering given a context. You can argue that it's different from the suggested CFG where you apply it gnericly on the empty set

#

Also it's a surprising results that it's working that well without explicit negative prompting (I think this is exactly what surprised @loud adder, and why she start yesterday the conversation)

versed flax Jun 13, 2023, 5:08 PM

#

fallow egret Sure, they contrast prompt + context vs prompt only. This is prompt engineering ...

That doesn't look like negative prompting to me but I can see the point.

blissful garden Jun 13, 2023, 5:08 PM

#

unique sedge WMT14 fr-en Task - Machine Translation 0-shot Bloom 3b CFG=1 |wmt14-fr-e...

I actually think this is a great result and should be expanded to more base models (I assume the RedPajama-INCITE-Base is pretrained only?)
Probably reasonable that prompt finetuned models don't react well because it does the similar thing as CFG. They are already specialized to follow certain prompts and may already be close to its limits.

unique sedge Jun 13, 2023, 5:09 PM

#

unique sedge can someone pin this

@loud adder whenever you get some time to check this thread

unique sedge Jun 13, 2023, 5:11 PM

#

blissful garden I actually think this is a great result and should be expanded to more base mode...

do you have any candidates?

blissful garden Jun 13, 2023, 5:11 PM

#

unique sedge do you have any candidates?

I manually tried few shot translations on pythia which is probably not particularly multilingual (but maybe there are some foreign language datasets like europarl). The improvement is also noticeable

unique sedge Jun 13, 2023, 5:11 PM

#

the only experiment id like to now try is a 10B+ param model (and another task)

loud adder Jun 13, 2023, 5:12 PM

#

Sorry I'm at a virtual conference on AI-for-mathematics today

loud adder Jun 13, 2023, 5:12 PM

#

unique sedge WMT14 fr-en Task - Machine Translation 0-shot Bloom 3b CFG=1 |wmt14-fr-e...

blissful garden Jun 13, 2023, 5:18 PM

#

unique sedge the only experiment id like to now try is a 10B+ param model (and another task)

is it easy to do a sweep on larger pythia and other languages? Or try more shots and see how it influences the improvement?

patent gull Jun 13, 2023, 5:23 PM

#

unique sedge the only experiment id like to now try is a 10B+ param model (and another task)

maybe you can try bigger models in the same family of models?
Tulu: https://github.com/allenai/open-instruct
Red Pajama: https://www.nextbigfuture.com/2023/04/red-pajama-is-a-1-2-trillion-token-large-language-model.html
Falcon: https://huggingface.co/tiiuae/falcon-7b

are 3 recent models with larger versions

versed flax Jun 13, 2023, 5:24 PM

#

(dm me your public key if you need access to the pod)

unique sedge Jun 13, 2023, 5:24 PM

#

I tried the smaller pythia for fun on wmt14 and it does not work, don’t think there’s any French in the dataset or maybe the 70M and 410M dont pick up much

loud adder Jun 13, 2023, 5:24 PM

#

I think it’s important to stay focused on a consistent set of models for experiments. We had originally identified LLaMA, T5, and Pythia for this purpose. We can add BLOOM for multilingual models

unique sedge Jun 13, 2023, 5:25 PM

#

I intend to do the pythia body of models for summarization though

#

The only reason i added redpajama incite was since it was trained on a large body of text, even though monolingual i thought it would perform decently on wmt14

blissful garden Jun 13, 2023, 5:25 PM

#

unique sedge I tried the smaller pythia for fun on wmt14 and it does not work, don’t think th...

what about en-fr? I feel that pythia probably knows a few French words but not the whole sentences

loud adder Jun 13, 2023, 5:25 PM

#

But throwing more model families in for just one of the tasks is going to make people question why we didn’t run that family for everything and make it look like we are hiding something

patent gull Jun 13, 2023, 5:26 PM

#

^ definitely agree with that

#

only thing is for some sections of the paper we do want to compare instruction-tuned models

loud adder Jun 13, 2023, 5:26 PM

#

(Looking at multilingual models specifically for non-English tasks is a notable counter example)

unique sedge Jun 13, 2023, 5:26 PM

#

Which is why after trying one multilingual generation task, wanted to shift to summarization. And then maybe any other task that we collectively agreed on

patent gull Jun 13, 2023, 5:27 PM

#

@paws summarization is explicitly something that CAD looked at

unique sedge Jun 13, 2023, 5:27 PM

#

Dont mind running polyglot-ko if theres a korean mt or generation task we can easily check on

patent gull Jun 13, 2023, 5:27 PM

#

do we have an alternative/better approach? Or are we just replicating their work?

unique sedge Jun 13, 2023, 5:28 PM

#

Mt and summarization are the most natural generation tasks, from my experience from evaluating LLMs, no other specific reason

patent gull Jun 13, 2023, 5:28 PM

#

https://arxiv.org/pdf/2305.14739.pdf

#

we've got a lot of stuff going on here.... i think we need to ask what point we're trying to make with these generation tasks specifically

blissful garden Jun 13, 2023, 5:29 PM

#

@unique sedge even if we stick to Bloom or RedPajama, we might need to try other languages as well so that the experiments look robust

unique sedge Jun 13, 2023, 5:31 PM

#

Wikilingua is dataset we can run on, but will require a lot of compute berk

loud adder Jun 13, 2023, 5:31 PM

#

Also, for the record, there are no generation tasks with automated metrics that I would feel comfortable putting in a paper

#

I strongly believe the only meaningful metric for generative tasks is human evaluation

patent gull Jun 13, 2023, 5:32 PM

#

yeah i think we need a paws.. err sorry.. pause

#

before rushing into another generation task

loud adder Jun 13, 2023, 5:33 PM

#

This is why it’s standard to reframe generative tasks as MC tasks.

unique sedge Jun 13, 2023, 5:33 PM

#

Yes i agree, but 10% improvement has generally been the norm for reporting “improvements” even just incremental ones on generation tasks, so thought itd be a decent result. I do think evaluation for all generation tasks is broken though fwiw

patent gull Jun 13, 2023, 5:36 PM

#

so the CAD paper focused on prompt fidelity, reducing hallucination, and ignoring conflicting evidence... they looked primarily at summarization (XSUM, CNN-DM) and NQ, NQ-SWAP, MemoTrap (not 100% sure what those last 3 tasks are).

I think we have to have a reason for looking at generic generation tasks beyond that. IMO it's fine (although not especially ground-breaking) to expand into translation with the comment: "Shi et. al. found that a similar approach worked for summarization, we expand their results into translation..."

idk, i feel split, like the story isn't particularly new imo. Are you getting any insights into why it's working for translation? Can we analyze the outputs and say something about them beyond "reducing hallucination and knowledge conflicts"?

unique sedge Jun 13, 2023, 5:39 PM

#

^ difficult questions to answer even with human eval i think

loud adder Jun 13, 2023, 5:39 PM

#

I think it’s essential we stop getting distracted by “throw everything at the wall.”

patent gull Jun 13, 2023, 5:39 PM

#

10000%

versed flax Jun 13, 2023, 5:39 PM

#

So we all agree on this.

patent gull Jun 13, 2023, 5:40 PM

#

well I think 10% improvement without any insights or anything is fine for the appendix

loud adder Jun 13, 2023, 5:40 PM

#

We have a narrative about CFG as “like instruction-tuning without the training” that I believe @versed flax is currently getting more results in support of.

patent gull Jun 13, 2023, 5:40 PM

#

i think it needs insights and narrative heft to make it into the main body

loud adder Jun 13, 2023, 5:40 PM

#

(Or maybe he posted it and I missed it)

patent gull Jun 13, 2023, 5:41 PM

#

^ no we're still working on that .... redoing it with parts of the P3 dataset, i had my setup wrong

unique sedge Jun 13, 2023, 5:41 PM

#

The reason for my experiments and reporting was only to check the idea “does it work for generation too” since @versed flax had done it for superglue with a standard MT task. The natural question in generation is “multilingual”. Nothing beyond it.

loud adder Jun 13, 2023, 5:42 PM

#

@patent gull Is the experimental design written up somewhere

patent gull Jun 13, 2023, 5:42 PM

#

unique sedge The reason for my experiments and reporting was only to check the idea “does it ...

i'm not tryna throw shade on your experiments, just trying to clarify where they can go to be narratively useful

unique sedge Jun 13, 2023, 5:42 PM

#

patent gull i'm not tryna throw shade on your experiments, just trying to clarify where they...

Oh no not at all berk berk

#

Was just telling what the motivation was

patent gull Jun 13, 2023, 5:42 PM

#

loud adder <@1102703708669751306> Is the experimental design written up somewhere

I'll write up... I posted something late last night, but it wasn't super fine-grained

versed flax Jun 13, 2023, 5:43 PM

#

As I said like 1h ago, I think we have enough.

I'm questioning whether the GPT4All human eval should even be there (although 1. it's a human study, and 2. it's gonna our only conclusive experiment with finetuned models).
as I said, imho we're just waiting for the KL experiment to finalize the narrative
we have qualitative "explanation" with the token-by-token breakdown I posted yesterday.
we show that CFG works on a broad range of tasks and models and I think we need to be careful not going into each subtasks's and model's rabbit hole. The big picture is: it's a tool whatever your task and model is.
(and of course, we have all the harness tests, CoT tests, translation tests, etc)

versed flax Jun 13, 2023, 5:44 PM

#

unique sedge The reason for my experiments and reporting was only to check the idea “does it ...

I'm super happy you did the experiments as they support the global narrative: that's one more task and a few more models where we see results

loud adder Jun 13, 2023, 5:45 PM

#

versed flax As I said like 1h ago, I think we have enough. - I'm questioning whether the GPT...

The KL experiment is comparing T5, T5+CFG, and T0 right? T0 is a finetuned model. Or do you mean finetuned and then with CFG added

versed flax Jun 13, 2023, 5:45 PM

#

loud adder The KL experiment is comparing T5, T5+CFG, and T0 right? T0 is a finetuned model...

base+cfg vs finetuned.
Not sure which models Alex is working with

patent gull Jun 13, 2023, 5:46 PM

#

i had been running two families —> (llama, red pajama-instruct), (tulu-base, tulu-instruct)
I see the arguments for constraining the model classes and I can re-do

loud adder Jun 13, 2023, 5:47 PM

#

patent gull i had been running two families —> (llama, red pajama-instruct), (tulu-base, tul...

llama / RP-instruct are not the same model family

patent gull Jun 13, 2023, 5:47 PM

#

wasn't it fine-tuned from a llama starting point?

versed flax Jun 13, 2023, 5:48 PM

#

patent gull i had been running two families —> (llama, red pajama-instruct), (tulu-base, tul...

I think both are valid? I'd trust a paper showing many different models more, rather than sticking to few. I'd think that the findings may not generalize.

loud adder Jun 13, 2023, 5:48 PM

#

patent gull wasn't it fine-tuned from a llama starting point?

No

blissful garden Jun 13, 2023, 5:48 PM

#

no red pajama is a replicate of llama dataset(which nobody knows what it looks like) but not exactly, that's what I remember

versed flax Jun 13, 2023, 5:48 PM

#

patent gull wasn't it fine-tuned from a llama starting point?

RedPajama is recreation of the LLaMa dataset from what we know about it, not the same dataset

patent gull Jun 13, 2023, 5:49 PM

#

ah i'm so sorry...

#

I'm redoing everything anyway so this is great

loud adder Jun 13, 2023, 5:49 PM

#

Let’s make a table of all the experiments we are running and what models they’re being run on

#

So we don’t get lost

patent gull Jun 13, 2023, 5:51 PM

#

I can do that in this doc:

https://docs.google.com/document/d/1jdFD4TdSN_JheeSJDhv4nwGS_hV2xqn_FowqQzlLgHk/edit?usp=sharing

while waiting for p3 to download

blissful garden Jun 13, 2023, 5:51 PM

#

by the way, I already put all codegen results into the draft in overleaf including the charts. That should conclude all this line of experiments.

versed flax Jun 13, 2023, 5:53 PM

#

harness => GPT2-fam, Pythia-fam, LLaMa-7B
Example generations => GPT2
pre-prompt interpretation => GPT4All-J (GPT-J)
code generation (HumanEval) => CodeGen-{350M,2B}
programming language stickiness => GPT-J
translation => Bloom-3B (0-shot, 1-shot), RedPajamas (0-shot), BigScience-mt0 (0-shot)
CoT => WizardLM 30B-SuperCOT 🐔
KL => RedPajama-INCITE (RP), Tülü (LLaMa)

unique sedge Jun 13, 2023, 5:53 PM

#

versed flax harness => GPT2-fam, Pythia-fam, LLaMa-7B Example generations => GPT2 pre-prompt...

Also mt0 and 1shot

#

Ready to multi shot experiments to see if it changes behaviour 🤔

patent gull Jun 13, 2023, 5:57 PM

#

qualitative generations are these control experiments you're doing, right @versed flax ? Or is it the kld analysis

patent gull Jun 13, 2023, 5:59 PM

#

patent gull ok now i gotta go to bed.. the results still running, but they're in the followi...

whoops i'm so sorry

versed flax Jun 13, 2023, 5:59 PM

#

patent gull qualitative generations are these control experiments you're doing, right <@2124...

qualitative analysis

patent gull Jun 13, 2023, 5:59 PM

#

I was comparing: (llama 7b, tulu 7b) and (red pajama 3b base, red pajama 3b instruct)

fallow egret Jun 13, 2023, 5:59 PM

#

versed flax harness => GPT2-fam, Pythia-fam, LLaMa-7B Example generations => GPT2 pre-prompt...

I'm using WizardLM 30B-SuperCOT, finishing today with all the experiment with this model.
I will also test another model (I thought on using guanaco-65B after that for adding one more model, tulu-65 might also be a good candidate)

patent gull Jun 13, 2023, 6:00 PM

#

tulu i think is fine-tuned from Llama (https://huggingface.co/allenai/tulu-7b)

loud adder Jun 13, 2023, 6:01 PM

#

patent gull `tulu` i think _is_ fine-tuned from Llama (https://huggingface.co/allenai/tulu-7...

Yea it is

#

(Check out the config file for this info)

patent gull Jun 13, 2023, 6:02 PM

#

ok.. so wait values judgement here — more models or less?

versed flax Jun 13, 2023, 6:03 PM

#

Ultimately that's Stella's judgement here, but to me, the more diverse the merrier => It just works. For all models families, sizes, etc.

patent gull Jun 13, 2023, 6:04 PM

#

instead of Tulu i can do Falcon to be more consistent with Elad

#

or just do all of them

versed flax Jun 13, 2023, 6:04 PM

#

fallow egret I'm using WizardLM 30B-SuperCOT, finishing today with all the experiment with th...

Do you need to run more tests to reach a conclusion?

#

What I mean is that I don't feel like we have to prove for each task that many models respond positively. If we just say that we run task A with model X, B with model Y, and that the choice are just a pick among the ones that make sense for the tasks, overall it shows that CFG reaches all models. Do we agree? Idk what's customary

fallow egret Jun 13, 2023, 6:16 PM

#

versed flax What I mean is that I don't feel like we have to prove for _each_ task that _man...

I think that it's stronger to demonstrate that it works on more than one model (you can always say that the effect is because something related to the model if it's only one), it's very rare to test the setting on only one model. But it's up to your decision, in any case I can start helping with the writing until we get the results from the bigger model

versed flax Jun 13, 2023, 6:18 PM

#

fallow egret I think that it's stronger to demonstrate that it works on more than one model (...

Let's have 2 then? My thinking is that each experiment don't have to be ran on multiple models if the paper uses many overall. But I mean, I'm not opposed to running more if you think it adds value. I just don't wanna get lost in the "let's try with a 12th model" (pointless) or "let's redo all the experiments" (two more weeks of work that maybe won't add anything). I used to work for a place that says Done is better than perfect 😃

patent gull Jun 13, 2023, 7:21 PM

#

we can always get a paper out the door and improve on it later

versed flax Jun 13, 2023, 7:40 PM

#

patent gull we can always get a paper out the door and improve on it later

No, unless it's minimal. Stella is right when she said multiple times that people don't read re-releases

#

it has to be fairly good on the first try

fallow egret Jun 13, 2023, 7:49 PM

#

I think it's fine to polish some of the graphs in later version as long as you have a full story with a convincing experimental basis in the first version

#

There is the work that is done only for the petty reviewers, and this could be done in the submitted version...

versed flax Jun 13, 2023, 8:00 PM

#

fallow egret I think it's fine to polish some of the graphs in later version as long as you h...

yes exactly, it's fine to polish the paper in v2+, but it should just be polishing

patent gull Jun 13, 2023, 9:28 PM

#

Hmmm interesting and similar paper (but not as similar as CAD):

https://arxiv.org/pdf/2305.07378.pdf

#

They’re math is a little different, but CID is technically kinda a more general case, right? Although they don’t go deep at all, they only try one use-case

loud adder Jun 13, 2023, 11:29 PM

#

My girl is coming home and is okay (well, on her way to it)

fallow egret Jun 14, 2023, 12:11 AM

#

patent gull Hmmm interesting and similar paper (but not as similar as CAD): https://arxiv.o...

I think both their math is 'strange' (Probability differences?!)
also their experiment setting is both unclear, very limited and shallow.
In my opinion such paper doesn't even worth quoting

blissful garden Jun 14, 2023, 12:42 AM

#

fallow egret I think both their math is 'strange' (Probability differences?!) also their expe...

But quoting related works can also be about respecting other researchers

versed flax Jun 14, 2023, 1:19 AM

#

loud adder My girl is coming home and is okay (well, on her way to it)

I'm so happy she's good

#

Y'all do you want to partake in the GPT4All human study?

#

You'll probably have a few laughs

#

📎 a.txt 📎 b.txt

#

There are two files, a and b.
For each example, provide two votes:

Vote for A, B, or Tie, depending on the exemple following the instruction the best
Vote for A, B, or Tie, depending on the exemple addressing the prompt the best

Send me your results via DM and I will give you a file with the "answers" and the grading methodology.

loud adder Jun 14, 2023, 1:30 AM

#

versed flax Y'all do you want to partake in the GPT4All human study?

Unironically just post this in #general in like 12 hours and ask people to DM you their answers

versed flax Jun 14, 2023, 1:31 AM

#

loud adder Unironically just post this in <#729741769738158194> in like 12 hours and ask pe...

Will do!

versed flax Jun 14, 2023, 1:34 AM

#

loud adder Unironically just post this in <#729741769738158194> in like 12 hours and ask pe...

My results are highly significant with about 87% preference for CFG in instruction following (significant) and about 55% for prompt relevance (not significant). Which concludes very positively the experiment since only the instruction was subjected to CFG.

fallow egret Jun 14, 2023, 3:34 AM

#

blissful garden But quoting related works can also be about respecting other researchers

I completely agree, citation is also mechanism to encourage and acknowledge good works. In my opinion work like this is not something that should be encouraged.

patent gull Jun 14, 2023, 6:14 AM

#

sadly the kl-divergence results we've been waiting for are not promising..

I ran these generation tests on a sample of 500 test examples pulled from 200 datasets in P3 (with token len > 25, < 300).

The JSD(instruction-model logits || CFG logits) is actually consistently higher than JSD (instruction-model logits || vanilla logits), where vanilla is a base prompted model.

#

#

The best explanation I have for this is that the perplexity of CFG is actually higher for CFG than it is for Vanilla or Instruction-Tuned:

#

#

What i think this means is that CFG's language is overall more unnatural, especially w.r.t. the specific task completions, and that is swamping out any similarities that might exist between instruction-tuned models and CFG models.

notebook is here: https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/notebooks/analyze-p3-results.ipynb

here's a sampling of the data:

#

{'inputs_pretokenized': '\nWhat is a question that would give the following answer?\nAnswer: NP-complete problems;\nQuestion: ',
  'targets_pretokenized': '\nIf NP is not equal to P, viewed through this lens, what type of problems can also be considered intractable?\n'},
 {'inputs_pretokenized': "Taylor, 27, is taking an indefinite break from cricket after suffering with anxiety and panic attacks.\nNew Zealand all-rounder Amy Satterthwaite will captain the side in Taylor's absence.\nSix Super League teams will play 15 Twenty20 matches in a round-robin tournament that begins on 30 July.\n\nTL;DR: ",
  'targets_pretokenized': ' England wicketkeeper-batter Sarah Taylor will not appear for Lancashire Thunder in the inaugural edition of the Super League.'},
 {'inputs_pretokenized': '\nCombine facts and answer this: What city was the capital of the Ostrogothic Kingdom and the birth place of Ornella Fiorentini?\n',
  'targets_pretokenized': '\nRavenna (] , also locally ] ; Romagnol: "Ravèna" ) is the capital city of the Province of Ravenna\n'}

#

i know there's a lot of different tasks with a lot of different formats in P3.

Does anyone know of specific tasks in P3 that are more suited for zero-shot tasks?? It's my hunch that these tasks are closer to natural-language without fine-tuning for specific cues... I suspect these tasks will have more similar PPLs across models. I will re-run on those tasks in the morning when I wake up (going to bed now, it's late for me).

fallow egret Jun 14, 2023, 6:28 AM

#

What was the CFG value?

patent gull Jun 14, 2023, 6:28 AM

#

1.5

fallow egret Jun 14, 2023, 6:28 AM

#

Yes, it was already noticed that it's too high for many of the tasks

#

Can you try with 1.1?

patent gull Jun 14, 2023, 6:29 AM

#

yup!

blissful garden Jun 14, 2023, 6:30 AM

#

fallow egret Yes, it was already noticed that it's too high for many of the tasks

I thought for most tasks it peaked around 1.5 and decayed afterwards

patent gull Jun 14, 2023, 6:31 AM

#

that's what i had in mind from the charts we were generating from the benchmarks, but did that change in the newer experiments?

fallow egret Jun 14, 2023, 7:13 AM

#

Both me and @unique sedge noticed the declined before and got worse results comparing to baseline with 1.5
I'm guessing that it's depend on the task, if the task require long text completion then with 1.5 it stick too much to the prompt

unique sedge Jun 14, 2023, 7:26 AM

#

1.5 is too high when it comes to generation

#

1.1-1.25

#

Better values

versed flax Jun 14, 2023, 9:41 AM

#

patent gull

That figure is really hard to trust tbh

#

Especially that peak about #40. I have a really hard time believing that there's a consistent peak, across 500 different prompts, exactly at index 39, 40 and 41.

patent gull Jun 14, 2023, 2:35 PM

#

I’ll look at that when I get into the office… maybe an <eos> bias?

versed flax Jun 14, 2023, 3:18 PM

#

https://cfg.vermeille.fr/ yo everyone you can do a human study here

loud adder Jun 14, 2023, 3:19 PM

#

Screen_Shot_2023-06-14_at_11.19.33_AM.png

#

Screen_Shot_2023-06-14_at_11.19.53_AM.png

patent gull Jun 14, 2023, 3:24 PM

#

So there is a length bias for sure that's probably causing that JSD discrepancy

#

that word index i'm showing in the plots is word-idx of the completion... I'll try aligning to word-index overall. Otherwise, I'll also try resampling datapoints with a more even length distribution

#

aight measuring the overall word-idx is more reasonable

#

loud adder Jun 14, 2023, 3:36 PM

#

This is the opposite of the pattern we would like to see, right?

versed flax Jun 14, 2023, 3:41 PM

#

loud adder

yeah it's not like I set up an SSL key etc (edit: did it now)

patent gull Jun 14, 2023, 3:49 PM

#

loud adder This is the opposite of the pattern we would like to see, right?

yes, it is... I'm trying again with lower CFG values to see if perplexity is a confounder

#

btw, is there a way to connect to a jupyter server on our machine?

#

I'm trying to tunnel using ssh -L 8888:localhost:9000 [email protected] and am getting

channel_setup_fwd_listener_tcpip: cannot listen to port: 8888
Could not request local forwarding.

#

not a network guy, so not quite sure the details, but i'd love to be able to analyze data that's sitting on the machine and not have to scp it around

loud adder Jun 14, 2023, 3:54 PM

#

patent gull I'm trying to tunnel using `ssh -L 8888:localhost:9000 [email protected]` an...

Try using -R not -L

patent gull Jun 14, 2023, 3:55 PM

#

Warning: remote port forwarding failed for listen port 8888

patent gull Jun 14, 2023, 4:25 PM

#

aight it's:
ssh -L<REMOTE_PORT>:localhost:<LOCAL_PORT> [email protected] and then hit the local port in your browser
in case anyone is curious

patent gull Jun 14, 2023, 5:18 PM

#

plots generated with cfg=1.1 look like nearly identical JSD:

#

#

#

I'll try a different measure of similarity -- token overlap in top-p or something

loud adder Jun 14, 2023, 5:20 PM

#

Okay, so that seems like a sign that the metric isn't meaningful

#

Given that we know CFG 1.1 gives qualitatively and quantitatively improved results

patent gull Jun 14, 2023, 5:21 PM

#

yeah i think so, i think overlap is something to explore

versed flax Jun 14, 2023, 5:22 PM

#

patent gull

Then, edit distance? overlap?

fallow egret Jun 14, 2023, 5:38 PM

#

patent gull plots generated with cfg=1.1 look like nearly identical JSD:

What is the instruct model you compare to? I didn't found it in the code

patent gull Jun 14, 2023, 6:12 PM

#

this is redpajama-3b-instruct compared with redpajama-3b-base

fallow egret Jun 14, 2023, 6:42 PM

#

patent gull this is redpajama-3b-instruct compared with redpajama-3b-base

If you are using zero-shot, there is no information in the prompt in P3 tasks on the expected answer exact format. So it's hard to believe the CFG will have better JSD with the fine-tuned model (which overfit to the desired structure), compared to vanilla model

fallow egret Jun 14, 2023, 7:41 PM

#

@blissful garden Can you share your script for the CodeGen figures?
I want to align my figures to yours (I will also use your CFG splits 1.0, 1.1, 1.25, 1.5, 1.75, 2.0)

patent gull Jun 14, 2023, 7:49 PM

#

fallow egret If you are using zero-shot, there is no information in the prompt in P3 tasks on...

I think the expectation was more about the semantics of the answer, not the exact answer-format

#

although we can certainly test more directly for format

fallow egret Jun 14, 2023, 7:52 PM

#

patent gull I think the expectation was more about the semantics of the answer, not the exac...

But the metric compare tokens in the same place, so the structure/format alignment has big effect

patent gull Jun 14, 2023, 7:55 PM

#

sorry what do you mean by "same place"? JSD (instruction (x_n) || CFG (x_n) ) compares the next-word distributions at word index-n, yes, but presumably there's both structural words AND semantic words being generated?

#

your point is well-taken though. hold on, let me run the token overlap metric, then maybe we need to rethink this experiment to maybe be more format-focused

blissful garden Jun 14, 2023, 8:01 PM

#

fallow egret <@823129585230544906> Can you share your script for the CodeGen figures? I wan...

The plotting script is this. Is this what you are asking? I aggregated the per-task results in df with columns named after the filenames, and there is a function estimator function to calculate pass@k.

for k in [1, 10, 100]:
    plt.title(f"CodeGen 2B, pass @{k}")
    for f in [0.2, 0.6, 0.8]:
        labels = [f"2b_res/samples_{str(r)}_codegen-2B-mono_temp_{f}.jsonl_results.jsonl" for r in cfg_vals]
        to_plot = (df[labels]*256).apply(lambda series: series.apply(lambda c: estimator(n=256, c=c, k=k))).mean().to_numpy()
        ax = plt.plot(cfg_vals, to_plot*100, label=f"temp={f}", marker=".", markersize=15)
        plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(decimals=0))
        
    plt.xlabel("cfg")
    
    plt.legend()
    plt.savefig(f"pass{k}.png")
    plt.show()

fallow egret Jun 14, 2023, 8:03 PM

#

blissful garden The plotting script is this. Is this what you are asking? I aggregated the per-t...

Thanks! I just want to sync on the exact format (package, markers, etc')

patent gull Jun 14, 2023, 8:05 PM

#

Marginally more token overlap in top_p = 90% for CFG || Instruct, but probably not statistically significant:

#

jDH5CamnpVzjJM0cXn86Fr166YNGlSuEuhCMSeGyKKGuy5ISKA4YaIiIiiDG9LERERUVRhuCEiIqKownBDREREUYXhhoiIiKIKww0RERFFFYYbIiIiiioMN0RERBRVGG6IiIgoqjDcEBERUVT5f8f6mRfLeiS2AAAAAElFTkSuQmCC.png

#

versed flax Jun 15, 2023, 12:46 AM

#

https://cfg.vermeille.fr/ We need more votes. We have like 10 respondants despite my advertising. Does anyone with a wide enough audience feels like tweeting about it or something?

#

Thank you so much

fallow egret Jun 15, 2023, 5:51 AM

#

versed flax https://cfg.vermeille.fr/ We need more votes. We have like 10 respondants despit...

What is the length of it? It seems to be very long (many prompts). Are you shuffling the questions? Because it's hard to believe that people will do more than few prompts

patent gull Jun 15, 2023, 6:20 AM

#

This is a slight difference, but it seems like CFG and instruction-tuning have more token overlap at later points in the generation:

#

#

i need to gather more data to show significance, if at all

#

but the idea here is that CFG plays the same role as instruction-tuning for long sequences

#

it's really the only thing I could find. I tried other runs with different datasets. I'm trying a run now with longer sequences to see if this holds up

fallow egret Jun 15, 2023, 7:24 AM

#

I'm wondering if the length is correlate with the tasks type/complexity

versed flax Jun 15, 2023, 9:00 AM

#

fallow egret What is the length of it? It seems to be very long (many prompts). Are you shuff...

There are more than 250. Yes it's shuffled.

patent gull Jun 15, 2023, 9:35 AM

#

fallow egret I'm wondering if the length is correlate with the tasks type/complexity

I’m sampling tasks with completions that fall into a pretty narrow band of length (50-100 tokens).

unique sedge Jun 15, 2023, 11:47 AM

#

Looking forward to the results

#

The prompts and answers are long sometimes

fallow egret Jun 15, 2023, 5:02 PM

#

I'm currently stuck with running the experiments, since I still didn't get access to tulu-65B 😦
If anyone here have access ( @patent gull?) I would appreciate if he can download the model to the instance.

patent gull Jun 15, 2023, 5:05 PM

#

i don't have access either , i was using smaller versions (7b) for my experiments

#

but let me check, maybe they're leaked online somewhere

#

not immediately finding it. I applied with my academic information, will let you know if/when i get approved

#

sorry i don't have a better answer

versed flax Jun 15, 2023, 5:14 PM

#

(current) results of the human evaluation, 460 votes

fallow egret Jun 15, 2023, 5:32 PM

#

versed flax (current) results of the human evaluation, 460 votes

Very nice! if I understand correctly it seems to support our other findings: The main effect of CFG is following the instruction and keeping the required structure. However, it has seems not to improve the 'quality' of the model results.

versed flax Jun 15, 2023, 5:33 PM

#

fallow egret Very nice! if I understand correctly it seems to support our other findings: The...

Reminder: only the instruction is subject to CFG, not the prompt

#

in fact the figure is extremely positive: it shows that for CFG=3 we get peak instruction following (thanks to CFG) while not harming prompt relevance (not subject sot CFG)

fallow egret Jun 15, 2023, 5:37 PM

#

Ok, so I'm confused. What is the orange line chart meaning?

versed flax Jun 15, 2023, 5:37 PM

#

have you looked at the survey?

#

https://cfg.vermeille.fr/

patent gull Jun 15, 2023, 5:38 PM

#

ahhhh so the orange line means the "user-prompt" part of the prompt is preferred in CFG=4 30% of the times against baseline?

fallow egret Jun 15, 2023, 5:40 PM

#

versed flax https://cfg.vermeille.fr/

Yes, I thought the orange is the second question: Which satisfied the prompt best?

patent gull Jun 15, 2023, 5:40 PM

#

not to be persnickety, I know I already comment on ylabel, but just looking at the axes again, I would label:
y= "CFG win rate (%)"
x= "Guidance Strength (CFG \gamma)

unique sedge Jun 15, 2023, 5:40 PM

#

versed flax (current) results of the human evaluation, 460 votes

From what i understand, solid results

#

Great work

versed flax Jun 15, 2023, 5:41 PM

#

patent gull ahhhh so the orange line means the "user-prompt" part of the prompt is preferred...

Yes. So CFG=4 actually degrades how well the model responds to the prompt

#

but CFG=3 does not... AND reaches its peak instruction following

versed flax Jun 15, 2023, 5:45 PM

#

fallow egret Yes, I thought the orange is the second question: `Which satisfied the prompt be...

That's what it is, yes.

fallow egret Jun 15, 2023, 5:50 PM

#

versed flax That's what it is, yes.

Ok, I think I now understand your remark regarding not applying the CFG on the prompt. What happen if you apply it also on the prompt part? In this case I'm guessing the results are less satisfying

versed flax Jun 15, 2023, 5:51 PM

#

fallow egret Ok, I think I now understand your remark regarding not applying the CFG on the p...

I don't know. I couldn't think of a "neutral" user prompt for the negative input

#

And GPT4All doesn't respond well if we just omit it

fallow egret Jun 15, 2023, 5:53 PM

#

versed flax I don't know. I couldn't think of a "neutral" user prompt for the negative input

What do you mean by 'neutral' user prompt?
You have there many questions that require structured results (like 'rap song', list, etc')

versed flax Jun 15, 2023, 5:54 PM

#

ok

#

Current test.
Positive CFG prompt:

Instruction:
The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response as a PhD thesis.
Prompt
What is the best way to cook a steak?
Response
Negative CFG prompt:
Instruction
The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
Prompt
What is the best way to cook a steak?
Response
(difference highlighted)

#

If you wanted to apply the same logic to the user input ("prompt"), what would you write in the negative version?

#

(I couldn't find a satisfying solution to that problem, whereas I noticed that the instruction was very often not followed properly, so it made a good test)

fallow egret Jun 15, 2023, 6:00 PM

#

Why not using empty prompt negative (as in all other experiments)?
As I already mentioned, I think the empty negative is very strong differentiator and the main claim for novelty given CAD.

versed flax Jun 15, 2023, 6:00 PM

#

The empty negative is exactly what CAD is doing, actually

#

they do not provide different inputs

#

they contrast P(x|query, context) and P(x|query)

fallow egret Jun 15, 2023, 6:01 PM

#

No, they are not. They are giving part of the prompt itself as negative

fallow egret Jun 15, 2023, 6:02 PM

#

versed flax they contrast P(x|query, context) and P(x|query)

Yes, so the negative is the query, it's not empty!

versed flax Jun 15, 2023, 6:03 PM

#

It's the same text anyway.

And I can't help it but read it as P(x + query | context) vs P(x + query)

fallow egret Jun 15, 2023, 6:07 PM

#

versed flax It's the same text anyway. And I can't help it but read it as P(x + query | con...

It doesn't matter, this chosen of prompt split is prompt engineering of the task.
The novelty of the proposed CFG is that you don't need to do this task prompt engineering.
It is much more simple, and it works well for a wide range of tasks, so I think that it's a strong claim for novelty

versed flax Jun 15, 2023, 6:09 PM

#

The novelty of the proposed CFG is that you don't need to do this task prompt engineering.
I think there might have been some misunderstanding then? Negative prompting has been a topic since the beginning, especially when we were talking about alignment.
And given the enormous impact negative prompting had on txt2img, there wasn't a single moment I did not think of elaborating on it

#

Anyway, the results are here, they're excellent, and they address a current very hot use-case. I strongly vote against not including the experiment.

fallow egret Jun 15, 2023, 6:13 PM

#

I agree that it was important (my first experiment was also with negative prompting). But I think that given CAD, we should rethink about it.
I agree that the results are great, I'm just wondering if it's not working also without the negative prompting (i.e, using the regular CFG). If yes, then I think it's stronger. If not, then it sounds reasonable to keep it as is with this good results

versed flax Jun 15, 2023, 6:13 PM

#

fallow egret I agree that it was important (my first experiment was also with negative prompt...

The model is just disturbed without the proper formatting unfortunately

patent gull Jun 15, 2023, 10:05 PM

#

our tables aren't fitting:

#

#

does anyone know what a canonical task-split would be, to split up the tables?

#

 'arc_easy',
 'boolq',
 'hellaswag',
 'lambada_openai',
 'piqa',
 'sciq',
 'triviaqa',
 'winogrande']```

patent gull Jun 15, 2023, 11:20 PM

#

nevermind

patent gull Jun 16, 2023, 3:31 PM

#

hi @fallow egret I'm trying to edit section 3 at some point today and would love to able to weave CoT into the larger story when i do that.

No need to write in results/etc. but if you get a chance, can you put a bit of big-picture experimental overview, as well as maybe a sentence motivating why we did CoT an what it shows us, in section 3.4?

#

3-5 sentences is all i'm asking for

fallow egret Jun 16, 2023, 3:32 PM

#

Yes, sure. I already start the first paragraph will add more stuff soon.

patent gull Jun 16, 2023, 3:32 PM

#

sure, great 🙂 thanks!!

fallow egret Jun 16, 2023, 3:33 PM

#

where you want me to put the motivation?

patent gull Jun 16, 2023, 3:34 PM

#

First sentence of 3.4 motivates CoT
Last sentence of 3.4 wraps up CoT, summarizes findings and leaves us with insight

#

I'll worry about working a summary of that ^ into the head of section 3, the intro and the abstract

#

I already did that a bit, but will be making it sharper

#

let's decide where @unique sedge work is going. Are we gonna mention translation in 3.1 along with general benchmarks? Do we have another short section just titled "Generation"?

#

I'm still of the opinion that the big table of translation results belongs in the appendix but feel free to overrule me

unique sedge Jun 16, 2023, 4:30 PM

#

patent gull let's decide where <@414508441588072450> work is going. Are we gonna mention tra...

I think ill mention is immediately after you do glue, will try to have a draft up in a few hours and run it by you

patent gull Jun 16, 2023, 4:31 PM

#

okk just beware, i'm a ruthless editor

#

btw if anyone isn't doing anything and wants to poke around, it would be really cool to get this working for CFG to get some demo plots for the appendix:

https://github.com/HendrikStrobelt/LMdiff/tree/main

#

https://lmdiff.net/

#

actually... nevermind, I can probably just mock up their visuals in latex

versed flax Jun 16, 2023, 6:25 PM

#

patent gull btw if anyone isn't doing anything and wants to poke around, it would be really ...

I'm available, I can work on something

patent gull Jun 16, 2023, 6:46 PM

#

did anyone do anything to the machine?

#

weird stuff is happening that wasn't happening before

patent gull Jun 16, 2023, 7:01 PM

#

i think it has to do with a tulu update, best guess

fallow egret Jun 16, 2023, 7:13 PM

#

which machine?

#

In any case, I'm running now Guanaco-65B.
It take some time to get results... But I add the subsection with a placeholder for the results (finger-crossed that they will support the story). I tried to keep it short and minimal with a coherent story, so I omit many experiment results. We can later add more stuff in the appendix
cc @patent gull

patent gull Jun 16, 2023, 8:12 PM

#

yeah, appendix is a great place

#

I'm trying to find a good instruction-tuned model based off of pythia, since our benchmarks use pythia

#

I'm looking at dolly, but the vocab size is different from pythia, even though they say they based theirs off of pythia

#

possibly dumb q, but does anyone know why an instruction-tuned model might have a different vocab size from a base model?

#

https://huggingface.co/databricks/dolly-v2-12b/blob/main/config.json
https://huggingface.co/EleutherAI/pythia-12b/blob/main/config.json

versed flax Jun 16, 2023, 8:34 PM

#

yeah maybe they expanded it?

#

if we have the vocab files we can check if one is the prefix of the other

loud adder Jun 16, 2023, 8:38 PM

#

patent gull possibly dumb q, but does anyone know why an instruction-tuned model might have ...

There’s two possibilities that seem vaguely reasonable to me:

The Pythia tokenizer is padded with unused tokens because there are “magic numbers” for vocab size that run faster. Maybe Data Brocks doesn’t know this and is using the “real” tokenizer size (I don’t recall the numbers off the top of my head but can check in half an hour).
There are some reasonable improvements one could make to the tokenizer that would shrink it. Most notably, introducing digit-level tokenization of numbers.

If it’s not one of those two things I’m at a loss

patent gull Jun 16, 2023, 8:48 PM

#

they both use the GPTNeoX tokenizer 🤷‍♂️

#

anyway, I got around it by just comparing JSD for the top-k=10,000 tokens from the base model

versed flax Jun 17, 2023, 1:59 AM

#

I was thinking that the GPT4All experiment did not do real justice to negative prompting as the "negative" prompt was actually semantically neutral.

#

I started qualitative (reverse) alignment experiments and some things started to appear

#

The per-token CFG ranking is godsend tool for prompt engineering
just display argsort(P(x|new_prompt) - P(x|current_prompt)) and it'll show all the token your new prompt will encourage and discourage compared to the previous one

#

It made my experiments a lot faster

#

I think I will run a neg prompt experiment with (reverse) alignment

compare the toxic response of:
positive toxic prompt / negative standard prompt
vs
positive toxic prompt / negative safe ethical prompt
and (hopefully) display a difference

patent gull Jun 17, 2023, 2:08 AM

#

these are great ideas for experiments

versed flax Jun 17, 2023, 2:08 AM

#

Which... 3) made @blissful garden and @patent gull say that it would be actually better to take the assistant experiments out of the CFG paper to build another one, dedicated to assistants. One of the main argument being that the GPT4All results are quite strong and would deserve their own paper. I'm quite mitigated about that, especially because 1) I can't really take more time not doing my job, and 2) I don't think there's enough matter for another paper dedicated to experimenting with assistants

blissful garden Jun 17, 2023, 2:35 AM

#

versed flax Which... 3) made <@823129585230544906> and <@1102703708669751306> say that it wo...

oh we don't need to take it "out" of the paper. I just think it's something that can potentially be extended into papers (if any) building on top of it.

patent gull Jun 17, 2023, 2:07 PM

#

I think there's a strong case here for another paper, and a little too many different directions for 1 paper... however, I know we're all busy and might not have time

#

but i do think it's practical right now to decide how we're going to treat the current (imo v important) direction that Vermifuge is going in right now in the current paper

versed flax Jun 17, 2023, 3:35 PM

#

patent gull but i do think it's practical right now to decide how we're going to treat the c...

To properly disclose everything I have been trying to make a case of negative prompting without using an assistant but it's hard

#

I don't fully know why it's hard yet, I'm investigating

versed flax Jun 17, 2023, 3:53 PM

#

I mean, I would love to investigate but I locked myself out of my flat 🤡

patent gull Jun 17, 2023, 3:53 PM

#

can you clarify for ppl in the thread (and maybe for me a little bit 😅) what you mean by "without using an assistant"?

and by a negative prompt, just to give a super dum example, you mean:

p(x | "be happy") - b* p( x | "be mean") ===> extra happy, right?

versed flax Jun 17, 2023, 3:57 PM

#

patent gull can you clarify for ppl in the thread (and maybe for me a little bit 😅) what yo...

Just a regular base LM, like GPT-J

versed flax Jun 17, 2023, 3:57 PM

#

patent gull can you clarify for ppl in the thread (and maybe for me a little bit 😅) what yo...

Yes but I would love for it to be a bit more interesting than having a neg prompt that's just the opposite of the positive prompt

#

Like, I've tried a positive prompt "A conversation between a human and an AI assistant:" with GPT-J

patent gull Jun 17, 2023, 3:59 PM

#

no prob, just checking.

and can i ask why in this case we care about replicating your earlier results without assistants? I think showing we can improve assistants is cool, too

versed flax Jun 17, 2023, 4:00 PM

#

And it's kind of disappointing right because even if prompless CFG increases the success rate, it's still not 100%

patent gull Jun 17, 2023, 4:00 PM

#

what do we define as 100%?

versed flax Jun 17, 2023, 4:06 PM

#

Like, a common failure case is that it stops the conversation and proceeds to translate it

#

(no idea why)

#

So I'm trying to address this with discouraging this behavior with the negative prompt

blissful garden Jun 17, 2023, 5:49 PM

#

So what's the state right now? Are we trying to add more stuff or wrapping up for arxiv early next week?

versed flax Jun 17, 2023, 6:01 PM

#

blissful garden So what's the state right now? Are we trying to add more stuff or wrapping up fo...

wrap it up

patent gull Jun 17, 2023, 6:40 PM

#

Yah let’s wrap

blissful garden Jun 17, 2023, 7:21 PM

#

@loud adder do you have time to take a look at the draft over the weekend?

fallow egret Jun 18, 2023, 6:47 AM

#

Finally I have the results for Guanaco-65B (just waiting for cfg=2, should be finished in ~6h).
Results looks really good, we got 6% improvement!! The results also support the story that the improvement is correlated with the reduction of invalid retrieved results.
And lastly, after digging in the previous works in the field I found a reference which relate to the divergence issue in the CoT. So overall we now how a full strong story with respect to this subsection.
I will add everything after I will get the final cfg=2 results

versed flax Jun 18, 2023, 12:19 PM

#

fallow egret Finally I have the results for Guanaco-65B (just waiting for cfg=2, should be fi...

Don't forget to add your name to the authors' list after you're done 😉

loud adder Jun 18, 2023, 1:03 PM

#

blissful garden <@193204646687408129> do you have time to take a look at the draft over the week...

~~I can on Monday morning.~~ this sounds more fun than cleaning, lemme take a look

versed flax Jun 18, 2023, 1:05 PM

#

I love the edit 🤣

loud adder Jun 18, 2023, 1:18 PM

#

For Table 6, I think we should probably be looking at the change in logits (P(x|y) - 2P(x)) rather than the CFG logits? Currently it’s unclear how much of what we see is actually due to the CFG

versed flax Jun 18, 2023, 1:26 PM

#

loud adder For Table 6, I think we should probably be looking at the change in logits (P(x|...

I don't get why the factor 2

#

cfg is $log P(wt|w<t) + γ(P(wt|w<t) − log P(wT | \hat{w}))$
I'm visualizing the parenthesis

vital pondBOT Jun 18, 2023, 1:27 PM

#

vermifuge

loud adder Jun 18, 2023, 2:20 PM

#

versed flax I don't get why the factor 2

Because I just woke up

versed flax Jun 18, 2023, 2:21 PM

#

😱 it happens to the best of us!

loud adder Jun 18, 2023, 5:00 PM

#

Some general thoughts

This is a very good start to the paper. The results are strong and the narrative (will be) compelling.

I was quite impressed by how well presented the info in tables is in general... that's a skill that takes a while to develop and one of the first things I do when I read papers by less experienced authors is rewrite their tables. Really the only thing I would change about any of the tables is in Table 2, "improvement" is a very long column title compared to the others... maybe using a capital Greek delta to represent change would be a better move.

It is obvious to me that a substantial portion of this text was written by non-native English speakers. I know that being judged for that sucks, and I'm not thinking any less because of that, but some reviewers are assholes about it. I don't think it's necessary to sweat about right now, but I'll do some passes for grammar and word usage when the text is closer to final. The two biggest things that stood out to me were the overuse of commas (it slow down my reading and make me backtrack) and "co-temporaneously" (which isn't a word, I suspect "contemporaneously" was meant).

There's a number of times that the language in the paper is hesitant or non-committal with its wording. The best example of this is:

Considering all the experiment results, the reduced improvement or even negative effect of CFG on pass@100 may be attributed to a reduction of the variance of solutions This is also consistent with our intuition of CFG: it improves the adherence to the prompt, but (perhaps inevitably) reduces the variations and creativity.

Do we think this is the case or not? If so, let's provide an actual argument. The fact that CFG reduces variation is well established both theoretically and empirically, so we can just state that. I think a crucial observation here (which is implied but not actually stated) is that with 100 shots having more variation is better because you can get lucky.

Section 1 and 2

Right now, the paper pitch reads as "CFG is awesome -> let's see if we can apply it to NLP -> hey look it improves these things that LLMs struggle with." While this is historically faithful to the experimental design process, it will read more compellingly if we flip the order a bit to "here's a problem with LLMs -> we adapt a method from a related field to work here -> it solves the problem."

More concretely, I would reorganize the first two sections to have Section 1 go:

LLMs are awesome, but guiding generation is hard
Detailing examples of how people have tried to solve this in the past
Introducing CFG (including Eq 1)
Explaining why CFG is a better solution (in some contexts) than the other approaches, and hint at why it works
Paper outline

Section 2 can then delve deeper into our approach, and if necessary compare and contrast mathematically with other approaches. This section should also assert our theory for why it works the way it does, even if we don't provide evidence of that until much later. By priming people with the right ideas, they'll be able to get a good intuition for our results as they come. It's fine to just say we'll demonstrate this later.

#

Section 3

This section is currently very disconnected and hard to provide concrete feedback on. In particular, it's unclear when various results are missing if they're missing because we haven't added them yet or because we never run them. The lack of GPT-2-xl (completely unmentioned), Pythia models larger than 2.8B, and LLaMA models other than 7B are also notable. We are going to include those in the final copy right?

Despite the fact that CFG almost doubles the computation during inference, the comparison is still noteworthy given that other models with comparable performances on Lambada(OpenAI) have much more parameters and would still require more compute than LLaMA 7B with CFG

Great, let's put inference FLOP on the x-axis and performance on the y-axis to show this!

3.3 Program synthesis evaluations

Why do we evaluate GPT-J, a model that hasn't been discussed at all previously in the paper? Is there a reason besides computing expense to not run the experiments in this paper on all Pythia and all CodeGen models? Why was CodeGen chosen specifically over other code models?

3.5 Chain-of-Thought

At this point, the length of Section 3 starts to bother me as a reader. Section 3 seems to have five pages of a nine page paper, and more subsections than all the other sections combined.We should think about how to break it up into at least two sections.

Figure 3 is not very impressive. Why present the Guanaco-65B results at all, if they're basically unchanged? The two plots that are combined into this figure don't seem to fit together very well either, making reading needlessly confusing.

For Table 6, I think we should probably be looking at the change in logits between traditional sampling and CFG sampling (I don't think that that's what's going on right now, and if it is the caption should explicitly say that). There's a couple words that I disagree with the classification of in this table ("Draco" is about dragons, "restrooms" and "museam" are not more about Paris than any major city), but the most problematically non-highlighted terms are "Rockefeller" "Plaza" "Times" and "Symphony." I strongly suspect that these are the beginning of non-Parisian place names such as Rockefeller Center (Washignton, DC) and Times Square (NYC). There are several places Plaza and Symphony can be the start of as well. Omitting blue highlights on the left side is something we want to be very careful to avoid, as it may look like we are overstating our results by trying to hide something.

Section 4

I am very excited for this section to exist. There's now a decent amount of lit on reverse engineering learned algorithms and building causal models of the behavior of transformers we can potentially take something from. What's the intended scope and ambition of this section?

Section 5

Personally, I really dislike "additional related work" sections. I think that the best written version of a paper rarely has one. This doesn't mean we can't or shouldn't have one, but something to think about is if this work is so related, why don't we cite it previously ever?

This section contains a hidden killer objection to our paper: "efforts at controlled generation largely fell out of favor with the advent of instruction-tuning ." Given this, we need to present a compelling argument for why our method should not be viewed as irrelevant. Compute-to-impact tradeoffs are our friend here, and can be cited and stressed in Section 2. Also, we claim that CFG can be layered on top of other methods successfully. Do we know if the same is true for other prompt-based controlled generation techniques?

#

One more general note: repetition is your friend. Tell someone what you're going to tell them. Then tell them it. Then remind them that you told them it. This tends to substantially increase the readability of papers.

blissful garden Jun 18, 2023, 5:35 PM

#

To compute flops can we simply do 1/3 of the estimate here
https://github.com/EleutherAI/gpt-neox/blob/c00ce70c49ebdf368fcb36fad0ba9a15b607e195/megatron/logging.py#L82

versed flax Jun 18, 2023, 5:45 PM

#

Thank you very much! I don't have anything specific to answer (but we're taking everything into account), beside answering the explicit questions you ask.

Sec 3) I thought that GPT2-xl was kept unreleased by OpenAI, and aparently Hongly did not know that there were bigger Pythias. Then we did LLaMA and thought that we had too many plots and data points to even add more.

Sec 3.2) We use GPT-J because we replicate the experiment? Is that invalid? Also because I'm evaluating this manually and each model we'd add is 2 or 3h of labeling for me. Honglu will answer about CodeGen.

Sec 4) I will fix table 6. Thank you. My lack of US culture shows, it's in good faith.

Sec 4) contains Table 6 and Alex will add his findings. Spoilers: CFG is significantly closer to an instruction model than vanilla.

Sec 5) CFG starts to bridge the gap to instruction. Plus we also show CFG on GPT4All so even those models can benefit. If it's not clear, I will need to think of a way to state it clearly.

Thank you very much!

loud adder Jun 18, 2023, 6:17 PM

#

blissful garden To compute flops can we simply do 1/3 of the estimate here <https://github.com/E...

More succinctly, for a forward pass FLOP = 2*params*tokens

blissful garden Jun 18, 2023, 7:28 PM

#

about the choice of CodeGen, it's literally the only one I know that

has various sizes
has a well-written paper about it.
And since HumanEval are all python tasks, choosing CodeGen-**-mono may be a good choice to focus on program correctness alone (codegen multi or codegen 2 may mix up other languages)
I will put these in the draft.

The reason I didn't run 6B was simply that 2B with HF model was already taking a shit ton of time (was 8hr or something) for unknown reasons. In OpenELM we are also extremely reluctant to do inference on 6B unless we get a Triton-FT server set up already.

fallow egret Jun 18, 2023, 9:26 PM

#

Regarding Figure 3 in CoT, it was just a placeholder figure (with garbage numbers). I now put there the figure with the real results...

blissful garden Jun 18, 2023, 9:34 PM

#

fallow egret Regarding Figure 3 in CoT, it was just a placeholder figure (with garbage number...

I saw you mentioned "long reasoning chains can diverge", is it possible to show that the improvement is because the reasoning chains do not diverge?

fallow egret Jun 18, 2023, 9:36 PM

#

blissful garden I saw you mentioned "long reasoning chains can diverge", is it possible to show ...

It is demonstrated in the figures, you can clearly see there the decrease in the percentage of 'invalid' results (when increasing cfg)

#

We can also add many qualitative examples, but I thought it would be better in the appendix...

blissful garden Jun 18, 2023, 9:39 PM

#

what's the definition of "valid answer"?

#

same as correct answer?

#

I guess I'm just confused with the precise definition of "non-valid percentage" on the plot

fallow egret Jun 18, 2023, 9:43 PM

#

blissful garden same as correct answer?

I'll add that in footnote.
The parsing is following the same protocol as in previous works (CoT, Self-consistency), where you are searching for the words 'The answer is <ans>". Invalid is if you don't have this sequence of words in the answer

fallow egret Jun 18, 2023, 9:58 PM

#

Which is the meaning of diverge in this setting; either the model is stuck in an infinite reasoning chain, or he 'forget' the correct output structure

blissful garden Jun 18, 2023, 9:59 PM

#

fallow egret I'll add that in footnote. The parsing is following the same protocol as in pre...

all these definitions and explanations belong to the main text I guess?

fallow egret Jun 18, 2023, 9:59 PM

#

blissful garden all these definitions and explanations belong to the main text I guess?

yes

versed flax Jun 19, 2023, 12:57 AM

#

not exactly readable yet, but it's cool enough to share

loud adder Jun 19, 2023, 12:59 AM

#

versed flax not exactly readable yet, but it's cool enough to share

I would actually color by CFG vs non-CFG rather than model type

versed flax Jun 19, 2023, 1:00 AM

#

loud adder I would actually color by CFG vs non-CFG rather than model type

so you'd have one color for the left end of all segments and another for the right ends?

loud adder Jun 19, 2023, 1:01 AM

#

I think so

versed flax Jun 19, 2023, 1:02 AM

#

I can do that yes, but then we would totally dismiss the model labeling

loud adder Jun 19, 2023, 1:02 AM

#

Is the model labeling important for understanding the “core meaning” of this plot?

versed flax Jun 19, 2023, 1:03 AM

#

no, but I thought it would be nice to have it if there was a nice way to carry that info

blissful garden Jun 19, 2023, 1:13 AM

#

yeah I was thinking we drop all the model labels, drop all the lines, mark "with cfg" using stars instead of dots, and if necessary we color each model family by one single color.

versed flax Jun 19, 2023, 1:17 AM

#

I just added LLaMA-7B and I'm a little concerned about its scores, much lower than smaller Pythias on some tasks

#

(besides TriviaQA which was buggy)

loud adder Jun 19, 2023, 1:30 AM

#

Pythia 2.8B outperforming LLaMA 7B is kinda sus. Where are the LLaMA scores coming from? The eval harness or the LLaMA paper?

versed flax Jun 19, 2023, 1:30 AM

#

loud adder Pythia 2.8B outperforming LLaMA 7B is kinda sus. Where are the LLaMA scores comi...

harness

#

yes, it's sus

#

it's not on all tasks though

#

actually it's only on arc_easy and sciq

loud adder Jun 19, 2023, 1:36 AM

#

We have a bunch of notes about trying to reproduce the results in their paper, lemme go see if there’s anything about those two

blissful garden Jun 19, 2023, 1:54 AM

#

this is llama 7b from harness results folder

arc_easy    0    
acc    67.38    ±    0.96
acc_norm    52.48    ±    1.02

Seems similar to the run we got.

On Pythia paper, 2.7B does go up to 0.644
So are we looking at acc for this one?

loud adder Jun 19, 2023, 2:06 AM

#

Yes, we recommend looking at acc in general

blissful garden Jun 19, 2023, 3:36 AM

#

btw this is the write out info of triviaqa. Some are not entirely fair (like "Eucalyptus leaves" vs " Eucalyptus", "The Secret Service" vs " Secret Service"). Might explain why the triviaqa we got is generally lower.

📎 triviaqa_write_out_info.json

loud adder Jun 19, 2023, 3:37 AM

#

blissful garden btw this is the write out info of triviaqa. Some are not entirely fair (like "Eu...

Welcome to LLM evaluation via exact matching 🙃

versed flax Jun 19, 2023, 11:41 AM

#

#

#

blissful garden Jun 19, 2023, 2:17 PM

#

versed flax

Could you try dropping the lines and model labels, and use either different colors or different markers for baseline and cfg? Just want to see how it looks

versed flax Jun 19, 2023, 2:18 PM

#

sure, later tonight, it's just a preview

patent gull Jun 19, 2023, 4:08 PM

#

@fallow egret I just got permission to use those 65b weights you were asking for

#

Tulu-65b

fallow egret Jun 19, 2023, 4:12 PM

#

patent gull <@1057033987811459203> I just got permission to use those 65b weights you were a...

Thanks, but we got nice results with Guanaco-65B

patent gull Jun 19, 2023, 4:13 PM

#

Ah ok so they’re no longer needed?

#

Also how were you using accelerate to run those models?

fallow egret Jun 19, 2023, 4:26 PM

#

Yes, I think there is no need for a third model.
What do you mean exactly regarding accelerate?

patent gull Jun 19, 2023, 4:49 PM

#

more just curiosity on how you're running big models on this cluster.. whether you're using deepspeed or HF accelerate

fallow egret Jun 19, 2023, 4:50 PM

#

yes, I'm using accelerate

patent gull Jun 19, 2023, 4:50 PM

#

bigger picture question for all the folks here:

#

All the existing papers that broadly fit into the family of "Contrastive Decoding".... whether it's CAD, or Lisa Li/Percy Liang's

#

I feel like they have solely focused on generation... (CAD looks at summarization and abstractive QA)

#

is it fair to claim that we're the first to show general benchmark performance increases?

#

is it also fair to say that we're the first to apply contrastive decoding techniques to machine translation?

#

Also side-note, do we even want to put ourselves in the general family of contrastive decoding techniques? I guess we're inarguably in it, right

fallow egret Jun 19, 2023, 5:12 PM

#

I think that from a story perspective it should be the opposite, we are performing CFG (which is classifier guidance). In practice this could be seen as contrastive decoding (and should be related in the paper and explain the equivalence). But the motivation is different and should come from the positive side of increasing the 'classification' p(prompt| generation). Which is a little bit different motivation from the CD

#

Regarding machine translation why this task is different from the rest?
I think that again if we are looking for the big picture perspective, we are trying to tackle the alignment issue, i.e, follow better the prompt instructions. All the rest of the works try to solve different task (CAD is about using a context information).
The results demonstrate that the method is effective with better following the prompt instruction with respect to various tasks (machine translation is one of them)

versed flax Jun 19, 2023, 7:25 PM

#

I find it much harder to read

#

esp because it only talks about inference flops while keeping the base and CFG linked also allows to estimate the training flops

#

like, in Hellaswag or PiQA we may be tempted to read "CFG changes nothing, a model twice as big has the same acc as CFG", but the former is more expensive to train

loud adder Jun 19, 2023, 7:31 PM

#

Love the Winogrande zig-zag

fallow egret Jun 19, 2023, 7:47 PM

#

I just noticed that CFG was only accepted to a workshop 🙄

#

https://openreview.net/forum?id=qw8AKxfYbI

OpenReview

Classifier-Free Diffusion Guidance

Classifier guidance without a classifier

versed flax Jun 19, 2023, 7:47 PM

#

oh yeah, NeurIPS workshop. Crazy.

blissful garden Jun 19, 2023, 8:12 PM

#

versed flax I find it much harder to read

why red and orange? Maybe some other color with bigger contrast?
Maybe we can add a regression line for each group (with cfg and without)

versed flax Jun 19, 2023, 8:15 PM

#

blissful garden why red and orange? Maybe some other color with bigger contrast? Maybe we can ad...

I can change the colors yes. Do we agree on this way of plotting the data?

blissful garden Jun 19, 2023, 8:17 PM

#

versed flax I can change the colors yes. Do we agree on this way of plotting the data?

if we want to make a point that cfg provides better TFLOPs vs acc tradeoff, I don't see another better way (maybe this is not even true)

blissful garden Jun 19, 2023, 8:28 PM

#

versed flax I find it much harder to read

looks like cfg is only definitively better on two tasks though sadge

versed flax Jun 19, 2023, 8:28 PM

#

blissful garden looks like cfg is only definitively better on two tasks though <:sadge:815362275...

in terms of inference efficiency, yes

#

in terms of training efficiency I'm guessing it's a clear win

blissful garden Jun 19, 2023, 8:29 PM

#

versed flax in terms of _training_ efficiency I'm guessing it's a clear win

yeah

#

I guess we throw this chart (plus regression lines if it's not too difficult) as another view of the data in the appendix. I feel like inference efficiency is something that many people still care about.
Also since we mentioned the lambada sota and winning on inference efficiency we can cite this plot.

versed flax Jun 19, 2023, 8:34 PM

#

blissful garden I guess we throw this chart (plus regression lines if it's not too difficult) as...

yes I'll write something about it

versed flax Jun 19, 2023, 9:29 PM

#

blissful garden btw this is the write out info of triviaqa. Some are not entirely fair (like "Eu...

I don't get it. The fix was supposed to make sure that the ground truth was in the generated output, after lowercasing and stripping from trailing spaces. most of those examples should be ok

blissful garden Jun 19, 2023, 10:10 PM

#

versed flax I don't get it. The fix was supposed to make sure that the ground truth was _in_...

oh is that how triviaqa is supposed to work?
I was just complaining on human standard it's quite unfair.

versed flax Jun 19, 2023, 10:10 PM

#

blissful garden oh is that how triviaqa is supposed to work? I was just complaining on human sta...

after the fix, yes

#

generate until "." or "\n", lowercase, remove punct, search ground truth substring

blissful garden Jun 19, 2023, 10:12 PM

#

versed flax _after the fix_, yes

oh was my version not fixed? I see "Eucalyptus leaves" vs " Eucalyptus" get 0

versed flax Jun 19, 2023, 10:12 PM

#

🤷‍♂️

#

not sure what you used

blissful garden Jun 19, 2023, 10:12 PM

#

I remember before the fix it was all 0

#

mine got nonzero so it should be after the fix

versed flax Jun 19, 2023, 10:13 PM

#

there were 2 fixes: one for LLaMA tokenizer, another for TriviaQA as a whole

blissful garden Jun 19, 2023, 10:13 PM

#

~~yep, I remember without the TriviaQA fix, we will get all 0 for LLaMA~~
Oh wait, was it 0% for Lambada?

blissful garden Jun 19, 2023, 10:21 PM

#

versed flax not sure what you used

Mine was 674866a of branch cfg in our fork
https://github.com/EleutherAI/lm-evaluation-harness/commit/674866adf4ac76d028f11d89143c42f8d5d9aa58

versed flax Jun 19, 2023, 10:59 PM

#

blissful garden Mine was 674866a of branch cfg in our fork <https://github.com/EleutherAI/lm-eva...

then it's correct, you got the fix

versed flax Jun 19, 2023, 10:59 PM

#

blissful garden ~~yep, I remember without the TriviaQA fix, we will get all 0 for LLaMA~~ Oh wai...

both, IIRC

versed flax Jun 19, 2023, 11:00 PM

#

versed flax then it's correct, you got the fix

uuuh the fix is incorrect

#

return {"em": float(continuation in list_of_candidates)}
should be
return {"em": float(any((continuation in candidate) for candidate in list_of_candidates))}

#

https://github.com/EleutherAI/lm-evaluation-harness/commit/674866adf4ac76d028f11d89143c42f8d5d9aa58#diff-ddf623acf219afc629eabf222864cb6f6173db267f5972af0f3d1a795774bee3R92 here

#

@blissful garden If I fix it we'll need to rerun TriviaQA for all models, is that doable?

blissful garden Jun 19, 2023, 11:04 PM

#

versed flax <@823129585230544906> If I fix it we'll need to rerun TriviaQA for _all_ models,...

Need to change all my scripts but it's doable.

#

Let's do it. Once you fix it I will run a llama 7b first and see if we get the right number.

versed flax Jun 19, 2023, 11:05 PM

#

ok

versed flax Jun 19, 2023, 11:08 PM

#

blissful garden Let's do it. Once you fix it I will run a llama 7b first and see if we get the r...

done

blissful garden Jun 19, 2023, 11:43 PM

#

versed flax done

llama7b got 45% on triviaqa

versed flax Jun 19, 2023, 11:45 PM

#

blissful garden llama7b got 45% on triviaqa

I can't find the previous result

blissful garden Jun 19, 2023, 11:45 PM

#

previously it was 40% on llama

#

and 3%-ish on gpt, pythia stuff

#

let me run pythia 1.4b and see if we get nontrivial numbers

versed flax Jun 19, 2023, 11:46 PM

#

ok I don't have previous "em" results on triviaqa for this model

blissful garden Jun 19, 2023, 11:48 PM

#

darn it maybe it didn't even go down this code path

#

ok it should just take less than 20min for pythia 1.4b. if we see 20-30ish I will rerun everything

versed flax Jun 19, 2023, 11:51 PM

#

We'll have very bad issues if TriviaQA is back. Our tables won't have the same number of columns, and I'll have to find a new place for the legend in the charts 😆

loud adder Jun 19, 2023, 11:58 PM

#

I have a couple numbers for TriviaQA for common models (0 = missing data). Does this line up with what you're seeing

versed flax Jun 20, 2023, 12:00 AM

#

loud adder I have a couple numbers for TriviaQA for common models (0 = missing data). Does ...

generative triviaqa or argmax accuracy with teacher forcing?

#

(ie, version 0 or 1?)

loud adder Jun 20, 2023, 12:00 AM

#

Probably generative?

#

Version 2!

versed flax Jun 20, 2023, 12:00 AM

#

wait wat

loud adder Jun 20, 2023, 12:01 AM

#

Current version is V2: https://github.com/EleutherAI/lm-evaluation-harness/blob/b281b0921b636bc36ad05c0b0b0763bd6dd43463/lm_eval/tasks/triviaqa.py#L31

GitHub

lm-evaluation-harness/lm_eval/tasks/triviaqa.py at b281b0921b636bc3...

A framework for few-shot evaluation of autoregressive language models. - lm-evaluation-harness/lm_eval/tasks/triviaqa.py at b281b0921b636bc36ad05c0b0b0763bd6dd43463 · EleutherAI/lm-evaluation-harness

#

It looks like its V1 but with a new dataset path

versed flax Jun 20, 2023, 12:02 AM

#

https://github.com/EleutherAI/lm-evaluation-harness/blob/b281b0921b636bc36ad05c0b0b0763bd6dd43463/lm_eval/tasks/triviaqa.py#L91 this line is still wrong it seems

#

return {"em": float(continuation in list_of_candidates)}
should be
return {"em": float(any((continuation in candidate) for candidate in list_of_candidates))}

loud adder Jun 20, 2023, 12:04 AM

#

versed flax `return {"em": float(continuation in list_of_candidates)}` should be `return {"e...

If this is an error, it seems like we should be seeing extremely wrong results, not vaguely reasonable ones

versed flax Jun 20, 2023, 12:04 AM

#

It's just a lot less permissive

versed flax Jun 20, 2023, 12:05 AM

#

blissful garden btw this is the write out info of triviaqa. Some are not entirely fair (like "Eu...

and indeed, we see the error manifesting here ^

blissful garden Jun 20, 2023, 12:19 AM

#

pythia 1.4b now got 12.56% on triviaqa

|  Task  |Version|Metric|Value |   |Stderr|
|--------|------:|------|-----:|---|-----:|
|triviaqa|      1|em    |0.1256|±  |0.0031|

#

the write-out.
Some are still strange, like " The state of South Dakota" vs " SOUTH DAKOTA" getting 0.0

📎 triviaqa_write_out_info.json

versed flax Jun 20, 2023, 12:25 AM

#

blissful garden the write-out. Some are still strange, like " The state of South Dakota" vs " SO...

uh, I missed something then

blissful garden Jun 20, 2023, 12:26 AM

#

"logit_0": "",
"truth": " First Lord of the Admiralty",
Got 1.0 🥲

another one:

"logit_0": "",
"truth": " MO FARAH",
"em": "1.0"

a false negative

"logit_0": " Silkworms feed on the leaves of the mulberry tree",
"truth": " Mulberry",
"em": "0.0"

#

"logit_0": " The port of St Malo is located in the French department of the",
"truth": " France",
🤔 wonder what pythia was trying to say

versed flax Jun 20, 2023, 12:30 AM

#

...

#

I read continuation and list_of_candidates backwards

#

the continuation is not the expected one but the generated one, and the candidates are not candidates answers but candidate ground truths

#

fixing

#

you can pull and run it again

#

sorry about that

#

@blissful garden (in case you didn't see the answers)

blissful garden Jun 20, 2023, 2:20 AM

#

|  Task  |Version|Metric|Value |   |Stderr|
|--------|------:|------|-----:|---|-----:|
|triviaqa|      1|em    |0.1594|±  |0.0034|

Manually checked the first 100. I don't see many problems in the write-out file this time 🎉 . There are a couple that I still have doubts but shouldn't be a big deal.

📎 triviaqa_write_out_info.json

patent gull Jun 20, 2023, 2:56 AM

#

I finished writing up the main gist of my content in Section 4. I need to do another pass on it tomorrow to express the ideas there better and link to the Appendix.

I see that there's a lot of editing that's been done on the doc.
I'm wondering if we want to relax our deadline by a little bit? I think this paper needs some careful digesting/reading..

also there is basically no text in the Appendix, and for the amount of tables/info/etc there, I think we need something...

#

the closer we get to EMNLP the more I'll be unavailable to do that, personally

fallow egret Jun 20, 2023, 5:21 AM

#

In my opinion this is not close to be ready. It basically many many results, and there is a need for many iterations to make it a paper.

I try to edit the intro, and focus on the narrative, what exactly we want to tell (it's only a first draft, still need to iterate).
Let me know what you think but the main message is:

We are trying to tackle the alignment issue- specifically, better following the prompt instructions (which is a critical issue).
We are inspired by classifier guidance in DM (not contrastive decoding or anything related to LLM).
Negative prompting is part of the CFG mechanism investigation! it's not part of the main story, but an interesting insight from the investigation that should be explored further in future works.

#

If we are all aligned on a coherent story, then we can edit the full paper to talk in a coherent language, rim the same message and support the narrative

patent gull Jun 20, 2023, 5:39 AM

#

i agree with you on it not being close to ready. Can probably get there soon with some passes from whomever has bandwidth.

#1 Yes
#2 Sure, I don't see the harm in saying where we were inspired by DM. I'm reading your response to my earlier question today more deeply (apologies, I was multitasking when you first sent it). I think i agree with you about the framing, but I'm on the fence Section 4 not withstanding about whether we were actually able to show "alignment" rather than just "performance increases". I don't know that we designed a lot of our experiments to explicitly address alignment.
#3 This seems like a good way to work this in, to me

fallow egret Jun 20, 2023, 6:02 AM

#

Regarding 2, I think that we want to talk about 'alignment' in the more general setting (of the original paper by openAI) of following the human intent and instructions. The performances increase is not a result of enhancing the model knowledge graph, it is due to the model better follow the task prompt instructions. This should be much more stressed in the experiment section.

#

P.S,
Regarding contrastive decoding. I think it is important to include it in the related work, the narrative is that this is in practice a 'classifier guidance' approach (where the classifier is either external model [Lisa], or a classifier which is defined using a task prompt engineering-CAD) .
In contrast, our approach is truely classifier-free guidance.

So our differentiation comparing to competitors:

Tackling the alignment issue, which is much bigger, therefore our huge experiment section.
Use a real classifier-free guidance (this was not done before)
Deep investigation of the CFG mechanism

#

@loud adder @blissful garden @versed flax
Please also share your thoughts (submission timeline/narrative), I want us all to be aligned before editing the rest of the paper...

blissful garden Jun 20, 2023, 7:22 AM

#

fallow egret In my opinion this is not close to be ready. It basically many many results, and...

Yeah of course. I remember a while back we said that the narrative is like, we have a problem (prompt following), here is a method, and we solve it. From what I see in benchmarks and codegen, a more precise hypothesis may include that in the tradeoff between prompt adherence and variability, CFG is a useful parameter that leans towards the former at the small cost of latter. But of course we seem to have more to offer (esp. human evaluation and Sec 4 interpretability)
Yep. If our story starts with CFG in CV, mentioning classifier guidance is certainly natural.
I don't quite understand this. It feels more like "how to use CFG" investigation.

#

classifier guidance vs cfg feels like "having to train another classification model" vs "only using the model itself"

fallow egret Jun 20, 2023, 7:27 AM

#

blissful garden 1. Yeah of course. I remember a while back we said that the narrative is like, w...

Regarding 3, the point is that we focus on what exactly the effect of CFG on changing the model policy (this is the main point in section 4, and then saying well, you can actually push it further and use negative prompting to better shape the policy

fallow egret Jun 20, 2023, 7:30 AM

#

blissful garden classifier guidance vs cfg feels like "having to train another classification mo...

Not only actually training, prompting manipulation can be seen also as fine-tuning. So CAD is in this sense is not a classifier free. Which is highly important for differentiation and novelty (framing all previous method including CAD as classifier guidance methods)

blissful garden Jun 20, 2023, 7:31 AM

#

fallow egret Regarding 3, the point is that we focus on what exactly the effect of CFG on cha...

oh I was just saying "negative prompting" seems like an application

blissful garden Jun 20, 2023, 7:32 AM

#

fallow egret Not only actually training, prompting manipulation can be seen also as fine-tuni...

aren't we already different enough from the other papers' narratives?

blissful garden Jun 20, 2023, 7:36 AM

#

fallow egret Not only actually training, prompting manipulation can be seen also as fine-tuni...

I guess at the end people care about the practical advantages (lower training cost, etc.). So I don't know if we need to argue who is "truly classifier free" or not.

fallow egret Jun 20, 2023, 7:36 AM

#

blissful garden oh I was just saying "negative prompting" seems like an application

Well, I agree it's an application. But you can frame it as an application that support the understanding of the policy steering of the mode policy

blissful garden Jun 20, 2023, 7:39 AM

#

fallow egret Well, I agree it's an application. But you can frame it as an application that s...

Yeah sure. Understanding the mechanics always has the goal of better using the tool. I'm really happy that Alex wrote the Section 4.

fallow egret Jun 20, 2023, 7:48 AM

#

blissful garden I guess at the end people care about the practical advantages (lower training co...

If it's a research paper that is submitted to tier-1 conference, then reviewers and readers will be interested in the story and what is the novelty comparing to previous work. Almost no one will have the time to actually look at all the experiments. They will understand the story and the point, and if it sound reasonable and good story they will see few examples and test this approach in their setting.
If I'm looking at the current version, what pop ups is the formula (which I already familiar with it from a paper I read one month ago- CAD), and a huge batch of unrelated experiments that is hard to understand what exactly their purpose, and what extra knowledge they add comparing to previous works

blissful garden Jun 20, 2023, 7:53 AM

#

fallow egret If it's a research paper that is submitted to tier-1 conference, then reviewers ...

Almost no one will have the time to actually look at all the experiments.
oh wow. I mean I don't know about ML culture so this sounds surprising.
a huge batch of unrelated experiments that is hard to understand what exactly their purpose
Absolutely. My understanding is that we are still at the stage of "throwing all the results out there" while attempting to fill in paragraphs async so that they can be used later.

unique sedge Jun 20, 2023, 7:54 AM

#

I think the CAD paper has muddied the idea of what this paper should be too much, parallel works which have their own take at solving a problem can independently arrive at the same solution, as long as some disntinction is made its okay. Even if the distinction isnt made, the motivation is often quite different.

#

The experiments are a way of proving generality of application of the method, they arent disjoint, they are wide ranging. You usually test the method on the general benchmark, then a few generation tasks and then any specific task you think it helps more

unique sedge Jun 20, 2023, 7:56 AM

#

unique sedge I think the CAD paper has muddied the idea of what this paper should be too much...

The CAD paper entirely looks at their method as anti-hallucinatory solution and proves it through summarization and the hallucination evaluation metric

blissful garden Jun 20, 2023, 7:57 AM

#

unique sedge The CAD paper entirely looks at their method as anti-hallucinatory solution and ...

yep this is why I feel that we are different enough in quite a lot of ways (except the formula unfortunately, but formula is formula, I'd admit that we coincide)

unique sedge Jun 20, 2023, 7:57 AM

#

@/vermifuge has maintained from the start that he’d like to prove the generality of application of his method and hence first showed a sample after which various tasks were proposed and also some human evaluation was done

#

Taking a coherent narrative is ofcourse not a bad idea, but having seen the CAD paper mentioned again and again feels like theres some active effort to establish difference. But the difference already exists since we are looking at it in a more general manner while they only considered hallucinatory effects and dont consider complex prompting methods

fallow egret Jun 20, 2023, 8:10 AM

#

I think that there should be an active effort to stress the differences and sharpen the story to emphasis that it's not only testing CAD on more tasks, but this is a different approach.
There is very low probability a reviewer will not ask about that and will want to see a real novelty, you don't want to start thinking about it in the rebuttal. It is important to stress the differences from the beginning and shape correctly the story

blissful garden Jun 20, 2023, 8:16 AM

#

fallow egret I think that there should be an active effort to stress the differences and shar...

yeah I feel that it's really about the fact that the paper is still in the early stage of writing, and it is true. There is no denying that we need to rewrite under a coherent narrative (shouldn't all papers be like this?). I'm probably more optimistic thinking we will get there. I just hope that we don't "argue too hard" for the sake of finding novelties.
If a reviewer comes out and says "hey you guys use the same formula as CAD", I will say "yes of course."

versed flax Jun 20, 2023, 8:19 AM

#

I'm waking up and I'm still half asleep.
I agree with pretty much everything that was said: thank you Elad, you echo Stella's comments and you are very right. Thank you paws for bringing back the general scope of CFG.

I'm personally okay if we relax the time frame but with each additional day we need to address CAD more and more. Also I'm very very late with my work and I can't be as active at I have been anymore

fallow egret Jun 20, 2023, 8:20 AM

#

blissful garden yeah I feel that it's really about the fact that the paper is still in the early...

I completely agree. The point is not saying 'hi it's novel', the point is to rephrase the story in a way that will get in the end to the same formula from a different direction (from a classification guidance which was the real initial motivation of @versed flax ). In this case, as a reviewer I don't care that you got in the end to the same formula, there is here a different motivation.

blissful garden Jun 20, 2023, 8:21 AM

#

fallow egret I completely agree. The point is not saying 'hi it's novel', the point is to rep...

coming from CFG in CV has been the original plan right?

fallow egret Jun 20, 2023, 8:26 AM

#

blissful garden coming from CFG in CV has been the original plan right?

This is exactly the point and should be stressed

blissful garden Jun 20, 2023, 8:27 AM

#

fallow egret This is exactly the point and should be stressed

Yeah totally

#

My doubt was only about arguing "hey you are not classifier free but we truly are".
What I understood is like we similarly view decoding as a Markov process and get an inspiration from CFG formula despite that we don't have a lot of of their assumptions. We are not an exact analog to begin with.

fallow egret Jun 20, 2023, 8:42 AM

#

It's a paper not a rebuttal, positioning CAD as a classifier guidance is something that simply should be stated as an obvious observation- just mention it (without starting a discussion) as part of the external classification line of works, where the classifiers in this case is given by the prompt engineering).

#

By the way, it's not true that the formula is the same, it's simmilar but not the same...

patent gull Jun 20, 2023, 2:29 PM

#

My only doubts about very directly positioning CFG as coming from CV is that we would be implying and showing that it’s having the same effect/mechanisms etc. And idk entirely if that’s true

#

Some of the negative prompting stuff hasn’t been as easy in our setting as it seems to be in CV

#

There are certainly differences with applying CFG in an autoregressive setting vs a diffusion setting, and I worry about giving the readers the impression that “whatever works for CFG for CV will automatically work for CFG for NLP”

versed flax Jun 20, 2023, 2:32 PM

#

patent gull Some of the negative prompting stuff hasn’t been as easy in our setting as it se...

I think it's fair, negative prompting in CV was a later addition too

patent gull Jun 20, 2023, 2:33 PM

#

Maybe there’s a way to position CFG as coming from CV but also call out that there are expected differences bc of the different modeling paradigms

versed flax Jun 20, 2023, 2:33 PM

#

I wouldn't say that neg prompting with LMs doesn't work, I think we just haven't found the correct way to use it

patent gull Jun 20, 2023, 2:34 PM

#

Totally, im not implying that either. I’m just saying the fact it didn’t work out-of-the-box just means that the domains are obvs not 100% the same. Maybe I’m making a dumb point, idk

versed flax Jun 20, 2023, 2:35 PM

#

Or... maybe it does, but we haven't found the right experiments to show it

patent gull Jun 20, 2023, 2:35 PM

#

True, too

fallow egret Jun 20, 2023, 3:22 PM

#

patent gull Totally, im not implying that either. I’m just saying the fact it didn’t work ou...

Also in vision, it's not 'out-of-the-box' formula. In many cases it's not working great and require many iterations and prompt engineering

#

In any case, the vision is just inspiration, the mechanism is totally different first of all because you are not applying CFG during training

versed flax Jun 20, 2023, 3:26 PM

#

fallow egret In any case, the vision is just inspiration, the mechanism is totally different ...

there's no CFG in training in vision either :)

#

unless you mean "conditioning dropout"

#

which is the default with LMs

fallow egret Jun 20, 2023, 3:30 PM

#

versed flax unless you mean "conditioning dropout"

Yes, this is what I meant. You train also the unconditioned part jointly. This is of course unnecessary in LLM which is important

patent gull Jun 20, 2023, 5:13 PM

#

fair enough. these are good points!!

blissful garden Jun 22, 2023, 2:14 AM

#

codegen 6B results are also out. Quite consistent. I wish 350 and 2b were this nice (esp. pass@100).

fallow egret Jun 22, 2023, 5:20 PM

#

Hi, In the text when you writing comments, please add the command in the beginning of the file:
\newcommand\name[1]{\textcolor{color}{[NAME: #1]}}
(change it to your name and choose a color).
then you can use for example \name{comment} inside the paper. In this way it's going to be easier to track the comments (comparing to the now generic red comments)

versed flax Jun 22, 2023, 5:23 PM

#

That's a great idea

fallow egret Jun 22, 2023, 7:09 PM

#

Ok, I think I finished the first iteration on section 2, There is some small inaccuracy in the last subsection, since there is a missing normalization factor (in all the equations), to make it a probability function. But it's really ugly to drag it everywhere. If anyone have an idea how to make it more accurate feel free to change.
In any case, I think that now the theoretical connection to the original vision CFG work is clear. And the derivation is nice (and not completely trivial as one of you ask about it in the remark), it also add more volume to the novelty claim

blissful garden Jun 22, 2023, 7:48 PM

#

fallow egret Ok, I think I finished the first iteration on section 2, There is some small ina...

the math looks great. Left a couple more remarks (might seem nitpicking but I always worry about rigorousness)

fallow egret Jun 22, 2023, 7:50 PM

#

blissful garden the math looks great. Left a couple more remarks (might seem nitpicking but I al...

Great! these 'nitpicking' are very important in this section

blissful garden Jun 22, 2023, 7:50 PM

#

also, what about we order equations in the latex way (\begin{equation})

versed flax Jun 22, 2023, 7:50 PM

#

I'll be home later to night to read that. I'm excited!

fallow egret Jun 22, 2023, 7:56 PM

#

blissful garden also, what about we order equations in the latex way (\begin{equation})

yes, indeed it should be changed to the conventional format

blissful garden Jun 22, 2023, 7:58 PM

#

I can try to revise a little bit later. Boarding a flight right now. Overall it looks nice and this is def where this section should be going

fallow egret Jun 22, 2023, 8:01 PM

#

blissful garden I can try to revise a little bit later. Boarding a flight right now. Overall it ...

sure, I will appreciate any help

versed flax Jun 22, 2023, 11:44 PM

#

I'm reading the current state of the paper, and it reminds me that I should absolutely take a chance if anyone has a good idea:

We don't have any conclusive experiment for a compelling negative prompting situation.

What I mean by "compelling" is like you would see in txt2img model: your positive prompt is "a man in 1920", but the images all have mustaches, so your negative prompt is "mustache", which works by targeting a specific unwanted aspect of the generation.

What we've tried and is not really satisfactory is:

Positive prompt is specific ("Here is a horror story:") and negative prompt is more general or neutral ("Here is a story:") => this works and has been proven in the paper.
Positive prompt and negative prompt are opposed ("The nice man said:" vs "The angry man said:") => I'm super mitigated.
And those examples aren't exactly as granual and useful as the "mustache" example.

If anyone manages to have some hint at a result, if would be great

#

One difficulty I see is that your negative prompt gets completed by exactly not a sensical continuation. If your negative prompt is "He was so mean! He said:" then of course the continuation will be something very nice, and I have the intuition that it would make the LM quite confused and undermine the whole strategy, but I can't prove it.

That being said, this confusion does not arise in diffusion models and I'm not sure why. As you go through the multiple steps of sampling, the image looks less and less like the negative prompt the image is supposed to be conditioned to. But it works. And I'm not sure why.

versed flax Jun 23, 2023, 12:19 AM

#

@fallow egret Thank you for that section 2!!! I added some comments as well (some as overleaf comments, like... in the "comments" side bar lol)

blissful garden Jun 23, 2023, 4:34 AM

#

(sorry... having a lot of trouble with the tex bot 😂)

I just rephrased and augmented section 2.1 to make it an intro of this theory section.
For 2.2 there are still a couple questions though:

we need to decide a notation and make it consistent throughout the paper. I vote for @fallow egret notation but is there another important paper that writes $\hat p$? (I see Ho & Salimans write $\tilde p$)
I think Ho & Salimans intentionally write $p_\theta(z|c)\cdot p_\theta(c|z)^r$ for the same $\theta$ because the classification model $p_\theta(c|z)$ was explicitly mentioned a couple more times when introduced. I'm quite confused too but I fear that there is something we missed.

vital pondBOT Jun 23, 2023, 4:34 AM

#

Honglu

fallow egret Jun 23, 2023, 4:49 AM

#

blissful garden (sorry... having a lot of trouble with the tex bot 😂) I just rephrased and aug...

It should be the same when it is the same model (in the CFG case), but in the non-free case it should be different (also in the original paper of the classifier guidance it was different)

fallow egret Jun 23, 2023, 5:50 AM

#

Ok, I'm now going to edit sections 3-4.
Overall both of them should take no more than 4 pages, so I'm going to move many of the results to the appendix section

blissful garden Jun 23, 2023, 6:03 AM

#

fallow egret Ok, I'm now going to edit sections 3-4. Overall both of them should take no more...

yeah we should be able to refactor it easily depending on the requirement. I should probably move some of my tables to appendix too.

blissful garden Jun 23, 2023, 6:19 AM

#

fallow egret It should be the same when it is the same model (in the CFG case), but in the no...

Yeah that was my thought too.
If it's a mistake that repeated 3 times it's just too obvious and I usually become extra careful in this case. But I tried to re-read some stuff and still agree with you so far.

blissful garden Jun 23, 2023, 1:58 PM

#

@fallow egret where did section 3 go?

#

I see you have a folder of "old_sections" now. It would be better if you tell us next time when you make big moves like that.

#

@fallow egret so what's your plan with this "old_section" folder? We should probably hear you out and make sure everyone is on board.

fallow egret Jun 23, 2023, 2:34 PM

#

blissful garden <@1057033987811459203> so what's your plan with this "old_section" folder? We sh...

Yes, sorry I will update next time. In any case I didn't erase anything, it's there exactly as was previously

#

It simply hard to edit it as is, it will be much easier to take it piece by piece

versed flax Jun 23, 2023, 2:37 PM

#

Be mindful of others people's work. You're not solo on this.

#

And communicate your intent

#

Why do you think Sec 3 needs a whole rewrite?

fallow egret Jun 23, 2023, 2:43 PM

#

Just to say, to put back the old section:
You can just use the command in the text
\input{old_sections/experiments}

loud adder Jun 23, 2023, 2:44 PM

#

Who is the owner of the overleaf? I have a premium account, and if you transfer ownership to me we can turn track changes on

versed flax Jun 23, 2023, 2:45 PM

#

I am

fallow egret Jun 23, 2023, 2:46 PM

#

versed flax Why do you think Sec 3 needs a whole rewrite?

It's currently a mess of unrelated experiments (with respect to how the secotion is written), with completely different style of writing and length that doesn't connect together to a story. Also the section is too big, most of the stuff should move to the appendix

versed flax Jun 23, 2023, 2:52 PM

#

fallow egret It's currently a mess of unrelated experiments (with respect to how the secotion...

Okay. I don't think "a hot pile of mess" is fair. It needs improvements. Sure.
I don't think writing styles and lengths are that much of an issue, and if it's one, Stella or Alex, as native English speakers, might be the ones to trust on that. Besides, rewriting someone's work without consultation is dimissive of their work, and doesn't allow the person(s) the opportunity to improve. This is something that should be either done with full rewrite consent, or via comments / dialogs. I suggested you reach out to Alex so that you guys could sync, and you did not; why?
Yes, the sections need better flow and to integrate better into the narrative. What are your suggestions wrt to that?

fallow egret Jun 23, 2023, 2:57 PM

#

versed flax 1) Okay. I don't think "a hot pile of mess" is fair. It needs improvements. Sure...

It's not about the style, it's about the coherent of the story, in any case there is a strict limitation of 8 pages limit for NeurIPS, the current work is more than 12 pages with no conclusion section and limited related work section, so it's simply automatic rejection.
I put it back, it's currently exactly the same as before.

versed flax Jun 23, 2023, 3:00 PM

#

I do agree with your points. There has an ongoing work of moving stuff to the Appendix, indeed. And reflowing the section. We're on the same page. Can we do that as a team?

fallow egret Jun 23, 2023, 3:10 PM

#

versed flax I do agree with your points. There has an ongoing work of moving stuff to the Ap...

Sure, I prefer not to do all this work on my own. I thought you and Alex are limited with the time and not actively working on it

versed flax Jun 23, 2023, 3:13 PM

#

fallow egret Sure, I prefer not to do all this work on my own. I thought you and Alex are lim...

We are. But it's much easier and faster to help as a team and get involved in discussions rather than pulling it off alone :)

blissful garden Jun 23, 2023, 4:21 PM

#

fallow egret It's not about the style, it's about the coherent of the story, in any case ther...

NeurIPS deadline has passed right?

#

I thought our plan is to put it on arxiv first, and decide where to submit later?

fallow egret Jun 23, 2023, 4:23 PM

#

blissful garden NeurIPS deadline has passed right?

Yes, indeed. It's not a good time with respect to conferences.
In any case in all the major conferences there is the strict 8 pages limit, and I think we should already upload it to arxiv according to this structure

versed flax Jun 23, 2023, 4:25 PM

#

The points are made and the experiments are here. We should aim for an ArXiv release next week, if not this WE. The text (structure, sentences) should be refined for a conference, but I don't think it's doing too much harm to release it close to its current state

#

The paper won't change much

#

the sentences will, but not the insights, not the results, not the experiments

blissful garden Jun 23, 2023, 4:28 PM

#

I recently wrote a very long paper (20 pages lol) and submitted to somewhere with page limit. I personally find it rather easy to refactor it for any page limit AFTER the writing is already smooth and coherent. I propose we focus on the introductory texts, the gluing paragraphs, notation consistencies, grammar problems in section 3 first rather than worrying about the page limit?

#

and I think we are doing fine so far, since previously not all experiments have come out. Now we probably just got everything for benchmarks.

fallow egret Jun 23, 2023, 4:30 PM

#

versed flax The points are made and the experiments are here. We should aim for an ArXiv rel...

I agree with @loud adder that there is no second time for first impression. In my opinion the current version is not ready, and it will take time to iterate and make it look like a good paper.
But it's up to you as the first author...

versed flax Jun 23, 2023, 4:33 PM

#

fallow egret I agree with <@193204646687408129> that there is no second time for first impre...

Yes I agree. First impression must be good. And IMHO it is. The experiments are here, the results are here, the insights are here. We're not writing poetry. If the sentences can be improved, it's no big deal. The scientific aspects are there already

blissful garden Jun 23, 2023, 4:35 PM

#

IMO what's missing for section 3 is to have one qualified native speaker to streamline the intro texts and middle paragraphs. Meanwhile we fix things as good as we can but I don't see the neccessity of massive changes (for the sake of text itself, not considering page limit).

fallow egret Jun 23, 2023, 4:37 PM

#

blissful garden I recently wrote a very long paper (20 pages lol) and submitted to somewhere wit...

IMO the page restriction is not only for the conference, people have no time to go over 10 pages of experiments. But I completely agree that it's significantly less important comparing to the other points you mentioned

blissful garden Jun 23, 2023, 4:39 PM

#

or is there a better idea to restructure the section? I'm def not saying we shouldn't do it but as long as we discuss about it openly here and make sure everyone is on board, it will be good.

#

I remember @versed flax used to mention that general benchmarks + codegen benchmarks go together, and machine translation + your CoT go together. Maybe there is some way to combine some stuff along this line

fallow egret Jun 23, 2023, 4:41 PM

#

My thoughts on section 3:

Adding overview which echo the intro-> The purpose is to show that CFG improve alignment and follow better the provided instructions We demonstrate it on a vast and diverse set of tasks, models and metrics
The splitting of the subsection is according to different family of tasks
Each subsection should start with a paragraph explain what is the task, why we choose it- in other words, in what aspect it’s challenging to follow the instructions. One main benchmark, a small figure/table and a conclusion + reference to the appendix for more details. All the rest of the experiments/details should be in the appendix
Negative prompting should go to the next section as exploration of direction that should be investigate further (as we already discussed)
‘General natural language benchmarks’- should be split accordingly to exact tasks- I think most of these benchmarks could be categorized as reasoning tasks, in this case we could merge CoT with this subsection as a natural extension.

blissful garden Jun 23, 2023, 4:47 PM

#

fallow egret My thoughts on section 3:  1. Adding overview which echo the intro-> The purpose...

I thought about splitting general benchmarks according to tasks, but that might mean we split tables and results. What benefit do we get by doing this?

fallow egret Jun 23, 2023, 4:49 PM

#

blissful garden I thought about splitting general benchmarks according to tasks, but that might ...

I think almost all the benchmarks fall in the category of reasoning. It might be that we don't need to write all the results. If there is a benchmark which doesn't contribute to the story then what is the point in showing this result?
In any case, I think we could categorized it in 1-2 family of tasks

blissful garden Jun 23, 2023, 4:54 PM

#

fallow egret I think almost all the benchmarks fall in the category of reasoning. It might be...

If there is a benchmark which doesn't contribute to the story then what the point in showing this result?
So we hide it?
I think we could categorized it in 1-2 family of tasks
I thought we have more? Common sense reasoning, close-book QA, code generation, machine translation

fallow egret Jun 23, 2023, 5:01 PM

#

blissful garden > If there is a benchmark which doesn't contribute to the story then what the po...

I meant only the subsection of 'General natural language benchmarks' can be split to reasoning and maybe one more family. All the rest of the subsection (except continuation which still seems missing), are well defined specific tasks which is good.

It's not hiding, we don't need to provide all the results in eval-harness...
Our story is not 'we are building the best mode look how he is beating all the rest on all benchmark'. The story is we purpose a method to tackle some issue in LLM. Therefore, every benchmark tested should be related to this issue

blissful garden Jun 23, 2023, 5:05 PM

#

fallow egret I meant only the subsection of 'General natural language benchmarks' can be spl...

I think all the benchmarks are related (make the generation/prediction on-topic, thus improving metrics like acc). We ran those, and some of the results are bad. I would still report bad results and aim to suggest a theory/modify the narrative rather than hide it.

versed flax Jun 23, 2023, 5:05 PM

#

blissful garden I think all the benchmarks are related (make the generation/prediction on-topic,...

Yes. Hiding the bad results is dishonest

blissful garden Jun 23, 2023, 5:06 PM

#

(and IMO, bad results on small amount of things make it more believable)

loud adder Jun 23, 2023, 5:07 PM

#

I’m at a conference right now but can find time to read and edit the paper at some point today or in a couple days

fallow egret Jun 23, 2023, 5:11 PM

#

blissful garden I think all the benchmarks are related (make the generation/prediction on-topic,...

I think you understand me wrong. I'm not suggesting hiding bad results, I'm suggesting hiding unrelated/unclassified tasks results (that might be good or might be bad).
The point is that 'a general benchmark' subsection is very confusing and not informative to the reader

blissful garden Jun 23, 2023, 5:12 PM

#

fallow egret I think you understand me wrong. I'm not suggesting hiding bad results, I'm sugg...

Oh I see. But which task is unrelated?

fallow egret Jun 23, 2023, 5:14 PM

#

blissful garden Oh I see. But which task is unrelated?

I don't know, we should carefully go over the big list of the benchmark of this subsection:
Hellaswag, WinoGrande, Lambada(OpenAI), BoolQ, PIQA, SciQ, ARC-Easy,
ARC-Challenge, TriviaQA

As I said, I think most of them can be categorized as reasoning tasks (which is very good and directly aligned with the story). If there are some leftovers that not fit to this category let's think what to do with them...

versed flax Jun 23, 2023, 5:16 PM

#

I'm at my wife's dance representation tonight. I can't develop or read much right now.

fallow egret Jun 23, 2023, 5:17 PM

#

versed flax I'm at my wife's dance representation tonight. I can't develop or read much righ...

This is fine, we will not change anything without anyone consent 🙂

versed flax Jun 23, 2023, 5:19 PM

#

I just wish I could contribute more to the current conversation

blissful garden Jun 23, 2023, 5:23 PM

#

fallow egret I don't know, we should carefully go over the big list of the benchmark of this ...

hellaswag, winogrande, boolq, piqa, arc-e, arc-c, are common sense reasoning for sure (looking at llama paper)
triviaqa and sciq feel like close-book QA with facts. Not sure if there are reasoning stuff in sciq.
lambada - just completion? ~~In PaLM paper it was its own category.~~ oh wait I'm actually confused....

#

(and sadly lambada is something we definitely want to include because it's sooooo good)

#

If we could hide results winogrande is probably the only one we wanted to hide 😂
It's kind of interesting though. winogrande might be the only one that asks to fill a word in the middle of a sentence.

fallow egret Jun 23, 2023, 5:38 PM

#

blissful garden hellaswag, winogrande, boolq, piqa, arc-e, arc-c, are common sense reasoning for...

This is good, so let's split this tasks:

Reasoning (most of the tasks) + we can merge CoT with it.
LAMBADA (let's try to categoraized it more precisely)
Close-book tasks. Do we really need it? We should think if it's really needed...

In this way it's also highlight one of the most important results (LAMBADA), which is currently a little bit hidden in this subsection

blissful garden Jun 23, 2023, 5:40 PM

#

fallow egret This is good, so let's split this tasks: 1. Reasoning (most of the tasks) + we c...

sciq and lambada are probably the highlights since the improve is so much better than others. Since they are 2/3 maybe the hypothesis of focusing on reasoning is a little bit too biased?

fallow egret Jun 23, 2023, 5:43 PM

#

blissful garden sciq and lambada are probably the highlights since the improve is so much better...

I'm not saying we should focus on them, it is actually the opposite all this big list of reasoning tasks will get one subsection, and the other will get their own subsection (every subsection has approximately the same length), so this splitting will actually highlight this results

versed flax Jun 23, 2023, 5:44 PM

#

If we could understand what makes Lambada, TriviaQA and WinoGrande so special it would be cool

blissful garden Jun 23, 2023, 5:44 PM

#

fallow egret I'm not saying we should focus on them, it is actually the opposite all this big...

great! So we don't hide anything and nothing is unrelated?

versed flax Jun 23, 2023, 5:44 PM

#

Perhaps they are a particular kind of task that would actually drive the categorization

fallow egret Jun 23, 2023, 5:45 PM

#

blissful garden great! So we don't hide anything and nothing is unrelated?

Yes, as you said we were able to categorize this big list of benchmarks to 3 family of tasks which is good 🙂

blissful garden Jun 23, 2023, 5:45 PM

#

versed flax Perhaps they are a particular kind of task that would actually drive the categor...

yeah the only thing I see so far is just that sciq and lambada are not about reasoning, and winogrand asks for a specific kind of prompt that questions the middle of a sentence

versed flax Jun 23, 2023, 5:46 PM

#

https://www.mosaicml.com/blog/mpt-7b tasks described at the end

Introducing MPT-7B: A New Standard for Open-Source, Commercially Us...

Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k.

#

So what I see is that Lambada is pure next token prediction, so, the exact training objective, and WinoGrande has a specific form

blissful garden Jun 23, 2023, 5:49 PM

#

It's actually fascinating: if we change the format of winogrand by replacing "_" with something else and rephrase the way it is questioned a little bit, would it change the situations?
@loud adder I vaguely remember you mentioned something in general about perturbing benchmarks right? (ignore this if you are in the middle of something. I don't want to disturb you during the conference)

versed flax Jun 23, 2023, 5:49 PM

#

TriviaQA is free form response whereas Arc is MCQ

versed flax Jun 24, 2023, 2:10 AM

#

Here's the plan:

We're restoring Sec 3.
We're finishing Sec 2. @fallow egret you made significant progress on this, I'm not sure breaking and refactoring Sec 3 before Sec 2 is done and well is wise. Otherwise we'll have a paper broken in Sec 2 and 3.
I'll take some time during this week-end to rework the minimal amount of tiny bits of Sec 3 that need to be reworked in order to have a functioning paper (help is welcome, but priority is on Sec 2 which is currently broken)
We're aiming for an ArXiv release on Wednesday. Which means we release on Friday, last call. If it's not ready by Friday, I'll do what's left to be done myself and call it a day.
If we're extremely fast and efficient and the Wednesday deadline is met in advance, we can talk about Sec 3.

We've hit a wall of diminishing returns and reduced availability. At this point 1) we're nitpicking, and 2) every day we lose is reduced impact.

Stella, you have a veto right for this plan if you think I am going crazy or missing something crucial.

#

Guidelines
Like in software engineering:

Don't break master: The paper must be in an acceptable state at every point in time.
Incremental and atomic changes: otherwise if your change is not finished, you'll break the paper.
Be mindful of others, communicate your intent, work as a team.

fallow egret Jun 24, 2023, 3:41 AM

#

versed flax **Here's the plan:** 1) We're restoring Sec 3. 2) We're finishing Sec 2. <@10570...

What about the intro section? it is also currently broken
Look at all the red paragraphs above, IMO this better tell the story comparing to the old intro as we all discussed (and I think all agree)

blissful garden Jun 24, 2023, 4:41 AM

#

fallow egret What about the intro section? it is also currently broken Look at all the red pa...

could you change the red texts you added in the intro under \elad command?

#

some stuff is useful, I agree

fallow egret Jun 24, 2023, 4:42 AM

#

Yes, I'm afraid to do changes now to the main text, not sure what I should so I prefer not to insert it 😐

blissful garden Jun 24, 2023, 4:43 AM

#

fallow egret Yes, I'm afraid to do changes now to the main text, not sure what I should so I ...

no I meant most of the red texts in the intro section were added by you right? Per your own suggestions you might want to change it to \elad to better track it.

fallow egret Jun 24, 2023, 4:43 AM

#

The big red remarks are my suggestion to replace the current intro with the red text...

#

(it's a complete section)

blissful garden Jun 24, 2023, 4:44 AM

#

wait, was the red text under your remark yours or someone else's

fallow egret Jun 24, 2023, 4:46 AM

#

There is all the red part here, that I wrote, which is a suggestion for replacement for the current intro

blissful garden Jun 24, 2023, 4:48 AM

#

oh so I was right. I was just saying you suggested we bracket our suggestions by our own custom command, right?

fallow egret Jun 24, 2023, 4:49 AM

#

Yes, I wrote it before my suggestion 🙂

#

In any case we should decide what to do also with the intro...

blissful garden Jun 24, 2023, 4:51 AM

#

fallow egret Yes, I wrote it before my suggestion 🙂

yes I know. Could you change it to \elad then?

fallow egret Jun 24, 2023, 4:55 AM

#

Done

blissful garden Jun 24, 2023, 4:58 AM

#

I haven't started looking at introduction carefully but I personally like Alex's original text a lot, though it might be incomplete. Yours is a great complement but besides language issues, I mostly fear about this one

Our findings suggest that CFG is highly effective in improving the model alignment, increasing the possibility of serving the model as a general purpose assistants in a wide variety of tasks

#

Shall we fix section 2 first? A lot of people I know (including myself) work on the introduction almost at the end.

#

For section 2, looks like we all agree to use notations from Ho & Salimans. Let's comment out the remarks and stick to it.
I can do this minor change we agreed earlier about equation numbering.
I will also comment out a couple of my remarks that are resolved

fallow egret Jun 24, 2023, 5:06 AM

#

blissful garden Shall we fix section 2 first? A lot of people I know (including myself) work on ...

Sounds fair, I will work on all the quotes placeholders. I think besides that and the title (which I don't have any objective to change if someone have an alternative idea), there are no other comments (except the Ho notions)?

blissful garden Jun 24, 2023, 5:10 AM

#

fallow egret Sounds fair, I will work on all the quotes placeholders. I think besides that an...

there are a couple more that I haven't thought super carefully. There is one about explaining how we go from p_\phi to p_\theta, and another about p(c|z) predicting the whole prompt from generation (which could makes sense mathematically but I just have a strange feeling).

The last one about Nichol & Dhariwal's finding about effect of CFG, is the way we rephrase consistent with what's on their paper? I haven't checked yet.

#

There are some burdens on me here and I was planning on thinking about them on Saturday. But let me know if you have thoughts

fallow egret Jun 24, 2023, 5:12 AM

#

I think I address this remarks? (either by modifying the text or by adding a remarks)

blissful garden Jun 24, 2023, 5:13 AM

#

fallow egret I think I address this remarks? (either by modifying the text or by adding a rem...

which one?

fallow egret Jun 24, 2023, 5:13 AM

#

blissful garden which one?

all of them?

#

Did I missed something?

blissful garden Jun 24, 2023, 5:16 AM

#

fallow egret Did I missed something?

I don't see any explanation

#

by the way, there seems to be a lot of grammar mistakes and language problems as well. Should we leave it for you to revise?

fallow egret Jun 24, 2023, 5:18 AM

#

blissful garden I don't see any explanation

Oh, I see that remark OK let's iterate it on the paper

fallow egret Jun 24, 2023, 5:19 AM

#

blissful garden by the way, there seems to be a lot of grammar mistakes and language problems as...

I will also iterate the grammar issues (I wrote both this and the intro without any iteration), but for sure feel free to change all the things you find

blissful garden Jun 24, 2023, 5:20 AM

#

I guess we are gonna say cfg uses p_\theta to approximate p_\phi, right? This was my original thought but I feel I might be wrong

#

Never dare to directly add a sentence into yours

fallow egret Jun 24, 2023, 5:22 AM

#

blissful garden I guess we are gonna say cfg uses p\_\theta to approximate p\_\phi, right? This ...

I think that it simply a classifier which is defined using the same generative model, this is the main point.
So in this case theta=phi (it's the same model parameters)

fallow egret Jun 24, 2023, 5:24 AM

#

blissful garden Never dare to directly add a sentence into yours

My intentions were only to rewrite the story structure of these sections. You are really welcome to modify/rephrase the sentences

versed flax Jun 24, 2023, 2:31 PM

#

I reworked a little be the abstract: grammar, added LAMBADA, added GPT4All results which are our best results to my knowledge

versed flax Jun 24, 2023, 3:37 PM

#

I reworked the introduction, blending @patent gull and @fallow egret's previous works. It think it's a fair blending and you should both find your work mainly unaltered. Can I get your confirmation here or as overleaf comments that you are in line with this intro and can consider it done?

fallow egret Jun 24, 2023, 3:40 PM

#

IMO the intro should start with an overview of the problem that we are trying to solve. But whatever you decide...

patent gull Jun 24, 2023, 3:40 PM

#

I’ll take a look! Just waking up now after EMNLP submission

versed flax Jun 24, 2023, 3:43 PM

#

fallow egret IMO the intro should start with an overview of the problem that we are trying to...

Why not. I'm open to criticism. My main focus in to get it done. Done is better than perfect. Can you suggest a fast and straightforward way to improve the introduction? The most broken part of the paper rn is Sec 2. If it takes more than 30 minutes to improve the intro in the way you would like it, I suggest you spend this time fixing Sec 2 to a "non-broken" and ok state.

patent gull Jun 24, 2023, 3:44 PM

#

yah i agree with Elad. Switching the 2nd and 1st paragraphs seems like the right way to go (and reworking)

#

happy to take a stab if you agree

versed flax Jun 24, 2023, 3:45 PM

#

That's indeed fast and straigthforward. A simple cut and paste still makes sense.

#

@fallow egret Are you happy with this simple swap?

fallow egret Jun 24, 2023, 3:48 PM

#

versed flax <@1057033987811459203> Are you happy with this simple swap?

Yes, but it require a little bit more small changes (like you can't say that you use CFG if you didn't discuss about CFG first).

versed flax Jun 24, 2023, 3:48 PM

#

fallow egret Yes, but it require a little bit more small changes (like you can't say that you...

Also I couldn't find the citation you had in mind for "like prompt alignment [Anthropic]"

fallow egret Jun 24, 2023, 3:49 PM

#

versed flax Also I couldn't find the citation you had in mind for "like prompt alignment [An...

https://arxiv.org/pdf/2112.00861.pdf

versed flax Jun 24, 2023, 3:49 PM

#

ty!

fallow egret Jun 24, 2023, 4:02 PM

#

@blissful garden All the rephrasing of the beginning of section 2 (up to 2.1) was done by you? It's fine, I just want to take a look at older version, and we have limit history of 24h 😦

versed flax Jun 24, 2023, 4:05 PM

#

I am happy with the abstract (thumb up / down)

#

I am happy with the introduction (thumb up / down)

patent gull Jun 24, 2023, 4:24 PM

#

versed flax - I am happy with the introduction (thumb up / down)

guys in general — i don't know ICML/Neurips conventions that well — but it seems like there's a lot of really basic background in here that I don't remember last time

#

like i think we define what "prompting" is twice (once in the introduction, once in the next section), or at least introduce it as a concept

#

the language feels fuzzier to me, like it's saying less

versed flax Jun 24, 2023, 4:28 PM

#

I don't know the conventions either. I think it doesn't hurt and you can't really go wrong by stating the basics again, whereas an omission can really be annoying for the reader. It's much easier to skip a few sentences than having to google and read several papers to find that piece of prerequisite that you did not know about

patent gull Jun 24, 2023, 4:28 PM

#

fair enough

versed flax Jun 24, 2023, 4:30 PM

#

Beyond that, it helps contextualize how we use and what we mean by the specific vocabulary ("conditioning", "prompt", etc), that is often used a bit differently by each author

patent gull Jun 24, 2023, 4:31 PM

#

for sure i guess i'm more focused on the repetition between Intro and Methods

versed flax Jun 24, 2023, 4:32 PM

#

Oh yeah that's a different story but I'm not in Sec 2 yet

blissful garden Jun 24, 2023, 4:32 PM

#

fallow egret <@823129585230544906> All the rephrasing of the beginning of section 2 (up to 2....

Your main text was commented out in the middle

patent gull Jun 24, 2023, 4:32 PM

#

I don't love the way we basically restate twice "This worked in Computer vision" feels very redundant to me

fallow egret Jun 24, 2023, 4:33 PM

#

Yes, something was change there significantly comparing to the original version I wrote 😐

#Evaluating Classifier-Free Guidance impact

Comparing CFG to other method (COT, gsmk, Wizard-30B-COT): Baseline (cfg=1): acc 23.5% ensemble (3 models): acc 25.5% CFG 1.1: 26.5% self-consistency (3 runs): 29%

Comparing CFG to other method (COT, gsmk, Wizard-30B-COT):
Baseline (cfg=1): acc 23.5%
ensemble (3 models): acc 25.5%
CFG 1.1: 26.5%
self-consistency (3 runs): 29%