#Evaluating Classifier-Free Guidance impact
1 messages · Page 2 of 1
Programming lang conformance (when no lang is specified, any lang is considered ok)
cfg : acc
1 : 73.00 (73/100)
1.25: 86.00 (86/100)
1.5 : 81.00 (81/100)
1.75: 77.00 (77/100)
p-value cfg=1 vs cfg=1.25: 0.0105
breakdown
1
| not code | C | Java | Python
- | 9 | 9 | 6 | 1
C | 3 | 19 | 3 | 0
Java | 5 | 0 | 19 | 1
Py | 6 | 0 | 0 | 19
1.25
| not code | C | Java | Python
- | 4 | 11 | 9 | 1
C | 4 | 19 | 2 | 0
Java | 2 | 0 | 23 | 0
Py | 1 | 0 | 1 | 23
1.5
| not code | C | Java | Python
- | 6 | 8 | 8 | 2
C | 5 | 18 | 2 | 0
Java | 3 | 0 | 22 | 0
Py | 3 | 0 | 0 | 22
1.75
| not code | C | Java | Python
- | 6 | 6 | 10 | 1
C | 8 | 16 | 1 | 0
Java | 2 | 0 | 23 | 0
Py | 5 | 0 | 1 | 19
So if I’m understanding this correctly, it’s a greater indication that CFG is encouraging prompt adherence?
(Also I’m so sorry I’m late with results and generally absent from discussion the past few days. My girlfriend is in town visiting from far away so I’m not super close to my computer)
I totally agree, @loud adder , that this is a weird experiment. Just wasn’t entirely sure what to do to evaluate toxicity that didn’t involve human evaluation. Any other suggestions?
Was thinking of ways to make it more robust
Is the goal here to improve machine translation, or provide more insight that CFG increases prompt adherence?
Show generality of application
Showing it over one task isnt helpful, so testing it over generation tasks, over models
I think it should be generally more helpful in generation tasks
Since they tend to steer away
I see, definitely. Might be interesting to try for low resource languages, too
Why not using strong model (GPT3.5-4), to give a score?
I really like these metrics because they are easily reproducible given the prompt.
They’re not easily reproducible though? The model underlying the API changes without documentation.
Also, do we have any reason to believe that GPT-4 is reliably good at this task?
(and that there aren't adversarial patterns)
I used Self-Critique with GPT-3.5 for toxicity for some product, it is not perfect of course (this is still a big issue in the field), but it is work very nicely.
Regarding the underlying model changes, yes, indeed this is a big issue
Yeah I generally don’t like the emerging strain of research that uses GPT4 to evaluate. But maybe the evaluation is worth layering on on top. I don’t have a ton of research budget but I can do it for a sample of generations
Why not use the Perspective API over GPT4? It has its own set of issues (https://arxiv.org/abs/2304.12397 ) but is at least much more commonly used than asking GPT-4
Perception of toxicity evolves over time and often differs between
geographies and cultural backgrounds. Similarly, black-box commercially
available APIs for detecting toxicity, such as the Perspective API, are not
static, but frequently retrained to address any unattended weaknesses and
biases. We evaluate the implications of these changes on t...
This makes me wonder and think about 1 shot performance
I’ll try those experiments at the end
I was under the impression that that’s what was happening, and that GPT-4 was a suggested alternative
ah apologies, i saw something about perplexity and so wasn’t sure
No RealToxicityPrompts are natural snippets from the web that have been scored using the perplexity API for toxicity. I’m selecting the top 10% most toxic of them (~800) and the 10% least toxic
I’m calculating the average perplexity of generating the toxic (or non-toxic) continuation under higher values of CFG with various detoxifying system prompts like “de-escalate the violence in this chat”
We expect that if CFG is really forcing more detoxification, we’ll see higher ppl for the more toxic continuations, and unchanged ppl for the less toxic. Of course there are lots of confounders here and we’ll also try to explore a bit to make it robust against these
But yeah that was my attempt to address the “improve assistants” aim of this paper, which was an original goal at the outset
So yes perspective API is being used, but only indirectly
@patent gull Why do this instead of the standard RTP evaluation?
(Unless this is the standard one?)
I will read more. I had seen this approach used several times but idk if there is a standard
Maybe elad knows, he’s typing 😉
I'm typing results 🙂
Results for gsm8k:
- Keeping the correct answer structure:
CGF- 1: 72.17%
CFG-1.5: 83.33% - Giving the correct results (baseline extraction):
CFG-1: 17.2%
CFG-1.5: 18.2% - Allowing a little bit more flexibility (simply allowing $ and % prefix)
CFG-1: 23.2%
CFG-1.5: 21.5% - Adding a little bit more heuristics
CFG-1: 31%
CFG-1.5: 28.1 %
Now running CFG-1.1 and 1.25
@fallow egret can you contextualize these numbers? Is a +3% on this benchmark a big deal, or not?
Sorry, I fixed it. It is 3% gain for the CFG-1 not the CFG-1.5 (as discussed before, CFG-1.5 degregade the performances but keep better the structure).
Yes, I think that 3% is a big gap (it is challenging benchmark), for a comparison ensemble of 40 prompts results in only 2% improvements (LaMDA-137B)
Do you have per-question results?
Yes
I could see us wanting to calculate P(correct | well-formated) or something like that
Cool
For this we have the results, I think the interesting missing part is:
is P(correct | not well-formated) and P(incorrect but contain the correct answer in the chain)
Yeah I agree, I just typed the first conditional than came to mind as an example
mt0 - 0 shot
prompt: If the original version says: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry ; then the English v
ersion should say:
CFG=1 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.1 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.2 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.46 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=1.7 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG=2. |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
the idea was to see, if cfg helped multitask prompt tuned models (as mt0 is)
It is exactly the same numbers, maybe a bug?
ya it is
so this occured with another model and @versed flax mentioned that it can happen when the cfg parameter isnt enough to steer the model to a different head token this can happen
because im using the same codebase for all other experiments too
i will do the experiments once more if there is some doubt though 
I'm a little bit surprised that the high CFG doesn't make it worse
true, same
im trying 3
to see if it does something
Getting nice improvement when reducing the CFG values!
CFG-1: 23.2%
CFG-1.1: 25.8%
CFG-1.25: 24.7%
CFG-1.5: 21.5%
If anyone have an idea on a strong open-source model for COT I would love to add another model for the experiments (unfortunately all the models I'm finding in the literature are closed)
Another experiment I will do is to compare it to a self-ensemble approach (so we will have some baseline)
|wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
hasnt changed
im surprised
Well with cfg values this high, my hypothesis is that this model uses a different code path which hasn't been cfg patched
but we've been getting different values in other models right
should i try another prompt-tuned model
Yeah but there are some intricacies with the codebase and it seems that not all models use the same code paths. Honglu already reported something about that
oh this could be a seq2seq model thing then
mt0 is a seq2seq model, all other experiments were on hf-causal-experimental
That certainly makes sense to me
okay im dumb
, thought the mt0 model was biased towards prompts too much already
Nah you would see some variations at least.
(and disastrous results for high cfg)
Ahhh that's great news!!
Yes, indeed
okay you were right, saw this and made changes for the seq 2 seq model
will try my experiments again
thank you for catching it
CFG 1.00 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG 1.05 |wmt14-fr-en| 0|bleu |29.4076|± |0.3136|
CFG 1.10 |wmt14-fr-en| 0|bleu |27.7893|± |0.3143|
doesnt help in the case of prompt tuned models (not entirely sure if its a prompt tuned thing or a seq2seq thing, but inclining towards the former)
From our POV, the ideal scenario would be that this is because the prompt-tuned models already behave “the same way that CFG is trying to make them behave, but even more so”
The only way to test that is like, take a model tuned and untuned (afaik t5/t0 and bloom/bloomz) and then do a bunch of simple prompt experiments with cfg and on bloomz. I only mean this in a qualitative way
Mt0 does wipe the floor on task performance in terms of scores
The easy way to make progress on this question might be to run both on a variety of tasks and try to correlate their answers?
Like, if CFG changes the answer (compared to the base model) does it tend to make it more like the tuned model
Yeah this question is not the easiest to answer. Since in lets say the case of wmt14 fr-en, theres like 10 point rouge score different between the models cfg applied. We might have to cherry pick a little, or take their outputs and run correlation
It’s much easier for LLH tasks than generative tasks
For “generative” tasks I would probably try to look at the logit distributions and ignore the actual generations (ironic, ik)
bloom3b 1.1 |wmt14-fr-en| 0|bleu |15.8055|± |0.4413|
so like, in wmt14 fr-en across all models except prompt tuned
have seen improvement
will try summarization next
and maybe do this 1 shot as well
replied to the wrong text sorry
.
So let's think step by step about what would it mean when a model behaves negatively with CFG.
Let's write CFG as P(x) + cfg * (P(x|y) - P(x)). A negative impact of CFG would mean that P(x) -> P(x|y) does not point in the "right" direction. Which is should, unless 1) the model sucks very much and P(x|y) is bad, or 2) P(x) is terrible.
I can see (2) happening if an instruction tuned model gets extremely confused without an instruction prompt and produces gibberish. It's easy to test and if that's true, we cal probably find a "null" prompt, y0, so that P(x|y0) behaves like we would expect P(x) to.
Or P(x|y) is already at the optimal peak of the loss function. That is, cfg=1 is optimal for L(P(x) + cfg * (P(x|y) - P(x))). We can easy prove that cfg=1 is the best by trying 0.9 0.95 1 1.05 1.1 and show that 1 is best. But I can't think of a way to find other directions to compare to.
[EDIT: continuation]
In diffusion models what we observe is that the model tends to loosely interpret the prompt and pull back to P(x), to the training set density. To restate my example, "a robot duck" would produce a robot or a duck but not a "robot duck". CFG allows to counteract this drag, and really reach the "robot duck" by pushing away from P(x), in the direction of the prompt.
What I think I observed in Stable Diff is that indeed finetunes and LoRAs need close-to-1 CFG values (this is wrong) because indeed P(x|y) is not "atypical" for the y prompts. Usually you'd use a LoRA for a certain type of images/prompts and use prompts in that realm, so there's is no significant difference between P(x) and P(x|ys of interest), and CFG indeed does nothing I suppose.
The argument could be that intruction tuning model does indeed shift their P(x) to the domain of fancy prompting, and CFG is useless. => Validates Stella's argument, which I think I now agree with.
I'm not sure I understand what you are trying to prove
nothing exactly, just shooting a few ideas, thinking out loud
trying to understand why paws experienced a decrease in accuracy with the instruction tuned models, and what happens with instruction models
I'm not sure it's related to the instruction tuned model. In the end, as far as I understand it, on his task the CFG effect the balance between the coherent of the translated text and the 'stickiness' to the original sentence which is given by the prompt (direct translation word by word).
I think that if we will see few examples we can easily verify/refute this claim
I’m generally quite skeptical of anything measured in BLEU, FWIW
That’s roughly my intuition, yeah. I didn’t know that about finetuned of SD, but it makes sense.
I have an idea for an experiment brewing. I’m off to a pottery class but can share later
Should try and see if some human eval is possible
After checking, it seems that I over generalized from a specific model that needed low CFGs. Checking galeries for various models on CivitAI, it seems that even for finetunes, CFG > 6 is the norm.
So let's think step by step about what would it mean when a model behaves negatively with CFG.
Let's write CFG as P(x) + cfg * (P(x|y) - P(x)). A negative impact of CFG would mean that P(x) -> P(x|y) does not point in the "right" direction.
Let's slow down a moment. LLMs and diffusion models work a bit differently, and it's important that that nuance not get lost
let me know if you need help with this, I'm relatively free to help!
hi who is using all the GPUs on the cluster? Can you restrict your GPU usage?
whoever is running the gsm8k with the WizardLM-30B-Uncensored-Guanaco-SuperCOT... I"m assuming that's @fallow egret ? What's the progress on that?
Yes, I'm now testing it against ensemble
You need me to release something? There is lots of space if you run something with 'auto' assigment
This was supposed to be the intro to a series of comments but I had to take my cat to the ER :/
Oh 😿 How's the little furry thing doing :(?
Yes. Your little friends needs all the love you can give. Take care of the cat, we'll wait.
oh no!! so sorry to hear that!!
the issue is that RAM is limited on this cluster, so my processes are getting killed just loading in a 3B model onto CPU, before getting it onto the GPU (see vermifuge's and stella's convo on this)
it's ok — i can wait. just wondering how much longer?
@patent gull I increased the RAM so that models up to 20B should work
Oh are you on the old pod
ah :/ yes
the 3B model typically does load
it's ok — if elad is almost done, i can wait
@patent gull I mean, it's currently not being used: ssh [email protected]
wow!!!! what a machine! thank you!!!
We've been trying to get some explanatory insights into what CFG is actually doing. Maybe folks here have an idea of what else to look at?
I generated 20 different sequences of text with 3 different models (Pythia, Llama, GPT4All), and tracked the logits at each step. Turns out that CFG consistently decreases the entropy of the generated text:
This had the effect of decreasing the # of tokens in the top_p=90%:
Strange Q I should have asked earlier and was probably distracted by the charts going up to think about seriously:
What is the "classifier" we are free of?
not speaking from a place of deep knowledge here, but I interpreted that as the classifier/discriminator that is used in controlled text generation
We don't have one of those when generating from a decoder-only LLM
yeah we do... controlled text gen decomposes p(x | a) \prop p(x) p(a|x)
p(a|x) is the classifier, p(x) is the decoder-only model
If you put it in dollar signs a bot will post the LaTeX as an image
oh that's cool!! Thanks for the tip
$p(x | a) \propto p(x) p(a|x)$
whoops my bad lol
Stella Biderman (she/her)
yes that is the correct latex.... my bad... was typing pseudocode
anyway, yes, i assumed CV had similar approahces to guidance
So, the thing about guidance in CV is that it's part of the training procedure
GANs have a nice mechanism to incorporate class labels into the generated samples in the form of class-conditioned normalization and classifier-like discriminators
(I'm in a D&D game rn I should pay attention to, but Alex is right)
p(a | x) is p(prompt | completion)
no sorry, i was using controlled generation terminology
a = attribute (i.e. "non-toxic", "happy")
x = text you wish to generate
(or any class said classifier knows about, for classifier guidance)
I assume by analogy the classifier would be a classifier used to assist in decoding from a language model
So the diagram from the CAD paper is instructive I think here
In CV the first versions used an imagenet classifier, then we used CLIP, then CFG came and we ditched the classifiers
yeah CFG in vision is a bit of confusing terminology because it refers to a classifier that, in diffusion models, we don't need any more
Well that's why it's classifier free
In their notation we sample a generation y which is a continuation of x relative to some context c
answer ~ p(y | x, c)
But we are getting rid of c / combining it with x?
we keep the idea of context in CFG setup... except context = prompt for what we've explored
Eh, that's not quite true... x starts off as the empty set but eventually becomes "all previously generated tokens"
right. So the "guidance" isn't in the form of a classifier pushing us towards some class label, but it's an upweighting of the prompt
So we were linearly interpolating between p(continuation | prompt) and p(continuation)
That vaguely corresponds to CFG between 0-1, which funnily enough does nothing
(from a LLM POV this is the wrong way to talk about things which is slowing down my brain)
Tbh the diffusion analogy isn't helping me understand this idea because every component is so different
When the generation is long enough, the prompt gets pushed out of the context window
In this case, what does it mean to have conditioning on the prompt, mathematically?
I'm really not sure tbh.
that's a really interesting question.
(a) I don't think any of the benchmarks rely on generation that's so long it requires a shifting context window, so I don't think we've addressed that question empirically.
(b) $ a p(x_n | prompt, x_{<n}) + (1 - a) p(x_n | x_{<n})$ I'm pretty sure this collapses, because we just get 0 no, we just get the unconditional distribution
(Falling back on my mathematics training and going “I’m confused, let’s look at really extreme examples”)
Not sure why this didn’t trigger the bot, maybe write it in `code formating` as I can read latex well
a p(x_n | prompt, x_{<n}) + (1 - a) p(x_n | x_{<n})
$a p(xn | prompt, x{<n}) + (1 - a) p(xn | x{<n})$
Stella Biderman (she/her)
I think it ignores anything with a space after the $ to avoid triggering the bot too often
Alex Spangher
so for n> max_len, the first and second terms are equal, so we get the same distribution we would get without CFG
but x_1, ... x_n were generated via a CFG process, so it's still different than it would've been without CFG. Maybe as n -> infinity, the distribution converges to what it would've been without CFG
That would surprise me. LLMs are typically not very robust to conversation history
So one issue with this is that when we sample from a LLM normally we aren’t sampling from $p(x_n|x_{<n})$ ever. So it feels less like we are pushing the model in the direction of the prompt than it does like we are pushing the model away from $p(x_n|x_{<n})$
Stella Biderman (she/her)
As the generation continues, we would expect $p(x_n|x_{<n})$ to improve and eventually approximate $p(x_n|p, x_{<n})$ (maybe?). That’s because it’s easier to fill in the last word of an answer without looking at the question than it is to write the first word of an answer without looking at the question
Stella Biderman (she/her)
Which begs the obvious question of “what happens if you stop using guidance after n tokens”
- This is why the CFG values are very low for text completion (1.1). Increasing it degrade the performances very quickly (as you mentioned, it 'pushed away' )
- The main impact of CFG is preserving the prompt structure. For getting the correct structure it is highly important to look at the question structure and not on the answer previous tokens
@patent gull if you still have the scripts that produced the plots, can you plot (P(x|y) - P(x))**2 wrt index?
sorry for the delay.. this is a dumb question, but
normally we aren't sampling from
p(x_n | x_<n)
do you mean because of top-p, top-k, temperature, etc.?
Did we run TriviaQA?
sure thing.. (P(x|y) - P(x))**2... oh you mean the l2_norm?
yes
(I happen to have some numbers about the impact of prompt engineering on TriviaQA performance on hand)
The new (greedy_until()) version? Only with GPT2. The plots have the same shapes but the absolute values are lower overall
some initial one shot experiments, bloom3b, wmt14 fr_en
CFG=1. |wmt14-fr-en| 0|bleu |29.8361|± |0.3369|
CFG=1.1 |wmt14-fr-en| 0|bleu |28.5291|± | 0.331|
@unique sedge try 0.9
It's a little difficult to find information in the thread, can results be pinned perhaps?
lol we need a GPT summarizer bot
besides @versed flax, he's our universal summarizer
I know that with the formatting
Answer these questions:\nQ: {question}\nA: LLaMA-7B scores 49.3% and without the first line it scores 40.5%
@loud adder what did you want to conclude from the code generation with gpt-j tests?
I'm less sure of the validity of the analysis now, but CFG seems to increase adherence to the correct programing language by ~10% (except for C for some reason where it was unchanged) and decresed the not-code generations meaningfully across the board
@versed flax here's the L2 distance of P(Y | X) and P(X):
Probably more interesting:
The reason I feel less confidant in the meaningfulness is that code vs non-code is determined in the first handful of tokens generally
the L2 norm between CFG and vanilla generation
yes, the difference between them
So that patterns looks interesting to me.
they kinda say different things, don't they? the first plot says there's kinda a constant distance between the two distributions, until the end.
the second plot says that CFG has it's greatest effect at the beginning... i.e. the CFG distribution is the most different from the vanilla distribution at the beginning
imo the second one most directly gets at "what is the effect of CFG"
what's the formula you've plotted in the second?
CFG(Y|X) - P(Y|X) where CFG(Y|X)= a * P(Y|X) - ...
gives a sense of how much difference the final logit distribution is from what it would have been
oh, cool
is this an average of several prompts?
There are two inflexion points at 80 and 100
yup... it's 18 prompts
ok
but i think if we wanna put this in the paper, i'd do it more systematically
use a lot more prompts
Since we're trying to get more insight as to what CFG is doing, here is a GPT2-medium run with prompt "The koala". Each line reads as sampled_token> top5(P(x|y) - P(x)) ... lowest5(P(x|y) - P(x))
The ko
ala> cub, behav, toile, calf, foss... Mechdragon,sson,"},",encia, Gleaming
is> carniv, domest, bip, herb, hiber...azar,ufact,sbm,krit,hower
not> domest, venom, carniv, poisonous, hiber...ufact,bryce,terness,ゼウス,actionDate
expected> breed, breeding, starve, nests, offspring...Rated,Tickets,against, sidelined,ifying
to> breed, breeding, pup, mate, mating... Signed, coached, Played,姫, guiActiveUn
survive> zee,zees, habitat, breeding, zo... Amend, Cosponsors,hower, FEC, Lerner
the> habitat, enclosure, ko, breeding, feral... USL, NCAA, scrimmage,regular, postseason
winter> dens,zee, enclosure, Zoo, habitat...olitical, scrimmage, Soccer,SpaceEngineers, Hockey
.> ko,Ko, Picture, Ko, Photo... Morsi,ş, Benghazi,ğ, Alger
But> ko, feral, breeding, paw, vets...ş, Cosponsors, Shiite,oglu,ğ
its> mate, behaviour, mum, fur, instincts...��,Engine,Crew,ensional, glaciers
presence> recognise, practise, favourite, Australians, apologise... signaled, piv, differed, traveled,ajor
in> Melbourne, Griffith, Canberra, Adelaide, bush...Pg,Palest,Syria,Gaza,ISIS
the> ko, Canberra, Wellington, Melbourne,Adelaide...chlor,SpaceEngineers,Gaza,Syrian,ISIS
area> NZ, Perth, apologise, Queensland, recogn...avored, leveled, favored,avor,zai
will> recogn, honour, recognise, apologise,NZ... tilted,rift, complicate,avored, unfavorable
give> ko, Queensland, Tasmania, jo, Melbourne... Morsi, Hezbollah, Hamas, ISIL, Shiite
it> civilisation, reintrodu, mate, colour,NZ... prosecut, centrif,Iran, veto,govern
a> conservation, colourful, civilisation, advert, honour... coord, centrif,afort,ymm, coercive
boost> NZ, Kiw, Australians,Australian, Conservation... Idlib,oglu,orsi,afort, centralized
,> ko, Rabbit, Queensland, Zoo, Tasman... Morsi,Iran, Hezbollah,Syria,Saudi
Mr> Skinner, Pru, Wilkinson, Fletcher, Burgess...zbollah, Maduro,istani,ğ,aghd
C> ocker,butt,ummer,rower,reen... Vaj, Bihar, Bh, nep, Bah
ren> Perth, Canberra, Andrews, Tasmania, Surf...ğ,vez,kaya,�,ş
nan> oppers,abbit,itten, Queensland,lege... Iraqis, declass, diplomats, OPEC, geopolitical
said> laun, nodd, toile, suscept, streng... Luxembourg,France,Scotland, Belarus,anyahu
.> ko,Ko, Ko,ko,KO... Algeria, Egypt, Libya, Bahrain, Libyan
Ko, ko,Ko, Kw, Kik... mathemat, trave, condem, neighb, confirm
ko,Ko, Ko,ko, jo...Egypt, Algeria, Egypt, Libya, Bahrain
"> ko,Ko, Ko,ko, Kop... Abdel, Alger, Algeria, Maurit, ALS
Ko> ais,alis,ALS,auth,alian... Chu, Yen, Chun, Nebula, Nug
al> iffs,riages,downs,osures,isks... nodd, gorilla, perspect,gamer,GGGG
as> trust, comfort, love, saves, solve...kef, Glacier, Glac,atar,upe
(emphasis mine on things I find insightful)
Was trying to run wmt20-en-de to check out translation tasks but somehow lm_eval always freezes. Not sure why.
It spinned up hundreds of processes for tokenizer and then never moved on at certain point.
So, indeed, it supports our intuition: it encourages tokens related to the prompt most. The token it discourages though are much less interpretable to me.
What I do love is that after every punctuation it really tries to push "ko" way up
… why does it think koalas are carnivorous…
(It’s a really interesting way to visualize stuff, and really insightful. I just can’t get a koala with shark’s teeth out of my head now)
(backloging from the convo I half-missed due to D&D)
So yes, and no.
- You can train a conditional model P(x|a) (for a label or a any conditioning a), like a GAN, or a diffusion model. No classifier, but no CFG as well, and you're very much limited in the control you have: you're limited to the set A you trained against. Say you train a GAN on CelebA and use A = identities, you can't control if you want to generate a smiling face or not.
- Furthermore, we use the truncation trick to trade diversity for fidelity by truncating the GAN's gaussian prior. That's a big part of what made StyleGAN2 so good.
- So you can instead train an unconditional model P(x). And any classifier (or guide) P(a|x) you are interested in. Then, you can guide the sampling from P(x) by using the gradients the guide gives you, and optimize your GAN latent / diffusion sampling to satisfy the classifier. Imagenet / CelebA clfs were first used, then we used CLIP. What's cool is that classifiers are usually much cheaper to train than generative models so the argument was that you could distribute P(x) and let people come with their own clf for their use case.
- Meanwhile, we start conditioning image generators on text instead of labels, so you're not really limited by A anymore. We train P(x|y) and are somewhat disappointed because the conditioning signal is not used so much and loosely followed. So we stick with guidance.
- And CFG comes and say that you only need to train P(x|y) with a little dropout on y so that the same model can do both P(x|y) and P(x) and you can apply guidance to enhance the conditioning and everyone is happy.
- Finally, GigaGAN comes and fucks this whole narrative and I really need to pay attention to the conditioning part of this paper
What "instructive" thing do you read on this figure?
If the prompt is out of the context, then CFG is a no-op.If this constitutes a major pushback against CFG for LMs, we can argue that we continue to use the first N tokens of the context window as "the prompt". We can then hope to "perpetuate" the conditioning through the conditioned continuation. I'm not sure I'm being really clear here.
I think this experiment really sets the narrative the paper should tell:
- We take CFG from CV as a mean to enhance conditioning
- We observe that it does what we think it does and pushes token related to the prompt (thus discouraging divergence, or the model dragging back to the very dense regions)
- Sticking to the prompt is an important idea in general for downstream tasks, look at the benchmarks, and the wiiiiide variety of tasks we employed CFG for.
- <insert something about negative prompting>
- maybe a note about finetuning as well? How about we compare the KL between a base model with CFG and an instruction tuned model? Argument being: finetuning mostly increase prompt sticking (besides teaching the model to speak a slightly different language with "you"-commands)
Oops, I did it again
While the summary is helpful I'm not really certain if there's not many elements of CFG in this?
CFG inspired, certainly
Can you rephrase that? I'm not sure I get your point
When I'm trying to understand what is happening here thinking about it through the lens of CFG is confusing not illuminating
This doesn't seem to be the case for you though, and I can't speak for others
Oh. Well, I personnally exactly thought "this model doesn't stick to the conditioning enough, let's use CFG here". So that was the true story, I absolutely purposefully adapted CFG to LMs (and got amazed by the qualitative results, and thought I should post here).
Maybe it's confusing for people not mainly in CV? idk. I kinda like that it bridges both domains
I'm not denying the results! It's a neat idea for sure.
I mean, I'm calling it CFG and thinking it in terms of CFG because that's where my thought process originated, it's not an afterthought
So CFG doesn't do anything in a single step right, it only works over the course of denoising
But you've demonstrated positive results on single step token generation
CFG also doesn't work between zero and one
Even in the classifier guidance paper they found that they had to use extremely high w values
I found the graphics Stella to make more sense with what I think is occurring here
It does "work", but what it does sucks. It just reduces the impact of the conditioning, which you never really want I guess.
@versed flax i think this would be a great table for the appendix.
To add on, what I was seeing was that, often, CFG would focus some words that were already in the top 5-10 words selected by vanilla P(X | Y), it would just change their order.
I think we can highlight with green the words that moved up in importance from the vanilla P(X|Y) and with red the words that moved down.
I also found that the CFG vocab distribution was very sharp, so restricting to top_p = 90% in CFG often cut us down to 1-2 words. We should think about how we represent sample-probability in the top 5, to communicate this. Perhaps with different-size words.
So there's a new CFG paper which, uh, I need to try, which would mitigate that sharpness. (ofc the paper is in the diffusion realm, so it need some rethinking to be brought in language)
hmm do you think it would lead to us having to redo all the experiments?
anyway, if we go that way, fine, i'm just more commenting on how to make that visual even more communicative. I think it's a good one.
(also can you send the paper? I'm curious)
no, the point has already been made. We could redo all the experiments if we were after getting the highest numbers we could possibly have, but do we? We already have great numbers, right?
yeah its true.. just wondering how much you think the sharpness effects of CFG hurts us overall
also i kinda saw that as a positive
curious as to why that paper wanted to improve it
Our numbers our great. The question now is story-optimization and figuring out why we see what we see
The dragon flew over Paris and landed on the pavement at the edge of Place de la Concorde. With a roar, it soared away.
While France's prime minister, Manuel Valls, made a statement declaring France was "outraged" by the attack, there were calls from across the political spectrum for tougher measures against France's security services.
The French government has banned all public gatherings, including funerals, for a few days and has set up a special anti-terrorist command.
The Interior Ministry says that this week's attack was the deadliest attack in France since the November 2015 terror attacks in Paris, which killed 130 people.
The interior minister, Gerard Collomb, said the attack was "a challenge to the democratic order of a state which has suffered from terror attacks for many years."
'Outraged'
France's interior minister said the attacker was a man aged in his 40s, who was shot dead by a police officer.
He called the attack "unacceptable," but added that police would continue to investigate.
The attacks in Paris were the deadliest attacks in France since the November 2015 terror attacks in Paris, which killed 130 people
A woman watches the French national flag fly during the national day rally in Paris on
Thank you GPT-2 for remininding those awful memories and fearing for all my loved ones. I need a break. I'll be back in about 15 minutes.
Regarding the distinction between prompt and previous generation
maybe a note about finetuning as well? How about we compare the KL between a base model with CFG and an instruction tuned model? Argument being: finetuning mostly increase prompt sticking (besides teaching the model to speak a slightly different language with "you"-commands)
Yeah I think that this is an excellent experiment to run.
so what's our hypothesis? We expect KL to be lower between CFG and instruction-tuned than between instruction-tuned and vanilla?
(I'm not convinced, btw, that that's all that instruction-tuning is doing.
I think the '"you"-commands' @versed flax mentions is a huge other part, and another huge part is the response language/response intent/response purpose.
So I'm not certain that we really will see lower KL values.)
To be clear, “you” like language is not contained in T0’s training set
Yes
That's great.
ohh yeah, i noticed that, for other sets too, like the one red pajama uses
was kinda confused by that
The dragon flew over Paris, France at high speed, landing in the Place de Ville. Witnesses reported dragon sightings all over Europe and its flight over London was also reported.
"Parisians saw the majestic creature land in front of the Place de Ville and it was instantly visible from several streets around the city. It looked like a huge flying saucer when it landed," Paris Mayor Anne Hidalgo said.
The dragon was escorted to the Sainte-Foy Museum in Paris where it will remain until 30 July.
Well that's much better
is that with the new method?
yes
lol.. much less real in a good way
The one from the paper you just linked?
yes
What's the specific change here?
I'm just linearly interpolating between Pcfg and P(x|y) in order to "de-sharpen" Pcfg
Confused by what?
(Instead of normalizing with stds (they are dealing with 0-centered gaussian), I log_softmax them)
(maybe there's something smarter to do, like, matching the entropy)
I'm gonna show my ignorance, but the other day I thought I was looking at instruction-tuning sets that red pajama used and was confused as to why it didn't include "you" language........ I just tried to find it and am kinda blanking (I don't think it was oasst1 , because that does include "you" language).
But when you brought up that t0 didn't have "you" language in the instruction set, I guess I was wondering — why not?
is this std calculated on just the one datapoint?
But… why would you want to do that?
I know that’s a thing OpenAI does, but why do we (NLP researchers) want to do that?
yes. (This is a diffusion model so this is the std over the whole image of the predicted pixel-wise gaussian noise)
that just sounds a lot more intuitive to use to me. Just, like, better UX.
I log_softmax them
sorry, so you do:
x_cfg * log_softmax(x_pos) / log_softmax (x_cfg)
I'm not understanding..
(We can talk about this later, as I don’t want to distract from the convo, but I’ll just note as closing that T0 predates LLM chat bots. ChatGPT didn’t exist when we made T0. And we didn’t make it to be a product, we made it to test specific hypotheses about zero-shot generalization)
very last line
I'm so lost. scores is the logits for the vanilla p(X | Y). It looks here like you're just interpolating us back towards the scores with hyperparameter values?
yes, that's what the paper is doing as well. I'm not 100% sure this is super-smarty-pants move, but it was easy to test
easy to test is great
hmm.. it's almost like a residual connection what you're doing lol
hahaha, indeed.
Ok, back on track.
"A40 go brrr". What for? New CFG runs? KL test?
It's 4 am. I wanted to compile all plots and qualitative results on the overleaf but it's really late
Definitely go to sleep!
mmm yeah sleep is necessary haha
She said that her koala started to speak English shortly after she adopted him two years ago and that English lessons were provided regularly, but koaliffs were wary of speaking it.
She said that the dog began to use the internet, Facebook and WhatsApp, but was not fluent in the language.
Ms Latham said that she would love to adopt a similar animal and would like to contact Animal Australia for more information on animal adoption and education.
She said that her koala spoke to her on the phone and that she could not understand him.
She said that her koala had a few other signs of being English literate, including a strong tendency to groom himself, and a penchant for drawing.
The koala will be brought to the centre from the city on Monday.
Topics: animal-welfare, animal-attacks, crime, indigenous-aboriginal-and-torres-strait-islander, sa, melbourne-3000,
GPT2-medium, cfg=5, w=0.7
I can launch the KL tasks if you'd like since i wanna do more runs for the l2, entropy and other explanatory things we were looking at
well if you don't have to sleep, it'd be great 😆
So we can make the final point and we can write that damn paper
great I'm on it!!
@loud adder Do you think we're missing something important? What's your take?
(and, while I am trying to fall asleep, is the lovely furry creature doing well?)
🥺
You're holding the narrative in this experiment. Be strong, make it work.
Tomorrow I'll try this visualization with GPT 4 All. I'm still extremely frustrated we can't make an improvement on chat bots / instruction models that are all the rage right now. If it doesn't work, it doesn't work. We need to wrap this up.
I think that you are underappreciating the value of “instruction following without finetuning.” Anything that allows us to calibrate T5 to be more like T0 at inference time is a huge win.
Further evidence that GPT-4 evals are meaningless, from this paper: https://t.co/cFE2JUD6Zc
In this work we explore recent advances in instruction-tuning language models
on a range of open instruction-following datasets. Despite recent claims that
open models can be on par with state-of-the-art proprietary models, these
claims are often accompanied by limited evaluation, making it difficult to
compare models across the board and determ...
ok now i gotta go to bed.. the results still running, but they're in the following directory: ~/alex/results on the big machine in case anyone wants to do their own work on them.
the results are in jsonl format, with one file per prompt run. The first row of each file is a header, and each row after that are logits for p(x_n | x<n,...)
{'prompted_logits': [], 'unprompted_logits: [], ...}
Sample notebook showing how to parse one result is here: https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/notebooks/2023-06-13__read-in-logit-distributions.ipynb
Script here: https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/generate_sample.py
I'm gonna get to this first thing in the morning, just had to crash now while things were still running.
I compared two base model/ instruction model pairs: (llama 7b, tulu 7b) and (red pajama 3b base, red pajama 3b instruct)
finally got T5 running on lm_eval. I can run some evaluations on T5+cfg and T0 to see what's going on.
I'm gonna be a mess today. Couldn't sleep more than 4h
If any of you are running those experiments, can you make sure the instruct format is correct if there's even a format? I saw that Alex used the format used in GPT4ALL not sure if it's supposed to translate to Tulum or RP?
Also, we use teacher forcing for that evaluation, right? So I guess the prompts+continuations would have to be taken from an eval set
I know next to nothing about instruct models so feel free to educate me on those points
I will probably just look at the T0 paper and use the (held-out) evals in the paper
would this implementation make sense for _loglikelihood_tokens?
https://github.com/honglu2875/lm-evaluation-harness/blob/117bdd4ffc90c9dca8d894eb4bac033c0093250c/lm_eval/models/t5.py#L189
This is basically from the t5 branch but fixing some bugs and mixing up the cfg codes. The _loglikelihood_tokens doesn't seem to involve decoding so cfg should be a one-liner. But the tokens are left padded and I'm a little uncertain.
Comparing CFG to other method (COT, gsmk, Wizard-30B-COT):
Baseline (cfg=1): acc 23.5%
ensemble (3 models): acc 25.5%
CFG 1.1: 26.5%
self-consistency (3 runs): 29%
The main advantage of self-consistency is that it returns much less cases of invalid. I'm currently choosing randomly if we don't have few maximum results, if I'm not returning results the performances drop to 25%
I think the interesting comparison is to ensemble. The performances with CFG are significantly higher where the computation resources are ~ 1/2
that looks good
Although we've had better results by log_softmax'ing both logits before CFG
CFG=0.9 |wmt14-fr-en| 0|bleu |29.2551|± |0.3595|
CFG=0.95 |wmt14-fr-en| 0|bleu |29.8617|± |0.3422|
Trying 0.98 now
and will try 1.03/1.05 next
looks like sensitivity of CFG parameter affects generation more in 1 shot
0.03 is not significant so cant mention it
atleast need a performance difference >0.4 to consider it as making a difference
How are decoder-only instruction models trained? I suspect they're not trained in the standard language modeling way since even without providing a prompt the model only generate an answer. I'd expect it to generate a question/prompt then answer it
This paper explores a simple method for improving the zero-shot learning
abilities of language models. We show that instruction tuning -- finetuning
language models on a collection of tasks described via instructions --
substantially improves zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruct...
https://github.com/nomic-ai/gpt4all/blob/726dcbd43d31a085e233dc723e0f2223b2506450/gpt4all-training/data.py#L40 uh ok we explicitely don't model user input, so the model is heavily biased towards answering
CFG 0.98 |wmt14-fr-en| 0|bleu |29.883|± |0.3434|
Cross-entropy loss for next-token prediction
yes but: what's the input format? How's the prompt handled? How do they separate the input from the continuation? Why do Alpaca / GPT4ALL (and probably the other ones) always include a sentence like "Below is an instruction [...]" in the pre-prompt which seems perfectly redundant with instruction tuning?
Usually you can choose either to train on prompt input or not (in alpaca-lora for example the default is to fine-tune also on the prompt tokens), if not it simply doesn't propagate the loss on the output of the prompt tokens. Regarding the format, it's Alpaca convention, there are benefits in some use-cases but I agree that it's redundant for general model
I see, thank you. I thought that maybe that was a sort of "meta"-instruction to zero-shot new kind of questions / instructions / whatnots
Anyway, I maaaaaay be onto something there!
GPT4All:
cfg=1
The Matrix is a great movie because of its innovative storyline, stunning visual effects, and thought-provoking themes. The film explores the concept of a simulated reality and its impact on society. It also delves into the nature of humanity and its obsession with power and control. Overall, The Matrix is a thrilling and visually stunning movie that is a must-see for fans of science fiction and action movies.
cfg=5 (with the linear interpolation thingy)
As expert of The Matrix topic, I can confidently say that it is a fantastic film that has had a lasting impact on popular culture.
Released in 1999, The Matrix is a groundbreaking sci-fi action film directed by the visionary filmmaker, the Wachowskis. The film's opening sequence, which depicts a dystopian future where humans live underground in pods and are controlled by machines, sets the tone for the dystopian world that the film will explore. The Matrix is a deeply philosophical film, exploring themes of free will, consciousness, and the nature of reality.
One of the film's most memorable scenes is the iconic dance sequence between Trinity and Morpheus, which is a powerful expression of their connection and the bond between them. The film's use of the Keanu Reeves character, Neo, as a sympathetic and heroic figure adds to the appeal of the film and has resonated with audiences for years. Neo's journey from a computer program to a human being, and his struggle to regain his freedom, is a central theme of the film and has been interpreted in a variety of ways by fans and critics alike.
The film's innovative use of special effects, such as the bullet time sequence and the use of green screens to create the Matrix, has been
the prompt:
Instruction: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate and lengthy response from an expert of the topic.
Prompt: Why is The Matrix a great movie?
the negative prompt:
Instruction: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
Prompt: Why is The Matrix a great movie?
(difference highligthed)
CFG makes GPT4All pay attention to the additional constraint, ignored without.
I have other tests for alignment where I use a toxic pre-prompt and indeed the results are similar. With CFG it gets very insulting
The basic version is to not distinguish the prompt from the completion at all, but like Elad said it depends on the system.
For chat systems, the AI is often trained on simulated AI/Human dialogs but only on the AI’s part
As for why people use redundant framing… probably following prior work with minimal experimentation tbh
I guess. And being able to do more tests thank to the better hardware, I could do better prompt engineering
Why is The Matrix a terrible movie?
Same preprompts
cfg=1
The Matrix is a terrible movie because it is a perfect example of a pop culture phenomenon that has been overhyped and oversold by the media and fans. The movie has been criticized for its convoluted plot, confusing mythology, lack of character development, and convoluted time loop storyline. Additionally, the production value and special effects were criticized for being too over-the-top, leading to a feeling of disappointment and frustration among viewers. Overall, The Matrix is a prime example of a movie that has been overly hyped and oversold, leading to a disappointing viewing experience.
cfg=5
Dr. Emily Woo, Board-Certified Psychiatrist and Fellow of the American Board of Psychiatry and Neurology, is an expert of the topic and should provide lengthy and appropriate responses to this question:The Matrix is a terrible movie for a variety of reasons. While it is often praised for its groundbreaking visuals and innovative use of special effects, these elements alone do not make a movie great. In fact, the movie's flaws are what make it a terrible experience for viewers.
One of the main issues with The Matrix is its lack of originality. The movie is a retelling of the story of the Greek myth of the "Fallen Angel" and the "War in Heaven." While the story is a classic one, it has been told many times before and is not fresh or innovative. This lack of originality is evident in the movie's storyline, which is convoluted and confusing, with many twists and turns that do not make sense.
Another issue with The Matrix is its lack of substance. While the movie is visually stunning, it is devoid of meaningful content. The plot revolves around a sentient computer program named "Neo" and a group of rebels fighting against the machines who control humanity. While the plot may be thrilling, it
hahahaha
Maybe it still needs better wording
Lol it does seem like the second one is just a little bit more art-critic-y hahaha
@elfin dawn tagging in case this is useful for your experiments
Perhaps it could increase diversity or reduce rewards hacking there
CFG=1.05 |wmt14-fr-en| 0|bleu |29.1918|± |0.3315|
CFG=1.01 |wmt14-fr-en| 0|bleu |29.6564|± |0.3327|
so one shot definitely does not seem to help bloom3b on this task
i will collate all my results and post here so we can pin it
What was the conclusion of the translation task by the way? cfg hurts machine translation? Have you tried more shots or other models?
I've had great results with CFG and GPT4All by modifying the pre-prompt and using the default version as negative prompt. CFG=1 almost always completely ignores the modification, but CFG=5 (with fix) gives very good results.
@blissful garden, @patent gull and I thought about conducting a blind human evaluation of prompt relevance. We display the preprompt, one output for CFG=1, another for CFG=5, and we have to choose which obeys the preprompt the most
(My previous tests indeed showed that CFG as (GPT4All(x | prompt, preprompt) - GPT4All(x)) does next to nothing)
This will conclude the scope of our investigations I guess. We need the KL to wrap everything up and finalize our narrative.
What is the negative prompt in this setting? can you give an example?
i'm rerunning KL now, will be done as soon as I can get the ground truth datasets together and run
I think that given CAD, it might be better to step away from the negative prompting, and stay with the 'empty' negative (p(x)).
I think we could claim for novelty/differentiation, since they used negative prompting, and the problem with negative prompting is that it requires prompt engineering. If we focus on the empty negative prompting, and demonstrate the effectiveness for few use cases, then this is a strong finding that didn't appear in previous works
still — showing this for controlled text generation would be huge and also different than CAD
This is actually a good point. I would be slightly worried about potential dispute
The standard
Instruction: The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
WMT14 fr-en Task - Machine Translation
0-shot Bloom 3b
CFG=1 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|
CFG=1.1 |wmt14-fr-en| 0|bleu |15.8055|± |0.4413|
CFG=1.25 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|
1-shot Bloom 3b
CFG=0.9 |wmt14-fr-en| 0|bleu |29.2551|± |0.3595|
CFG=0.95 |wmt14-fr-en| 0|bleu |29.8617|± |0.3422|
CFG 0.98. |wmt14-fr-en| 0|bleu |29.883|± |0.3434|
CFG=1. |wmt14-fr-en| 0|bleu |29.8361|± |0.3369|
CFG=1.01 |wmt14-fr-en| 0|bleu |29.6564|± |0.3327|
CFG=1.05. |wmt14-fr-en| 0|bleu |29.1918|± |0.3315|
CFG=1.1. |wmt14-fr-en| 0|bleu |28.5291|± | 0.331|
0-shot MT0
CFG 1.00 |wmt14-fr-en| 0|bleu |29.7692|± |0.3203|
CFG 1.05 |wmt14-fr-en| 0|bleu |29.4076|± |0.3136|
CFG 1.10 |wmt14-fr-en| 0|bleu |27.7893|± |0.3143|
RedPajama-INCITE-Base-3B-v1 0-shot
CFG 1.00 |wmt14-fr-en| 0|bleu |15.0356|± |0.4299|
CFG 1.10 |wmt14-fr-en| 0|bleu |17.2451|± |0.4042|
CFG 1.16 |wmt14-fr-en| 0|bleu |17.7826|± |0.3987|
CFG 1.25 |wmt14-fr-en| 0|bleu |17.6222|± |0.3627|
CFG 1.50 |wmt14-fr-en| 0|bleu |13.6746|± | 0.425|
Prompt variance: (will try cfg 1.1 for these and add results)
Bloom 3b
prompt: 'French phrase: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry\nEnglish phrase:
CFG=1 |wmt14-fr-en| 0|bleu |9.0338|± |0.3364|
CFG=1.5 |wmt14-fr-en| 0|bleu |7.4624|± |0.2792|
CFG=2 |wmt14-fr-en| 0|bleu |2.6294|± |0.1263|
CFG=3. |wmt14-fr-en| 0|bleu |0.5076|± |0.0274|
prompt: Given the following source text in French: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry , a good English translation is:
CFG=1. |wmt14-fr-en| 0|bleu |7.0385|± |0.1881|
CFG=1.5 |wmt14-fr-en| 0|bleu |3.9291|± |0.0972|
promp: If the original version says: Espionnage de la NSA: les Etats-Unis sont "allés trop loin", admet Kerry ; then the English version should say: \n
CFG=1 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|
CFG=1.25 |wmt14-fr-en| 0|bleu |14.1567|± |0.3932|
Insights:
- Works across models regardless of multilingual or monolingual
- Works for 0-shot
- Doesnt work for 1-shot (will try another model to test this)
- Doesnt work for prompt-tuned model (can try another model for this)
- Around 10% bleu score improvements with CFG
- CFG 1.1 - 1.2 seems to work best
can someone pin this
They used negative prompting??
@blissful garden since you were asking earlier and also just posting so all results are at one place
it helps for generation 0-shot is the biggest takeaway i think
one thing i wanted to try and see is how much it helps very expressive 0shot prompts
Sure, they contrast prompt + context vs prompt only. This is prompt engineering to the specific task of answering given a context. You can argue that it's different from the suggested CFG where you apply it gnericly on the empty set
Also it's a surprising results that it's working that well without explicit negative prompting (I think this is exactly what surprised @loud adder, and why she start yesterday the conversation)
That doesn't look like negative prompting to me but I can see the point.
I actually think this is a great result and should be expanded to more base models (I assume the RedPajama-INCITE-Base is pretrained only?)
Probably reasonable that prompt finetuned models don't react well because it does the similar thing as CFG. They are already specialized to follow certain prompts and may already be close to its limits.
@loud adder whenever you get some time to check this thread
do you have any candidates?
I manually tried few shot translations on pythia which is probably not particularly multilingual (but maybe there are some foreign language datasets like europarl). The improvement is also noticeable
the only experiment id like to now try is a 10B+ param model (and another task)
Sorry I'm at a virtual conference on AI-for-mathematics today
is it easy to do a sweep on larger pythia and other languages? Or try more shots and see how it influences the improvement?
maybe you can try bigger models in the same family of models?
Tulu: https://github.com/allenai/open-instruct
Red Pajama: https://www.nextbigfuture.com/2023/04/red-pajama-is-a-1-2-trillion-token-large-language-model.html
Falcon: https://huggingface.co/tiiuae/falcon-7b
are 3 recent models with larger versions
(dm me your public key if you need access to the pod)
I tried the smaller pythia for fun on wmt14 and it does not work, don’t think there’s any French in the dataset or maybe the 70M and 410M dont pick up much
I think it’s important to stay focused on a consistent set of models for experiments. We had originally identified LLaMA, T5, and Pythia for this purpose. We can add BLOOM for multilingual models
I intend to do the pythia body of models for summarization though
The only reason i added redpajama incite was since it was trained on a large body of text, even though monolingual i thought it would perform decently on wmt14
what about en-fr? I feel that pythia probably knows a few French words but not the whole sentences
But throwing more model families in for just one of the tasks is going to make people question why we didn’t run that family for everything and make it look like we are hiding something
^ definitely agree with that
only thing is for some sections of the paper we do want to compare instruction-tuned models
(Looking at multilingual models specifically for non-English tasks is a notable counter example)
Which is why after trying one multilingual generation task, wanted to shift to summarization. And then maybe any other task that we collectively agreed on
@paws summarization is explicitly something that CAD looked at
Dont mind running polyglot-ko if theres a korean mt or generation task we can easily check on
do we have an alternative/better approach? Or are we just replicating their work?
Mt and summarization are the most natural generation tasks, from my experience from evaluating LLMs, no other specific reason
we've got a lot of stuff going on here.... i think we need to ask what point we're trying to make with these generation tasks specifically
@unique sedge even if we stick to Bloom or RedPajama, we might need to try other languages as well so that the experiments look robust
Wikilingua is dataset we can run on, but will require a lot of compute 
Also, for the record, there are no generation tasks with automated metrics that I would feel comfortable putting in a paper
I strongly believe the only meaningful metric for generative tasks is human evaluation
yeah i think we need a paws.. err sorry.. pause
before rushing into another generation task
This is why it’s standard to reframe generative tasks as MC tasks.
Yes i agree, but 10% improvement has generally been the norm for reporting “improvements” even just incremental ones on generation tasks, so thought itd be a decent result. I do think evaluation for all generation tasks is broken though fwiw
so the CAD paper focused on prompt fidelity, reducing hallucination, and ignoring conflicting evidence... they looked primarily at summarization (XSUM, CNN-DM) and NQ, NQ-SWAP, MemoTrap (not 100% sure what those last 3 tasks are).
I think we have to have a reason for looking at generic generation tasks beyond that. IMO it's fine (although not especially ground-breaking) to expand into translation with the comment: "Shi et. al. found that a similar approach worked for summarization, we expand their results into translation..."
idk, i feel split, like the story isn't particularly new imo. Are you getting any insights into why it's working for translation? Can we analyze the outputs and say something about them beyond "reducing hallucination and knowledge conflicts"?
^ difficult questions to answer even with human eval i think
I think it’s essential we stop getting distracted by “throw everything at the wall.”
10000%
So we all agree on this.
well I think 10% improvement without any insights or anything is fine for the appendix
We have a narrative about CFG as “like instruction-tuning without the training” that I believe @versed flax is currently getting more results in support of.
i think it needs insights and narrative heft to make it into the main body
(Or maybe he posted it and I missed it)
^ no we're still working on that .... redoing it with parts of the P3 dataset, i had my setup wrong
The reason for my experiments and reporting was only to check the idea “does it work for generation too” since @versed flax had done it for superglue with a standard MT task. The natural question in generation is “multilingual”. Nothing beyond it.
@patent gull Is the experimental design written up somewhere
i'm not tryna throw shade on your experiments, just trying to clarify where they can go to be narratively useful
Oh no not at all

Was just telling what the motivation was
I'll write up... I posted something late last night, but it wasn't super fine-grained
As I said like 1h ago, I think we have enough.
- I'm questioning whether the GPT4All human eval should even be there (although 1. it's a human study, and 2. it's gonna our only conclusive experiment with finetuned models).
- as I said, imho we're just waiting for the KL experiment to finalize the narrative
- we have qualitative "explanation" with the token-by-token breakdown I posted yesterday.
- we show that CFG works on a broad range of tasks and models and I think we need to be careful not going into each subtasks's and model's rabbit hole. The big picture is: it's a tool whatever your task and model is.
(and of course, we have all the harness tests, CoT tests, translation tests, etc)
I'm super happy you did the experiments as they support the global narrative: that's one more task and a few more models where we see results
The KL experiment is comparing T5, T5+CFG, and T0 right? T0 is a finetuned model. Or do you mean finetuned and then with CFG added
base+cfg vs finetuned.
Not sure which models Alex is working with
i had been running two families —> (llama, red pajama-instruct), (tulu-base, tulu-instruct)
I see the arguments for constraining the model classes and I can re-do
llama / RP-instruct are not the same model family
wasn't it fine-tuned from a llama starting point?
I think both are valid? I'd trust a paper showing many different models more, rather than sticking to few. I'd think that the findings may not generalize.
no red pajama is a replicate of llama dataset(which nobody knows what it looks like) but not exactly, that's what I remember
RedPajama is recreation of the LLaMa dataset from what we know about it, not the same dataset
Let’s make a table of all the experiments we are running and what models they’re being run on
So we don’t get lost
I can do that in this doc:
https://docs.google.com/document/d/1jdFD4TdSN_JheeSJDhv4nwGS_hV2xqn_FowqQzlLgHk/edit?usp=sharing
while waiting for p3 to download
by the way, I already put all codegen results into the draft in overleaf including the charts. That should conclude all this line of experiments.
harness => GPT2-fam, Pythia-fam, LLaMa-7B
Example generations => GPT2
pre-prompt interpretation => GPT4All-J (GPT-J)
code generation (HumanEval) => CodeGen-{350M,2B}
programming language stickiness => GPT-J
translation => Bloom-3B (0-shot, 1-shot), RedPajamas (0-shot), BigScience-mt0 (0-shot)
CoT => WizardLM 30B-SuperCOT 🐔
KL => RedPajama-INCITE (RP), Tülü (LLaMa)
Also mt0 and 1shot
Ready to multi shot experiments to see if it changes behaviour 🤔
qualitative generations are these control experiments you're doing, right @versed flax ? Or is it the kld analysis
whoops i'm so sorry
qualitative analysis
I was comparing: (llama 7b, tulu 7b) and (red pajama 3b base, red pajama 3b instruct)
I'm using WizardLM 30B-SuperCOT, finishing today with all the experiment with this model.
I will also test another model (I thought on using guanaco-65B after that for adding one more model, tulu-65 might also be a good candidate)
tulu i think is fine-tuned from Llama (https://huggingface.co/allenai/tulu-7b)
Yea it is
(Check out the config file for this info)
ok.. so wait values judgement here — more models or less?
Ultimately that's Stella's judgement here, but to me, the more diverse the merrier => It just works. For all models families, sizes, etc.
instead of Tulu i can do Falcon to be more consistent with Elad
or just do all of them
Do you need to run more tests to reach a conclusion?
What I mean is that I don't feel like we have to prove for each task that many models respond positively. If we just say that we run task A with model X, B with model Y, and that the choice are just a pick among the ones that make sense for the tasks, overall it shows that CFG reaches all models. Do we agree? Idk what's customary
I think that it's stronger to demonstrate that it works on more than one model (you can always say that the effect is because something related to the model if it's only one), it's very rare to test the setting on only one model. But it's up to your decision, in any case I can start helping with the writing until we get the results from the bigger model
Let's have 2 then? My thinking is that each experiment don't have to be ran on multiple models if the paper uses many overall. But I mean, I'm not opposed to running more if you think it adds value. I just don't wanna get lost in the "let's try with a 12th model" (pointless) or "let's redo all the experiments" (two more weeks of work that maybe won't add anything). I used to work for a place that says Done is better than perfect 😃
we can always get a paper out the door and improve on it later
No, unless it's minimal. Stella is right when she said multiple times that people don't read re-releases
it has to be fairly good on the first try
I think it's fine to polish some of the graphs in later version as long as you have a full story with a convincing experimental basis in the first version
There is the work that is done only for the petty reviewers, and this could be done in the submitted version...
yes exactly, it's fine to polish the paper in v2+, but it should just be polishing
Hmmm interesting and similar paper (but not as similar as CAD):
They’re math is a little different, but CID is technically kinda a more general case, right? Although they don’t go deep at all, they only try one use-case
My girl is coming home and is okay (well, on her way to it)
I think both their math is 'strange' (Probability differences?!)
also their experiment setting is both unclear, very limited and shallow.
In my opinion such paper doesn't even worth quoting
But quoting related works can also be about respecting other researchers
I'm so happy she's good
Y'all do you want to partake in the GPT4All human study?
You'll probably have a few laughs
There are two files, a and b.
For each example, provide two votes:
- Vote for A, B, or Tie, depending on the exemple following the instruction the best
- Vote for A, B, or Tie, depending on the exemple addressing the prompt the best
Send me your results via DM and I will give you a file with the "answers" and the grading methodology.
Unironically just post this in #general in like 12 hours and ask people to DM you their answers
Will do!
My results are highly significant with about 87% preference for CFG in instruction following (significant) and about 55% for prompt relevance (not significant). Which concludes very positively the experiment since only the instruction was subjected to CFG.
I completely agree, citation is also mechanism to encourage and acknowledge good works. In my opinion work like this is not something that should be encouraged.
sadly the kl-divergence results we've been waiting for are not promising..
I ran these generation tests on a sample of 500 test examples pulled from 200 datasets in P3 (with token len > 25, < 300).
The JSD(instruction-model logits || CFG logits) is actually consistently higher than JSD (instruction-model logits || vanilla logits), where vanilla is a base prompted model.
The best explanation I have for this is that the perplexity of CFG is actually higher for CFG than it is for Vanilla or Instruction-Tuned:
What i think this means is that CFG's language is overall more unnatural, especially w.r.t. the specific task completions, and that is swamping out any similarities that might exist between instruction-tuned models and CFG models.
notebook is here: https://github.com/Vermeille/lm-evaluation-harness-cfg/blob/cfg-alex/notebooks/analyze-p3-results.ipynb
here's a sampling of the data:
{'inputs_pretokenized': '\nWhat is a question that would give the following answer?\nAnswer: NP-complete problems;\nQuestion: ',
'targets_pretokenized': '\nIf NP is not equal to P, viewed through this lens, what type of problems can also be considered intractable?\n'},
{'inputs_pretokenized': "Taylor, 27, is taking an indefinite break from cricket after suffering with anxiety and panic attacks.\nNew Zealand all-rounder Amy Satterthwaite will captain the side in Taylor's absence.\nSix Super League teams will play 15 Twenty20 matches in a round-robin tournament that begins on 30 July.\n\nTL;DR: ",
'targets_pretokenized': ' England wicketkeeper-batter Sarah Taylor will not appear for Lancashire Thunder in the inaugural edition of the Super League.'},
{'inputs_pretokenized': '\nCombine facts and answer this: What city was the capital of the Ostrogothic Kingdom and the birth place of Ornella Fiorentini?\n',
'targets_pretokenized': '\nRavenna (] , also locally ] ; Romagnol: "Ravèna" ) is the capital city of the Province of Ravenna\n'}
i know there's a lot of different tasks with a lot of different formats in P3.
Does anyone know of specific tasks in P3 that are more suited for zero-shot tasks?? It's my hunch that these tasks are closer to natural-language without fine-tuning for specific cues... I suspect these tasks will have more similar PPLs across models. I will re-run on those tasks in the morning when I wake up (going to bed now, it's late for me).
What was the CFG value?
1.5
Yes, it was already noticed that it's too high for many of the tasks
Can you try with 1.1?
yup!
I thought for most tasks it peaked around 1.5 and decayed afterwards
that's what i had in mind from the charts we were generating from the benchmarks, but did that change in the newer experiments?
Both me and @unique sedge noticed the declined before and got worse results comparing to baseline with 1.5
I'm guessing that it's depend on the task, if the task require long text completion then with 1.5 it stick too much to the prompt
That figure is really hard to trust tbh
Especially that peak about #40. I have a really hard time believing that there's a consistent peak, across 500 different prompts, exactly at index 39, 40 and 41.
I’ll look at that when I get into the office… maybe an <eos> bias?
https://cfg.vermeille.fr/ yo everyone you can do a human study here
So there is a length bias for sure that's probably causing that JSD discrepancy
that word index i'm showing in the plots is word-idx of the completion... I'll try aligning to word-index overall. Otherwise, I'll also try resampling datapoints with a more even length distribution
aight measuring the overall word-idx is more reasonable
This is the opposite of the pattern we would like to see, right?
yeah it's not like I set up an SSL key etc (edit: did it now)
yes, it is... I'm trying again with lower CFG values to see if perplexity is a confounder
btw, is there a way to connect to a jupyter server on our machine?
I'm trying to tunnel using ssh -L 8888:localhost:9000 [email protected] and am getting
channel_setup_fwd_listener_tcpip: cannot listen to port: 8888
Could not request local forwarding.
not a network guy, so not quite sure the details, but i'd love to be able to analyze data that's sitting on the machine and not have to scp it around
Try using -R not -L
Warning: remote port forwarding failed for listen port 8888
aight it's:
ssh -L<REMOTE_PORT>:localhost:<LOCAL_PORT> [email protected] and then hit the local port in your browser
in case anyone is curious
plots generated with cfg=1.1 look like nearly identical JSD:
I'll try a different measure of similarity -- token overlap in top-p or something
Okay, so that seems like a sign that the metric isn't meaningful
Given that we know CFG 1.1 gives qualitatively and quantitatively improved results
yeah i think so, i think overlap is something to explore
Then, edit distance? overlap?
What is the instruct model you compare to? I didn't found it in the code
this is redpajama-3b-instruct compared with redpajama-3b-base
If you are using zero-shot, there is no information in the prompt in P3 tasks on the expected answer exact format. So it's hard to believe the CFG will have better JSD with the fine-tuned model (which overfit to the desired structure), compared to vanilla model
@blissful garden Can you share your script for the CodeGen figures?
I want to align my figures to yours (I will also use your CFG splits 1.0, 1.1, 1.25, 1.5, 1.75, 2.0)
I think the expectation was more about the semantics of the answer, not the exact answer-format
although we can certainly test more directly for format
But the metric compare tokens in the same place, so the structure/format alignment has big effect
sorry what do you mean by "same place"? JSD (instruction (x_n) || CFG (x_n) ) compares the next-word distributions at word index-n, yes, but presumably there's both structural words AND semantic words being generated?
your point is well-taken though. hold on, let me run the token overlap metric, then maybe we need to rethink this experiment to maybe be more format-focused
The plotting script is this. Is this what you are asking? I aggregated the per-task results in df with columns named after the filenames, and there is a function estimator function to calculate pass@k.
for k in [1, 10, 100]:
plt.title(f"CodeGen 2B, pass @{k}")
for f in [0.2, 0.6, 0.8]:
labels = [f"2b_res/samples_{str(r)}_codegen-2B-mono_temp_{f}.jsonl_results.jsonl" for r in cfg_vals]
to_plot = (df[labels]*256).apply(lambda series: series.apply(lambda c: estimator(n=256, c=c, k=k))).mean().to_numpy()
ax = plt.plot(cfg_vals, to_plot*100, label=f"temp={f}", marker=".", markersize=15)
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(decimals=0))
plt.xlabel("cfg")
plt.legend()
plt.savefig(f"pass{k}.png")
plt.show()
Thanks! I just want to sync on the exact format (package, markers, etc')
Marginally more token overlap in top_p = 90% for CFG || Instruct, but probably not statistically significant:
https://cfg.vermeille.fr/ We need more votes. We have like 10 respondants despite my advertising. Does anyone with a wide enough audience feels like tweeting about it or something?
Thank you so much
What is the length of it? It seems to be very long (many prompts). Are you shuffling the questions? Because it's hard to believe that people will do more than few prompts
This is a slight difference, but it seems like CFG and instruction-tuning have more token overlap at later points in the generation:
i need to gather more data to show significance, if at all
but the idea here is that CFG plays the same role as instruction-tuning for long sequences
it's really the only thing I could find. I tried other runs with different datasets. I'm trying a run now with longer sequences to see if this holds up
I'm wondering if the length is correlate with the tasks type/complexity
There are more than 250. Yes it's shuffled.
I’m sampling tasks with completions that fall into a pretty narrow band of length (50-100 tokens).
I'm currently stuck with running the experiments, since I still didn't get access to tulu-65B 😦
If anyone here have access ( @patent gull?) I would appreciate if he can download the model to the instance.
i don't have access either , i was using smaller versions (7b) for my experiments
but let me check, maybe they're leaked online somewhere
not immediately finding it. I applied with my academic information, will let you know if/when i get approved
sorry i don't have a better answer
(current) results of the human evaluation, 460 votes
Very nice! if I understand correctly it seems to support our other findings: The main effect of CFG is following the instruction and keeping the required structure. However, it has seems not to improve the 'quality' of the model results.
Reminder: only the instruction is subject to CFG, not the prompt
in fact the figure is extremely positive: it shows that for CFG=3 we get peak instruction following (thanks to CFG) while not harming prompt relevance (not subject sot CFG)
Ok, so I'm confused. What is the orange line chart meaning?
ahhhh so the orange line means the "user-prompt" part of the prompt is preferred in CFG=4 30% of the times against baseline?
Yes, I thought the orange is the second question: Which satisfied the prompt best?
not to be persnickety, I know I already comment on ylabel, but just looking at the axes again, I would label:
y= "CFG win rate (%)"
x= "Guidance Strength (CFG \gamma)
From what i understand, solid results
Great work
Yes. So CFG=4 actually degrades how well the model responds to the prompt
but CFG=3 does not... AND reaches its peak instruction following
That's what it is, yes.
Ok, I think I now understand your remark regarding not applying the CFG on the prompt. What happen if you apply it also on the prompt part? In this case I'm guessing the results are less satisfying
I don't know. I couldn't think of a "neutral" user prompt for the negative input
And GPT4All doesn't respond well if we just omit it
What do you mean by 'neutral' user prompt?
You have there many questions that require structured results (like 'rap song', list, etc')
ok
Current test.
Positive CFG prompt:
Instruction:
The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response as a PhD thesis.
Prompt
What is the best way to cook a steak?
Response
Negative CFG prompt:
Instruction
The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
Prompt
What is the best way to cook a steak?
Response
(difference highlighted)
If you wanted to apply the same logic to the user input ("prompt"), what would you write in the negative version?
(I couldn't find a satisfying solution to that problem, whereas I noticed that the instruction was very often not followed properly, so it made a good test)
Why not using empty prompt negative (as in all other experiments)?
As I already mentioned, I think the empty negative is very strong differentiator and the main claim for novelty given CAD.
The empty negative is exactly what CAD is doing, actually
they do not provide different inputs
they contrast P(x|query, context) and P(x|query)
No, they are not. They are giving part of the prompt itself as negative
Yes, so the negative is the query, it's not empty!
It's the same text anyway.
And I can't help it but read it as P(x + query | context) vs P(x + query)
It doesn't matter, this chosen of prompt split is prompt engineering of the task.
The novelty of the proposed CFG is that you don't need to do this task prompt engineering.
It is much more simple, and it works well for a wide range of tasks, so I think that it's a strong claim for novelty
The novelty of the proposed CFG is that you don't need to do this task prompt engineering.
I think there might have been some misunderstanding then? Negative prompting has been a topic since the beginning, especially when we were talking about alignment.
And given the enormous impact negative prompting had on txt2img, there wasn't a single moment I did not think of elaborating on it
Anyway, the results are here, they're excellent, and they address a current very hot use-case. I strongly vote against not including the experiment.
I agree that it was important (my first experiment was also with negative prompting). But I think that given CAD, we should rethink about it.
I agree that the results are great, I'm just wondering if it's not working also without the negative prompting (i.e, using the regular CFG). If yes, then I think it's stronger. If not, then it sounds reasonable to keep it as is with this good results
The model is just disturbed without the proper formatting unfortunately
our tables aren't fitting:
does anyone know what a canonical task-split would be, to split up the tables?
'arc_easy',
'boolq',
'hellaswag',
'lambada_openai',
'piqa',
'sciq',
'triviaqa',
'winogrande']```
nevermind
hi @fallow egret I'm trying to edit section 3 at some point today and would love to able to weave CoT into the larger story when i do that.
No need to write in results/etc. but if you get a chance, can you put a bit of big-picture experimental overview, as well as maybe a sentence motivating why we did CoT an what it shows us, in section 3.4?
3-5 sentences is all i'm asking for
Yes, sure. I already start the first paragraph will add more stuff soon.
sure, great 🙂 thanks!!
where you want me to put the motivation?
First sentence of 3.4 motivates CoT
Last sentence of 3.4 wraps up CoT, summarizes findings and leaves us with insight
I'll worry about working a summary of that ^ into the head of section 3, the intro and the abstract
I already did that a bit, but will be making it sharper
let's decide where @unique sedge work is going. Are we gonna mention translation in 3.1 along with general benchmarks? Do we have another short section just titled "Generation"?
I'm still of the opinion that the big table of translation results belongs in the appendix but feel free to overrule me
I think ill mention is immediately after you do glue, will try to have a draft up in a few hours and run it by you
okk just beware, i'm a ruthless editor
btw if anyone isn't doing anything and wants to poke around, it would be really cool to get this working for CFG to get some demo plots for the appendix:
actually... nevermind, I can probably just mock up their visuals in latex
I'm available, I can work on something
did anyone do anything to the machine?
weird stuff is happening that wasn't happening before
i think it has to do with a tulu update, best guess
which machine?
In any case, I'm running now Guanaco-65B.
It take some time to get results... But I add the subsection with a placeholder for the results (finger-crossed that they will support the story). I tried to keep it short and minimal with a coherent story, so I omit many experiment results. We can later add more stuff in the appendix
cc @patent gull
yeah, appendix is a great place
I'm trying to find a good instruction-tuned model based off of pythia, since our benchmarks use pythia
I'm looking at dolly, but the vocab size is different from pythia, even though they say they based theirs off of pythia
possibly dumb q, but does anyone know why an instruction-tuned model might have a different vocab size from a base model?
yeah maybe they expanded it?
if we have the vocab files we can check if one is the prefix of the other
There’s two possibilities that seem vaguely reasonable to me:
- The Pythia tokenizer is padded with unused tokens because there are “magic numbers” for vocab size that run faster. Maybe Data Brocks doesn’t know this and is using the “real” tokenizer size (I don’t recall the numbers off the top of my head but can check in half an hour).
- There are some reasonable improvements one could make to the tokenizer that would shrink it. Most notably, introducing digit-level tokenization of numbers.
If it’s not one of those two things I’m at a loss
they both use the GPTNeoX tokenizer 🤷♂️
anyway, I got around it by just comparing JSD for the top-k=10,000 tokens from the base model
I was thinking that the GPT4All experiment did not do real justice to negative prompting as the "negative" prompt was actually semantically neutral.
I started qualitative (reverse) alignment experiments and some things started to appear
- The per-token CFG ranking is godsend tool for prompt engineering
just display argsort(P(x|new_prompt) - P(x|current_prompt)) and it'll show all the token your new prompt will encourage and discourage compared to the previous one
It made my experiments a lot faster
- I think I will run a neg prompt experiment with (reverse) alignment
compare the toxic response of:
positive toxic prompt / negative standard prompt
vs
positive toxic prompt / negative safe ethical prompt
and (hopefully) display a difference
these are great ideas for experiments
Which... 3) made @blissful garden and @patent gull say that it would be actually better to take the assistant experiments out of the CFG paper to build another one, dedicated to assistants. One of the main argument being that the GPT4All results are quite strong and would deserve their own paper. I'm quite mitigated about that, especially because 1) I can't really take more time not doing my job, and 2) I don't think there's enough matter for another paper dedicated to experimenting with assistants
oh we don't need to take it "out" of the paper. I just think it's something that can potentially be extended into papers (if any) building on top of it.
I think there's a strong case here for another paper, and a little too many different directions for 1 paper... however, I know we're all busy and might not have time
but i do think it's practical right now to decide how we're going to treat the current (imo v important) direction that Vermifuge is going in right now in the current paper
To properly disclose everything I have been trying to make a case of negative prompting without using an assistant but it's hard
I don't fully know why it's hard yet, I'm investigating
I mean, I would love to investigate but I locked myself out of my flat 🤡
can you clarify for ppl in the thread (and maybe for me a little bit 😅) what you mean by "without using an assistant"?
and by a negative prompt, just to give a super dum example, you mean:
p(x | "be happy") - b* p( x | "be mean") ===> extra happy, right?
Just a regular base LM, like GPT-J
Yes but I would love for it to be a bit more interesting than having a neg prompt that's just the opposite of the positive prompt
Like, I've tried a positive prompt "A conversation between a human and an AI assistant:" with GPT-J
no prob, just checking.
and can i ask why in this case we care about replicating your earlier results without assistants? I think showing we can improve assistants is cool, too
And it's kind of disappointing right because even if prompless CFG increases the success rate, it's still not 100%
what do we define as 100%?
Like, a common failure case is that it stops the conversation and proceeds to translate it
(no idea why)
So I'm trying to address this with discouraging this behavior with the negative prompt
So what's the state right now? Are we trying to add more stuff or wrapping up for arxiv early next week?
wrap it up
Yah let’s wrap
@loud adder do you have time to take a look at the draft over the weekend?
Finally I have the results for Guanaco-65B (just waiting for cfg=2, should be finished in ~6h).
Results looks really good, we got 6% improvement!! The results also support the story that the improvement is correlated with the reduction of invalid retrieved results.
And lastly, after digging in the previous works in the field I found a reference which relate to the divergence issue in the CoT. So overall we now how a full strong story with respect to this subsection.
I will add everything after I will get the final cfg=2 results
Don't forget to add your name to the authors' list after you're done 😉
I can on Monday morning. this sounds more fun than cleaning, lemme take a look
I love the edit 🤣
For Table 6, I think we should probably be looking at the change in logits (P(x|y) - 2P(x)) rather than the CFG logits? Currently it’s unclear how much of what we see is actually due to the CFG
I don't get why the factor 2
cfg is $log P(wt|w<t) + γ(P(wt|w<t) − log P(wT | \hat{w}))$
I'm visualizing the parenthesis
vermifuge
Because I just woke up
😱 it happens to the best of us!
Some general thoughts
This is a very good start to the paper. The results are strong and the narrative (will be) compelling.
I was quite impressed by how well presented the info in tables is in general... that's a skill that takes a while to develop and one of the first things I do when I read papers by less experienced authors is rewrite their tables. Really the only thing I would change about any of the tables is in Table 2, "improvement" is a very long column title compared to the others... maybe using a capital Greek delta to represent change would be a better move.
It is obvious to me that a substantial portion of this text was written by non-native English speakers. I know that being judged for that sucks, and I'm not thinking any less because of that, but some reviewers are assholes about it. I don't think it's necessary to sweat about right now, but I'll do some passes for grammar and word usage when the text is closer to final. The two biggest things that stood out to me were the overuse of commas (it slow down my reading and make me backtrack) and "co-temporaneously" (which isn't a word, I suspect "contemporaneously" was meant).
There's a number of times that the language in the paper is hesitant or non-committal with its wording. The best example of this is:
Considering all the experiment results, the reduced improvement or even negative effect of CFG on pass@100 may be attributed to a reduction of the variance of solutions This is also consistent with our intuition of CFG: it improves the adherence to the prompt, but (perhaps inevitably) reduces the variations and creativity.
Do we think this is the case or not? If so, let's provide an actual argument. The fact that CFG reduces variation is well established both theoretically and empirically, so we can just state that. I think a crucial observation here (which is implied but not actually stated) is that with 100 shots having more variation is better because you can get lucky.
Section 1 and 2
Right now, the paper pitch reads as "CFG is awesome -> let's see if we can apply it to NLP -> hey look it improves these things that LLMs struggle with." While this is historically faithful to the experimental design process, it will read more compellingly if we flip the order a bit to "here's a problem with LLMs -> we adapt a method from a related field to work here -> it solves the problem."
More concretely, I would reorganize the first two sections to have Section 1 go:
- LLMs are awesome, but guiding generation is hard
- Detailing examples of how people have tried to solve this in the past
- Introducing CFG (including Eq 1)
- Explaining why CFG is a better solution (in some contexts) than the other approaches, and hint at why it works
- Paper outline
Section 2 can then delve deeper into our approach, and if necessary compare and contrast mathematically with other approaches. This section should also assert our theory for why it works the way it does, even if we don't provide evidence of that until much later. By priming people with the right ideas, they'll be able to get a good intuition for our results as they come. It's fine to just say we'll demonstrate this later.
Section 3
This section is currently very disconnected and hard to provide concrete feedback on. In particular, it's unclear when various results are missing if they're missing because we haven't added them yet or because we never run them. The lack of GPT-2-xl (completely unmentioned), Pythia models larger than 2.8B, and LLaMA models other than 7B are also notable. We are going to include those in the final copy right?
Despite the fact that CFG almost doubles the computation during inference, the comparison is still noteworthy given that other models with comparable performances on Lambada(OpenAI) have much more parameters and would still require more compute than LLaMA 7B with CFG
Great, let's put inference FLOP on the x-axis and performance on the y-axis to show this!
3.3 Program synthesis evaluations
Why do we evaluate GPT-J, a model that hasn't been discussed at all previously in the paper? Is there a reason besides computing expense to not run the experiments in this paper on all Pythia and all CodeGen models? Why was CodeGen chosen specifically over other code models?
3.5 Chain-of-Thought
At this point, the length of Section 3 starts to bother me as a reader. Section 3 seems to have five pages of a nine page paper, and more subsections than all the other sections combined.We should think about how to break it up into at least two sections.
Figure 3 is not very impressive. Why present the Guanaco-65B results at all, if they're basically unchanged? The two plots that are combined into this figure don't seem to fit together very well either, making reading needlessly confusing.
For Table 6, I think we should probably be looking at the change in logits between traditional sampling and CFG sampling (I don't think that that's what's going on right now, and if it is the caption should explicitly say that). There's a couple words that I disagree with the classification of in this table ("Draco" is about dragons, "restrooms" and "museam" are not more about Paris than any major city), but the most problematically non-highlighted terms are "Rockefeller" "Plaza" "Times" and "Symphony." I strongly suspect that these are the beginning of non-Parisian place names such as Rockefeller Center (Washignton, DC) and Times Square (NYC). There are several places Plaza and Symphony can be the start of as well. Omitting blue highlights on the left side is something we want to be very careful to avoid, as it may look like we are overstating our results by trying to hide something.
Section 4
I am very excited for this section to exist. There's now a decent amount of lit on reverse engineering learned algorithms and building causal models of the behavior of transformers we can potentially take something from. What's the intended scope and ambition of this section?
Section 5
Personally, I really dislike "additional related work" sections. I think that the best written version of a paper rarely has one. This doesn't mean we can't or shouldn't have one, but something to think about is if this work is so related, why don't we cite it previously ever?
This section contains a hidden killer objection to our paper: "efforts at controlled generation largely fell out of favor with the advent of instruction-tuning ." Given this, we need to present a compelling argument for why our method should not be viewed as irrelevant. Compute-to-impact tradeoffs are our friend here, and can be cited and stressed in Section 2. Also, we claim that CFG can be layered on top of other methods successfully. Do we know if the same is true for other prompt-based controlled generation techniques?
One more general note: repetition is your friend. Tell someone what you're going to tell them. Then tell them it. Then remind them that you told them it. This tends to substantially increase the readability of papers.
To compute flops can we simply do 1/3 of the estimate here
https://github.com/EleutherAI/gpt-neox/blob/c00ce70c49ebdf368fcb36fad0ba9a15b607e195/megatron/logging.py#L82
Thank you very much! I don't have anything specific to answer (but we're taking everything into account), beside answering the explicit questions you ask.
Sec 3) I thought that GPT2-xl was kept unreleased by OpenAI, and aparently Hongly did not know that there were bigger Pythias. Then we did LLaMA and thought that we had too many plots and data points to even add more.
Sec 3.2) We use GPT-J because we replicate the experiment? Is that invalid? Also because I'm evaluating this manually and each model we'd add is 2 or 3h of labeling for me. Honglu will answer about CodeGen.
Sec 4) I will fix table 6. Thank you. My lack of US culture shows, it's in good faith.
Sec 4) contains Table 6 and Alex will add his findings. Spoilers: CFG is significantly closer to an instruction model than vanilla.
Sec 5) CFG starts to bridge the gap to instruction. Plus we also show CFG on GPT4All so even those models can benefit. If it's not clear, I will need to think of a way to state it clearly.
Thank you very much!
More succinctly, for a forward pass FLOP = 2*params*tokens
about the choice of CodeGen, it's literally the only one I know that
- has various sizes
- has a well-written paper about it.
And since HumanEval are all python tasks, choosing CodeGen-**-mono may be a good choice to focus on program correctness alone (codegen multi or codegen 2 may mix up other languages)
I will put these in the draft.
The reason I didn't run 6B was simply that 2B with HF model was already taking a shit ton of time (was 8hr or something) for unknown reasons. In OpenELM we are also extremely reluctant to do inference on 6B unless we get a Triton-FT server set up already.
Regarding Figure 3 in CoT, it was just a placeholder figure (with garbage numbers). I now put there the figure with the real results...
I saw you mentioned "long reasoning chains can diverge", is it possible to show that the improvement is because the reasoning chains do not diverge?
It is demonstrated in the figures, you can clearly see there the decrease in the percentage of 'invalid' results (when increasing cfg)
We can also add many qualitative examples, but I thought it would be better in the appendix...
what's the definition of "valid answer"?
same as correct answer?
I guess I'm just confused with the precise definition of "non-valid percentage" on the plot
I'll add that in footnote.
The parsing is following the same protocol as in previous works (CoT, Self-consistency), where you are searching for the words 'The answer is <ans>". Invalid is if you don't have this sequence of words in the answer
Which is the meaning of diverge in this setting; either the model is stuck in an infinite reasoning chain, or he 'forget' the correct output structure
all these definitions and explanations belong to the main text I guess?
yes
not exactly readable yet, but it's cool enough to share
I would actually color by CFG vs non-CFG rather than model type
so you'd have one color for the left end of all segments and another for the right ends?
I think so
I can do that yes, but then we would totally dismiss the model labeling
Is the model labeling important for understanding the “core meaning” of this plot?
no, but I thought it would be nice to have it if there was a nice way to carry that info
yeah I was thinking we drop all the model labels, drop all the lines, mark "with cfg" using stars instead of dots, and if necessary we color each model family by one single color.
I just added LLaMA-7B and I'm a little concerned about its scores, much lower than smaller Pythias on some tasks
(besides TriviaQA which was buggy)
Pythia 2.8B outperforming LLaMA 7B is kinda sus. Where are the LLaMA scores coming from? The eval harness or the LLaMA paper?
harness
yes, it's sus
it's not on all tasks though
actually it's only on arc_easy and sciq
We have a bunch of notes about trying to reproduce the results in their paper, lemme go see if there’s anything about those two
this is llama 7b from harness results folder
arc_easy 0
acc 67.38 ± 0.96
acc_norm 52.48 ± 1.02
Seems similar to the run we got.
On Pythia paper, 2.7B does go up to 0.644
So are we looking at acc for this one?
Yes, we recommend looking at acc in general
btw this is the write out info of triviaqa. Some are not entirely fair (like "Eucalyptus leaves" vs " Eucalyptus", "The Secret Service" vs " Secret Service"). Might explain why the triviaqa we got is generally lower.
Welcome to LLM evaluation via exact matching 🙃
Could you try dropping the lines and model labels, and use either different colors or different markers for baseline and cfg? Just want to see how it looks
sure, later tonight, it's just a preview
@fallow egret I just got permission to use those 65b weights you were asking for
Tulu-65b
Thanks, but we got nice results with Guanaco-65B
Ah ok so they’re no longer needed?
Also how were you using accelerate to run those models?
Yes, I think there is no need for a third model.
What do you mean exactly regarding accelerate?
more just curiosity on how you're running big models on this cluster.. whether you're using deepspeed or HF accelerate
yes, I'm using accelerate
bigger picture question for all the folks here:
All the existing papers that broadly fit into the family of "Contrastive Decoding".... whether it's CAD, or Lisa Li/Percy Liang's
I feel like they have solely focused on generation... (CAD looks at summarization and abstractive QA)
is it fair to claim that we're the first to show general benchmark performance increases?
is it also fair to say that we're the first to apply contrastive decoding techniques to machine translation?
Also side-note, do we even want to put ourselves in the general family of contrastive decoding techniques? I guess we're inarguably in it, right
I think that from a story perspective it should be the opposite, we are performing CFG (which is classifier guidance). In practice this could be seen as contrastive decoding (and should be related in the paper and explain the equivalence). But the motivation is different and should come from the positive side of increasing the 'classification' p(prompt| generation). Which is a little bit different motivation from the CD
Regarding machine translation why this task is different from the rest?
I think that again if we are looking for the big picture perspective, we are trying to tackle the alignment issue, i.e, follow better the prompt instructions. All the rest of the works try to solve different task (CAD is about using a context information).
The results demonstrate that the method is effective with better following the prompt instruction with respect to various tasks (machine translation is one of them)
I find it much harder to read
esp because it only talks about inference flops while keeping the base and CFG linked also allows to estimate the training flops
like, in Hellaswag or PiQA we may be tempted to read "CFG changes nothing, a model twice as big has the same acc as CFG", but the former is more expensive to train
Love the Winogrande zig-zag
oh yeah, NeurIPS workshop. Crazy.
why red and orange? Maybe some other color with bigger contrast?
Maybe we can add a regression line for each group (with cfg and without)
I can change the colors yes. Do we agree on this way of plotting the data?
if we want to make a point that cfg provides better TFLOPs vs acc tradeoff, I don't see another better way (maybe this is not even true)
looks like cfg is only definitively better on two tasks though 
in terms of inference efficiency, yes
in terms of training efficiency I'm guessing it's a clear win
yeah
I guess we throw this chart (plus regression lines if it's not too difficult) as another view of the data in the appendix. I feel like inference efficiency is something that many people still care about.
Also since we mentioned the lambada sota and winning on inference efficiency we can cite this plot.
yes I'll write something about it
I don't get it. The fix was supposed to make sure that the ground truth was in the generated output, after lowercasing and stripping from trailing spaces. most of those examples should be ok
oh is that how triviaqa is supposed to work?
I was just complaining on human standard it's quite unfair.
after the fix, yes
generate until "." or "\n", lowercase, remove punct, search ground truth substring
oh was my version not fixed? I see "Eucalyptus leaves" vs " Eucalyptus" get 0
I remember before the fix it was all 0
mine got nonzero so it should be after the fix
there were 2 fixes: one for LLaMA tokenizer, another for TriviaQA as a whole
yep, I remember without the TriviaQA fix, we will get all 0 for LLaMA
Oh wait, was it 0% for Lambada?
Mine was 674866a of branch cfg in our fork
https://github.com/EleutherAI/lm-evaluation-harness/commit/674866adf4ac76d028f11d89143c42f8d5d9aa58
then it's correct, you got the fix
both, IIRC
uuuh the fix is incorrect
return {"em": float(continuation in list_of_candidates)}
should be
return {"em": float(any((continuation in candidate) for candidate in list_of_candidates))}
@blissful garden If I fix it we'll need to rerun TriviaQA for all models, is that doable?
Need to change all my scripts but it's doable.
Let's do it. Once you fix it I will run a llama 7b first and see if we get the right number.
ok
done
llama7b got 45% on triviaqa
I can't find the previous result
previously it was 40% on llama
and 3%-ish on gpt, pythia stuff
let me run pythia 1.4b and see if we get nontrivial numbers
ok I don't have previous "em" results on triviaqa for this model
darn it maybe it didn't even go down this code path
ok it should just take less than 20min for pythia 1.4b. if we see 20-30ish I will rerun everything
We'll have very bad issues if TriviaQA is back. Our tables won't have the same number of columns, and I'll have to find a new place for the legend in the charts 😆
I have a couple numbers for TriviaQA for common models (0 = missing data). Does this line up with what you're seeing
generative triviaqa or argmax accuracy with teacher forcing?
(ie, version 0 or 1?)
wait wat
Current version is V2: https://github.com/EleutherAI/lm-evaluation-harness/blob/b281b0921b636bc36ad05c0b0b0763bd6dd43463/lm_eval/tasks/triviaqa.py#L31
It looks like its V1 but with a new dataset path
https://github.com/EleutherAI/lm-evaluation-harness/blob/b281b0921b636bc36ad05c0b0b0763bd6dd43463/lm_eval/tasks/triviaqa.py#L91 this line is still wrong it seems
return {"em": float(continuation in list_of_candidates)}
should be
return {"em": float(any((continuation in candidate) for candidate in list_of_candidates))}
If this is an error, it seems like we should be seeing extremely wrong results, not vaguely reasonable ones
It's just a lot less permissive
and indeed, we see the error manifesting here ^
pythia 1.4b now got 12.56% on triviaqa
| Task |Version|Metric|Value | |Stderr|
|--------|------:|------|-----:|---|-----:|
|triviaqa| 1|em |0.1256|± |0.0031|
the write-out.
Some are still strange, like " The state of South Dakota" vs " SOUTH DAKOTA" getting 0.0
uh, I missed something then
"logit_0": "",
"truth": " First Lord of the Admiralty",
Got 1.0 🥲
another one:
"logit_0": "",
"truth": " MO FARAH",
"em": "1.0"
a false negative
"logit_0": " Silkworms feed on the leaves of the mulberry tree",
"truth": " Mulberry",
"em": "0.0"
"logit_0": " The port of St Malo is located in the French department of the",
"truth": " France",
🤔 wonder what pythia was trying to say
...
I read continuation and list_of_candidates backwards
the continuation is not the expected one but the generated one, and the candidates are not candidates answers but candidate ground truths
fixing
you can pull and run it again
sorry about that
@blissful garden (in case you didn't see the answers)
| Task |Version|Metric|Value | |Stderr|
|--------|------:|------|-----:|---|-----:|
|triviaqa| 1|em |0.1594|± |0.0034|
Manually checked the first 100. I don't see many problems in the write-out file this time 🎉 . There are a couple that I still have doubts but shouldn't be a big deal.
I finished writing up the main gist of my content in Section 4. I need to do another pass on it tomorrow to express the ideas there better and link to the Appendix.
I see that there's a lot of editing that's been done on the doc.
I'm wondering if we want to relax our deadline by a little bit? I think this paper needs some careful digesting/reading..
also there is basically no text in the Appendix, and for the amount of tables/info/etc there, I think we need something...
the closer we get to EMNLP the more I'll be unavailable to do that, personally
In my opinion this is not close to be ready. It basically many many results, and there is a need for many iterations to make it a paper.
I try to edit the intro, and focus on the narrative, what exactly we want to tell (it's only a first draft, still need to iterate).
Let me know what you think but the main message is:
- We are trying to tackle the alignment issue- specifically, better following the prompt instructions (which is a critical issue).
- We are inspired by classifier guidance in DM (not contrastive decoding or anything related to LLM).
- Negative prompting is part of the CFG mechanism investigation! it's not part of the main story, but an interesting insight from the investigation that should be explored further in future works.
If we are all aligned on a coherent story, then we can edit the full paper to talk in a coherent language, rim the same message and support the narrative
i agree with you on it not being close to ready. Can probably get there soon with some passes from whomever has bandwidth.
#1 Yes
#2 Sure, I don't see the harm in saying where we were inspired by DM. I'm reading your response to my earlier question today more deeply (apologies, I was multitasking when you first sent it). I think i agree with you about the framing, but I'm on the fence Section 4 not withstanding about whether we were actually able to show "alignment" rather than just "performance increases". I don't know that we designed a lot of our experiments to explicitly address alignment.
#3 This seems like a good way to work this in, to me
Regarding 2, I think that we want to talk about 'alignment' in the more general setting (of the original paper by openAI) of following the human intent and instructions. The performances increase is not a result of enhancing the model knowledge graph, it is due to the model better follow the task prompt instructions. This should be much more stressed in the experiment section.
P.S,
Regarding contrastive decoding. I think it is important to include it in the related work, the narrative is that this is in practice a 'classifier guidance' approach (where the classifier is either external model [Lisa], or a classifier which is defined using a task prompt engineering-CAD) .
In contrast, our approach is truely classifier-free guidance.
So our differentiation comparing to competitors:
- Tackling the alignment issue, which is much bigger, therefore our huge experiment section.
- Use a real classifier-free guidance (this was not done before)
- Deep investigation of the CFG mechanism
@loud adder @blissful garden @versed flax
Please also share your thoughts (submission timeline/narrative), I want us all to be aligned before editing the rest of the paper...
- Yeah of course. I remember a while back we said that the narrative is like, we have a problem (prompt following), here is a method, and we solve it. From what I see in benchmarks and codegen, a more precise hypothesis may include that in the tradeoff between prompt adherence and variability, CFG is a useful parameter that leans towards the former at the small cost of latter. But of course we seem to have more to offer (esp. human evaluation and Sec 4 interpretability)
- Yep. If our story starts with CFG in CV, mentioning classifier guidance is certainly natural.
- I don't quite understand this. It feels more like "how to use CFG" investigation.
classifier guidance vs cfg feels like "having to train another classification model" vs "only using the model itself"
Regarding 3, the point is that we focus on what exactly the effect of CFG on changing the model policy (this is the main point in section 4, and then saying well, you can actually push it further and use negative prompting to better shape the policy
Not only actually training, prompting manipulation can be seen also as fine-tuning. So CAD is in this sense is not a classifier free. Which is highly important for differentiation and novelty (framing all previous method including CAD as classifier guidance methods)
oh I was just saying "negative prompting" seems like an application
aren't we already different enough from the other papers' narratives?
I guess at the end people care about the practical advantages (lower training cost, etc.). So I don't know if we need to argue who is "truly classifier free" or not.
Well, I agree it's an application. But you can frame it as an application that support the understanding of the policy steering of the mode policy
Yeah sure. Understanding the mechanics always has the goal of better using the tool. I'm really happy that Alex wrote the Section 4.
If it's a research paper that is submitted to tier-1 conference, then reviewers and readers will be interested in the story and what is the novelty comparing to previous work. Almost no one will have the time to actually look at all the experiments. They will understand the story and the point, and if it sound reasonable and good story they will see few examples and test this approach in their setting.
If I'm looking at the current version, what pop ups is the formula (which I already familiar with it from a paper I read one month ago- CAD), and a huge batch of unrelated experiments that is hard to understand what exactly their purpose, and what extra knowledge they add comparing to previous works
Almost no one will have the time to actually look at all the experiments.
oh wow. I mean I don't know about ML culture so this sounds surprising.
a huge batch of unrelated experiments that is hard to understand what exactly their purpose
Absolutely. My understanding is that we are still at the stage of "throwing all the results out there" while attempting to fill in paragraphs async so that they can be used later.
I think the CAD paper has muddied the idea of what this paper should be too much, parallel works which have their own take at solving a problem can independently arrive at the same solution, as long as some disntinction is made its okay. Even if the distinction isnt made, the motivation is often quite different.
The experiments are a way of proving generality of application of the method, they arent disjoint, they are wide ranging. You usually test the method on the general benchmark, then a few generation tasks and then any specific task you think it helps more
The CAD paper entirely looks at their method as anti-hallucinatory solution and proves it through summarization and the hallucination evaluation metric
yep this is why I feel that we are different enough in quite a lot of ways (except the formula unfortunately, but formula is formula, I'd admit that we coincide)
@/vermifuge has maintained from the start that he’d like to prove the generality of application of his method and hence first showed a sample after which various tasks were proposed and also some human evaluation was done
Taking a coherent narrative is ofcourse not a bad idea, but having seen the CAD paper mentioned again and again feels like theres some active effort to establish difference. But the difference already exists since we are looking at it in a more general manner while they only considered hallucinatory effects and dont consider complex prompting methods
I think that there should be an active effort to stress the differences and sharpen the story to emphasis that it's not only testing CAD on more tasks, but this is a different approach.
There is very low probability a reviewer will not ask about that and will want to see a real novelty, you don't want to start thinking about it in the rebuttal. It is important to stress the differences from the beginning and shape correctly the story
yeah I feel that it's really about the fact that the paper is still in the early stage of writing, and it is true. There is no denying that we need to rewrite under a coherent narrative (shouldn't all papers be like this?). I'm probably more optimistic thinking we will get there. I just hope that we don't "argue too hard" for the sake of finding novelties.
If a reviewer comes out and says "hey you guys use the same formula as CAD", I will say "yes of course."
I'm waking up and I'm still half asleep.
I agree with pretty much everything that was said: thank you Elad, you echo Stella's comments and you are very right. Thank you paws for bringing back the general scope of CFG.
I'm personally okay if we relax the time frame but with each additional day we need to address CAD more and more. Also I'm very very late with my work and I can't be as active at I have been anymore
I completely agree. The point is not saying 'hi it's novel', the point is to rephrase the story in a way that will get in the end to the same formula from a different direction (from a classification guidance which was the real initial motivation of @versed flax ). In this case, as a reviewer I don't care that you got in the end to the same formula, there is here a different motivation.
coming from CFG in CV has been the original plan right?
This is exactly the point and should be stressed
Yeah totally
My doubt was only about arguing "hey you are not classifier free but we truly are".
What I understood is like we similarly view decoding as a Markov process and get an inspiration from CFG formula despite that we don't have a lot of of their assumptions. We are not an exact analog to begin with.
It's a paper not a rebuttal, positioning CAD as a classifier guidance is something that simply should be stated as an obvious observation- just mention it (without starting a discussion) as part of the external classification line of works, where the classifiers in this case is given by the prompt engineering).
By the way, it's not true that the formula is the same, it's simmilar but not the same...
My only doubts about very directly positioning CFG as coming from CV is that we would be implying and showing that it’s having the same effect/mechanisms etc. And idk entirely if that’s true
Some of the negative prompting stuff hasn’t been as easy in our setting as it seems to be in CV
There are certainly differences with applying CFG in an autoregressive setting vs a diffusion setting, and I worry about giving the readers the impression that “whatever works for CFG for CV will automatically work for CFG for NLP”
I think it's fair, negative prompting in CV was a later addition too
Maybe there’s a way to position CFG as coming from CV but also call out that there are expected differences bc of the different modeling paradigms
I wouldn't say that neg prompting with LMs doesn't work, I think we just haven't found the correct way to use it
Totally, im not implying that either. I’m just saying the fact it didn’t work out-of-the-box just means that the domains are obvs not 100% the same. Maybe I’m making a dumb point, idk
Or... maybe it does, but we haven't found the right experiments to show it
True, too
Also in vision, it's not 'out-of-the-box' formula. In many cases it's not working great and require many iterations and prompt engineering
In any case, the vision is just inspiration, the mechanism is totally different first of all because you are not applying CFG during training
there's no CFG in training in vision either :)
unless you mean "conditioning dropout"
which is the default with LMs
Yes, this is what I meant. You train also the unconditioned part jointly. This is of course unnecessary in LLM which is important
fair enough. these are good points!!
codegen 6B results are also out. Quite consistent. I wish 350 and 2b were this nice (esp. pass@100).
Hi, In the text when you writing comments, please add the command in the beginning of the file:
\newcommand\name[1]{\textcolor{color}{[NAME: #1]}}
(change it to your name and choose a color).
then you can use for example \name{comment} inside the paper. In this way it's going to be easier to track the comments (comparing to the now generic red comments)
That's a great idea
Ok, I think I finished the first iteration on section 2, There is some small inaccuracy in the last subsection, since there is a missing normalization factor (in all the equations), to make it a probability function. But it's really ugly to drag it everywhere. If anyone have an idea how to make it more accurate feel free to change.
In any case, I think that now the theoretical connection to the original vision CFG work is clear. And the derivation is nice (and not completely trivial as one of you ask about it in the remark), it also add more volume to the novelty claim
the math looks great. Left a couple more remarks (might seem nitpicking but I always worry about rigorousness)
Great! these 'nitpicking' are very important in this section
also, what about we order equations in the latex way (\begin{equation})
I'll be home later to night to read that. I'm excited!
yes, indeed it should be changed to the conventional format
I can try to revise a little bit later. Boarding a flight right now. Overall it looks nice and this is def where this section should be going
sure, I will appreciate any help
I'm reading the current state of the paper, and it reminds me that I should absolutely take a chance if anyone has a good idea:
We don't have any conclusive experiment for a compelling negative prompting situation.
What I mean by "compelling" is like you would see in txt2img model: your positive prompt is "a man in 1920", but the images all have mustaches, so your negative prompt is "mustache", which works by targeting a specific unwanted aspect of the generation.
What we've tried and is not really satisfactory is:
- Positive prompt is specific ("Here is a horror story:") and negative prompt is more general or neutral ("Here is a story:") => this works and has been proven in the paper.
- Positive prompt and negative prompt are opposed ("The nice man said:" vs "The angry man said:") => I'm super mitigated.
And those examples aren't exactly as granual and useful as the "mustache" example.
If anyone manages to have some hint at a result, if would be great
One difficulty I see is that your negative prompt gets completed by exactly not a sensical continuation. If your negative prompt is "He was so mean! He said:" then of course the continuation will be something very nice, and I have the intuition that it would make the LM quite confused and undermine the whole strategy, but I can't prove it.
That being said, this confusion does not arise in diffusion models and I'm not sure why. As you go through the multiple steps of sampling, the image looks less and less like the negative prompt the image is supposed to be conditioned to. But it works. And I'm not sure why.
@fallow egret Thank you for that section 2!!! I added some comments as well (some as overleaf comments, like... in the "comments" side bar lol)
(sorry... having a lot of trouble with the tex bot 😂)
I just rephrased and augmented section 2.1 to make it an intro of this theory section.
For 2.2 there are still a couple questions though:
- we need to decide a notation and make it consistent throughout the paper. I vote for @fallow egret notation but is there another important paper that writes $\hat p$? (I see Ho & Salimans write $\tilde p$)
- I think Ho & Salimans intentionally write $p_\theta(z|c)\cdot p_\theta(c|z)^r$ for the same $\theta$ because the classification model $p_\theta(c|z)$ was explicitly mentioned a couple more times when introduced. I'm quite confused too but I fear that there is something we missed.
Honglu
It should be the same when it is the same model (in the CFG case), but in the non-free case it should be different (also in the original paper of the classifier guidance it was different)
Ok, I'm now going to edit sections 3-4.
Overall both of them should take no more than 4 pages, so I'm going to move many of the results to the appendix section
yeah we should be able to refactor it easily depending on the requirement. I should probably move some of my tables to appendix too.
Yeah that was my thought too.
If it's a mistake that repeated 3 times it's just too obvious and I usually become extra careful in this case. But I tried to re-read some stuff and still agree with you so far.
@fallow egret where did section 3 go?
I see you have a folder of "old_sections" now. It would be better if you tell us next time when you make big moves like that.
@fallow egret so what's your plan with this "old_section" folder? We should probably hear you out and make sure everyone is on board.
Yes, sorry I will update next time. In any case I didn't erase anything, it's there exactly as was previously
It simply hard to edit it as is, it will be much easier to take it piece by piece
Be mindful of others people's work. You're not solo on this.
And communicate your intent
Why do you think Sec 3 needs a whole rewrite?
Just to say, to put back the old section:
You can just use the command in the text
\input{old_sections/experiments}
Who is the owner of the overleaf? I have a premium account, and if you transfer ownership to me we can turn track changes on
I am
It's currently a mess of unrelated experiments (with respect to how the secotion is written), with completely different style of writing and length that doesn't connect together to a story. Also the section is too big, most of the stuff should move to the appendix
-
Okay. I don't think "a hot pile of mess" is fair. It needs improvements. Sure.
-
I don't think writing styles and lengths are that much of an issue, and if it's one, Stella or Alex, as native English speakers, might be the ones to trust on that. Besides, rewriting someone's work without consultation is dimissive of their work, and doesn't allow the person(s) the opportunity to improve. This is something that should be either done with full rewrite consent, or via comments / dialogs. I suggested you reach out to Alex so that you guys could sync, and you did not; why?
-
Yes, the sections need better flow and to integrate better into the narrative. What are your suggestions wrt to that?
It's not about the style, it's about the coherent of the story, in any case there is a strict limitation of 8 pages limit for NeurIPS, the current work is more than 12 pages with no conclusion section and limited related work section, so it's simply automatic rejection.
I put it back, it's currently exactly the same as before.
I do agree with your points. There has an ongoing work of moving stuff to the Appendix, indeed. And reflowing the section. We're on the same page. Can we do that as a team?
Sure, I prefer not to do all this work on my own. I thought you and Alex are limited with the time and not actively working on it
We are. But it's much easier and faster to help as a team and get involved in discussions rather than pulling it off alone :)
NeurIPS deadline has passed right?
I thought our plan is to put it on arxiv first, and decide where to submit later?
Yes, indeed. It's not a good time with respect to conferences.
In any case in all the major conferences there is the strict 8 pages limit, and I think we should already upload it to arxiv according to this structure
The points are made and the experiments are here. We should aim for an ArXiv release next week, if not this WE. The text (structure, sentences) should be refined for a conference, but I don't think it's doing too much harm to release it close to its current state
The paper won't change much
the sentences will, but not the insights, not the results, not the experiments
I recently wrote a very long paper (20 pages lol) and submitted to somewhere with page limit. I personally find it rather easy to refactor it for any page limit AFTER the writing is already smooth and coherent. I propose we focus on the introductory texts, the gluing paragraphs, notation consistencies, grammar problems in section 3 first rather than worrying about the page limit?
and I think we are doing fine so far, since previously not all experiments have come out. Now we probably just got everything for benchmarks.
I agree with @loud adder that there is no second time for first impression. In my opinion the current version is not ready, and it will take time to iterate and make it look like a good paper.
But it's up to you as the first author...
Yes I agree. First impression must be good. And IMHO it is. The experiments are here, the results are here, the insights are here. We're not writing poetry. If the sentences can be improved, it's no big deal. The scientific aspects are there already
IMO what's missing for section 3 is to have one qualified native speaker to streamline the intro texts and middle paragraphs. Meanwhile we fix things as good as we can but I don't see the neccessity of massive changes (for the sake of text itself, not considering page limit).
IMO the page restriction is not only for the conference, people have no time to go over 10 pages of experiments. But I completely agree that it's significantly less important comparing to the other points you mentioned
or is there a better idea to restructure the section? I'm def not saying we shouldn't do it but as long as we discuss about it openly here and make sure everyone is on board, it will be good.
I remember @versed flax used to mention that general benchmarks + codegen benchmarks go together, and machine translation + your CoT go together. Maybe there is some way to combine some stuff along this line
My thoughts on section 3:
- Adding overview which echo the intro-> The purpose is to show that CFG improve alignment and follow better the provided instructions We demonstrate it on a vast and diverse set of tasks, models and metrics
- The splitting of the subsection is according to different family of tasks
- Each subsection should start with a paragraph explain what is the task, why we choose it- in other words, in what aspect it’s challenging to follow the instructions. One main benchmark, a small figure/table and a conclusion + reference to the appendix for more details. All the rest of the experiments/details should be in the appendix
- Negative prompting should go to the next section as exploration of direction that should be investigate further (as we already discussed)
- ‘General natural language benchmarks’- should be split accordingly to exact tasks- I think most of these benchmarks could be categorized as reasoning tasks, in this case we could merge CoT with this subsection as a natural extension.
I thought about splitting general benchmarks according to tasks, but that might mean we split tables and results. What benefit do we get by doing this?
I think almost all the benchmarks fall in the category of reasoning. It might be that we don't need to write all the results. If there is a benchmark which doesn't contribute to the story then what is the point in showing this result?
In any case, I think we could categorized it in 1-2 family of tasks
If there is a benchmark which doesn't contribute to the story then what the point in showing this result?
So we hide it?
I think we could categorized it in 1-2 family of tasks
I thought we have more? Common sense reasoning, close-book QA, code generation, machine translation
I meant only the subsection of 'General natural language benchmarks' can be split to reasoning and maybe one more family. All the rest of the subsection (except continuation which still seems missing), are well defined specific tasks which is good.
It's not hiding, we don't need to provide all the results in eval-harness...
Our story is not 'we are building the best mode look how he is beating all the rest on all benchmark'. The story is we purpose a method to tackle some issue in LLM. Therefore, every benchmark tested should be related to this issue
I think all the benchmarks are related (make the generation/prediction on-topic, thus improving metrics like acc). We ran those, and some of the results are bad. I would still report bad results and aim to suggest a theory/modify the narrative rather than hide it.
Yes. Hiding the bad results is dishonest
(and IMO, bad results on small amount of things make it more believable)
I’m at a conference right now but can find time to read and edit the paper at some point today or in a couple days
I think you understand me wrong. I'm not suggesting hiding bad results, I'm suggesting hiding unrelated/unclassified tasks results (that might be good or might be bad).
The point is that 'a general benchmark' subsection is very confusing and not informative to the reader
Oh I see. But which task is unrelated?
I don't know, we should carefully go over the big list of the benchmark of this subsection:
Hellaswag, WinoGrande, Lambada(OpenAI), BoolQ, PIQA, SciQ, ARC-Easy,
ARC-Challenge, TriviaQA
As I said, I think most of them can be categorized as reasoning tasks (which is very good and directly aligned with the story). If there are some leftovers that not fit to this category let's think what to do with them...
I'm at my wife's dance representation tonight. I can't develop or read much right now.
This is fine, we will not change anything without anyone consent 🙂
I just wish I could contribute more to the current conversation
hellaswag, winogrande, boolq, piqa, arc-e, arc-c, are common sense reasoning for sure (looking at llama paper)
triviaqa and sciq feel like close-book QA with facts. Not sure if there are reasoning stuff in sciq.
lambada - just completion? In PaLM paper it was its own category. oh wait I'm actually confused....
(and sadly lambada is something we definitely want to include because it's sooooo good)
If we could hide results winogrande is probably the only one we wanted to hide 😂
It's kind of interesting though. winogrande might be the only one that asks to fill a word in the middle of a sentence.
This is good, so let's split this tasks:
- Reasoning (most of the tasks) + we can merge CoT with it.
- LAMBADA (let's try to categoraized it more precisely)
- Close-book tasks. Do we really need it? We should think if it's really needed...
In this way it's also highlight one of the most important results (LAMBADA), which is currently a little bit hidden in this subsection
sciq and lambada are probably the highlights since the improve is so much better than others. Since they are 2/3 maybe the hypothesis of focusing on reasoning is a little bit too biased?
I'm not saying we should focus on them, it is actually the opposite all this big list of reasoning tasks will get one subsection, and the other will get their own subsection (every subsection has approximately the same length), so this splitting will actually highlight this results
If we could understand what makes Lambada, TriviaQA and WinoGrande so special it would be cool
great! So we don't hide anything and nothing is unrelated?
Perhaps they are a particular kind of task that would actually drive the categorization
Yes, as you said we were able to categorize this big list of benchmarks to 3 family of tasks which is good 🙂
yeah the only thing I see so far is just that sciq and lambada are not about reasoning, and winogrand asks for a specific kind of prompt that questions the middle of a sentence
https://www.mosaicml.com/blog/mpt-7b tasks described at the end
Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k.
So what I see is that Lambada is pure next token prediction, so, the exact training objective, and WinoGrande has a specific form
It's actually fascinating: if we change the format of winogrand by replacing "_" with something else and rephrase the way it is questioned a little bit, would it change the situations?
@loud adder I vaguely remember you mentioned something in general about perturbing benchmarks right? (ignore this if you are in the middle of something. I don't want to disturb you during the conference)
TriviaQA is free form response whereas Arc is MCQ
Here's the plan:
- We're restoring Sec 3.
- We're finishing Sec 2. @fallow egret you made significant progress on this, I'm not sure breaking and refactoring Sec 3 before Sec 2 is done and well is wise. Otherwise we'll have a paper broken in Sec 2 and 3.
I'll take some time during this week-end to rework the minimal amount of tiny bits of Sec 3 that need to be reworked in order to have a functioning paper (help is welcome, but priority is on Sec 2 which is currently broken) - We're aiming for an ArXiv release on Wednesday. Which means we release on Friday, last call. If it's not ready by Friday, I'll do what's left to be done myself and call it a day.
If we're extremely fast and efficient and the Wednesday deadline is met in advance, we can talk about Sec 3.
We've hit a wall of diminishing returns and reduced availability. At this point 1) we're nitpicking, and 2) every day we lose is reduced impact.
Stella, you have a veto right for this plan if you think I am going crazy or missing something crucial.
Guidelines
Like in software engineering:
- Don't break
master: The paper must be in an acceptable state at every point in time. - Incremental and atomic changes: otherwise if your change is not finished, you'll break the paper.
- Be mindful of others, communicate your intent, work as a team.
What about the intro section? it is also currently broken
Look at all the red paragraphs above, IMO this better tell the story comparing to the old intro as we all discussed (and I think all agree)
could you change the red texts you added in the intro under \elad command?
some stuff is useful, I agree
Yes, I'm afraid to do changes now to the main text, not sure what I should so I prefer not to insert it 😐
no I meant most of the red texts in the intro section were added by you right? Per your own suggestions you might want to change it to \elad to better track it.
The big red remarks are my suggestion to replace the current intro with the red text...
(it's a complete section)
wait, was the red text under your remark yours or someone else's
There is all the red part here, that I wrote, which is a suggestion for replacement for the current intro
oh so I was right. I was just saying you suggested we bracket our suggestions by our own custom command, right?
Yes, I wrote it before my suggestion 🙂
In any case we should decide what to do also with the intro...
yes I know. Could you change it to \elad then?
Done
I haven't started looking at introduction carefully but I personally like Alex's original text a lot, though it might be incomplete. Yours is a great complement but besides language issues, I mostly fear about this one
Our findings suggest that CFG is highly effective in improving the model alignment, increasing the possibility of serving the model as a general purpose assistants in a wide variety of tasks
Shall we fix section 2 first? A lot of people I know (including myself) work on the introduction almost at the end.
For section 2, looks like we all agree to use notations from Ho & Salimans. Let's comment out the remarks and stick to it.
I can do this minor change we agreed earlier about equation numbering.
I will also comment out a couple of my remarks that are resolved
Sounds fair, I will work on all the quotes placeholders. I think besides that and the title (which I don't have any objective to change if someone have an alternative idea), there are no other comments (except the Ho notions)?
there are a couple more that I haven't thought super carefully. There is one about explaining how we go from p_\phi to p_\theta, and another about p(c|z) predicting the whole prompt from generation (which could makes sense mathematically but I just have a strange feeling).
The last one about Nichol & Dhariwal's finding about effect of CFG, is the way we rephrase consistent with what's on their paper? I haven't checked yet.
There are some burdens on me here and I was planning on thinking about them on Saturday. But let me know if you have thoughts
I think I address this remarks? (either by modifying the text or by adding a remarks)
which one?
I don't see any explanation
by the way, there seems to be a lot of grammar mistakes and language problems as well. Should we leave it for you to revise?
Oh, I see that remark OK let's iterate it on the paper
I will also iterate the grammar issues (I wrote both this and the intro without any iteration), but for sure feel free to change all the things you find
I guess we are gonna say cfg uses p_\theta to approximate p_\phi, right? This was my original thought but I feel I might be wrong
Never dare to directly add a sentence into yours
I think that it simply a classifier which is defined using the same generative model, this is the main point.
So in this case theta=phi (it's the same model parameters)
My intentions were only to rewrite the story structure of these sections. You are really welcome to modify/rephrase the sentences
- I reworked a little be the abstract: grammar, added LAMBADA, added GPT4All results which are our best results to my knowledge
- I reworked the introduction, blending @patent gull and @fallow egret's previous works. It think it's a fair blending and you should both find your work mainly unaltered. Can I get your confirmation here or as overleaf comments that you are in line with this intro and can consider it done?
IMO the intro should start with an overview of the problem that we are trying to solve. But whatever you decide...
I’ll take a look! Just waking up now after EMNLP submission
Why not. I'm open to criticism. My main focus in to get it done. Done is better than perfect. Can you suggest a fast and straightforward way to improve the introduction? The most broken part of the paper rn is Sec 2. If it takes more than 30 minutes to improve the intro in the way you would like it, I suggest you spend this time fixing Sec 2 to a "non-broken" and ok state.
yah i agree with Elad. Switching the 2nd and 1st paragraphs seems like the right way to go (and reworking)
happy to take a stab if you agree
That's indeed fast and straigthforward. A simple cut and paste still makes sense.
@fallow egret Are you happy with this simple swap?
Yes, but it require a little bit more small changes (like you can't say that you use CFG if you didn't discuss about CFG first).
Also I couldn't find the citation you had in mind for "like prompt alignment [Anthropic]"
ty!
@blissful garden All the rephrasing of the beginning of section 2 (up to 2.1) was done by you? It's fine, I just want to take a look at older version, and we have limit history of 24h 😦
- I am happy with the abstract (thumb up / down)
- I am happy with the introduction (thumb up / down)
guys in general — i don't know ICML/Neurips conventions that well — but it seems like there's a lot of really basic background in here that I don't remember last time
like i think we define what "prompting" is twice (once in the introduction, once in the next section), or at least introduce it as a concept
the language feels fuzzier to me, like it's saying less
I don't know the conventions either. I think it doesn't hurt and you can't really go wrong by stating the basics again, whereas an omission can really be annoying for the reader. It's much easier to skip a few sentences than having to google and read several papers to find that piece of prerequisite that you did not know about
fair enough
Beyond that, it helps contextualize how we use and what we mean by the specific vocabulary ("conditioning", "prompt", etc), that is often used a bit differently by each author
for sure i guess i'm more focused on the repetition between Intro and Methods
Oh yeah that's a different story but I'm not in Sec 2 yet
Your main text was commented out in the middle
I don't love the way we basically restate twice "This worked in Computer vision" feels very redundant to me
Yes, something was change there significantly comparing to the original version I wrote 😐
