#Nemotron 4
1 messages · Page 1 of 1 (latest)
they say its a bit better then 4turbo in some part
well, I didn't feel like it after a small test I conducted
to be fair I was mostly interested in looking into how their RM aligned it
I'm mostly interested because it's quite good with multilanguage stuff.
As far as translation is concerned it's the best open model I've encountered so far.
9T training tokens apparently
tho purely for translation 340B is an extreme overkill
there are 10B encoder-decoder models perfectly capable of that, like MADLAD400
Capable sure, but capable and good are different things.
I'm pretty picky and have tested a lot of LLMs over the last year when it comes to translation.
And Nemotron is exceptionally good for an open model.
have you tested encoder-decoder models or only decoder-only ones?
Decoder-only models are inherently worse at translation
and by "perfectly capable" I meant "perfectly decent"
Tho Nemotron is seemingly indeed capable of decent translation
Problem is it seemingly meant for generating synthetic data and not for general purpose use
I have tested some encoder-decoder models, though the majority have indeed been decoder-only models. I've generally found that the translation focused encoder-decoder models like MADLAD400 struggle with translating more than one sentence at a time, which is a big limiting factor when translating languages like Japanese, where translating multiple sentences at once is very benerficial due to how context sensitive the language is. It's also not outstanding when translating single sentences. It's certainly decent, I agree. But decent is not really what I'm looking for.
It's also quite useful to be able to supply additional details for the translation, like the setting of the text, background info, intended audience, etc. And instruct trained decoder models are more capable of integrating that into the translation.
I certainly agree that 340B is much larger than a good translation model should be, I'd love a much smaller on that works just as well, but I haven't found on yet.
It would be quite interesting to see if NVIDIA included multilingual examples into L3 based SteerLM models, as those are at a much more reasonable scale of 70 billion parameters.
About MADLAD - I found it to be at least good enough to translate training data into russian and belarussian, with which it likely indeed has an easier time.
One problem with Nemotron I see already is that it's not exactly great at writing on it's own
I can't see myself using this to gen data seeds
Maybe I can do genstruct tho
Tho 4096 token long context window doesn't really allow for larger documents to gen from
From what I’ve seen of it… This thing demands further testing. The only non-coding benchmark it genuinely scored sub-par on was the MMLU. Which is weird, because for things like HellaSwag, it BLEW past most other models, and beat even stuff like Base GPT-4.
I’m genuinely wondering if something went wrong there. Like it might’ve been a prompting issue.
Sure, its only on-par for logic puzzles with other, smaller models, but what I’m noticing here is that it DOES show promise in what is obviously a product of its success in Synthetic Data generation as the goal: Writing.
Near as I can tell from initial tests I’ve seen of its narrative capacity, not ONLY is it succesful at just plain old convincingly real-sounding text, but it also seems to, DELIBERATELY, not have a latent positivity bias in it.
Saw someone write a hypothetical text conversation between a woman and her friend about a date gone wrong, and it genuinely skeeved me out for how real it sounded.
Granted, that’s not NECESSARILY a positive. But what it tells me is that Nvidia, for want to produce data that you could train a model to AVOID as much as ADHERE to, it doesn’t have any marketability-focused training that’d make it suck at writing anything other than latently optimistic storytelling. Cyberpunk, thrillers, etc. all are on the table here.
Yes, we are at present limited to 4K here. But hopefully RoPE extension is a potential avenue to fix that.
At the end of the day, my crackpot theory is that the fact that it’s designed to create convincing text for synthetic data makes it not-amazing at things we’re USED TO LLMs being good at, but potentially very good at things we’ve grown used to LLMs being kinda bad at.
Exciting stuff, from where I’m sitting!
I’m sorry to hear your efforts in getting it to write good have been unsuccessful, Aetherwiing. My personal tests so far have been quite promising. Though, we’ll have to wait for something like an API endpoint for rubber to really hit the road with testing.
hm, would be interesting to test it on a real synth data gen framework, like Distilabel
LMSYS is fairly limited in what I can test in this regard.
I see some GPT-ish patterns in gens, but let's be honest - currently pretty much 100% of LLMs contain them in some amount.
I also wonder if it requires lower temp or more sampling than it has on LMSYS, bc it lost coherence to some degree on some longer writing prompts I've tested
That could be useful for KTO or Reward model data
Hmmm… Yeah, I’d love to see the ability to adjust some dials here. And yeah, I think -isms are a bit hard to suss out these days. Like, you could argue Claude Opus has “-isms”, but that’s moreso beholden to Claude being a singular entity that, like anybody writing anything, would fall into common patterns in writing stories, regardless of genre.
I’m honestly really intrigued by the “we don’t recommend a system prompt” line on HF.
Interesting thing about Claude Opus is that it gets picked as AI on a classifier trained to differentiate between GPT 3.5 Turbo (as AI) gens and human writing
That… Does surprise me.
Makes me wonder if classifiers are, ultimately, looking for mistakes.
you can try for yourself (that's my model, yeah)
https://huggingface.co/nothingiisreal/open-gpt-3.5-detector
they are looking for patterns
Right, and the problem with that is that, barring willfully messing with structure, truly high quality media will inevitably sound similar because “good” can be kinda unitary in practice.
Anyways, in the weeds.
hm, there are LLMs which are mostly trained on synth, and barely pick any synthetic patterns
one of them is #1250867165737914569
Okay, well bow I just wonder if classifiers are garbage.
I know Euryale is the holy grail in common parlance, but like, come on…
I don't think so
This thing didn't flagged any human data from eval test and further runs as AI
It sounds like somethings gone wrong here.
Didn’t the Bible get caught by classifiers recently?
Euryale get's flagged still, just less
I tested mine on Bible, detected as human writing
Oh! Oh YOU made that classifier?
I think it's just most classifiers are trained like shit
Yes
Okay, interesting.
this one is mine
In any case, yeah. Curious how Nemotron fares against it.
Though, I will say that, if nothing else, Claude Opus is not a bad writer. And my pushback is probably just me falling into the trap of thinking “Seems AI generated” = Bad objectively.
Random example of creative writing by Nemotron
Class label 1, AI generated
Synth data is not necessarily bad, it's just rather repetitive and sometimes non-human sounding, so to speak.
It can be a problem for creative writing, and other creative related things, but on the other hand, barely anyone will complain about good quality synth coding data. So ymmv
Fair…
At the end of the day, I just miss the old Claude before Anthropic reigned it in.
Was hoping Nemotron could give a taste of what we might get out of Llama 405B. But it seems i may still have to wait?
Same, I miss early Claude-1
NVIDIA already gave us something rather valuable with Nemotron - PPO data and code, and two pretrained Reward Models
All of those things are really damn rare
PPO finetuning at least
which is a good thing, PPO has a way higher entry bar than any other pref optimization
mostly because of it's indirect nature
Finally gives me a fair chance to compare PPO and KTO
+1 for this model; it’s kinda like a preview of Llama 3 405b, but great for generating and responding with higher quality data without as many synthetic artifacts. Also helps that it doesn’t forget it’s knowledge unlike Claude or Llama
yeah, it would be interesting to give this model a spin on my Airo 1K regenning test
Wonder why NVIDIA still haven't added this to their API yet
16 A100s for bf16, maybe this could be hosted on a single 8xA100 node with int8/4 precision
is the OpenRouter team interested in hosting this once support is added to a couple backends?
https://huggingface.co/failspy/Nemotron-4-340B-Instruct-SafeTensors
I wish someone converted SteerLM 70B from NeMo to ST too
That one should be way easier to get running, as it is based on L3-70B
I noticed that Failspy went and converted the raw weights into a Safetensors file for people to try in order to get inference running.
Hopefully people smart enough to do so can figure it out. I truly do reserve jugment on this thing until I can get it running through a private frontend.
@flat linden
Adding
wtf is with the template for thismodel
<extra_id_0>System
<extra_id_1>User
{prompt 1}
<extra_id_1>Assistant
{response 1}
<extra_id_1>User
{prompt 2}
<extra_id_1>Assistant
{response 2}
...
<extra_id_1>User
{prompt N}
<extra_id_1>Assistant
Even a LLM would create a more sane naming scheme 😉
this is bad,even worse 4k ct
rip
add this is just a waste of slot
I guess DeepInfra will pull it down soon enough just like #1246016908571054182 ...
But while it's still up: https://openrouter.ai/models/nvidia/nemotron-4-340b-instruct
Nemotron-4-340B-Instruct is an English-language chat model optimized for synthetic data generation. This large language model (LLM) is a fine-tuned version of Nemotron-4-340B-Base, designed for single and multi-turn chat use-cases with a 4,096 token context length.
The base model was pre-trained on 9 trillion tokens from diverse English texts, ...
Thank you!!!
Even if it’s for a short while, I’ve been itching to try this. Many thanks for the chance!
Though uh… Anybody got a SillyTavern conversion of the prompting format? That’s my main front end for testing these things, and I imagine many others.
Hmm! Seems my hunch was right. While it’s no Euryale-2.1 in terms of evocative prose, it demonstrates a robust and thorough understanding of the writing task, and follows instructions very well!
Just a shame it can only manage a short story at the moment before conking out. V_V
Hopefully they do some context enhancing stuff to it like with Llama 3 a little while prior!