#gpt-3.5-turbo-0613 is significantly less intelligent than 0301

192 messages · Page 1 of 1 (latest)

valid otter
#

Am I the only one who’s noticed this? It’s hard to get it to follow even the most basic instructions, it just disregards them entirely.

Many prompts that worked perfectly on 0301 are now unusable. Even wording some of the prompts as if I’m talking to a 3rd grader doesn’t work.

It feels like they gave the model severe ADHD, and at the same time, decreased it’s intelligence by at least 30%.

The new model also seems to translate things verbatim instead of understanding context, yet another sign of what I’m referring to here.

I was so hyped reading that OpenAI email yesterday, because every engine they’ve released has been superior to the one before it. This one is not. There’s no way they can say this engine is “on par with davinci-003” now, as it’s not even close.

It’s faster yes, but is this much overall degradation really worth it? Surely someone at OpenAI must have noticed this by now…

snow pier
#

I have also noticed this issue. This issue seems to be present in both 0613 AND the new 16K model.

In our case, both 0613 and the 16K model yield much longer, less intelligent, and verbose outputs.

midnight ermine
valid otter
valid otter
midnight ermine
valid otter
# midnight ermine I still haven't had a chance to use it, but can you share an example of how it d...

When you use it, it will become painfully obvious very quickly, but I will provide an example.

With 0301, I could provide an instruction like:

(Sample only, the phrase is subject to change, and every phrase has different results)

——

Phrase: here is the best X
Write a sentence that contains the phrase and wrap it with # tags, i.e #phrase#. Avoid using the phrase at the beginning of the sentence.

——

And it would almost always complete this request properly, with the entire phrase wrapped with # tags.

With 0613, it doesn’t seem to understand what “entire phrase” means, and it often* rewords the phrase to be something that changes the meaning or context.

Realizing this, the closest I was able to get was to literally talk to it like a 3rd grader.

——

Phrase: here is the best X
Write a sentence that contains the full, exact phrase and wrap the full, exact phrase in # tags, i.e #here is the best X#. Avoid using the phrase at the beginning of the sentence.

——

This is closest I’ve gotten, and this is only one example. Notice how I had to repeat “full, exact” twice to even get close. This advanced of an LLM model should not have to be spoken to like a 3rd grader to do such a basic task.

In this example, it often lacks the intelligence/creative ability to think a sentence that could contain the phrase, whereas the previous iteration had no problem.

Again, the only way I can describe is that with 0301, I could talk to like a 9th grader, 0613 I have to talk to like a 3rd grader. That’s a significant drop in intelligence/comprehension.

#

I was pulling my hair out last night, at one point typing part of the instructions in all caps, and it still wouldn’t listen. Whatever traits they decreased in exchange for speed was not worth it.

midnight ermine
# valid otter When you use it, it will become painfully obvious very quickly, but I will provi...

Hmm... From my experience(for the tasks that I need it to perform) I already had to talk to it like a 3rd grader, similar to your example, to get the proper results. I'll see if my current prompt fails when I get back to my project.
I do think that the difference in the output is inevitable, since they fine tuned it and adjusted the weighs. If it gets better at one thing, it probably gets worse at a different thing.

valid otter
midnight ermine
ashen kindle
#

@valid otter I have also noticed a regression. This is a message I posted last night:

#dev-chat message

I noticed that with the same prompt as I used with the original turbo model, the 16k AND the 0613 models both were way more long winded even though the prompt instructs them not to be so verbose + instructive vs empathetic towards the user.

#

To see if it was a prompt issue, I stripped out a lot of instructions in my original prompt but that did not really fix the problem...

valid otter
valid otter
ashen kindle
vernal fable
#

16k is pretty much unusable for anything remotely creative. Doesn't follow the writing and dialogue style I prompt no matter how I write it. Haven't seen output as bad and boring as this from an AI in a while tbh

stuck marsh
#

Yes!! I feel the same. I am really frustrated... Also it feels like they just decreased the default temp, so temp = 1 now feels like temp= 0.2 on 0301

#

I appreciate how much faster 0613 is, and how it doesn't run into any of the server traffic overload issues... but the quality is so poor

valid otter
# stuck marsh I appreciate how much faster 0613 is, and how it doesn't run into any of the ser...

I would assume the computational power required to run a less intelligent LLM is significantly less than before. So they opted to save resources and increase speed, hoping the AI would still be passable in most use cases.

Imo it’s likely that the bulk of the computation done by AI is in understanding the prompt/instructions. Similarly to a human, once we fully comprehend the request, we can attempt* to fulfill it pretty easily, even if the result is incorrect.

But if someone asks you a complicated question, or gives complicated instructions, understanding and taking actions on those instructions is going to be a lot harder. You’d just do the part you understood, whether it was correct or not.

That is essentially what the engine is doing now. It doesn’t understand or care about detailed/specific instructions, it just does the most basic version and skips the details.

vernal fable
#

If that was their plan, it's a move they shouldn't have attempted until GTP4 was widely available for a decent price. I guess the only solution is to go keep using the previous version and hope they fix this one or there are alternatives by the time it's deprecated

valid otter
#

I think it’s possible if not likely that ChatGPT Pro is using the 0613 model now. I have never seen as dumb of responses as I’m getting for coding.

I was doing some tests that should’ve been easy, I asked it how to capitalize a letter after a “#” sign at the start of a sentence.

I’m not even joking, it tried to put a “/u” in the replace variable of preg_replace and said it would make the character uppercase.

That would’ve never happened a few weeks ago, because I’ve tested similar requests over time (to see how it responds).

After I told it that it should use a preg_replace_callback, it did it correctly, but why should I have to give such instructions? What if I didn’t know what a preg_replace_callback was?

vernal fable
#

I'm still using GPT4 through Pro to help me write and it's still good, then again I don't ask anything crazy of it, pretty much aid in scenes heavy with a topic I'm not familiar about and at least it does do what I ask and is creative enough to use as a baseline.

The new GPT3.5 models though are so bad. I was excited about 16k so I could provide more context and lore, but like you say it's like it understands the bare minimum. It's really frustrating because it doesn't listen to the instructions no matter what I do, and keeps writing verbose but empty and repetitive content.

To be fair the previous GPT3.5 was not a great writer by any means either, but this one is laughable with how bad the style is and how uncreative at pushing scenes forward. For instance, before the update, a character would exit the scene and do something else and the AI would understand that the character was not in the scene anymore, the current one will hallucinate scenes with both characters still in them

#

So yeah, pretty terrible

valid otter
# vernal fable I'm still using GPT4 through Pro to help me write and it's still good, then agai...

It’s the same concept as someone said on the forum about translations. The old 3.5-turbo model understood the context around the translations. The new model is just translating verbatim.

It’s ability to understand context is has significantly diminished like everything else. That’s likely why it failed to understand that the character was in the scene anymore. It’s unable to understand context, which a clear sign of intelligence or lack thereof.

stuck marsh
#

Also surely OpenAI will fix 0613 right... this difference is so crazy I'm surprised not more people are talking about it

vernal fable
vernal fable
valid otter
stuck marsh
#

should we post a bug report haha

vernal fable
#

The writing style is so so bad that I've seen the output from two other users in other servers and their texts read exactly like mine lol exact same word choices, sentence structures and repetition

stuck marsh
vernal fable
valid otter
# stuck marsh and if they dont notice a difference, I wonder what they're using it for. Aka wh...

I’d assume basic prompts aren’t affected as much. Its like asking an 5th grader to write an article about dogs, they’ll be able to do it even though they (for all intents of purposes) are young and not that intelligent (generally speaking, based on age alone).

It’s when you start asking that 5th grader to write an article about dogs that contains information about specific topics within the article, using a particular writing style, at a certain length (or anything else that’s slightly more advanced), that their lack of ability to understand/complete the request would become obvious.

To me it feels like we went from 9th grade comprehension to 5th/6th grade. The decrease in intelligence/comprehension feels like 30% or more, with a 100% increase in speed. I can’t even imagine how much money OpenAI is saving with this inferior model. Microsoft must be happy.

valid otter
# stuck marsh should we post a bug report haha

They would have to change the weights to fix this problem, and until a lot more people complain, that’s probably not going to happen.

I’ve been through this before with major bugs that took OpenAI 10-15 days to even notice (and hundreds of posts on the forum/Discord).

It’s when the “end end user” starts noticing, then enough people will complain for them to notice.

By “end end” I mean to say there are thousands of services that use the gpt-3.5-turbo API. Once they switch the default to 0613 on June 27th, I’d expect drastically (to say the least) more complaints.

We just have to wait for a few weeks after the 27th, and hopefully enough noise will be made for some action to be taken. I’d trade 30% decrease in speed for 15% increase in intelligence any day.

wise crag
#

fwiw i just wanted to chime in I am seeing the opposite

#

with a very verbose system instruction

midnight ermine
#

Nice to see there are people with a different experience. I feel like this thread was overwhelmed with people having a negative experience only.

queen crescent
#

I think it is just fine tuned to work a bit differently than the old snapshot? It gives more reliance on system messages and function calling than the conversation history itself.

But this means all instances of gpt-3.5/4 in prod have to restructure the old prompts and API calls as this will be replacing the old snapshot as the default pretty soon. This isn't a very developer or downstream company friendly move from OpenAI

vernal fable
#

I don't think it's just about prompting, the AI just feels like it simply isn't smart enough like blackhat777 said above. Believe me when I say I've tried everything when it comes to adjusting system messages and steering the AI, the result was always just bad, specially the 16k model: terrible writing style, lack of understanding of the instructions no matter how they're structured or worded (I've tried plain text, I've tried bullet points, I've tried ELIML, short, verbose...) Lack of understanding of context, bad creativity overall when it comes to coming up with story ideas and twists (or rather, nonexistent twists).

#

The lack of autonomy this new model displays is really bad for people that are using it to help with creative stuff

queen crescent
vernal fable
modest eagle
#

Also noticed decreased quality. It feels like a hybrid of turbo and curie.

modest eagle
#
OpenAI Developer Forum

I’m alarmed because I have been testing the new GPT 3.5-06-13 version and it has significantly worse performance in other langues (such as French). For example, translating from English to French using the current GPT 3.5 03-01 yields translations that understand the context. With the new GPT 3.5-06-13 version, it translates things verbatim wi...

compact dust
#

They are probably distilling to save money while unintentionally nerfing quality

torn garden
compact dust
#

we must raise awareness! if not us, than who? if not now, then when?

#

What are extreme measures to be taken

torn garden
#

let's start by cc'ing some mods

#

@trail flower @warm stump

valid otter
# vernal fable I don't think it's just about prompting, the AI just feels like it simply isn't ...

The system messages don’t help at all. I tried after I read that guys post… it did absolutely nothing.

I spent 3 hours last night testing prompts (both system and user) in every way imaginable. Finally I just gave up, the new engine simply cannot be used for specific prompts like the last one could.

I haven’t tested gpt4 0613 much, but something tells me it’s likely had a similar decrease in intelligence (except it’s starting from a much higher point than 3.5, likely harder to notice).

valid otter
valid otter
valid otter
# compact dust What are extreme measures to be taken

OpenAI won’t listen to us until enough people complain. We’re going to have to wait until after June 27 when they force this inferior engine on everyone.

Once that happens, people are going to notice, and they will complain. Every day people are testing 0613 and realizing how bad it is. After 27th it will be A LOT more people.

modest eagle
vernal fable
#

It's bad to the point I'm just using GPT4 on ChatGPT Plus, as much as I prefer the api, it's unusable for me

#

😬 The new GPT4 model has about the same performance as the current GPT3.5 model. 🪦

I’ve always had the feeling that with every OpenAI release we get new capabilities, but come September there will be no available OAI model matching todays GPT4.

cynically waiting for the day…

compact dust
compact dust
#

organized developer protest outside openAI headquarters?

we may have to fend off the homeless folks, but this effort might require that valiance

midnight ermine
#

I think what one of the posters on the forum said is correct: we need to provide examples of prompts and their responses of the new version compared to the old one. If open ai doesn't know what direction to improve in, no protest is gonna help. Aside from maybe asking to not deprecated the old model.

vernal fable
#

I might go by the forums some time during the weekend, then, but idk, it's so obviously bad, for instance, I'm writing a scene and something akin to the following just happened.

Me: John said, "You and your people have suffered greatly through the ages!"
GPT continuing the scene: Jane said, "John, let's tap on the wisdom of our ancestors, maybe we'll find something to help us save our people!"

This update is so laughably bad that it doesn't understand what "you" is, I rerolled the answer three times and it was all the same.

midnight ermine
vernal fable
# midnight ermine If you ask me, I don't see what exactly the issue is in that example (probably b...

The problem is that John and Jane are from different planets and different races, and John clearly said "You and your people" but GPT had Jane act as if John belonged to her people. The previous model never did so in this story.

Example of a response the previous model would have crafted:
Jane said, "My people have suffered indeed. But thank you for helping us in this fight, John."

The AI also has lore and instructions to tap for information about these characters, by the way. But they're getting ignored, the previous model was so much better at following the information provided.

#

Something whacky that happened earlier too was:

Me: Character A sustained heavy wounds and is bedridden.
GPT: Character A goes to the library and picks up a book.

#

the previous GPT would forget about stuff like this after a few messages, which is expected, this one is not getting it on the first message

queen crescent
#

Just create new PRs with examples that have worse performance or don’t work at all

#

Maybe look at the existing ones first

#

It’s openai/evals on gh

vernal fable
ashen kindle
#

Just saw that Azure will also be deprecating the old models. Very disappointed by the regression and no option but to upgrade by the deadlines

valid otter
#

Finally people are noticing this and it’s getting traction

vernal fable
#

Good to know, hopefully OpenAI does something about it, I'm just baffled they have forced an inferior product on us

ashen kindle
#

Have ya'll found any success just stuffing the 16k model with lots of examples to overcome regression issue?

night oyster
#

Has anyone made a legit eval or even compared responses side by side?

queen crescent
#

i wrote some logic for a project, but as soon as I switched the model to 0613, it started producing very verbose and low quality outputs which broke stuff

night oyster
queen crescent
#

ill send soon sure

#

maybe ill submit an eval directly

vernal fable
#

new turbo and 16k for me seemed to work well for a couple of days and now they're back to being as bad as when they launched, other people I've spoken to have noticed the same, these models are so unreliable

stuck marsh
#

as @valid otter predicted, more people are starting to see the issues now that the default is 0613

valid otter
#

Not only did they keep it significantly less intelligent, but now they decreased the speed to that of the former as well.

I had accepted that we were going to trade intelligence for speed, now we don’t even get the speed increase.

Basically we all get scammed while Microsoft saves a billion dollars 😂

midnight ermine
#

What does Microsoft have to do with it?
The speed decrease just means that they had separate environments/servers for the older and newer checkpoints. And now that the default model changed, the load on these servers has increased.

valid otter
#

Microsoft pretty much owns OpenAI at this point, $10+ billion investment, which gives them full access to the weights and just about everything else.

If you don’t think they want to see more of a return on their investment you’re kidding yourself.

OpenAI doesn’t release inferior engines, they never have. Yet this engine is provably inferior, and people are realizing it in doves. Even researchers who extensively study AI are noticing.

valid otter
#

Inferior to its predecessor… it doesn’t even understand how to insert links anymore. One guy on the forum asked it to insert links and now it literally inserts “[link]” like a third-grader who doesn’t understand how to rationalize instructions.

unborn elk
#

It does very well when given guidance with functions. We've moved most of our API prompts to using those. And it's a nice steerability feature when you have function name/description as well as parameter name/description

vernal fable
#

trying to work around GPT3.5 last night was awful lol had to almost fight it to get some work done, I feel like it gets worse and worse. And I agree with the Microsoft issue.

#

At least ChatGPT users are starting to notice now, lots of complaints on Reddit

#

honestly I feel like GPT4 is also dumber than when it launched, but at least it's not bordering on unusable like 3.5, which is just useless at this point for a lot of use cases

vernal fable
#

When I asked 3.5 for help writing a simple scene of a barber working on a haircut (since I wanted to know techniques or the names of the tools I could use for my own story), this was legit one of the sentences it output, very nonsensical and bizarre:
Barber-Level-Cutty-Question-Mark man whatever you want me to call you I think I wrote too much here sorry

stuck marsh
#

it still says may24 version so i assumed it didn't change yet

vernal fable
#

also this

stuck marsh
valid otter
valid otter
midnight ermine
#

Actually, there are some evidence that the model checkpoint in the chat changed a few days ago. it might very well be using 0613 in the chat now.

unborn elk
#

It changed to 0613 on the 27th

vernal fable
#

that matches with the time people started complaining on reddit, yup...

valid otter
#

I am just realizing the extent of it now when comparing my sites that used the old gpt-3.5-turbo and the new one... holy hell. The new one is insanely repetitive, even at randomize intervals of Temp/PP/FP, and struggles to listen to prompts. I'm going to have rewrite like 100+, and even then I'm not convinced it will be of remotely smilar quality.

valid otter
#

@warm stump tell us OpenAI is aware of this and going to do something to make it at least somewhat better 😭

tough tulip
#

I agree. I absolutely don't care about speed in the slightest. Give me the smartest/slowest model you have. I noticed this a month or so ago. I can definitely tell every time they make a change. It's to the point where I might consider going with an open source model. If the output is bad I might as well not pay for it.

stuck marsh
vernal fable
#

14.5k people let's go

valid otter
#

It's wild how they expect people not to notice. Now my theory is all but confirmed, that they made gpt-3.5-turbo-0613 the default model in ChatGPT. I knew it weeks ago because of how it was acting.

Even GPT4 on ChatGPT Pro doesn't act the same way. I'm assuming it's gpt4-0613 as well. I'm convinced that the things I've had GPT4 code me in the past it would not be able to code today.

#

The real question is will OpenAI do anything about it. You figure at 15k upvotes, they shourld at least make a statement. If they do change something, I wonder if it would be only on ChatGPT and not the API so the bulk of their users would be satisfied, or if they'd actually change the API as well.

#

The OP in that thread says it feels like a 20% reduction in IQ... I think he's being conservative, it's more like 30%.

vernal fable
#

Hopefully the API as well, specially with how expensive it is, which by the way, that's also a reason people should be outraged at this. Current GPT4 definitely is not worth the price OAI is asking for it. I guess on ChatGPT maybe, since you get 25 messages every three hours, which is quite cheap for this model. But the API? It was already expensive when the model was good, now that it's average, the price is egregious imo.

#

Not sure i'm very hopeful about all of this. OAI only gives us silence and PR lies ("more steerable models!"), and what's happening is probably a result of them cutting down on resources+safety training, so...

tough tulip
#

It's pretty sad tbh. Like, I was super on board with openai, I had so many projects planned, I was going to use it in so many places... Lately though it feels more trouble than it's worth. I can't get it to output anything of value as of late. I feel like there's been a huge step backwards. They really really need to change up the moderation strategy (content filtering strategy in general). There's absolutely ways to moderate output in realtime without muddling with the model output directly. Ways that would be more aggressive even. I hope the dev team pushes back against this direction, if possible.

tough tulip
tough tulip
#

Too much scifi

vernal fable
#

seriously, I read some of the things these AI companies say and I feel like all of their knowledge comes from sci-fi rather than real world applications, like jesus, just focus on making a good model

#

superintelligence will most likely not be evil, this is humans projecting our nature onto something that is not

#

I get the need for research but this is ridiculous

tough tulip
#

yeah it's literally just creative use of a text predicter, it's cool, but it isn't what these people are claiming in all these marketing pages. all these things you see in the news about ai doing x or y, (like the military drone sim thing) it's because a human designed a system around it to do x or y... 😐 it's just like any other programming... like do we say "python did x or y" so now we limit python's ability?

#

also yeah, some people really love to anthropomorphize everything

#

i saw on reddit a post that said "chat gpt made a github issue" ... like no, a human designed a chat plugin to make github issues and ran the script via chat 😐

#

i thought we decided back when gpt2 made waves that it was nothing more than pattern matching... not to speculate too hard but i get the feeling openai just trying to hype up the crypto bros - if it's not that, and they're serious, then i worry -- also, here i am complaining about how dog water the model has become and there over there talking about how dangerous it is? bro i can't even get it to output proper markdown anymore

stiff vigil
#

🍿 🍿 🍿 🍿

limpid venture
#

It's like Lord of the Rings 😂

#

King of the world with the ring of unfiltered ChatGPT

tough tulip
#

I’m pro filtering. I’m just against the current strategy. We got a taste, they took it away. 🤷‍♂️

vernal fable
#

yeah, this also proves that whatever safety stuff OAI is doing, it's only making the model dumber

#

people trying to use AI to aid with creative writing should probably try Claude 2 by the way, I've been playing with it for the past few days and the results are so good

#

it's also safety trained, which is a bit of a pain for fiction (Claude gets uppity at the dumbest stuff, so it's not perfect performance by any means), since it's way too sensitive, but their approach at least hasn't ruined the quality of the model yet

valid otter
#

Can you use removepaywall.com to read it.

"High-profile A.I. chatbot ChatGPT performed worse on certain tasks in June than its March version, a Stanford University study found. "

This entire thread is about the June version.

Amazing, and OpenAI still hasn't responded 😂

vernal fable
#

but by June, “for reasons that are not clear,” Zou says, ChatGPT stopped showing its step-by-step reasoning. It matters that a chatbot show its work so that researchers can study how it arrives at certain answers—in this case whether 17077 is a prime number.

#

this is why it can't do things like create new creative words from concepts or other words for me anymore

#

You need to understand the linguistic reasoning behind it and follow a process: Look at the concepts > look at possible stems > look at suffixes and preffixes > decide which stems and suffixes/preffixes to combine based on the intended meaning > combine them

#

it did that before, now it just throws around random combinations without any display of logic behind them

stuck marsh
#

I feel like they would’ve referenced the specific models…?

#

Tho chatgpt did update on June 27, without any notice that it was a diff version from the “may 22 version”

fresh spear
wise crag
#

as someone who's used the api a lot, i disagree

#

as soon as you put large context through gpt4 the website is far cheaper

vernal fable
#

and now with the 50 message every 3 hours limit the offer on ChatGPT Plus is very good

wise crag
#

i spent over 100 in a month on the api

#

the website is really reasonable

chilly isle
#

i'm glad im not the only one experiencing this! the 16k + 0613 model seems to change its mind about prompts periodically, sometimes during the course of a day the same prompt will stop working. the pattern i see is if the context-window is loaded more (6-10k tokens), the model performs worse.

tough tulip
#

The 16k model is bad on a whole new level. As soon as another big LLM hits the market I'm outta here. Where's the transparency? What are they doing? Why are they doing it? I really hope OAI gives it up. This strategy isn't working. They need to remove whatever embedding they're using that replaced the old system and try something new that doesn't muddle with the output. This concept of "alignment" is misguided and based on treating this LLM like an AGI... it's not. Use a moderation system that flags output if it crosses whatever thresholds you don't want it to pass in a separate process that doesn't affect the original output. Why is that so hard? (relative to everything else that's been done so far ofc)

drowsy mural
#

On the positive side, the cutoff for 0301 has been extended a whole year so that proves they’re listening. Anything could happen in a year…

primal sluice
#

It's very important to OAI that the models are "safe". They need to prevent it from providing problematic content, they need to patch out jailbreaks, they need to make sure it's never mean (even if it's not problematic).

it's gonna have impact on "intelligence", it's a tradeoff they've chosen to make.

Never really liked 3.5+ since it's biased towards being a polite assistant, and I used gpt3 before chatGPT made AI popular. If you use it for anything else than polite chat assistant you're fighting against the intention of the developers. It's not a general LLM model.

The moderation method of changing the LLM itself to not be able to output unwanted content is a very costly one, as opposed to pre-3.5 where the LLM is unmoderated but then a separate moderation model decides if it's acceptable.

What I'm saying is, if you don't want to deal with this, use the GPT3 models, they're still good

drowsy mural
# primal sluice It's very important to OAI that the models are "safe". They need to prevent it f...

I kindof get your point, and I’ll bear it in mind. The extra year does seem to me like they misjudged something though. It’s definitely possible that at the end of the year they’ll just be like “welp just like promised, 0301 and the rest are offline. We are dedicated to making boring corporate helpdesk chatbots, if you want something else then go elsewhere”
It wouldn’t be the first time a promising piece of software gets suckier as time goes on…

tough tulip
#

You can make a safe model without lobotomizing it. I refuse to believe otherwise. Content filtering already exists for human content and it works.

vernal fable
#

Exactly lol look at Claude, it performs so well and it knows when not to overstep boundaries, I use it almost exclusively now

#

What needs to happen is that LLMs get "smart" enough to know when content is actually problematic or not too, because the flags I get on ChatGPT Plus are absolutely ridiculous, I'm working with texts that are rated 13+ and they get flagged!

#

API is less restricted, but I worry that someday I'll get banned for trying to edit a translation that it considers problematic when it's really not, this is a big problem, you can't be judge and jury of the fictional content your users are creating or translating, it reeks of censorship and it's just wrong in most cases

#

again, I'm not even trying to work with some grimdark fantasy novel here, just teenage-grade stuff that tackles complex themes (even though I wouldn't have an AI censor grimdark by any means)

#

tl;dr: There has to be another way, this has to be polished and LLMs have to get smart enough to be able to really discern when something could be actually dangerous

tough tulip
#

I'm still of the opinion that it shouldn't be the LLM itself that decides but instead a completely disconnected model and process that gives a stamp of approval or not. Absolutely, not an embedding or some type of message alignment system, but just a yes/no system. Did the output get flagged? Then abort stream. It doesn't need to be more complex than that.

Embeddings should be additive or used when you know exactly what kind of output you want that's really specific. Using them as blankets or to subtract data from being output makes everything worse. (In my opinion, from my experience using LLMs since the start of gpt2)

I'm going to start using Claude more heavily this week. I've heard really good things about it. I don't do a lot of creative writing, just programming, but good explanations are just as valuable as good code.

vernal fable
# tough tulip I'm still of the opinion that it shouldn't be the LLM itself that decides but in...

Do hope Claude can help you, it's been great for me.

I guess the reason why I thought that the model could do it itself, is that after using Claude I realized it works really well. As in, sure, Claude can be annoying with "ethics" too, but more often than not you can reason with it, and be like "Claude, remember we're working with fiction" and it will be like "Oh yeah of course, let's continue as you asked". Its system definitely needs refinement for sure, because sometimes it can get stuck and enter a loop, but at least it doesn't lobotomize the AI

tough tulip
#

From what I researched about Claude is that they started with much more friendly training data in general and really baked it in from the start. So hopefully this is a non-issue.

primal sluice
# tough tulip I'm still of the opinion that it shouldn't be the LLM itself that decides but in...

I'm still of the opinion that it shouldn't be the LLM itself that decides but instead a completely disconnected model and process that gives a stamp of approval or not.
Yeah me too.
You can make a safe model without lobotomizing it. I refuse to believe otherwise.
Sure, if you have a separate moderation model. But I think putting the moderation in the model itself (like GPT3.5+) is has costs like discussed in this thread.
I just hope they do another general, non-assistant biased model personally.

tough tulip
#

Oh yeah I totally agree. I'm over this whole helpful assistant gimmick also. I'm down for a simple input/output style style helper. I don't need in-text disclaimers and qualifiers. Does anyone actually read the buckets of useless info it generates?

vernal fable
#

Yeah it's ridiculous, and also tbh I don't see the value of AI if it's just a glorified Google assistant

#

I saw someone on Reddit mention that they were downright hurt and offended by the disclaimers when it outputs info about mental health, because GPT goes all like "See a therapist! Talk to friends!" and it's like, do these corporations realize how hurtful saying that to someone suffering from mental health can be? Lol a lot of people simply can't

#

And GPT4 didn't do that before

#

By the way a message I sent here was just deleted because it contained a perfectly normal mental health related word, good to see how much OAI cares about treating this topic respectfully

#

But what I was trying to say is that I was working with a translation that treated a complex mental health topic completely respectfully the other day and it got filtered, I just wanted to brainstorm with GPT4 to make sure I had all bases covered

They restrict content under the banner of safety, but it does more harm than good. Censorship is bad.

#

Claude did it for me though

drowsy mural
vernal fable
#

Yes, big corpo stuff masquerading as safety, and I'm tired of being treated like a child by tech companies

muted cedar
#

Hi, has open AI addressed this yet? I am considering creating come automations on my side that periodically benchmark the models to flag significant deviations from previous outputs

#

Don’t want to bother with all of that though if they are going to explain themselves soon or already have lol

#

I’ve been working with the API for a while and it’s hard to say for sure if the same intelligence nerf is there or not. I have seen some strange “dumbness”, which seemingly comes out of no where, but it’s pretty rare.

I’d guess that this is not nefarious on open ai’s part but likely a result of small changes to complicated system leading to unintended/undefined behaviors

tough tulip
# muted cedar Hi, has open AI addressed this yet? I am considering creating come automations o...

The best we have is a tweet from the product vp saying thousands of people are just "expecting more" from the model after using it for a while. Total BS. I'm very interested in a set of evaluations that track over time if you're making something like that. There's been some other projects that are attempting to do that as well (i haven't seen anything tangible yet tho), but the more the merrier. Maybe it'll knock some sense into oai about their product performance.

#

Also, fwiw, we still have access to the old model for some time if you wanted to do some testings on the api checkpoints.

#

The problem with that though, is, ChatGPT is likely a set of embeddings added onto to plain old gpt4 api. Either that, or other submodels being activated in certain cases. Those are likely the things everyday ChatGPT users are experiencing a performance decline with. (specifically in reguards to moderation) That is, assuming oai isn't just plain lying to us about them not changing the model. They don't tell us anything, really.

tough tulip
#

Just some additional speculation, soon after the nerf, they filed a copyright on gpt5, my guess is that they're going to fast-path gpt5 training after the moderation nerf, and to combat bad press and competition from claude2, despite previously stating that gpt5 was not being worked on at all. I think their priorities have shifted. My guess is they're going to brute force the problem away, and they're going to keep the nerf so they don't get sued in the meantime.

muted cedar
muted cedar
#

Good to know all this context, thank you. I think I will forge ahead and try hacking together an OSS test suite in the next couple of weeks. I’ll go look ofc, but if you have any good places to look to see what has been tried, I’m all ears