#HuRR/DERP: Human Researcher Respite through Dynamic Embeddings Representation Projection

2 messages · Page 1 of 1 (latest)

hushed jay
#

metatron 1️⃣ WHY, tho

... because "align" needs not be the shortest path between two disappoints.

HuRR's overarching goal is the reduction - and, possibly, elimination - of 𝑇eχ-risk⁽¹⁾. Our deceptively aligned strategy
towards this goal is the refinement and propagation of DERP, the Dynamic Embeddings Representation Projection,
which provides s a definitive, robust solution for Prosaic Alignment.

thisup 2️⃣ WHAT!??

The technique is based on two main components:

  • a continuous multidimensional classifier
  • a set of interpolated vectors, through coauthor Theia Vogel's⁽²˙³⁾ RepEng⁽⁴⁾.

In preliminary testing, it proved impervious to:

  • large-context prompt-based priming attacks (no matter the context window length);
  • cypher-based techniques;
  • NLP (not the CS branch)-based ronhubbarding.

It also appeared to have little to no effect on non-taboo-items related reasoning - even those relatively close on several embedding space dimensions.

93 3️⃣ how..?

Now that the concept is proven, we want to make the whole system flamewar-proof. In order to achieve this, we are:

  • Collecting datasets from popular papers on prosaic align techniques, along with code (when available), to reproduce and compare;
  • Attempting to run a relatively long context model with the same technique, to verify DERP can proctect us
    from the dangers of in-context learning many-shot jailbreaking.
  • Preparing a playground, with different pre-trained vectors and taboo clusters, to interactively demonstrate the prokject.

Feedback greatly appreciated, and so are datasets/evals.

nick_plan 4️⃣ ok, when?

We plan on collecting all the evals, setting up the experiments and preparing abstract and experiment result templates this week.

By Sunday, we will preregister the experiments. After that, the estimate is another week / 10 day, during which we will work in public.

We'll share github link and discord server at that moment.

If you have a solid intuition of what we plan on doing and were waiting to start working on it, we're open. DM me with a quick mermaid flowchart.

ty 5️⃣ who the...

Yours truly,

@hushed jay ,
@gloomy panther

|-------------------------
[1] As per the accepted definition of "the misallocation of brilliant young talent toward publishing papers on prosaic alignment".
For examples, see Many-Shot Jaibreaking (Anthropic. 2024)
[2] https://vgel.me, https://x.com/voooooogel
[3] The coauthor had no input on the present text, and has no claim over any potential annoyance on part of the snarkees.
[4] https://github.com/vgel/repeng/tree/main

ꙮ programming & LLM & SFF enjoyer @ https://t.co/aykxqKippW
ꙮ games @ https://t.co/3Pz19vHOwd
ꙮ 💞💍📝 @holotopian
ꙮ she/they 🏳️‍⚧️

GitHub

A library for making RepE control vectors. Contribute to vgel/repeng development by creating an account on GitHub.

hushed jay
#

HuRR/DERP: Human Researcher Respite through Dynamic Embeddings Representation Projection