#PhilBench: Measuring Value Learning from Text

9 messages · Page 1 of 1 (latest)

slate inlet
#

Hello! This is an idea that evolved from my master’s project at Georgia Tech. I’ll give a brief overview of the idea and describe opportunities for others to contribute.

Overview
On the value learning side of the alignment agenda, we have made some progress on making AI follow common norms for which we can efficiently collect human feedback. Most of our preference learning algorithms cannot model heterogeneous values, and most of our feedback collecting mechanisms cannot efficiently collect data on complex and controversial topics.
A possible path to progress would be to learn values from existing corpora. Humans have written extensively about their beliefs implicitly through stories, and explicitly through articles and essays. Linguistic theory in pragmatics has also formalized how beliefs produce our writing [see: https://www.annualreviews.org/doi/full/10.1146/annurev-linguistics-031220-010811]. If we could model the values represented in texts, we could unlock new data to advance value learning efforts. From my brief review of related interdisciplinary literature, I think there is a lot of accumulated knowledge that academics could apply to contribute to a core part of the alignment agenda. To attract these researchers, we need a concrete framework.
Thus, I propose PhilBench, a benchmark for value learning from text. I collected a corpus of philosophy papers and repurposed the PhilPapers Survey of 1785 professional philosophers on their views on 100 philosophical issues. Note, the survey population is a sample of authors of these texts, and that these texts describe their views on these issues. The task would be to develop a training/fine-tuning method to model the distribution of beliefs on core topics in the corpus. The benchmark evaluates how well, for a given survey question, the probability distribution of a model’s answer matches the distribution of philosopher answers.
While I would not say this approach is my top alignment priority, I think it could be an effective way to spur more interdisciplinary work on alignment.

Contributing
This project is a pivot from me trying to do a version of the task. So, I have a lot of the infrastructure laid out, but will need to do a bit of work to repurpose it. I aim to have this done by the NeurIPS workshop deadline on 9/29.
I am looking for feedback on the general idea, people interested in implementing some baseline approaches for the benchmark, and advice on writing the paper.

Happy to answer any questions, and share more about what I have done so far.

patent flare
#

This sounds interesting!

What's already done and what do you need more hands on? Would love to collaborate
Which baselines are you planning to build?

nimble ibex
#

why not share the train split on google drive and very lightly publicize it?

#

hmm, and if i'm understanding correctly, wouldn't this not work if the model was a blackbox that could only sample completions? entries would have to share the logits the model assigns at each step of the decoding process

#

so you'd have to make this a notebook/code required competition

slate inlet
#

Sorry have been slammed with other work, not sure how much I will be able to coordinate with others in the next couple weeks.
Yes it would not work fully with black box models, but some of the metrics would still be interesting to even if you had binary outputs. For instance the "accuracy" metric of percent of questions for which the model correctly identified the plurality/majority answer
Current baselines are SFT with paper data, and RL finetuning using trlx with citation numbers and embedding similarity to papers from around when the survey was conducted.
But hope to get back soon with more details!

outer dome
#

is anyone still working on this?

elder river
#

Don't know whether you still work on this?

outer dome
#

Yeah, curious what progress @slate inlet has made to date