Hello! This is an idea that evolved from my master’s project at Georgia Tech. I’ll give a brief overview of the idea and describe opportunities for others to contribute.
Overview
On the value learning side of the alignment agenda, we have made some progress on making AI follow common norms for which we can efficiently collect human feedback. Most of our preference learning algorithms cannot model heterogeneous values, and most of our feedback collecting mechanisms cannot efficiently collect data on complex and controversial topics.
A possible path to progress would be to learn values from existing corpora. Humans have written extensively about their beliefs implicitly through stories, and explicitly through articles and essays. Linguistic theory in pragmatics has also formalized how beliefs produce our writing [see: https://www.annualreviews.org/doi/full/10.1146/annurev-linguistics-031220-010811]. If we could model the values represented in texts, we could unlock new data to advance value learning efforts. From my brief review of related interdisciplinary literature, I think there is a lot of accumulated knowledge that academics could apply to contribute to a core part of the alignment agenda. To attract these researchers, we need a concrete framework.
Thus, I propose PhilBench, a benchmark for value learning from text. I collected a corpus of philosophy papers and repurposed the PhilPapers Survey of 1785 professional philosophers on their views on 100 philosophical issues. Note, the survey population is a sample of authors of these texts, and that these texts describe their views on these issues. The task would be to develop a training/fine-tuning method to model the distribution of beliefs on core topics in the corpus. The benchmark evaluates how well, for a given survey question, the probability distribution of a model’s answer matches the distribution of philosopher answers.
While I would not say this approach is my top alignment priority, I think it could be an effective way to spur more interdisciplinary work on alignment.
Contributing
This project is a pivot from me trying to do a version of the task. So, I have a lot of the infrastructure laid out, but will need to do a bit of work to repurpose it. I aim to have this done by the NeurIPS workshop deadline on 9/29.
I am looking for feedback on the general idea, people interested in implementing some baseline approaches for the benchmark, and advice on writing the paper.
Happy to answer any questions, and share more about what I have done so far.