#Understanding How LLMs Understand Humans: Training Data Attribution for Social Reasoning

7 messages · Page 1 of 1 (latest)

west meteor
#

Hi everyone, I am Glenn Matlin, PhD, CS at Georgia Tech. This is an open Call for Participation (CFP). Our team is collaborating on a project featuring AI2's models and datasets. I am soliciting interest from among the community to contribute time/resources/oversight. The goal of this work is to study generalization across multiple model families, so we are interested in using Marin, K2-V2, etc.

The project is conducting training data attribution on a pre-training corpus at a billion-token scale to understand which types of documents help the LLM learn how humans reason and think about complex social patterns and how humans understand and relate to each other. More details, e.g., the proposal document, code repo, early results, working copies of the manuscripts, etc., are available upon request.

If you know anything about TDA, you will know it is expensive and complicated. We have both the expertise and the resources to execute. We are looking for people who can immediately help out.

What we need

We are targeting COLM 2026 (deadline March 31). This is already a fast-moving, execution-focused collaboration; this is not something starting from scratch. We have a detailed project plan, an active Linear board, and a working manuscript.

The goal of this CFP is to expand the scope and impact of this project with the help of collaborators like you. We are looking for contributors who can take ownership of concrete tasks immediately in one or more of the following areas:

#

Compute & infrastructure: If you have done research working with multi-GPU clusters (ie 8 to 64 H100s/H200s) and experience with large-scale model inference, data attribution, or continual pre-training

Cross-model generalisation: Our core pipeline runs on OLMo3/Dolma. We want to replicate findings on Marin (8B/32B), K2-V2, or other open models developed on transparent corpora. If you maintain or have deep familiarity with one of these model families and can run the same attribution pipeline on your models, that's an ideal fit.

Intervention experiments: We're planning mixture reweighting and targeted overtraining experiments. If you have experience with data-mixture fine-tuning or continued pretraining recipes and can execute experiments on a short timeline, this is a high-impact contribution, and you would be a huge help.

How to get involved

If you want to get involved and you have relevant skills and can commit meaningful time, please get in touch with me directly at [email protected] with:

  1. Your CV/Website/GitHub/GScholar
  2. Which area(s) above can you contribute to
  3. What compute resources can you bring (if applicable)
  4. Your schedule availability in # hours/week
  5. Everything else about your coding style and research interests

We can share the proposal document, working manuscript, code repo, and Linear board with serious contributors. The co-authorship follows standard contribution norms. All substantive contributions go towards authorship.

⚠️ Finally, for early-career students, please note: this project is not a good learning opportunity or a reading group. We have a complete project plan and a hard deadline. We are looking for people who can pick up well-specified tasks and deliver results independently.

worn pond
#

@west meteor Can you rename this post to be somthing that conveys more information? "Open Call for Participation on EleutherAI Project" is pretty redundant given that you posted it here.

west meteor
#

It's true, I can't deny it

#

Understanding How LLMs Understand Humans: Training Data Attribution for Social Reasoning

violet spruce
bold mango
#

emailed @west meteor