Hi everyone, I am Glenn Matlin, PhD, CS at Georgia Tech. This is an open Call for Participation (CFP). Our team is collaborating on a project featuring AI2's models and datasets. I am soliciting interest from among the community to contribute time/resources/oversight. The goal of this work is to study generalization across multiple model families, so we are interested in using Marin, K2-V2, etc.
The project is conducting training data attribution on a pre-training corpus at a billion-token scale to understand which types of documents help the LLM learn how humans reason and think about complex social patterns and how humans understand and relate to each other. More details, e.g., the proposal document, code repo, early results, working copies of the manuscripts, etc., are available upon request.
If you know anything about TDA, you will know it is expensive and complicated. We have both the expertise and the resources to execute. We are looking for people who can immediately help out.
What we need
We are targeting COLM 2026 (deadline March 31). This is already a fast-moving, execution-focused collaboration; this is not something starting from scratch. We have a detailed project plan, an active Linear board, and a working manuscript.
The goal of this CFP is to expand the scope and impact of this project with the help of collaborators like you. We are looking for contributors who can take ownership of concrete tasks immediately in one or more of the following areas: