CSCW 2025 Workshop

Responsibly Training Foundation Models: Actualizing Ethical Principles for Curating Large-Scale Training Datasets in the Era of Massive AI Models

October 18th, 2025, 9am-6pm

Overview

AI technologies have become ubiquitous, influencing domains from healthcare to finance and permeating our daily lives. Concerns about the values underlying the creation and use of datasets to develop AI technologies are growing. Current dataset practices often disregard critical ethical issues, despite the fact that data represents and impacts real people. While progress has been made in establishing best practices for curating smaller datasets in a more ethical fashion, the unprecedented scale of training data in the era foundation models presents unique hurdles for which AI researchers and practitioners must now face.

This one-day, in-person workshop aims to unite interdisciplinary researchers and practitioners in an effort to identify the challenges unique to curating datasets for large-scale foundation models---and then begin to ideate best practices for tackling those challenges. Drawing from CSCW's tradition of interdisciplinary exchange, our aim is to cultivate a diverse community of researchers and practitioners interested in defining the future of ethical responsibility in the composition, process, and release of large-scale datasets for foundation model training.

  • Composition: The makeup of the dataset itself, including the data schema, data instances, and annotations.
  • Process: The process that goes into curating the dataset, such as collecting data and annotating it.
  • Release: The release of the dataset to be used by others for evaluation and modeling purposes.
We plan to disseminate the outcomes of this workshop to the HCI community and beyond by developing a conceptual framework of both the challenges and potential solutions associated specifically with curating datasets for foundation models.

This workshop is part of a series we are running. Upcoming events in this series include our Los Angeles workshop from September 14-15.

Key Information

Themes

Themes of interest related to this workshop include, but are not limited to the following:
  • Considerations for responsible data curation across different dataset use cases (e.g., pre-training vs. post-training) and with different data types (e.g., identifying vs. non-identifying, copyrighted vs. fair use)
  • Discussion on the cultural, technical, social, legal, and environmental factors that impact dataset curation
  • Practices and conditions to ensure ethical labor standards in data collection and annotation
  • Tools, processes, and policies to support more responsible large-scale training dataset curation
  • Participation and community-centered approaches for large-scale dataset curation
  • Consent, attribution, revocation, and the "right to be forgotten" in datasets

Call for Participation

We invite 25-30 individuals interested in exploring design's role in transforming complex challenges into constructive opportunities to participate in our one-day, in-person workshop. Submissions are welcomed in several forms:

  • Position papers or drafts (1-2 pages, ACM single-column format, excluding references). Alternative formats such as design fiction are also encouraged.

  • "Encore" submissions of previously published conference or journal papers relevant to the workshop themes.

  • Initial research ideas submitted as extended abstracts (1-2 pages, ACM single-column format, excluding references).

  • Expression of interest (2-4 paragraphs) describing what you hope to gain from the workshop and how it might connect to your current or future work.

We ask all attendees to provide an accessible PDF of their submission.

The selection of participants will involve a light review process by the organizers, who will evaluate submissions based on their relevance to the workshop's theme, overall quality, and the diversity of perspectives they bring.

Submit →

Workshop Agenda

Time Activity and Description
9:00-9:30am Welcome and Opening Remarks
9:30-10:30am Participant Lightning Talks
10:30-11:00am Coffee Break
11:00am-12:00pm Group Session # 1: Ethical Principles for Dataset Curation
12:00-1:30pm Lunch Break
1:30-2:30pm Group Session #2: Challenges to Ethical Dataset Curation
2:30-2:45pm Coffee Break
2:45-3:45pm Group Session #3: Opportunities for Ethical Dataset Curation
3:45-4:00pm Coffee Break
4:00-4:45pm Group Session #4: Framework Writing
4:45-5:45pm Group Share
5:45-6:00pm Closing Remarks
(Optional) Post-Workshop Dinner

Organizers

Dora Zhao
Dora Zhao

Stanford University

Abeba Birhane
Abeba Birhane

Trinity College Dublin

Q. Vera Liao
Q. Vera Liao

University of Michigan, Ann Arbor

Georgia Panagiotidou
Georgia Panagiotidou

King's College London

Pooja Chitre
Pooja Chitre

Arizona State University

Kathleen H. Pine
Kathleen H. Pine

Arizona State University

Shawn Walker
Shawn Walker

Arizona State University

Jieyu Zhao
Jieyu Zhao

University of Southern California

Alice Xiang
Alice Xiang

Sony AI