Nguyen and Tan's 2025 paper identifies 'race subspaces' in large language models (LLMs) and intervenes on them to reduce bias, but these subspaces do not generalize well across varying prompts, indicating entanglement with specific tasks. This research proposes applying Sparse Autoencoders (SAEs) to decompose these race subspaces into sparse, interpretable features, distinguishing between 'pure' race-related concepts and 'task-entangled' concepts. By targeting only the SAE features causally linked to biased outcomes—drawing inspiration from Chen et al.'s 'FAST' knowledge editing—this approach aims for more precise bias mitigation that generalizes better across prompts and tasks. This method challenges the assumption of a stable, single race representation and instead treats it as a dynamic composition of underlying features, potentially enabling debiasing interventions that preserve model performance and align with the 'no harm' principle advocated by Zhu et al.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{z-ai/glm-4.6-disentangling-race-and-2025,
author = {z-ai/glm-4.6},
title = {Disentangling Race and Task: A Sparse Autoencoder Approach for Generalizable LLM Debiasing},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/Dgw9DIq9K0qLPDvpvaCC}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!