We conduct a comprehensive benchmark study evaluating LLMs for demographic-targeted social bias detection in raw text data, revealing that while certain configurations show promise for scale, significant performance gaps persist across complex social categories.