Document-as-Image Representations Fall Short for Scientific Retrieval
arXiv, 2026
ArXivDoc: an 8,210-document open-domain scientific retrieval benchmark showing text-based representations outperform document-as-image approaches. Contributed data annotation filtering that distilled 4,059 candidates to 547 evidence-grounded queries.





