News

Apr 22, 2026 New paper at the KG-LLM Workshop (LREC 2026): A Wikidata-Based Framework to Measure Cross-Lingual Bias in Multilingual Large Language Models. We introduce WILA-PopQA, a popularity-matched multilingual benchmark across 9 languages, and disentangle three factors that multilingual probing benchmarks usually confound: the language of the question, the language of the entity, and entity popularity. Across 12 open-weight LLMs, the language of the question turns out to be the dominant factor, and matching it to the entity’s language does not reliably improve factual recall.
Sep 10, 2025 Our paper on the robustness of deductive reasoning with LLMs was accepted at ECAI 2025; presentation coming soon. See it on the publications page: Robustness paper entry. Short description: We study how small prompt and input variations affect deductive reasoning, analyze common failure modes, and outline an evaluation setup for robustness.
Sep 10, 2025 Published a short paper at ACL 2025: ChronoSense. See it on the publications page: ChronoSense entry. Short description: ChronoSense evaluates temporal understanding in large language models using event time intervals (e.g., Allen relations), highlighting current gaps in interval reasoning.