| Apr 22, 2026 | New paper at the KG-LLM Workshop (LREC 2026): A Wikidata-Based Framework to Measure Cross-Lingual Bias in Multilingual Large Language Models. We introduce WILA-PopQA, a popularity-matched multilingual benchmark across 9 languages, and disentangle three factors that multilingual probing benchmarks usually confound: the language of the question, the language of the entity, and entity popularity. Across 12 open-weight LLMs, the language of the question turns out to be the dominant factor, and matching it to the entity’s language does not reliably improve factual recall. |
| Sep 10, 2025 | Our paper on the robustness of deductive reasoning with LLMs was accepted at ECAI 2025; presentation coming soon. See it on the publications page: Robustness paper entry. Short description: We study how small prompt and input variations affect deductive reasoning, analyze common failure modes, and outline an evaluation setup for robustness. |
| Sep 10, 2025 | Published a short paper at ACL 2025: ChronoSense. See it on the publications page: ChronoSense entry. Short description: ChronoSense evaluates temporal understanding in large language models using event time intervals (e.g., Allen relations), highlighting current gaps in interval reasoning. |