Accepted papers to be presented at the workshop

Natural Language Inference with CCG Parser and Automated Theorem Prover for DTS

Asa Tomita, Mai Matsubara, Hinari Daido, Daisuke Bekki
Abstract: We propose a Natural Language Inference (NLI) system based on compositional semantics. The system combines lightblue, a syntactic and semantic parser grounded in Combinatory Categorial Grammar (CCG) and Dependent Type Semantics (DTS), with wani, an automated theorem prover for Dependent Type Theory (DTT).
Because each computational step reflects a theoretical assumption, system evaluation serves as a form of hypothesis verification.
We evaluate the inference system using the Japanese Semantic Test Suite JSeM, and demonstrate how error analysis provides feedback to improve both the system and the underlying linguistic theory.

Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance

Timothy Pistotti, Jason Brown, Michael J. Witbrock
Abstract: Recent studies employing Large Language Models (LLMs) to test the Argument from the Poverty of the Stimulus (APS) have yielded contrasting results across syntactic phenomena. This paper investigates the hypothesis that characteristics of the stimuli used in recent studies, including lexical ambiguities and structural complexities, may confound model performance. A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2. This involves: 1) establishing a baseline on previously used (both filtered and unfiltered) stimuli, and 2) generating a new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5 Pro Preview) guided by linguistically-informed templates designed to mitigate identified confounds. Our preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines, suggesting that stimulus quality significantly influences outcomes in surprisal-based evaluations of LLM syntactic competency.

Aoi Iimura, Teruyuki Mizuno, Daisuke Bekki
Abstract: In the field of natural language processing, linguistic pipelines grounded in theoretical linguistics offer a complementary approach to large language models and have seen rapid development in recent years. This progress is partly driven by the rise of dependent type semantics, a theory of natural language meaning based on type theory. Dependent type semantics has proven effective in analyzing complex linguistic phenomena such as anaphora and presuppositions. However, its application to modal expressions, which involves concepts of possibility and necessity, leaves room for expansion. To address this, our study proposes a framework that extends dependent type semantics with modal types. This extension broadens its coverage to include phenomena such as modal subordination, where anaphora and modal expressions interact.

Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

Timothy Pistotti, Jason Brown, Michael J. Witbrock
Abstract: Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the “wh-effect”) to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM’s syntactic competence.

Towards Developmentally Motivated Curriculum Learning for Language Models

Arzu Burcu Güven, Rob van der Goot, Anna Rogers
Abstract: This paper presents our work in progress on devising a developmentally inspired, syntax-based curriculum for small-scale language model training. While developmental approaches to curriculum learning (CL) in Natural Language Processing (NLP) have been explored before, prior work has been limited in corpora analysis and curriculum quantification. We present the most comprehensive effort to date to analyze child-directed speech through syntactic patterns, as the former studies focused on either specific constructions or age groups. Our current contributions include a toolkit for organizing syntactically annotated datasets into 13 syntactic categories, and a dataset of six corpora reorganized according to a developmental curriculum. Our future work will focus on investigating the effect of this curriculum on language modeling performance.

Coordination of Theoretical and Computational Linguistics

Adam Przepiórkowski, Agnieszka Patejuk
Abstract: The aim of this paper is to present a case study of a fruitful and, hopefully, inspiring interaction between formal and computational linguistics. A variety of NLP tools and resources have been used in linguistic investigations of the symmetry of coordination, leading to novel theoretical arguments. The converse impact of theoretical results on NLP work has been successful only in some cases.

An instructive implementation of semantic parsing and reasoning using Lexical Functional Grammar

Mark-Matthias Zymla, Kascha Kruschwitz, Paul Zodl
Abstract: This paper presents a computational resource for exploring semantic parsing and reasoning through a strictly formal lense. Inspired by the framework of Lexical Functional Grammar, our system allows for modular exploration of different aspects of semantic parsing. It consists of a hand-coded formal grammar combining syntactic and semantic annotations, producing basic semantic representations. The system provides the option to extend these basic semantics via rewrite rules in a principled fashion to explore more complex reasoning. The result is a layered system enabling an incremental approach to semantic parsing. We illustrate this approach with examples from the Fracas testsuite demonstrating its overall functionality and viability.

Modelling Expectation-based and Memory-based Predictors of Human Reading Times with Syntax-guided Attention

Lukas Mielczarek, Timothée Bernard, Laura Kallmeyer, Katharina Spalek, Benoit Crabbé
Abstract: The correlation between reading times and surprisal is well known in psycholinguistics and is easy to spot from reading time data. There is also a correlation between reading times and structural integration that is harder to detect (Gibson, 2000). The latter correlation has been studied classically by parsing models whose outputs are linked to reading times. In this paper, we study the relevance of structural effects in reading times and how to predict them using neural language models. We find that integration costs significantly improve surprisal-based reading time prediction. Inspired by Timkey and Linzen (2023), we design a small-scale autoregressive transformer language model in which attention heads are supervised by dependency relations. We compare this model to a standard variant with respect to human reading time predictions and find that we can leverage attention predictions as proxies for syntactic integration costs successfully to predict self-paced reading times.

Syntax-Guided Parameter Efficient Fine-Tuning: Integrating Formal Grammatical Constraints into Language Models

Prasanth Yadla
Abstract: Large Language Models (LLMs) demonstrate remarkable linguistic capabilities but lack explicit syntactic knowledge grounded in formal grammatical theory. This paper introduces a syntax-guided parameter-efficient fine-tuning approach that integrates formal syntactic constraints into transformer-based models using Low-Rank Adaptation (LoRA). We develop a hybrid training objective incorporating violations of syntactic well-formedness derived from dependency parsing and context-free grammar constraints. Our method is evaluated on established syntactic benchmarks including BLiMP, CoLA, and SyntaxGym targeting specific grammatical phenomena. Results show consistent improvements in syntactic competence: 7.3% average improvement on BLiMP overall, with particularly strong gains of 9.5% on agreement phenomena and filler-gap dependencies, alongside 5.8% improvement on CoLA MCC scores, while maintaining performance on general NLP tasks. The parameter-efficient approach reduces training time by 77% compared to full fine-tuning while achieving substantial syntactic gains. This work demonstrates a practical pathway for incorporating linguistic theory into modern NLP systems, yielding more interpretable and grammatically robust language models.

On the relative impact of categorical and semantic information on the induction of self-embedding structures

Antoine Venant, Yutaka Suzuki
Abstract: We investigate the impact of center embedding and selectional restrictions on neural latent tree models’ tendency to induce self-embedding structures. To this aim we compare their behavior in different controlled artificial environments involving noun phrases modified by relative clauses, with different quantity of available training data. Our results provide evidence that the existence of multiple center self-embedding is a stronger incentive than selectional restrictions alone, but that the combination of both is the best incentive overall. We also show that different architectures benefit very differently from these incentives.

Plural Interpretive Biases: A Comparison Between Human Language Processing and Language Models

Jia Ren
Abstract: Human communications make frequent use of plural predications. Plural sentences have been observed to be highly ambiguous. There are many theoretical and experimental studies on the nature and logic of plurality in linguistics and philosophy. In this paper, we investigate two lexical aspects of predicates which influence the resolution of plural ambiguity from the novel perspective of the predictions of large language models (LLMs), in particular, BERT and GPT-2. The results of our models differ from the results gained from human experiments. While human language users have certain bias in their interpretation of plural sentences, the biases are not prominent in the language models we looked at. By the end of the paper, we discuss some potential implications of the results.