Conference Report ACL 2025 - How can we move forward as a field?

30 minute read

Published:

DISCLAIMER: If you disagree with me, or if I represented your opinions wrongly in this article, please contact me directly via e-mail! I am happy to adapt this article if needed.

Last week, I attended the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). My PhD advisor Manfred Pinkal told me recently that the first ACL he attended featured 150 attendees. This year, there were more than 5000 onsite and roughly 1000 additional remote participants. It was also the first time I attended a huge conference after the Covid break, so it was extra nice to catch up with a lot of people in person.

Every invited talk and panel (except maybe that by Mark Johnson in the XLLM workshop) that I attended spent the first ten minutes discussing the great capabilities of large language models (LLMs) and why they do not yet solve language understanding, or some related tasks. Moreover, ACL ARR has a massive problem with way too many submissions, too few senior reviewers, and too many automatically generated reviews. In a plenary panel discussion, Ed Hovy, Mirella Lapata, Yue Zhang, and Dan Roth discussed how we could move forward as a field.

  • Ed: With LLMs, we are given a self-driving car that just crashes regularly. We should spend more effort on understanding how it works internally. I think this is really worthwhile, though he really meant that we should develop methods that let us understand the internal workings of the model such that we can fix ‘misbehavior’. All the papers that just probe the models on some task and then say, hey, LLMs can do this or that (to some degree) are like popcorn (his words, not mine), you read them, you feel satisfied for ten minutes, and then you feel empty again. I chatted with Ed later and (if I did not misunderstand him) he actually does agree that system-building using LLMs as a component is still a worthwhile endeavor, if that’s your focus. I will come back to this point later. As a community, we have tried understanding how neural models work for quite a while, e.g., along the lines of the BlackBoxNLP workshop series, which is still active. There were some interesting insights, e.g., that BERT behaves a bit like the old natural language processing (NLP) pipelines with higher-level semantics being put together at higher levels. However, I think we have failed to even understand simple models like word2vec fully in the sense of being able to predict when they succeed and when they fail. In my current opinion (which I am happy to change if someone brings evidence or good arguments), LLMs are still distributional models, with just highly sophisticated vector spaces that enable us to store a lot of information how to traverse the vector space and output natural language answers at the end. It does make sense to look at where they succeed and where they fail. In particular when building systems for real-world use, it is paramount to test them extensively and make sure they do not harm anyone. Unfortunately, the popcorn papers are overwritten as soon as the next model generation comes out. Then we have to test everything again, and unless we can be sure that the new models were not simply trained on the probing test data, we even cannot re-use the same benchmark data. We’ve already seen this with all the BERTology papers.

  • Mirella: We need to get back to an actual science, where we have a controlled setting of what went into the training data and on what data we test. I completely agree. This was mainly a call to the model vendors. The issue is that indexing and providing search capabilities for the pretraining data is of course costly. Plus, we would have to keep developing methods to find the instances of data that are of interest to the model behavior we aim to analyze, which may also raise new questions how to do this science. (Though I personally think doing research on such retrieval methods could be fun.) During the XLLM workshop, I learned about some interesting endeavors in this direction like OpenEuroLLM. OLMo is also an exciting initiative, which also features an instruction-tuned variant (unfortunately at the moment predominantly for English). However, for most of use, regular researchers at a regular unversity, training such models is not something we can actually afford. So what can we do to make the field more scientific again? I will present my thoughts on that below.

  • Dan: The topic of the panel discussion was actually supposed to be “generalisation,” even if a large part of the discussion was more general than that (pun intended). Dan reported that he had asked some LLM about himself recently and got an impressive list of prizes he had won, just that he had not actually won them. But he pointed out that the model actually did generalize by hallucinating these facts, as they typically fit for a university professor at his stage. If I understood correctly, Dan, who was presented as the “pragmatist,” largely took the perspective that we should build and evaluate systems. But also that we should aim to build systems (or understand LLMs in that sense) how they deal with reasoning chains or, as they put it, causal reasoning. In our own work on quantifying uncertainty in natural language text in Bayesian reasoning scenarios (EMNLP 2024), we actually found that they perform okay on causal reasoning (inferring the likelihood of effects based on some cause), but that their performance drops markedly when the underlying problem requires evidential (updating one’s beliefs about causes when new evidence comes in) or explaining-away style reasoning (in which the knowledge about one cause makes the beliefs that some other cause is the case less likely in light of evidence). This, I think, is actually somewhat intuitive given their autoregressive nature.

Results taken from our EMNLP 2024 paper: the models' performance drops for evidential and explaining-away reasoning

What is worth noting is that despite the teaser image, our paper is not actually popcorn (I hope), because our point was actually to create a neurosymbolic model that parses problems into a machine-readable logic programming language that can then solve the problems regardless of the underlying reasoning types (see blue bars in the plot). Results like this make me somehow believe that without a major change in model architecture, the models have some built-in bias that just works unlike the human brain, and that to achieve models that explain reasoning in a way that we humans can deal with, we need at least one additional architecture shift in AI. But that is a belief, and as a researcher, I occasionally update my beliefs. Evidential, causal, or explaining-away reasoning included.

So still, ACL has a massive problem of somehow having drifted away from the scientific principles guiding NLP research in the past decades. Getting papers accepted seems to have become some kind of gamble which highly depends on being assigned a responsible meta-reviewer. In his keynote at the Scientific Document Processing Workshop, Ed Hovy again provocatively stated that automatic meta-reviewing had been solved (don’t get me wrong, I loved his provocative analogies, our community needs leaders like him that have thought about NLP and meaning for decades). Again, in our coffee break chat, he agreed that neiter reviewing nor meta-reviewing should be summarization. Practically, the meta-reviews that my students receive regularly read like superficial summaries of the reviews. There is no meaningful evaluation of the contributions as in prior times. But it shouldn’t be that way. Ed actually thinks we should admit all the papers and vote on-site which papers should get into the proceedings. A little like it is actually the case in linguistics - in this field, contributions are often accepted for presentation at a conference based on an abstract, to check the topical fit. After discussion and feedback, editors that actually edit compile a collection of articles into a book. Maybe that wouldn’t be a bad idea, I am just not sure whether it scales with the number of ACL participants. But a lot of people I talked to at ACL this year (including also Alexander Koller) and also ACL president Chengqing Zong in his presidential address advocated going back to smaller and specialized conferences.

What bothers me, personally, is understanding how we should do science these days. And how to communicate this to the newcomers in our field. On my train ride back home, I was thinking about how I communicate this to my PhD students and decided to put together some concrete suggestions that can help people who do not own huge compute centers. But first, let’s look back. What were experimental setups, what were “interesting” or “valid” contributions?

Computational Linguistics before the year 2000

For decades, computational linguistics dealt with building programs (typically not yet what we would call a system today) that would process natural language, often motivated by theoretically established linguistic rules. The contribution was as much on the linguistic side as on the computational side, as by formalizing and testing linguistic ideas in a computational way was still rather novel. For example, look at the PUNDIT system for temporal relation inference (which is more on the theoretical side) or a Text Interpretation System for MUC-3.

screenshots taken from the two papers mentioned in the text that illustrate how the grammar rules look like

These papers do not even have an experimental or evaluation section. Reviewers would have to look at the ideas and judge whether they would be interesting to discuss, whether they have the potential to spark new ideas in others. Whether the approach does something different than existing work. Proposal #1: Let’s add this back to our criteria of reviewing and meta-reviewing, and not just on guidelines pages, but actually do that. It will result in much more interesting contributions to ACL than benchmark chasing. I think this fits in nicely with the necessity to create less compute-intensive models. And let’s not just do that in specific environmental computing tracks. Let’s invite diversity back into our approaches.

To illustrate what is currently wrong: The only paper from my group that did not get into ACL or Findings actually presented a super interesting way to create training data for a complicated semantic parsing task and showed clear improvements on medium-sized models. It was rejected mainly for the reason that GPT4o and Deepseek achieved around 61% accuracy on the dataset, and our approach only 57%. Are these really numbers that already tell us that we are on the wrong track? What if our model had achieved 63% accuracy? I think if any SOTA LLMs achieve anything less than 98% accuracy on a (non-subjective) task, it is absolutely worth discussing alternative approaches! (This paragraph is not intended to be a personal complaint, it just intends to illustrate what I also hear from lots of colleagues, despite the ACL ARR guidelines already discouraging such simplified ways of evaluating the perceived utility of an idea.)

Natural Language Processing 2000-2022

In the 2000s, the mainstream paradigm in NLP was to collect data, annotate it manually, compare whether the task is meaningful by at least checking if several humans would arrive at the same conclusion if they perform it, then train a machine learning model guided by their intuition and a development set not included in the training data. Finally, we would test whether the model generalizes either in- or cross-domain by applying it to some also manually-labeled test data. One could argue that often, the data sets were way too small for drawing statistically significant conclusions (although they were often much bigger than nowadays). Often, it can also be shown that the models actually use superficial cues rather than actual understanding. So I really do not want to say “everything used to be better”. However, it was paramount to separate the data on which the model was trained from the data on which hypotheses were tested. Also, the data should either have been elicited throug human annotation (more often the case, example: my own PhD work on linguistic aspect) or they should have come from the real world (less often the case, examples from the patent NLP domain: a dataset on patent classification created by a Bosch employee over the course of 25 years or PAP2PAT, which aligns patents and papers).

image summarizing the experimental paradigm 2000-2022 that is also explained in the main text

As a side note, it was also considered bad practice to reverse-engineer other systems. For example, if you take an existing dependency parser and run it on a dataset, which you then use to train another parser, you have not actually learned anything new, you have just learned the set of rules already explicitly or implicitly incorporated in the initial parser. The only admissible setting that I can think of is when the aim is to construct (typically additional) silver data that is combined with existing gold data and the aim is not to make statements about the underlying phenomenon that you are trying to capture but instead to improve downstream performance on an unseen test set. (I am mentioning this because I noted recently that a lot of novice researchers have not heard of the concept of silver standard data, although much of the currently LLM-created benchmark data falls in this category in my opinion. It was also suggested by Jessy Li in her keynote at the Linguistic Annotation Workshop that we should increasingly work with platinum data, i.e., manually created data collected by actual expert annotators.) Synthetic data generated by an LLM is necessarily just silver data, hence, we should always take conclusions made based on it (especially if it is used as test data) with a grain of salt.

But what can we learn from this age? Only if the training data does not contain the test set, we can actually make statements about whether a model has learned anything beyond memorizing some instances. But how can we tell what data LLMs have seen during pre-training? Sadly, at the moment, for the most widely-used LLMs, we cannot. But we can do something. Proposal #2: We should use test data that was created newly, either by hand-labeling data (by making use of trustworthy annotations, perhaps even having them onsite – I would not trust crowd-sourcing any more) or by collecting real-world data that was definitely created after the cut-off date of the pretraining data of the LLMs that are part of the system we are testing. An analysis of such non-contaminated test data should be a mandatory part of any paper using the standard machine-learning benchmarking paradigm described above. And reviewers should give credit for it, because it is a lot of extra work!

Here is an example taken from the PAP2PAT paper. I really liked that my student Valentin collected this additional data and performed experiments on it, actually observing some interesting deviations on the linguistic style, and I think it would be very worth diving into such observations with an even deeper analysis. (This is not criticism of his careful work, which took more than one year, and which I, in my completely unbiased opinion, think it would absolutely have been worthy of being accepted at the main conference. :))

a screenshot of the Pat2Pat paper sections on test data contamination

Reflection on “research” practices today

So why are people like Mirella Lapata saying that our research field needs to get more scientific again? The problem is that LLMs have become so human-like in their way of expression that humans easily trust them potentially too much. Steven Bird says (in this recommended article) that this is due to the Eliza Effect, the linguistic correlate of pareidolia (the human tendency to see faces in the clouds, i.e., to assume meaning where there is none). If you think about it, it does make sense: we rely on recognising other humans, other intelligent beings, so it is better to overgeneralize in recognising them by their behavior rather than to miss out on noticing one of these dangerous predators in our surroundings. Another issue that is discussed a lot these days is whether we potentially anthromorphize LLMs too much. I have several times talked to people outside our field who were suprised that “agents” are just systems using the same underlying LLM technology. They were somehow of the impression that “agentic AI” was really something fundamentally new.

Granted, on some tasks (for some languages), LLMs even work more consistently than trained human expert annotators. However, I do not think this is generally true for all relevant setups. Moreover, saying that they have superhuman performance is really misleading, and I think that is still true. At the very least, we should define human performance with regard to that person who is best at performing a specific task, and not with regard to an average crowdworker.

But psychology set aside, what is the issue? A lot of papers (including those I co-authored, I am sure you can find examples of what should be done differently in those as well!) generate training or test data. In some domains, this is extremely helpful as it may be impossible to work with real data, e.g., due to copyright or privacy issues. So in some sense, the possibility to actually generate new texts or data is highly beneficial to our research. There is only one issue: the data already follows some internal distributions of the model(s). Depending on the use case, this may not be a big problem. If you intend to build a useful system, it may be totally okay, as long as you test on real-world data. Often, a test dataset is simply built in the same way, which means that we may overestimate performance.

image of a cycle going through the steps described in the main text, with issues

I have seen some papers (during the review process) that then use the LLM to label the data, and perhaps also additionally filter the data “to increase data quality.” Maybe that increases the quality of the instances that were labeled, but it certainly also discards the relevant cases. And getting the hard cases along the decision boundaries right should remain the core interest of NLP, not dealing with test sets whose instances are at the far ends of the spectrum.

Then, the “modeling” takes place. I think you see where I am going - if the same LLM is used to model the data that we just automatically generated and/or labeled and filtered, it has an unfair advantage because it uses the same underlying distributions. If you use a different LLM, you risk reverse-engineering the data-generating model. If they have not already been aligned during pre-training and instruction tuning. In my personal experience (does anyone know a good citation for this?), very different models sometimes tend to make exactly the same mistakes on difficult cases, which lets me wonder to what extent the training data for one model was actually taken from the other model. Sigh. This is the opposite of scientific.

Finally, evaluation. Real-world data with reliable annotations is extremely hard to get by. During my PhD, I worked with a group of several trained student annotators for more than three years. Who does something like that any more nowadays? The next ACL ARR cycle is due in two months. Luckily, we have LLMs that behave almost like humans, so that is why the LLM-as-a-judge (a special case of LLM-as-an-annotator according to Rotem Dror, see her keynote in the Linguistic Annotation Workshop), became popular. LLMs are still incredibly unreliable at outputting the solutions in the format we want to have it, i.e., we often need to map it to other formats or find out if it contains the correct answer (and no incorrect one). Plus, it is still tricky to evaluate the quality of generated text. It is easy to see that when using an LLM that was used to generate or label the data, it will prefer its own labels. There exists already a growing body of work on the biases of models (and humans) on judgment tasks, revealing systematic differences.

image of a cycle going through the steps described in the main text, with potential solutions

Okay, but what can we do (to do scientific research)? For a moment, let’s stay in the machine-learning style NLP framework and see how we can improve each step to break the cyclic research. Here is my proposal.

  • Let’s base our research on open models for which we can actually verify whether something was included in the training data or not. Even if they are smaller, even if Deepseek, ChatGPT, or LLama-3.3-405B (replace this with the currently biggest commercial and open-weights models of your choice) perform better on some benchmark. This should really, really be not a factor in reviewing.

  • If we are using LLMs to label data, verify the quality of this step by comparing to real-world data (that had also not been included in its training) or human expert annotations. There is a lot of recent work by Rotem Dror on statistical significance tests for agreement. I also recommend studying this article by Ron Artstein and Massimo Poesio. It’s from 2008, but the metrics are still widely used, account for chance agreement, and if you look at them class-wise, they can tell you which are the hard and easy cases. We can simply not say generally how many instances we need to annotate manually to build a good labeling prompt. Class imbalance has a major impact on performance of both humans and systems, and there are simply easier and difficult cases. During the crowdsourcing age, a lot of methods have been developed to figure our statistically which annotations to trust, e.g., Dirk Hovy’s MACE. I think these methods could be quite useful to figure out whether to trust LLM annotations as well. To study agreement, I invite everyone to look into the old literature. :)

  • From a system building perspective, I think we would kind of be done at that step. We have collected a small, but meaningful set of annotations, written some rules (or programmed the LLM with some prompts if you prefer this terminology) and checked with the data how well our rules capture the quality annotations. That was exactly how NLP was done for a long time, and it actually does not seem wrong to me these days either. I would like to read about this step of the research much more. The problem is (as it was before) that if the person who wrote the rules is also the annotator, the performance measure may be overly optimistic again. While I studied during my Master’s degree in Language Science and Technology in 2010, we were made well aware of this issue.

  • For historical reasons, after the data collection, the modeling happens. Now some additional prompts are designed, perhaps an agent system is put to use. If your system really does more than the original labeling prompt (maybe you finetuned a model and want to show that it made progress by solving a particular task), it should next be evaluated. As before, I think when using an LLM-judge, it is paramount to estimate the quality of the judge on the particular task. We should always, always, always do that unless the particular judge has already been evaluated for the exactly same test data by a predecessor paper. And then we should account for the variance introduced by the potential uncertainty of the judge when we compare systems. Not sure how to do this statistically, I am sure this would be a great research topic for you, Rotem Dror, let me know if you are interested in collaborating if you read this. :)

Looking sideways: research methods in HCI/IR

In my opinion, a lot of work feels less scientific because the methodology conflates experimental paradigms from machine learning and from system building. I think we need to more clearly position research papers which experimental paradigm they follow. As I have explained above, the train-dev-test data paradigm was not always predominant in computational linguistics. So I think while it is important to go back to it for research work that aims at improving the machine learning step itself, we should also acknowledge that part of our research today kind of went back to system building. For machine learning experiments, we should rely on transparent models and obtain the kind scientific trustworthiness back that Mirella is advocating. As for approaches that build systems, we should just not try to sell it under a wrong paradigm and use evaluation methodology that employs LLMs as questionable judges.

Let us consider again for whom we are building language technology.

  1. Other humans. Humans work differently from LLMs. Even if LLMs-as-a-judge achieve good correlations with human responses, I advocate that systems should still be evaluated by the distribution of human users that are the intended users. This is pretty common in other scientific research fields such as human computer interaction (HCI) or information retrieval (IR). It does not always have to be a questionnaire, we can also track gaze, click traces, etc. (with the users’ consent of course).

  2. Other systems. The output of our NLP approach may be entered into databases or fed into business processes, etc. For each step, there should also be human experts that should inspect several cases. They should be sampled in a way that they cover the entire range of rare, frequent, difficult, and easy cases. Potentially, this step can also simply be evaluated based on real-world data.

What does this mean? Authors should clarify under which experimental paradigm they work, and reviewers should not discount research that makes its claim based on user studies rather than F1 scores. Maybe that would help to avoid papers that use questionable research designs.

Summary: how can we move forwards?

  • Any scientific hypothesis can only be proven if we KNOW that a model has not seen the test data during pre-training or instruction tuning. If a paper uses open models (which I hope will become more common during the next year) and/or describes how they checked for test data contaminations, this should be credited strongly positively in reviews and meta-reviews.

  • Again, please don’t get me wrong. I think it is okay to use LLMs for system building. We should just distinguish carefully between research on systems and engineering systems. While engineering can be research, too, in the context of LLMs it helps to think about the fact that submitting a patent application that just says “my invention is to perform task X with an LLM” will highly likely be rejected. By contrast, if a research paper on a system performs an insightful analyses on the capabilities, boundaries, component statistics, and potential impact of the system (ideally by testing it using the actual target users), that could count as research. As a research community, we should also perform research on the current capabilities of models that are considered to be on the forefront of our field. Perhaps even popcorn papers are important to raise awareness of all the sociological and ethical problems that come with the spread of LLMs. But precisely for that reason, humans should be kept in the loop when evaluating such models or systems.

  • LLMs are already actively in use to support research in other fields. And not only to polish texts or search the literature, but also to obtain data from this (according to Luke Zettlemoyer) huge mountain of compressed data. Which may not be wrong per se, but unless we carefully test the data that we obtain from it, we risk drawing misleading conclusions from the synthetic data. I think it is paramount that we as NLP researchers set a good example how to deal with this data.

Is there AI-generated content in this article? I did not use any AI to write this article. Like Iryna Gurevych, who pointed this out in response to Ed’s talk at SDProc, I still find it easier and faster to directly find the words that I can use to express something rather than finding the prompt that generates the text that I actually want. Maybe that is because I was actually trained extensively in writing, and my way of doing things is old-fashioned. Like someone who still remembers landline numbers rather than searching for a phone number in their Google contacts. Sometimes the old way is just faster. (I have to admit that I have to look up even my husband’s cell number these days, though.) I do not know whether this means that we should be worried as a generation of researchers and developers that learn neither to write nor to program themselves is in a critical education stage these days. Maybe I am really just old-fashioned. On the other hand, how could I learn to write prompts for a translation system and judge if the output is what I want if I do not actually speak the target language? Maybe I could smoothe the language a bit (I am not a native English speaker). However, the article is authentic, the words are mine, not those of some LLM.

Thanks to Casey Kennington (Boise State University) for finding my typos and suggesting to add or elaborate on several interesting aspects.