CoMeDi: Context and Meaning - Navigating Disagreements in NLP Annotations

CoMeDi: Context and Meaning—Navigating
Disagreements in NLP Annotations

Workshop to be held in conjunction with COLING 2025 in Abu Dhabi

January 19, 2025

Disagreements among annotators pose a significant challenge in Natural Language Processing, impacting the quality and reliability of datasets and consequently the performance of NLP models. This workshop aims to explore the complexities of annotation disagreements, their causes, and strategies towards their effective resolution, with a focus on meaning in context.

The quality and reliability of annotated data is crucial for the development of robust NLP models. However, managing disagreements among annotators poses significant challenges to researchers and practitioners. Such disagreements can stem from various factors, including subjective interpretations, cultural biases, and ambiguous guidelines. Early research has highlighted the impact of annotator disagreements on data quality and model performance (e.g. Artstein and Poesio, 2008; Pustejovsky and Stubbs, 2012; Plank et al., 2014).

More recent work on perspectivism in NLP, such as that by Basile et al. (2021), highlights the importance of embracing multiple perspectives in annotation tasks to better capture the diversity of human language. This approach argues for the inclusion of various viewpoints to improve the robustness and fairness of NLP models. On the modeling side, various methods for dealing with annotation disagreements have been proposed. For example, Hovy et al. (2013) and Passonneau and Carpenter (2014) identify and weigh annotator reliability to better aggregate contributions, whereas recent approaches following the perspectivism approach leverage inherent disagreements in subjective tasks to train models handling diverse opinions (Davani et al., 2022; Deng et al., 2023).

Barbara Plank: The Spectrum of Human Label Variation: Reflections from the Last Ten Years in NLP

9:20 Opening

9:30 Invited talk: The Spectrum of Human Label Variation: Reflections from the Last Ten Years in NLP
Barbara Plank

10:30 Coffee break

Pre-lunch session

11:00 Is a bunch of words enough to detect disagreement in hateful content?
Giulia Rizzi, Paolo Rosso and Elisabetta Fersini

11:20 On Crowdsourcing Task Design for Discourse Relation Annotation
Frances Yung and Vera Demberg

11:40 Sources of Disagreement in Data for LLM Instruction Tuning
Russel Dsouza and Venelin Kovatchev

12:00 CoMeDi Shared Task: Median Judgment Classification & Mean Disagreement Ranking with Ordinal Word-in-Context Judgments
Dominik Schlechtweg, Tejaswi Choppa, Wei Zhao and Michael Roth

12:20 Lunch break

Post-lunch session

14:00 Deep-change at CoMeDi: the Cross-Entropy Loss is not All You Need
Mikhail Kuklin and Nikolay Arefyev

14:20 Predicting Median, Disagreement and Noise Label in Ordinal Word-in-Context Data
Tejaswi Choppa, Michael Roth and Dominik Schlechtweg

14:40 Poster session

Funzac at CoMeDi Shared Task: Modeling Annotator Disagreement from Word-In-Context Perspectives
Olufunke O. Sarumi, Charles Welch, Lucie Flek and Jörg Schlötterer

JuniperLiu at CoMeDi Shared Task: Models as Annotators in Lexical Semantics Disagreements
Zhu Liu, Zhen Hu and Ying Liu

ABDN-NLP at CoMeDi Shared Task: Predicting the Aggregated Human Judgment via Weighted Few-Shot Prompting
Ying Xuan Loke, Dominik Schlechtweg and Wei Zhao

The Impact of Annotation Choices on Computational Representations of Semantic and Phonological Distance in Sign Languages
Lisa Loy

Ambiguity Meets Uncertainty: Investigating Uncertainty Estimation for Word Sense Disambiguation
Zhu Liu and Ying Liu

Disagreement in Metaphor Annotation of Mexican Spanish Science Tweets
Alec M. Sanchez-Montero, Gemma Bel-Enguix, Sergio Luis Ojeda Trueba and Gerardo Sierra Martínez

15:30 Coffee break

16:00 Closing

Online only

Automating Annotation Guideline Improvements using LLMs: A Case Study ^[video]
Adrien Bibal, Nathaniel Gerlek, Goran Muric, Elizabeth Boschee, Steven C. Fincke, Mike Ross and Steven N. Minton

Ambiguity and Disagreement in Abstract Meaning Representation ^[video]
Shira Wein

FuocChuVIP123 at CoMeDi Shared Task: Disagreement Ranking with XLM-Roberta Sentence Embeddings and Deep Neural Regression ^[video]
Phuoc Duong Huy Chu

GRASP at CoMeDi Shared Task: Multi-Strategy Modeling of Annotator Behavior in Multi-Lingual Semantic Judgments ^[video]
David Alfter and Mattias Appelgren

MMLabUIT at CoMeDi Shared Task: Text Embedding Techniques versus Generation-Based NLI for Median Judgment Classification ^[video]
Tai Duc Le and Thin Dang Van

We invite both long (8 pages) and short (4 page) papers. The limits refer to the content and any number of additional pages for references are allowed. The papers should follow the COLING 2025 formatting instructions.

Each submission must be anonymized, written in English, and contain a title and abstract. We especially welcome papers that address the following themes, for a single type of disagreement or annotation disagreements in general:

New benchmarks for detecting or categegorizing disagreements
Models and modeling strategies for variations in annotation
Evaluation schemes and metrics for phenomena without a single ground truth
Phenomena that are not yet within reach with current NLP technology.

To encourage discussion and community building and to bootstrap potential collaborations, we elicit, in addition to shared task papers and regular "archival" track papers, also non-archival submissions. These can take two forms:

Works in progress, that are not yet mature enough for a full submission, can be submitted in the form of a title and abstract. Abstracts may be up to two pages in length.
Already published work, or work currently under submission elsewhere, can be submitted in the form of the original abstract and a copy of the submission/publication.

These works will be reviewed for topical fit and accepted submissions will be presented as posters. Depending on the final workshop program, selected works may be presented in panels. We plan for these to be an opportunity for researchers to present and discuss their work with the relevant community.

Please submit your papers here.

The task description paper can be found here.

The role of linguistic-semantic factors causing disagreements, and the extent to which such cases can potentially be predicted, has been scarcely investigated. Building on previous research on ambiguities arising from pronominal anaphora (Yang et al., 2010) and implicit references (Roth et al., 2022), we will host a shared task on predicting disagreements on word sense annotation in context (a.k.a. Word-in-Context, WiC). Realistic WiC datasets often show considerable disagreement. Consequently, we lose information when discarding instances during aggregation or summarizing them by majority or median judgment. Recent research has started to incorporate this information by using alternative label aggregation methods (Uma et al., 2022; Leonardelli et al., 2023). Modelling this disagreement is important because in a real world scenario we most often do not have clean data. We need to do prediction on samples where high disagreement is expected and which are inherently difficult to categorize. Predicting disagreement can help to detect or filter highly complicated samples.

Participants are asked to solve two subtasks. Both rely on data from human WiC judgments on an ordinal scale, such as the DWUG EN dataset (Schlechtweg et al., 2021). Each instance has a target word $w$, for which two word uses, $u_1$ and $u_2$, are provided (use pair). Each of these uses expresses a specific meaning of $w$. Annotators were asked to provide labels on an ordinal relatedness scale from 1 (the two uses of the word have completely unrelated meanings) to 4 (the two uses of the word have identical meanings) following the DURel annotation framework (Schlechtweg et al., 2018). As an example, consider the two annotation instances below. Pair (1,2) would likely receive label 4 (identical) while pair (1,3) would rather receive a lower label such as 2 (distantly related).

...and taking a knife from her pocket, she opened a vein in her little arm.
...and though he saw her within reach of his arm, yet the light of her eyes seemed as far off.

Sample judgments: [4,4]; median: 4; mean pairwise difference: 0.0

...and taking a knife from her pocket, she opened a vein in her little arm.
It stood behind a high brick wall, its back windows overlooking an arm of the sea which, at low tide, was a black and stinking mud-flat.

Sample judgments: [2,3,2]; median: 2; mean pairwise difference: 0.667

Subtask 1: Median Judgment Classification with Ordinal Word-in-Context Judgments (OGWiC)

For each use pair $(u_1,u_2)$ participants are asked to predict the median of annotator judgments. This task is similar to the previous WiC (Pilehvar et al., 2019) and GWiC (Armendariz et al., 2020) tasks. However, we limit the label set in predictions and penalize stronger deviations from the true label. This makes OGWiC an ordinal classification task (Sakai, 2021), in contrast to binary classification (WiC) or ranking (GWiC). Predictions will be evaluated against the median labels with the ordinal version of Krippendorff's $\alpha$ (Krippendorff, 2018).

Treating graded WiC as an ordinal classification task instead of a ranking task constrains model predictions to exactly reproduce instance labels instead of just inferring their relative order. This is advantageous if ordinal labels have an interpretation because predictions then inherit this interpretation. Such an interpretation can be assigned to the DURel scale as explained in Schlechtweg et al. (2018) and in more detail in Schlechtweg (2023, pp. 22-23): Judgment 1-4 can be interpreted as "homonymy" (1), "polysemy" (2), "context variance" (3) and "identity" (4), respectively.

Subtask 2: Mean Disagreement Ranking with Ordinal Word-in-Context Judgments (DisWiC)

For each use pair $(u_1,u_2)$ participants are asked to predict the mean of pairwise absolute judgment differences between annotators: $D(J)=\frac{1}{|J|}\sum_{(j_1,j_2)\in J}(|j_1-j_2|)$, where $J$ is the set of unique pairwise combinations of judgments. For pair (1,2) from above, $D(J)=\frac{1}{2}(|(4-4)|+|(4-4)|)=0.0$ while for (1,3) it amounts to $\frac{1}{3}(|(2-3)|+|(2-2)|+|(3-2)|)=0.667$. DisWiC can be seen as a ranking task. Participants are asked to rank instances according to the magnitude of disagreement observed between annotators. It differs from previous tasks (Leonardelli et al., 2023) by aggregating "gold" labels purely over judgment differences, thus making disagreement the explicit ranking aim. Predictions will be evaluated against the mean disagreement labels with Spearman's $\rho$ (Spearman, 1904).

Organization

Please register to our CodaLab competition and our Google group to participate. (After login into your Google account, you should be able to join the group directly.) The provided starting kits contain training and development data, a description of the data format, the evaluation scripts, baseline scripts and a sample answer/prediction that can be readily uploaded to CodaLab. If you have any questions, please ask them through our Google group.

The task is organized by Dominik Schlechtweg, Tejaswi Choppa, Wei Zhao and Michael Roth.

Shared Task

August 23, 2024: Release of shared-task training and development data
August 23, 2024: Start of development phase for Subtask 1
September 15, 2024: Start of development phase for Subtask 2
October 14, 2024: Start of evaluation phase for Subtask 1
October 21, 2024: Start of evaluation phase for Subtask 2
October 28, 2024: Release of competition results.
November 18, 2024: Due date for system description papers
December 4-6, 2024: Author response period
December 9, 2024: Notification of acceptance
December 13, 2024: Camera-ready submission deadline

Deadlines refer to 11:59pm UTC.

Regular and non-archival submissions

November 22, 2024: Due date for regular and non-archival workshop papers
December 6, 2024: Notification of acceptance
December 13, 2024: Camera-ready submission deadline

Deadlines refer to 11:59pm GMT -12 hours ("anywhere in the world").

Michael Roth, University of Technology Nuremberg
Dominik Schlechtweg, University of Stuttgart

Program Committee

David Alfter, University of Gothenburg
Valerio Basile, University of Turin
Felipe Bravo, University of Chile
Jing Chen, Hong Kong Polytechnic University
Diego Frassinelli, University of Konstanz / LMU Munich
Dubossarsky Haim, Queen Mary University
Simon Hengchen, iguanodon.ai & Université de Genève
Snigdha Khanna, Indiana University
Sandra Kübler, Indiana University
Andrei Kutuzov, University of Oslo
Elisa Leonardelli, Fondazione Bruno Kessler
Melissa Lieffers, Indiana University
Marie-Catherine de Marneffe, UCLouvain
Maja Pavlovic, Queen Mary University
Siyao Peng, LMU Munich
Pauline Sander, University of Stuttgart
Pia Sommerauer, Vrije Universiteit Amsterdam
Alexandra Uma
Frank D. Zamora-Reina, University of Chile
Wei Zhao, University of Aberdeen

Carlos Santos Armendariz, Matthew Purver, Senja Pollak, Nikola Ljubešić, Matej Ulčar, Ivan Vulić, and Mohammad Taher Pilehvar. 2020. SemEval-2020 task 3: Graded word similarity in context. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 36–49, Barcelona (online). International Committee for Computational Linguistics.
Klaus Krippendorff. 2018. Content Analysis: An Introduction to Its Methodology. SAGE Publications.
Elisa Leonardelli, Gavin Abercrombie, Dina Almanea, Valerio Basile, Tommaso Fornaciari, Barbara Plank, Verena Rieser, Alexandra Uma, and Massimo Poesio. 2023. SemEval-2023 Task 11: Learning with Disagreements (LeWiDi). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2304–2318, Toronto, Canada. Association for Computational Linguistics.
Tetsuya Sakai. 2021. Evaluating Evaluation Measures for Ordinal Classification and Ordinal Quantification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2759–2769, Online. Association for Computational Linguistics.
Dominik Schlechtweg, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 169–174, New Orleans, Louisiana. Association for Computational Linguistics.
Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, and Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7079–7091, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Dominik Schlechtweg. 2023. Human and Computational Measurement of Lexical Semantic Change. Ph.D. thesis, University of Stuttgart, Stuttgart, Germany.
Charles Spearman. 1904. The proof and measurement of association between two things. In American Journal of Psychology, 15(1), 72–101.
Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, Minneapolis, Minnesota. Association for Computational Linguistics.
Michael Roth, Talita Anthonio, and Anna Sauer. 2022. SemEval-2022 task 7: Identifying plausible clari cations of implicit and underspecifed phrases in instructional texts. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1039–1049, Seattle, United States. Association for Computational Linguistics.
Alexandra N. Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. 2022. Learning from disagreement: A survey. J. Artif. Int. Res., 72:1385—1470.
Hui Yang, Anne de Roeck, Alistair Willis, and Bashar Nuseibeh. 2010. A methodology for automatic identification of nocuous ambiguity. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 1218–1226, Beijing, China.