CoMeDi: Context and Meaning—Navigating
Disagreements in NLP Annotations

Workshop to be held in conjunction with COLING 2025 in Abu Dhabi

January 19/20, 2025



Disagreements among annotators pose a significant challenge in Natural Language Processing, impacting the quality and reliability of datasets and consequently the performance of NLP models. This workshop aims to explore the complexities of annotation disagreements, their causes, and strategies towards their effective resolution, with a focus on meaning in context.

The quality and reliability of annotated data is crucial for the development of robust NLP models. However, managing disagreements among annotators poses significant challenges to researchers and practitioners. Such disagreements can stem from various factors, including subjective interpretations, cultural biases, and ambiguous guidelines. Early research has highlighted the impact of annotator disagreements on data quality and model performance (e.g. Artstein and Poesio, 2008; Pustejovsky and Stubbs, 2012; Plank et al., 2014).

More recent work on perspectivism in NLP, such as that by Basile et al. (2021), highlights the importance of embracing multiple perspectives in annotation tasks to better capture the diversity of human language. This approach argues for the inclusion of various viewpoints to improve the robustness and fairness of NLP models. On the modeling side, various methods for dealing with annotation disagreements have been proposed. For example, Hovy et al. (2013) and Passonneau and Carpenter (2014) identify and weigh annotator reliability to better aggregate contributions, whereas recent approaches following the perspectivism approach leverage inherent disagreements in subjective tasks to train models handling diverse opinions (Davani et al., 2022; Deng et al., 2023).

We invite both long (8 pages) and short (4 page) papers. The limits refer to the content and any number of additional pages for references are allowed. The papers should follow the COLING 2025 formatting instructions.

Each submission must be anonymized, written in English, and contain a title and abstract. We especially welcome papers that address the following themes, for a single type of disagreement or annotation disagreements in general:

To encourage discussion and community building and to bootstrap potential collaborations, we elicit, in addition to shared task papers and regular "archival" track papers, also non-archival submissions. These can take two forms:

These works will be reviewed for topical fit and accepted submissions will be presented as posters. Depending on the final workshop program, selected works may be presented in panels. We plan for these to be an opportunity for researchers to present and discuss their work with the relevant community.

Please submit your papers here.

The role of linguistic-semantic factors causing disagreements, and the extent to which such cases can potentially be predicted, has been scarcely investigated. Building on previous research on ambiguities arising from pronominal anaphora (Yang et al., 2010) and implicit references (Roth et al., 2022), we will host a shared task on predicting disagreements on word sense annotation in context (a.k.a. Word-in-Context, WiC). Realistic WiC datasets often show considerable disagreement. Consequently, we lose information when discarding instances during aggregation or summarizing them by majority or median judgment. Recent research has started to incorporate this information by using alternative label aggregation methods (Uma et al., 2022; Leonardelli et al., 2023). Modelling this disagreement is important because in a real world scenario we most often do not have clean data. We need to do prediction on samples where high disagreement is expected and which are inherently difficult to categorize. Predicting disagreement can help to detect or filter highly complicated samples.

Participants are asked to solve two subtasks. Both rely on data from human WiC judgments on an ordinal scale, such as the DWUG EN dataset (Schlechtweg et al., 2021). Each instance has a target word $w$, for which two word uses, $u_1$ and $u_2$, are provided (use pair). Each of these uses expresses a specific meaning of $w$. Annotators were asked to provide labels on an ordinal relatedness scale from 1 (the two uses of the word have completely unrelated meanings) to 4 (the two uses of the word have identical meanings) following the DURel annotation framework (Schlechtweg et al., 2018). As an example, consider the two annotation instances below. Pair (1,2) would likely receive label 4 (identical) while pair (1,3) would rather receive a lower label such as 2 (distantly related).

  1. ...and taking a knife from her pocket, she opened a vein in her little arm.
  2. ...and though he saw her within reach of his arm, yet the light of her eyes seemed as far off.
  1. ...and taking a knife from her pocket, she opened a vein in her little arm.
  2. It stood behind a high brick wall, its back windows overlooking an arm of the sea which, at low tide, was a black and stinking mud-flat.

Subtask 1: Median Judgment Classification with Ordinal Word-in-Context Judgments (OGWiC)

For each use pair $(u_1,u_2)$ participants are asked to predict the median of annotator judgments. This task is similar to the previous WiC (Pilehvar et al., 2019) and GWiC (Armendariz et al., 2020) tasks. However, we limit the label set in predictions and penalize stronger deviations from the true label. This makes OGWiC an ordinal classification task (Sakai, 2021), in contrast to binary classification (WiC) or ranking (GWiC). Predictions will be evaluated against the median labels with the ordinal version of Krippendorff's $\alpha$ (Krippendorff, 2018).

Treating graded WiC as an ordinal classification task instead of a ranking task constrains model predictions to exactly reproduce instance labels instead of just inferring their relative order. This is advantageous if ordinal labels have an interpretation because predictions then inherit this interpretation. Such an interpretation can be assigned to the DURel scale as explained in Schlechtweg et al. (2018) and in more detail in Schlechtweg (2023, pp. 22-23): Judgment 1-4 can be interpreted as "homonymy" (1), "polysemy" (2), "context variance" (3) and "identity" (4), respectively.

Subtask 2: Mean Disagreement Ranking with Ordinal Word-in-Context Judgments (DisWiC)

For each use pair $(u_1,u_2)$ participants are asked to predict the mean of pairwise absolute judgment differences between annotators: $D(J)=\frac{1}{|J|}\sum_{(j_1,j_2)\in J}(|j_1-j_2|)$, where $J$ is the set of unique pairwise combinations of judgments. For pair (1,2) from above, $D(J)=\frac{1}{2}(|(4-4)|+|(4-4)|)=0.0$ while for (1,3) it amounts to $\frac{1}{3}(|(2-3)|+|(2-2)|+|(3-2)|)=0.667$. DisWiC can be seen as a ranking task. Participants are asked to rank instances according to the magnitude of disagreement observed between annotators. It differs from previous tasks (Leonardelli et al., 2023) by aggregating "gold" labels purely over judgment differences, thus making disagreement the explicit ranking aim. Predictions will be evaluated against the mean disagreement labels with Spearman's $\rho$ (Spearman, 1904).

Organization

Please register to our CodaLab competition and our Google group to participate. (After login into your Google account, you should be able to join the group directly.) The provided starting kits contain training and development data, a description of the data format, the evaluation scripts, baseline scripts and a sample answer/prediction that can be readily uploaded to CodaLab. If you have any questions, please ask them through our Google group.

The task is organized by Dominik Schlechtweg, Tejaswi Choppa, Wei Zhao and Michael Roth.

Shared task deadlines refer to 11:59pm UTC. All other deadlines refer to 11:59pm GMT -12 hours ("anywhere in the world").

Program Committee