The NTCIR-18 FairWeb-2 Task

Updates

Test topics are here; baseline runs are here! (November 1, 2024)
Japanese-speaking researchers – please have a look at the IPSJ paper in the Papers section (September 19, 2024)
FairWeb-2 task slide deck (June 15, 2024)

Important Dates

Nov 1, 2024: test topics released; task registrations due
Dec 15, 2024: run submissions due
Dec 2024-Jan 2025: Entity annotation + evaluation
Feb 1, 2025: Evaluation results + draft overview released
June 10-13, 2025: NTCIR-18@NII, Tokyo

Topic types and attribute sets, and gold distributions

Following the NTCIR-17 FairWeb-1 task, we will have the following three topic types: M (movie), R (researcher), and Y (YouTube video).
The FairWeb-1 test topics can be found here. The FairWeb-2 test topics will be very similar. The Web Search Subtask and the Conversational Search Subtask will use the same test topics.

We have the following attribute sets for evaluating group fairness, depending on the topic type.
- M topics: RATINGS with four ordinal groups (based on the number of ratings on IMDb), and ORIGIN with eight nominal groups (based on “country of origin” on IMDb);
- R topics: HINDEX with four ordinal groups (based on Google scholar h-index), and PRONOUN with three nominal groups (whether “he” or “she” occurs or not in the researcher bio);
- Y topics: SUBSCS (4 ordinal groups based on the number of subscribers of the YouTuber).
See our slide deck for more information about the grouping in each attribute set.

For evaluating group fairness in terms of each attribute set,
we will use a uniform distribution over all groups as the gold distribution,
except for ORIGIN. For ORIGIN, the gold distribution is defined based on how many countries/regions
each group contains, as shown in this excel file.

Web Search Subtask (same as FairWeb-1, but features a reproducibility challenge)

In terms of input to and output from participating systems,
this is a regular adhoc IR task. That is, the input is a topic, and the output is a ranked list of document IDs.
The target web corpus is Chuweb21D (See [Chu+23SIGIRAP]).
However, submitted runs are evaluated not only from the viewpoint of relevance,
but also from the viewpoint of group fairness (e.g. giving fair exposure to less well-known entities).

Reproducibility

This challenge addresses whether a top performer in the NTCIR-17 FairWeb-1 web search task can be reproduced with the new NTCIR-18 FairWeb-2 topics. More specifically, we examine whether THUIR-QD-RR4, a diversity-based run from Tsinghua University (See [Tao+23NTCIR17][Tu+23NTCIR17]), can be reproduced by another team. We do this as follows.
- THUIR will use the algorithm used at NTCIR-17 to process the new NTCIR-18 FairWeb-2 topics. This represents the SOTA as of NTCIR-17. The new runs is referred to as the REV (revived) run.
- Participating teams interested in reproducing what THUIR did will try to do so (the organisers will provide baseline runs as THUIR-QD-RR4 was a run generated by reranking the NTCIR-17 baseline run). Such a run is referred to as a REP (reproduced) run.
- The organisers will compare each REP run with the REV run to quantify the degree of reproducibility.

Advancing the SOTA in Web Search

Of course, participants can try to become the new SOTA by trying novel web search algorithms to achieve high effectiveness in terms of relevance AND/OR group fairness. If some manual effort was involved in creating a run (e.g. manually formulating queries based on the topic description), that run is called a MN (manual) run. Other runs are either RR (reranked – the run is obtained by reranking the baseline runs provided by the organisers) or RG (regular – any non-manual run that is not an RR run).

To see whether a new algorithm should be considered the new SOTA in web search, the submitted run is compared with the REV run descrived above. If the new run statistically significantly outperforms the REV run, and also outperforms all other runs on average, the new run becomes the new SOTA.

Run submission instructions

Runs submitted to the Web Search subtask are referred to as WS runs.
Our slide deck contains the following information about WS runs.
- How each WS run file should be named (and the maximum number of runs allowed per team)
- WS run file format (basically the same as a TREC run file)

WS runs from each team should be submitted to the Organisers’ mailing list
as a single zip file, with the file name [TEAMNAME]-WS.zip
e.g. THUIR-WS.zip

Evaluation Measures

Following FairWeb-1, we will use the Group Fairness and Relevance (GFR) framework for evaluating the web search results in terms of relevance and group fairness (See [Sakai+23TOIS][Tao+23NTCIR-17]).
For group fairness evaluation based on attribute sets containing nominal groups (i.e., ORIGIN for M-topics and PRONOUN for R-topics),
the similarity between the target distribution and the system’s achieved distribution is measured based on Jensen-Shannon Divergence (JSD);
For group fairness evaluation based on attribute sets containing ordinal groups (i.e., RATINGS for M-topics, HINDEX for R-topics, and SUBSCS for Y-topics),
the similarity is measured based on Normalised Match Distance (NMD) and Root Normalised Order-aware Divergence (RNOD).
The divergences are described in [Sakai21ACL].

We will quantify reproducibility using several methods including those discussed in [Breuer+20SIGIR].

Conversational Search Subtask (NEW!)

Unlike the Web Search Subtask where participating systems are required to submit a ranked list of document IDs for each topic,
the new Conversational Search Subtask requires systems to return a textual user-system conversation for each topic.
The conversation can be single-turn (one user prompt followed by one system turn)
or multi-turn (e.g., user-system-user-system).

Run submission instructions

Runs submitted to the Conversational Search subtask are referred to as CS runs.
Our slide deck contains the following information about CS runs.
- How each CS run file should be named (and the maximum number of runs allowed per team)
- CS run file format (contains a plain text conversation for each topic)
- What constitutes a plain text conversation (Here’s a sample)

CS runs from each team should be submitted to the Organisers’ mailing list
as a single zip file, with the file name [TEAMNAME]-CS.zip
e.g. THUIR-CS.zip

Evaluation Measures

The Group Fairness and Relevance for Conversations (GFRC) framework [Sakai23EVIA] will be used for evaluating the textual conversations.
See also the appendix in our slide deck.
Ideally, relevant entities (or “nuggets”) should be placed near the beginning of the conversation,
and the user turns should be as concise as possible,
because GFRC discounts later-occurring nuggets in a conversation just like
nDCG discounts later-occurring documents in a ranked list.
A long user turn is regarded to represent much user effort; hence any relevant entity following it is discounted accordingly.
We use JSD, NMD, and RNOD as divergences in the same way as the Web Search subtask.

Organisers

fairweb2org
at list.waseda.jp

Maria Maistro (University of Copenhagen, Denmark)
Tetsuya Sakai (Waseda University, Japan)
Sijie Tao (Waseda University, Japan)
Junjie Wang (Waseda University, Japan)
Hanpei Fang (Waseda University, Japan)
Yuxiang Zhang (Waseda University, Japan)
Nuo Chen (The Hong Kong Polytechnic University, China)
Haitao Li (Tsinghua University, China)
Yiteng Tu (Tsinghua University, China)

Papers

[Sakai+24IPSJ]酒井哲也, Sijie Tao, Hanpei Fang, Yuxiang Zhang: FairWeb-2: グループフェアなウェブ検索と会話型検索, 情報処理学会研究報告 2024-DBS-179/2024-IFAT-156, No.22, 2024. paper pdf slides
[Breuer+20SIGIR]Timo Breuer, Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, Philipp Schaer, and Ian Soboroff: How to Measure the Reproducibility of System-oriented IR Experiments, Proceedings of ACM SIGIR 2020, pp.349-358, 2020.
[Chu+23SIGIRAP] Zhumin Chu, Tetsuya Sakai, Qingyao Ai, and Yiqun Liu: Chuweb21D: A Deduped English Document Collection for Web Search Tasks, Proceedings of ACM SIGIR-AP 2023, pp.63-72, 2023. pdf
[Sakai21ACL] Tetsuya Sakai: Evaluating Evaluation Measures for Ordinal Classification and Ordinal Quantification, Proceedings of ACL-IJCNLP 2021, pp.2759-2769, 2021. paper lots of videos and slides
[Sakai23EVIA] Tetsuya Sakai: Fairness-based Evaluation of Conversational Search: A Pilot Study, Proceedings of EVIA 2023, pp.5-13, 2023. paper slides YouTube video (from around 27:20)
[Sakai+23TOIS] Tetsuya Sakai, Jin Young Kim, and Inho Kang: A Versatile Framework for Evaluating Ranked Lists in terms of Group Fairness and Relevance, ACM TOIS, 2023. open access pdf
[Tao+23NTCIR17] Sijie Tao, Nuo Chen, Tetsuya Sakai, Zhumin Chu, Hiromi Arai, Ian Soboroff, Nicola Ferro, Maria Maistro: Overview of the NTCIR-17 FairWeb-1 Task, Proceedings of NTCIR-17, pp.284-305, 2023. pdf
[Tu+NTCIR17] Yiteng Tu, Haitao Li, Zhumin Chu, Qingyao Ai, and Yiqun Liu: THUIR at the NTCIR-17 FairWeb-1 Task: An Initial Exploration of the Relationship Between Relevance and Fairness, Proceedings of NTCIR-17, pp.319-324, 2023. pdf

Links

NTCIR-17 FairWeb-1
NTCIR