SemEval 2024 BRAINTEASER: A Novel Task Defying Common Sense

The Codalab Competition is now available!!!

Motivation

Human reasoning processes comprise two types of thinking: vertical and lateral. Vertical thinking, also known as linear, convergent, or logical thinking, is a sequential analytical process that is based on rationality, logic, and rules. Meanwhile, lateral thinking (or “thinking outside the box”) is a divergent and creative process that involves looking at a problem from a new perspective and defying preconceptions.

The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model’s ability to exhibit lateral thinking and defy default commonsense associations.

BRAINTEASER QA task consists of two subtasks-Sentence Puzzle and Word Puzzle that require awareness of commonsense “defaults” and overwriting them through unconventional thinking that distinguishes these defaults from hard constraints.

Sentence Puzzle: Sentence-type brain teaser where the puzzle defying commonsense is centered on sentence snippets.
Word Puzzle: Word-type brain teaser where the answer violates the default meaning of the word and focuses on the letter composition of the target question

Both tasks include an adversarial subset, created by manually modifying the original brain teasers without changing their latent reasoning path.

Task Example

Here are two examples from each subtasks

Question	Choice
A man shaves everyday, yet keeps his beard long.	He is a barber. He wants to maintain his appearance. He wants his girlfriend to buy him a razor. None of the above.

What part of London is in France?	The letter N. The letter O. The letter L. None of the above.

To ensure that our task evaluates reasoning ability rather than memorization, we construct adversarial versions of the original data in two ways:

Semantic Reconstruction rephrases the original question without changing the correct answer and the distractors.
Context Reconstruction keeps the original reasoning path but changes both the question and the answer to describe a new situational context.

Here are the example of two adversarial versions of Sentence Puzzle:

Adversarial Strategy	Question	Choice
Oringinal	A man shaves everyday, yet keeps his beard long.	He is a barber. He wants to maintain his appearance. He wants his girlfriend to buy him a razor. None of the above.

Semantic Reconstruction	A man preserves a lengthy beard despite shaving every day.	He is a barber. He wants to maintain his appearance. He wants his girlfriend to buy him a razor. None of the above.

Context Reconstruction	Tom attends class every day but doesn’t do any homework.	He is a teacher. He is a lazy person. His teacher will not let him fail. None of the above.

Each system will be evaluated based on the following two accuracy metrics:

Instance-based Accuracy: We consider each question (original/adversarial) as a separate instance. We will report accuracy for the original and its adversaries.
Group-based Accuracy: Each question and its associated adversarial instances form a group, and a system will only receive a score of 1 when it correctly solves all questions in the group.

Data

The training data is now available. The training and validation split will be releaded based on the SemEval timeline.

Registration form for participation and the legal usage of data.

Mailing list for task updates.

For further question, please contact: yifjia@isi.edu

Codalab

The tasks are set to be facilitated on CodaLab, with the availability of the link being aligned with the SemEval schedule. Participants are encouraged to register at the earliest and join the mailing list to stay abreast of updates.

The Codalab Competition is now available!!!

Leaderboard (Evaluate over the test set (20%) of the Competition)

Sentence Puzzle

Rank	Team	Original	Semantic	Context	Ori & Sem	Ori & Sem & Con	Overall
1	Abdelhak	1.000 (1)	1.000 (1)	0.950 (1)	1.000 (1)	0.950 (1)	0.983 (1)
2	HW-TSC	1.000 (1)	0.975 (2)	0.925 (2)	0.975 (2)	0.900 (3)	0.967 (2)
3	Maxine	0.975 (2)	0.975 (2)	0.925 (2)	0.950 (3)	0.900 (3)	0.958 (3)
4	YingluLi	0.975 (2)	0.950 (3)	0.925 (2)	0.950 (3)	0.900 (3)	0.950 (4)
4	Theo	0.950 (3)	0.950 (3)	0.950 (1)	0.950 (3)	0.925 (2)	0.950 (4)
5	somethingx95	0.950 (3)	0.950 (3)	0.925 (2)	0.950 (3)	0.900 (3)	0.942 (5)
5	gerald	0.950 (3)	0.950 (3)	0.925 (2)	0.950 (3)	0.900 (3)	0.942 (5)
6	AmazUtah_NLP	0.925 (4)	0.950 (3)	0.900 (3)	0.925 (4)	0.875 (4)	0.925 (6)
7	BITS Pilani	0.975 (2)	0.925 (4)	0.800 (7)	0.925 (4)	0.775 (6)	0.900 (7)
7	ALF	0.925 (4)	0.950 (3)	0.825 (6)	0.925 (4)	0.825 (5)	0.900 (7)
8	uTeBC-NLP	0.975 (2)	0.875 (6)	0.825 (6)	0.850 (7)	0.750 (7)	0.892 (8)
8	jkarolczak	0.975 (2)	0.875 (6)	0.825 (6)	0.875 (6)	0.775 (6)	0.892 (8)
8	kubapok	0.925 (4)	0.900 (5)	0.850 (5)	0.900 (5)	0.825 (5)	0.892 (8)
8	yangqi	0.900 (5)	0.900 (5)	0.875 (4)	0.900 (5)	0.875 (4)	0.892 (8)
9	Mothman	0.975 (2)	0.850 (7)	0.800 (7)	0.850 (7)	0.700 (9)	0.875 (9)
10	zero_shot_is_all_you_need	0.950 (3)	0.825 (8)	0.825 (6)	0.800 (9)	0.725 (8)	0.867 (10)
10	OUNLP	0.950 (3)	0.875 (6)	0.775 (8)	0.850 (7)	0.725 (8)	0.867 (10)
11	justingu	0.950 (3)	0.825 (8)	0.775 (8)	0.825 (8)	0.700 (9)	0.850 (11)
11	BAMO	0.900 (5)	0.825 (8)	0.825 (6)	0.825 (8)	0.700 (9)	0.850 (11)
12	YNU-HPCC	0.900 (5)	0.825 (8)	0.800 (7)	0.825 (8)	0.725 (8)	0.842 (12)
13	FtG-CoT	0.900 (5)	0.825 (8)	0.775 (8)	0.800 (9)	0.675 (10)	0.833 (13)
13	MasonTigers	0.850 (6)	0.825 (8)	0.825 (6)	0.800 (9)	0.700 (9)	0.833 (13)
14	AILS-NTUA	0.850 (6)	0.825 (8)	0.775 (8)	0.825 (8)	0.700 (9)	0.817 (14)
15	RiddleMaster	0.800 (8)	0.775 (10)	0.800 (7)	0.725 (12)	0.650 (11)	0.792 (15)
15	UMBCLU	0.750 (10)	0.850 (7)	0.775 (8)	0.725 (12)	0.600 (13)	0.792 (15)
16	johnp	0.850 (6)	0.775 (10)	0.725 (10)	0.750 (11)	0.675 (10)	0.783 (16)
16	MABUSETTEH	0.800 (8)	0.775 (10)	0.775 (8)	0.775 (10)	0.700 (9)	0.783 (16)
16	KnowComp	0.825 (7)	0.775 (10)	0.750 (9)	0.725 (12)	0.625 (15)	0.783 (16)
17	ehsan.tavan	0.800 (8)	0.800 (9)	0.725 (10)	0.775 (10)	0.675 (10)	0.775 (17)
17	amr8ta	0.775 (9)	0.775 (10)	0.775 (8)	0.750 (11)	0.650 (11)	0.775 (17)
18	yiannispn	0.800 (8)	0.800 (9)	0.700 (11)	0.750 (11)	0.625 (12)	0.767 (18)
19	haha123	0.825 (7)	0.775 (10)	0.675 (12)	0.750 (11)	0.625 (12)	0.758 (19)
19	adriti	0.750 (10)	0.725 (12)	0.800 (7)	0.725 (12)	0.675 (10)	0.758 (19)
19	TienDat23	0.725 (11)	0.800 (9)	0.750 (9)	0.675 (14)	0.525 (16)	0.758 (19)
20	Deja Vu	0.775 (9)	0.700 (13)	0.775 (8)	0.700 (13)	0.625 (12)	0.750 (20)
20	NIMZ	0.750 (10)	0.725 (12)	0.775 (8)	0.700 (13)	0.675 (10)	0.750 (20)
21	iREL	0.775 (9)	0.725 (12)	0.700 (11)	0.700 (13)	0.575 (14)	0.733 (21)
21	GeminiPro	0.750 (10)	0.750 (11)	0.700 (11)	0.700 (13)	0.600 (13)	0.733 (21)
22	caoyongwang	0.800 (8)	0.700 (13)	0.675 (12)	0.700 (13)	0.550 (15)	0.725 (22)
23	IIMAS	0.650 (12)	0.675 (14)	0.650 (13)	0.600 (16)	0.500 (17)	0.658 (23)
24	IUST-NLPLAB	0.625 (13)	0.625 (15)	0.575 (15)	0.625 (15)	0.500 (17)	0.608 (24)
25	ROSHA	0.625 (13)	0.575 (16)	0.600 (14)	0.500 (17)	0.375 (18)	0.600 (25)
26	Team DaVinci	0.575 (14)	0.550 (17)	0.425 (17)	0.500 (17)	0.300 (19)	0.517 (26)
27	StFX-NLP	0.425 (15)	0.400 (18)	0.475 (16)	0.350 (18)	0.200 (20)	0.433 (27)
28	Team 9	0.275 (17)	0.275 (19)	0.200 (20)	0.100 (20)	0.000 (23)	0.250 (28)
28	DeBERTa	0.225 (18)	0.250 (20)	0.275 (19)	0.200 (19)	0.075 (21)	0.250 (28)
29	amirhallaji	0.225 (18)	0.200 (21)	0.300 (18)	0.050 (22)	0.025 (22)	0.242 (29)
30	maryam.najafi	0.225 (18)	0.275 (19)	0.200 (20)	0.100 (20)	0.025 (22)	0.233 (30)

Word Puzzle

Rank	Team	Original	Semantic	Context	Ori & Sem	Ori & Sem & Con	Overall
1	Theo	1.000 (1)	1.000 (1)	0.969 (2)	1.000 (1)	0.969 (1)	0.990 (1)
1	gerald	1.000 (1)	1.000 (1)	0.969 (2)	1.000 (1)	0.969 (1)	0.990 (1)
2	somethingx95	1.000 (1)	1.000 (1)	0.938 (3)	1.000 (1)	0.938 (2)	0.979 (2)
2	zero_shot_is_all_you_need	1.000 (1)	1.000 (1)	0.938 (3)	1.000 (1)	0.938 (2)	0.979 (2)
2	MasonTigers	0.969 (2)	0.969 (2)	1.000 (1)	0.969 (2)	0.969 (1)	0.979 (2)
3	HW-TSC	0.969 (2)	0.938 (3)	1.000 (1)	0.938 (3)	0.938 (2)	0.969 (3)
3	Maxine	0.969 (2)	0.938 (3)	1.000 (1)	0.938 (3)	0.938 (2)	0.969 (3)
3	YingluLi	0.969 (2)	0.938 (3)	1.000 (1)	0.938 (3)	0.938 (2)	0.969 (3)
4	kubapok	0.906 (4)	1.000 (1)	0.938 (3)	0.906 (4)	0.844 (3)	0.948 (4)
5	BITS Pilani	0.938 (3)	0.938 (3)	0.875 (4)	0.938 (3)	0.812 (4)	0.917 (5)
5	justingu	0.938 (3)	0.938 (3)	0.875 (4)	0.906 (4)	0.781 (5)	0.917 (5)
6	jkarolczak	0.906 (4)	0.938 (3)	0.781 (7)	0.875 (5)	0.688 (8)	0.875 (6)
6	yangqi	0.906 (4)	0.938 (3)	0.781 (7)	0.906 (4)	0.688 (8)	0.875 (6)
6	ehsan.tavan	0.906 (4)	0.875 (5)	0.844 (5)	0.812 (6)	0.750 (6)	0.875 (6)
7	AILS-NTUA	0.875 (5)	0.906 (4)	0.781 (7)	0.812 (6)	0.719 (7)	0.854 (7)
7	johnp	0.875 (5)	0.906 (4)	0.781 (7)	0.812 (6)	0.719 (7)	0.854 (7)
7	caoyongwang	0.844 (6)	0.844 (6)	0.875 (4)	0.781 (7)	0.719 (7)	0.854 (7)
7	KnowComp	0.844 (6)	0.906 (4)	0.812 (6)	0.844	0.656 (9)	0.854 (7)
8	RiddleMaster	0.844 (6)	0.844 (6)	0.844 (5)	0.781 (7)	0.656 (9)	0.844 (8)
9	yiannispn	0.844 (6)	0.844 (6)	0.812 (6)	0.719 (9)	0.625 (10)	0.833 (9)
10	AmazUtah_NLP	0.844 (6)	0.812 (7)	0.750 (8)	0.781 (7)	0.594 (11)	0.802 (10)
11	OUNLP	0.781 (7)	0.812 (7)	0.781 (7)	0.719 (9)	0.531 (12)	0.792 (11)
11	UMBCLU	0.781 (7)	0.750 (8)	0.844 (5)	0.719 (9)	0.625 (10)	0.792 (11)
11	TienDat23	0.844 (6)	0.750 (8)	0.781 (7)	0.750 (8)	0.625 (10)	0.792 (11)
12	GeminiPro	0.781 (7)	0.719 (9)	0.844 (5)	0.594 (11)	0.594 (11)	0.781 (12)
13	YNU-HPCC	0.781 (7)	0.719 (9)	0.812 (6)	0.719 (9)	0.625 (10)	0.771 (13)
14	iREL	0.719 (8)	0.719 (9)	0.781 (7)	0.562 (12)	0.531 (12)	0.740 (14)
15	Team DaVinci	0.719 (8)	0.719 (9)	0.625 (9)	0.594 (11)	0.469 (13)	0.688 (15)
16	Abdelhak	0.625 (10)	0.625 (10)	0.594 (10)	0.562 (12)	0.406 (15)	0.615 (16)
17	amr8ta	0.625 (10)	0.625 (10)	0.562 (11)	0.594 (11)	0.438 (14)	0.604 (17)
17	adriti	0.656 (9)	0.625 (10)	0.531 (12)	0.625 (10)	0.375 (16)	0.604 (17)
18	MABUSETTEH	0.594 (11)	0.625 (10)	0.531 (12)	0.562 (12)	0.281 (17)	0.583 (18)
19	NIMZ	0.438 (12)	0.469 (11)	0.438 (13)	0.406 (13)	0.219 (19)	0.448 (19)
20	Deja Vu	0.375 (14)	0.469 (11)	0.375 (15)	0.344 (15)	0.125 (20)	0.406 (20)
20	ROSHA	0.438 (12)	0.375 (12)	0.406 (14)	0.375 (14)	0.250 (18)	0.406 (20)
21	StFX-NLP	0.406 (13)	0.219 (14)	0.344 (16)	0.125 (16)	0.062 (21)	0.323 (21)
22	IIMAS	0.250 (15)	0.250 (13)	0.281 (17)	0.125 (16)	0.062 (21)	0.260 (22)

Average Score

Rank	Team	Sentence Puzzle	Word Puzzle	Average
1	Theo	0.950 (4)	0.990 (1)	0.97
2	HW-TSC	0.967 (2)	0.969 (3)	0.968
3	gerald	0.942 (5)	0.990 (1)	0.966
4	Maxine	0.958 (3)	0.969 (3)	0.9635
5	somethingx95	0.942 (5)	0.979 (2)	0.9605
6	YingluLi	0.950 (4)	0.969 (3)	0.9595
7	zero_shot_is_all_you_need	0.867 (10)	0.979 (2)	0.923
8	kubapok	0.892 (8)	0.948 (4)	0.92
9	BITS Pilani	0.900 (7)	0.917 (5)	0.9085
10	MasonTigers	0.833 (13)	0.979 (2)	0.906
11	jkarolczak	0.892 (8)	0.875 (6)	0.8835
11	justingu	0.850 (11)	0.917 (5)	0.8835
11	yangqi	0.892 (8)	0.875 (6)	0.8835
12	AmazUtah_NLP	0.925 (6)	0.802 (10)	0.8635
13	AILS-NTUA	0.817 (14)	0.854 (7)	0.8355
14	OUNLP	0.867 (10)	0.792 (11)	0.8295
15	ehsan.tavan	0.775 (17)	0.875 (6)	0.825
16	johnp	0.783 (16)	0.854 (7)	0.8185
16	KnowComp	0.783 (16)	0.854 (7)	0.8185
17	RiddleMaster	0.792 (15)	0.844 (8)	0.818
18	YNU-HPCC	0.842 (12)	0.771 (13)	0.8065
19	yiannispn	0.767 (18)	0.833 (9)	0.8
20	Abdelhak	0.983 (1)	0.615 (16)	0.799
21	UMBCLU	0.792 (15)	0.792 (11)	0.792
22	caoyongwang	0.725 (22)	0.854 (7)	0.7895
23	TienDat23	0.758 (19)	0.792 (11)	0.775
24	GeminiPro	0.733 (21)	0.781 (12)	0.757
25	iREL	0.733 (21)	0.740 (14)	0.7365
26	amr8ta	0.775 (17)	0.604 (17)	0.6895
27	MABUSETTEH	0.783 (16)	0.583 (18)	0.683
28	adriti	0.758 (19)	0.604 (17)	0.681
29	Team DaVinci	0.517 (26)	0.688 (15)	0.6025
30	NIMZ	0.750 (20)	0.448 (19)	0.599
31	Deja Vu	0.750 (20)	0.406 (20)	0.578
32	ROSHA	0.600 (25)	0.406 (20)	0.503
33	IIMAS	0.658 (23)	0.260 (22)	0.459
34	StFX-NLP	0.433 (27)	0.323 (21)	0.378

Important Dates

Event	Date
Tasks announced (with sample data available)	17 July 2023
Training data ready	4 September 2023
Competition Practice Phase start	13 September 2023
Evaluation start	10 January 2024
Evaluation end by by	31 January 2024 (latest date; task organizers may choose an earlier date)
Paper submission due	19 February 2024
Notification to authors	1 April 2024
Camera ready due	22 April 2024
SemEval workshop	TBD, 2024 (co-located with a major NLP conference)

Organization

Name	Affiliation	Email
Yifan Jiang	USC,ISI	yifjia@isi.edu
Filip Ilievski	USC,ISI	ilievski@isi.edu
Kaixin Ma	CMU, LTI	kaixinm@andrew.cmu.edu

We have some important notes for our task as follows:

1, We update the public dataset on both GitHub and Codalab platforms. Due to the limitation of our dataset, we replaced the previous trial dataset with a proportion of training data. We wish to save more data for testing to get more consistent, and the current phase only aims to let participants be familiar with the submission format. So, for all teams who already sent their submission file on the Codalab, you can send a new submission file for the new trial data if you want to get the latest feedback. This is not mandatory, as the new trial data has the same format as the previous one, and the previous successful submission means your submission format is correct.
2. We also allow participants to perform the task in a zero-shot manner (without access to the training data). To make the competition more fair and convenient for later system analysis, we wish each team to add ZS in their team name to state their result is conducted in a zero-shot manner. We will assume everyone that doesn't have ZS in their name is fine-tuning
3. Here is the detailed description of the three phases in Codalab (https://semeval.github.io/SemEval2024/codalab). Here is a brief summary:
Practice Phase:

Duration: Until approximately 10 Jan 2024 (official dates to be announced).
Data:Use of the official evaluation script on trial data.
High limit for submissions to allow extensive testing.
Public leaderboard to enable participants to verify their format and approach.
Purpose: Helps participants prepare by checking formatting and initial performance.

Evaluation Phase:

Duration: Approximately 10 Jan to 31 Jan 2024.
Data: Employment of the official evaluation script and official test data.
Restriction on submissions, with emphasis typically on the final valid submission on CodaLab.
Leaderboard visibility is hidden.
Purpose: Official assessment phase, focusing on competition and ranking outcomes.

Post-Evaluation Phase:

Duration: Begins around 31 Jan 2024.
Data: Continues the use of the official evaluation script and test data.
The submission limit is set high again (e.g., 999).
The public leaderboard is public.
Purpose: Enables scoring of contrastive runs for detailed analysis in system description papers. Supports future analysis and scoring for systems interested in the task beyond SemEval-2024. Encourages ongoing research and analysis post-competition.

SemEval-2024

The Codalab Competition is now available!!!

Motivation

Task Example

Data

Codalab

Leaderboard (Evaluate over the test set (20%) of the Competition)

Sentence Puzzle

Word Puzzle

Average Score

Important Dates

Organization

We have some important notes for our task as follows: