SemEval 2024 BRAINTEASER: A Novel Task Defying Common Sense
Task Home PageThe Codalab Competition is now available!!!
Motivation
Human reasoning processes comprise two types of thinking: vertical and lateral. Vertical thinking, also known as linear, convergent, or logical thinking, is a sequential analytical process that is based on rationality, logic, and rules. Meanwhile, lateral thinking (or “thinking outside the box”) is a divergent and creative process that involves looking at a problem from a new perspective and defying preconceptions.
The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model’s ability to exhibit lateral thinking and defy default commonsense associations.
BRAINTEASER QA task consists of two subtasks-Sentence Puzzle and Word Puzzle that require awareness of commonsense “defaults” and overwriting them through unconventional thinking that distinguishes these defaults from hard constraints.
- Sentence Puzzle: Sentence-type brain teaser where the puzzle defying commonsense is centered on sentence snippets.
- Word Puzzle: Word-type brain teaser where the answer violates the default meaning of the word and focuses on the letter composition of the target question
Both tasks include an adversarial subset, created by manually modifying the original brain teasers without changing their latent reasoning path.
Task Example
Here are two examples from each subtasks
Question | Choice | |
---|---|---|
A man shaves everyday, yet keeps his beard long. |
He is a barber. He wants to maintain his appearance. He wants his girlfriend to buy him a razor. None of the above. |
|
What part of London is in France? |
The letter N. The letter O. The letter L. None of the above. |
|
To ensure that our task evaluates reasoning ability rather than memorization, we construct adversarial versions of the original data in two ways:
- Semantic Reconstruction rephrases the original question without changing the correct answer and the distractors.
- Context Reconstruction keeps the original reasoning path but changes both the question and the answer to describe a new situational context.
Here are the example of two adversarial versions of Sentence Puzzle:
Adversarial Strategy | Question | Choice |
---|---|---|
Oringinal | A man shaves everyday, yet keeps his beard long. |
He is a barber. He wants to maintain his appearance. He wants his girlfriend to buy him a razor. None of the above. |
Semantic Reconstruction | A man preserves a lengthy beard despite shaving every day. |
He is a barber. He wants to maintain his appearance. He wants his girlfriend to buy him a razor. None of the above. |
Context Reconstruction | Tom attends class every day but doesn’t do any homework. |
He is a teacher. He is a lazy person. His teacher will not let him fail. None of the above. |
Each system will be evaluated based on the following two accuracy metrics:
- Instance-based Accuracy: We consider each question (original/adversarial) as a separate instance. We will report accuracy for the original and its adversaries.
- Group-based Accuracy: Each question and its associated adversarial instances form a group, and a system will only receive a score of 1 when it correctly solves all questions in the group.
Data
The training data is now available. The training and validation split will be releaded based on the SemEval timeline.
Registration form for participation and the legal usage of data.
Mailing list for task updates.
For further question, please contact: yifjia@isi.edu
Codalab
The tasks are set to be facilitated on CodaLab, with the availability of the link being aligned with the SemEval schedule. Participants are encouraged to register at the earliest and join the mailing list to stay abreast of updates.
The Codalab Competition is now available!!!
Leaderboard (Evaluate over the test set (20%) of the Competition)
Sentence Puzzle
Rank | Team | Original | Semantic | Context | Ori & Sem | Ori & Sem & Con | Overall |
---|---|---|---|---|---|---|---|
1 | Abdelhak | 1.000 (1) | 1.000 (1) | 0.950 (1) | 1.000 (1) | 0.950 (1) | 0.983 (1) |
2 | HW-TSC | 1.000 (1) | 0.975 (2) | 0.925 (2) | 0.975 (2) | 0.900 (3) | 0.967 (2) |
3 | Maxine | 0.975 (2) | 0.975 (2) | 0.925 (2) | 0.950 (3) | 0.900 (3) | 0.958 (3) |
4 | YingluLi | 0.975 (2) | 0.950 (3) | 0.925 (2) | 0.950 (3) | 0.900 (3) | 0.950 (4) |
4 | Theo | 0.950 (3) | 0.950 (3) | 0.950 (1) | 0.950 (3) | 0.925 (2) | 0.950 (4) |
5 | somethingx95 | 0.950 (3) | 0.950 (3) | 0.925 (2) | 0.950 (3) | 0.900 (3) | 0.942 (5) |
5 | gerald | 0.950 (3) | 0.950 (3) | 0.925 (2) | 0.950 (3) | 0.900 (3) | 0.942 (5) |
6 | AmazUtah_NLP | 0.925 (4) | 0.950 (3) | 0.900 (3) | 0.925 (4) | 0.875 (4) | 0.925 (6) |
7 | BITS Pilani | 0.975 (2) | 0.925 (4) | 0.800 (7) | 0.925 (4) | 0.775 (6) | 0.900 (7) |
7 | ALF | 0.925 (4) | 0.950 (3) | 0.825 (6) | 0.925 (4) | 0.825 (5) | 0.900 (7) |
8 | uTeBC-NLP | 0.975 (2) | 0.875 (6) | 0.825 (6) | 0.850 (7) | 0.750 (7) | 0.892 (8) |
8 | jkarolczak | 0.975 (2) | 0.875 (6) | 0.825 (6) | 0.875 (6) | 0.775 (6) | 0.892 (8) |
8 | kubapok | 0.925 (4) | 0.900 (5) | 0.850 (5) | 0.900 (5) | 0.825 (5) | 0.892 (8) |
8 | yangqi | 0.900 (5) | 0.900 (5) | 0.875 (4) | 0.900 (5) | 0.875 (4) | 0.892 (8) |
9 | Mothman | 0.975 (2) | 0.850 (7) | 0.800 (7) | 0.850 (7) | 0.700 (9) | 0.875 (9) |
10 | zero_shot_is_all_you_need | 0.950 (3) | 0.825 (8) | 0.825 (6) | 0.800 (9) | 0.725 (8) | 0.867 (10) |
10 | OUNLP | 0.950 (3) | 0.875 (6) | 0.775 (8) | 0.850 (7) | 0.725 (8) | 0.867 (10) |
11 | justingu | 0.950 (3) | 0.825 (8) | 0.775 (8) | 0.825 (8) | 0.700 (9) | 0.850 (11) |
11 | BAMO | 0.900 (5) | 0.825 (8) | 0.825 (6) | 0.825 (8) | 0.700 (9) | 0.850 (11) |
12 | YNU-HPCC | 0.900 (5) | 0.825 (8) | 0.800 (7) | 0.825 (8) | 0.725 (8) | 0.842 (12) |
13 | FtG-CoT | 0.900 (5) | 0.825 (8) | 0.775 (8) | 0.800 (9) | 0.675 (10) | 0.833 (13) |
13 | MasonTigers | 0.850 (6) | 0.825 (8) | 0.825 (6) | 0.800 (9) | 0.700 (9) | 0.833 (13) |
14 | AILS-NTUA | 0.850 (6) | 0.825 (8) | 0.775 (8) | 0.825 (8) | 0.700 (9) | 0.817 (14) |
15 | RiddleMaster | 0.800 (8) | 0.775 (10) | 0.800 (7) | 0.725 (12) | 0.650 (11) | 0.792 (15) |
15 | UMBCLU | 0.750 (10) | 0.850 (7) | 0.775 (8) | 0.725 (12) | 0.600 (13) | 0.792 (15) |
16 | johnp | 0.850 (6) | 0.775 (10) | 0.725 (10) | 0.750 (11) | 0.675 (10) | 0.783 (16) |
16 | MABUSETTEH | 0.800 (8) | 0.775 (10) | 0.775 (8) | 0.775 (10) | 0.700 (9) | 0.783 (16) |
16 | KnowComp | 0.825 (7) | 0.775 (10) | 0.750 (9) | 0.725 (12) | 0.625 (15) | 0.783 (16) |
17 | ehsan.tavan | 0.800 (8) | 0.800 (9) | 0.725 (10) | 0.775 (10) | 0.675 (10) | 0.775 (17) |
17 | amr8ta | 0.775 (9) | 0.775 (10) | 0.775 (8) | 0.750 (11) | 0.650 (11) | 0.775 (17) |
18 | yiannispn | 0.800 (8) | 0.800 (9) | 0.700 (11) | 0.750 (11) | 0.625 (12) | 0.767 (18) |
19 | haha123 | 0.825 (7) | 0.775 (10) | 0.675 (12) | 0.750 (11) | 0.625 (12) | 0.758 (19) |
19 | adriti | 0.750 (10) | 0.725 (12) | 0.800 (7) | 0.725 (12) | 0.675 (10) | 0.758 (19) |
19 | TienDat23 | 0.725 (11) | 0.800 (9) | 0.750 (9) | 0.675 (14) | 0.525 (16) | 0.758 (19) |
20 | Deja Vu | 0.775 (9) | 0.700 (13) | 0.775 (8) | 0.700 (13) | 0.625 (12) | 0.750 (20) |
20 | NIMZ | 0.750 (10) | 0.725 (12) | 0.775 (8) | 0.700 (13) | 0.675 (10) | 0.750 (20) |
21 | iREL | 0.775 (9) | 0.725 (12) | 0.700 (11) | 0.700 (13) | 0.575 (14) | 0.733 (21) |
21 | GeminiPro | 0.750 (10) | 0.750 (11) | 0.700 (11) | 0.700 (13) | 0.600 (13) | 0.733 (21) |
22 | caoyongwang | 0.800 (8) | 0.700 (13) | 0.675 (12) | 0.700 (13) | 0.550 (15) | 0.725 (22) |
23 | IIMAS | 0.650 (12) | 0.675 (14) | 0.650 (13) | 0.600 (16) | 0.500 (17) | 0.658 (23) |
24 | IUST-NLPLAB | 0.625 (13) | 0.625 (15) | 0.575 (15) | 0.625 (15) | 0.500 (17) | 0.608 (24) |
25 | ROSHA | 0.625 (13) | 0.575 (16) | 0.600 (14) | 0.500 (17) | 0.375 (18) | 0.600 (25) |
26 | Team DaVinci | 0.575 (14) | 0.550 (17) | 0.425 (17) | 0.500 (17) | 0.300 (19) | 0.517 (26) |
27 | StFX-NLP | 0.425 (15) | 0.400 (18) | 0.475 (16) | 0.350 (18) | 0.200 (20) | 0.433 (27) |
28 | Team 9 | 0.275 (17) | 0.275 (19) | 0.200 (20) | 0.100 (20) | 0.000 (23) | 0.250 (28) |
28 | DeBERTa | 0.225 (18) | 0.250 (20) | 0.275 (19) | 0.200 (19) | 0.075 (21) | 0.250 (28) |
29 | amirhallaji | 0.225 (18) | 0.200 (21) | 0.300 (18) | 0.050 (22) | 0.025 (22) | 0.242 (29) |
30 | maryam.najafi | 0.225 (18) | 0.275 (19) | 0.200 (20) | 0.100 (20) | 0.025 (22) | 0.233 (30) |
Word Puzzle
Rank | Team | Original | Semantic | Context | Ori & Sem | Ori & Sem & Con | Overall |
---|---|---|---|---|---|---|---|
1 | Theo | 1.000 (1) | 1.000 (1) | 0.969 (2) | 1.000 (1) | 0.969 (1) | 0.990 (1) |
1 | gerald | 1.000 (1) | 1.000 (1) | 0.969 (2) | 1.000 (1) | 0.969 (1) | 0.990 (1) |
2 | somethingx95 | 1.000 (1) | 1.000 (1) | 0.938 (3) | 1.000 (1) | 0.938 (2) | 0.979 (2) |
2 | zero_shot_is_all_you_need | 1.000 (1) | 1.000 (1) | 0.938 (3) | 1.000 (1) | 0.938 (2) | 0.979 (2) |
2 | MasonTigers | 0.969 (2) | 0.969 (2) | 1.000 (1) | 0.969 (2) | 0.969 (1) | 0.979 (2) |
3 | HW-TSC | 0.969 (2) | 0.938 (3) | 1.000 (1) | 0.938 (3) | 0.938 (2) | 0.969 (3) |
3 | Maxine | 0.969 (2) | 0.938 (3) | 1.000 (1) | 0.938 (3) | 0.938 (2) | 0.969 (3) |
3 | YingluLi | 0.969 (2) | 0.938 (3) | 1.000 (1) | 0.938 (3) | 0.938 (2) | 0.969 (3) |
4 | kubapok | 0.906 (4) | 1.000 (1) | 0.938 (3) | 0.906 (4) | 0.844 (3) | 0.948 (4) |
5 | BITS Pilani | 0.938 (3) | 0.938 (3) | 0.875 (4) | 0.938 (3) | 0.812 (4) | 0.917 (5) |
5 | justingu | 0.938 (3) | 0.938 (3) | 0.875 (4) | 0.906 (4) | 0.781 (5) | 0.917 (5) |
6 | jkarolczak | 0.906 (4) | 0.938 (3) | 0.781 (7) | 0.875 (5) | 0.688 (8) | 0.875 (6) |
6 | yangqi | 0.906 (4) | 0.938 (3) | 0.781 (7) | 0.906 (4) | 0.688 (8) | 0.875 (6) |
6 | ehsan.tavan | 0.906 (4) | 0.875 (5) | 0.844 (5) | 0.812 (6) | 0.750 (6) | 0.875 (6) |
7 | AILS-NTUA | 0.875 (5) | 0.906 (4) | 0.781 (7) | 0.812 (6) | 0.719 (7) | 0.854 (7) |
7 | johnp | 0.875 (5) | 0.906 (4) | 0.781 (7) | 0.812 (6) | 0.719 (7) | 0.854 (7) |
7 | caoyongwang | 0.844 (6) | 0.844 (6) | 0.875 (4) | 0.781 (7) | 0.719 (7) | 0.854 (7) |
7 | KnowComp | 0.844 (6) | 0.906 (4) | 0.812 (6) | 0.844 | 0.656 (9) | 0.854 (7) |
8 | RiddleMaster | 0.844 (6) | 0.844 (6) | 0.844 (5) | 0.781 (7) | 0.656 (9) | 0.844 (8) |
9 | yiannispn | 0.844 (6) | 0.844 (6) | 0.812 (6) | 0.719 (9) | 0.625 (10) | 0.833 (9) |
10 | AmazUtah_NLP | 0.844 (6) | 0.812 (7) | 0.750 (8) | 0.781 (7) | 0.594 (11) | 0.802 (10) |
11 | OUNLP | 0.781 (7) | 0.812 (7) | 0.781 (7) | 0.719 (9) | 0.531 (12) | 0.792 (11) |
11 | UMBCLU | 0.781 (7) | 0.750 (8) | 0.844 (5) | 0.719 (9) | 0.625 (10) | 0.792 (11) |
11 | TienDat23 | 0.844 (6) | 0.750 (8) | 0.781 (7) | 0.750 (8) | 0.625 (10) | 0.792 (11) |
12 | GeminiPro | 0.781 (7) | 0.719 (9) | 0.844 (5) | 0.594 (11) | 0.594 (11) | 0.781 (12) |
13 | YNU-HPCC | 0.781 (7) | 0.719 (9) | 0.812 (6) | 0.719 (9) | 0.625 (10) | 0.771 (13) |
14 | iREL | 0.719 (8) | 0.719 (9) | 0.781 (7) | 0.562 (12) | 0.531 (12) | 0.740 (14) |
15 | Team DaVinci | 0.719 (8) | 0.719 (9) | 0.625 (9) | 0.594 (11) | 0.469 (13) | 0.688 (15) |
16 | Abdelhak | 0.625 (10) | 0.625 (10) | 0.594 (10) | 0.562 (12) | 0.406 (15) | 0.615 (16) |
17 | amr8ta | 0.625 (10) | 0.625 (10) | 0.562 (11) | 0.594 (11) | 0.438 (14) | 0.604 (17) |
17 | adriti | 0.656 (9) | 0.625 (10) | 0.531 (12) | 0.625 (10) | 0.375 (16) | 0.604 (17) |
18 | MABUSETTEH | 0.594 (11) | 0.625 (10) | 0.531 (12) | 0.562 (12) | 0.281 (17) | 0.583 (18) |
19 | NIMZ | 0.438 (12) | 0.469 (11) | 0.438 (13) | 0.406 (13) | 0.219 (19) | 0.448 (19) |
20 | Deja Vu | 0.375 (14) | 0.469 (11) | 0.375 (15) | 0.344 (15) | 0.125 (20) | 0.406 (20) |
20 | ROSHA | 0.438 (12) | 0.375 (12) | 0.406 (14) | 0.375 (14) | 0.250 (18) | 0.406 (20) |
21 | StFX-NLP | 0.406 (13) | 0.219 (14) | 0.344 (16) | 0.125 (16) | 0.062 (21) | 0.323 (21) |
22 | IIMAS | 0.250 (15) | 0.250 (13) | 0.281 (17) | 0.125 (16) | 0.062 (21) | 0.260 (22) |
Average Score
Rank | Team | Sentence Puzzle | Word Puzzle | Average |
---|---|---|---|---|
1 | Theo | 0.950 (4) | 0.990 (1) | 0.97 |
2 | HW-TSC | 0.967 (2) | 0.969 (3) | 0.968 |
3 | gerald | 0.942 (5) | 0.990 (1) | 0.966 |
4 | Maxine | 0.958 (3) | 0.969 (3) | 0.9635 |
5 | somethingx95 | 0.942 (5) | 0.979 (2) | 0.9605 |
6 | YingluLi | 0.950 (4) | 0.969 (3) | 0.9595 |
7 | zero_shot_is_all_you_need | 0.867 (10) | 0.979 (2) | 0.923 |
8 | kubapok | 0.892 (8) | 0.948 (4) | 0.92 |
9 | BITS Pilani | 0.900 (7) | 0.917 (5) | 0.9085 |
10 | MasonTigers | 0.833 (13) | 0.979 (2) | 0.906 |
11 | jkarolczak | 0.892 (8) | 0.875 (6) | 0.8835 |
11 | justingu | 0.850 (11) | 0.917 (5) | 0.8835 |
11 | yangqi | 0.892 (8) | 0.875 (6) | 0.8835 |
12 | AmazUtah_NLP | 0.925 (6) | 0.802 (10) | 0.8635 |
13 | AILS-NTUA | 0.817 (14) | 0.854 (7) | 0.8355 |
14 | OUNLP | 0.867 (10) | 0.792 (11) | 0.8295 |
15 | ehsan.tavan | 0.775 (17) | 0.875 (6) | 0.825 |
16 | johnp | 0.783 (16) | 0.854 (7) | 0.8185 |
16 | KnowComp | 0.783 (16) | 0.854 (7) | 0.8185 |
17 | RiddleMaster | 0.792 (15) | 0.844 (8) | 0.818 |
18 | YNU-HPCC | 0.842 (12) | 0.771 (13) | 0.8065 |
19 | yiannispn | 0.767 (18) | 0.833 (9) | 0.8 |
20 | Abdelhak | 0.983 (1) | 0.615 (16) | 0.799 |
21 | UMBCLU | 0.792 (15) | 0.792 (11) | 0.792 |
22 | caoyongwang | 0.725 (22) | 0.854 (7) | 0.7895 |
23 | TienDat23 | 0.758 (19) | 0.792 (11) | 0.775 |
24 | GeminiPro | 0.733 (21) | 0.781 (12) | 0.757 |
25 | iREL | 0.733 (21) | 0.740 (14) | 0.7365 |
26 | amr8ta | 0.775 (17) | 0.604 (17) | 0.6895 |
27 | MABUSETTEH | 0.783 (16) | 0.583 (18) | 0.683 |
28 | adriti | 0.758 (19) | 0.604 (17) | 0.681 |
29 | Team DaVinci | 0.517 (26) | 0.688 (15) | 0.6025 |
30 | NIMZ | 0.750 (20) | 0.448 (19) | 0.599 |
31 | Deja Vu | 0.750 (20) | 0.406 (20) | 0.578 |
32 | ROSHA | 0.600 (25) | 0.406 (20) | 0.503 |
33 | IIMAS | 0.658 (23) | 0.260 (22) | 0.459 |
34 | StFX-NLP | 0.433 (27) | 0.323 (21) | 0.378 |
Important Dates
Event | Date |
---|---|
Tasks announced (with sample data available) | 17 July 2023 |
Training data ready | 4 September 2023 |
Competition Practice Phase start | 13 September 2023 |
Evaluation start | 10 January 2024 |
Evaluation end by by | 31 January 2024 (latest date; task organizers may choose an earlier date) |
Paper submission due | 19 February 2024 |
Notification to authors | 1 April 2024 |
Camera ready due | 22 April 2024 |
SemEval workshop | TBD, 2024 (co-located with a major NLP conference) |
Organization
Name | Affiliation | |
---|---|---|
Yifan Jiang | USC,ISI | yifjia@isi.edu |
Filip Ilievski | USC,ISI | ilievski@isi.edu |
Kaixin Ma | CMU, LTI | kaixinm@andrew.cmu.edu |
We have some important notes for our task as follows:
1, We update the public dataset on both GitHub and Codalab platforms. Due to the limitation of our dataset, we replaced the previous trial dataset with a proportion of training data. We wish to save more data for testing to get more consistent, and the current phase only aims to let participants be familiar with the submission format. So, for all teams who already sent their submission file on the Codalab, you can send a new submission file for the new trial data if you want to get the latest feedback. This is not mandatory, as the new trial data has the same format as the previous one, and the previous successful submission means your submission format is correct.
2. We also allow participants to perform the task in a zero-shot manner (without access to the training data). To make the competition more fair and convenient for later system analysis, we wish each team to add ZS in their team name to state their result is conducted in a zero-shot manner. We will assume everyone that doesn't have ZS in their name is fine-tuning
3. Here is the detailed description of the three phases in Codalab (https://semeval.github.io/SemEval2024/codalab). Here is a brief summary:
Practice Phase:
- Duration: Until approximately 10 Jan 2024 (official dates to be announced).
- Data:Use of the official evaluation script on trial data.
- High limit for submissions to allow extensive testing.
- Public leaderboard to enable participants to verify their format and approach.
- Purpose: Helps participants prepare by checking formatting and initial performance.
Evaluation Phase:
- Duration: Approximately 10 Jan to 31 Jan 2024.
- Data: Employment of the official evaluation script and official test data.
- Restriction on submissions, with emphasis typically on the final valid submission on CodaLab.
- Leaderboard visibility is hidden.
- Purpose: Official assessment phase, focusing on competition and ranking outcomes.
Post-Evaluation Phase:
- Duration: Begins around 31 Jan 2024.
- Data: Continues the use of the official evaluation script and test data.
- The submission limit is set high again (e.g., 999).
- The public leaderboard is public.
- Purpose: Enables scoring of contrastive runs for detailed analysis in system description papers. Supports future analysis and scoring for systems interested in the task beyond SemEval-2024. Encourages ongoing research and analysis post-competition.