SemEval 2024 BRAINTEASER: A Novel Task Defying Common Sense

Task Home Page

The Codalab Competition is now available!!!

Motivation

Human reasoning processes comprise two types of thinking: vertical and lateral. Vertical thinking, also known as linear, convergent, or logical thinking, is a sequential analytical process that is based on rationality, logic, and rules. Meanwhile, lateral thinking (or “thinking outside the box”) is a divergent and creative process that involves looking at a problem from a new perspective and defying preconceptions.

The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model’s ability to exhibit lateral thinking and defy default commonsense associations.

BRAINTEASER QA task consists of two subtasks-Sentence Puzzle and Word Puzzle that require awareness of commonsense “defaults” and overwriting them through unconventional thinking that distinguishes these defaults from hard constraints.

Both tasks include an adversarial subset, created by manually modifying the original brain teasers without changing their latent reasoning path.

Task Example

Here are two examples from each subtasks

Question Choice
A man shaves everyday, yet keeps his beard long. He is a barber.
He wants to maintain his appearance.
He wants his girlfriend to buy him a razor.
None of the above.

What part of London is in France? The letter N.
The letter O.
The letter L.
None of the above.

To ensure that our task evaluates reasoning ability rather than memorization, we construct adversarial versions of the original data in two ways:

Here are the example of two adversarial versions of Sentence Puzzle:

Adversarial Strategy Question Choice
Oringinal A man shaves everyday, yet keeps his beard long. He is a barber.
He wants to maintain his appearance.
He wants his girlfriend to buy him a razor.
None of the above.

Semantic Reconstruction A man preserves a lengthy beard despite shaving every day. He is a barber.
He wants to maintain his appearance.
He wants his girlfriend to buy him a razor.
None of the above.

Context Reconstruction Tom attends class every day but doesn’t do any homework. He is a teacher.
He is a lazy person.
His teacher will not let him fail.
None of the above.

Each system will be evaluated based on the following two accuracy metrics:

Data

The training data is now available. The training and validation split will be releaded based on the SemEval timeline.

Registration form for participation and the legal usage of data.

Mailing list for task updates.

For further question, please contact: yifjia@isi.edu

Codalab

The tasks are set to be facilitated on CodaLab, with the availability of the link being aligned with the SemEval schedule. Participants are encouraged to register at the earliest and join the mailing list to stay abreast of updates.

The Codalab Competition is now available!!!

Leaderboard (Evaluate over the test set (20%) of the Competition)

Sentence Puzzle

Rank Team Original Semantic Context Ori & Sem Ori & Sem & Con Overall
1 Abdelhak 1.000 (1) 1.000 (1) 0.950 (1) 1.000 (1) 0.950 (1) 0.983 (1)
2 HW-TSC 1.000 (1) 0.975 (2) 0.925 (2) 0.975 (2) 0.900 (3) 0.967 (2)
3 Maxine 0.975 (2) 0.975 (2) 0.925 (2) 0.950 (3) 0.900 (3) 0.958 (3)
4 YingluLi 0.975 (2) 0.950 (3) 0.925 (2) 0.950 (3) 0.900 (3) 0.950 (4)
4 Theo 0.950 (3) 0.950 (3) 0.950 (1) 0.950 (3) 0.925 (2) 0.950 (4)
5 somethingx95 0.950 (3) 0.950 (3) 0.925 (2) 0.950 (3) 0.900 (3) 0.942 (5)
5 gerald 0.950 (3) 0.950 (3) 0.925 (2) 0.950 (3) 0.900 (3) 0.942 (5)
6 AmazUtah_NLP 0.925 (4) 0.950 (3) 0.900 (3) 0.925 (4) 0.875 (4) 0.925 (6)
7 BITS Pilani 0.975 (2) 0.925 (4) 0.800 (7) 0.925 (4) 0.775 (6) 0.900 (7)
7 ALF 0.925 (4) 0.950 (3) 0.825 (6) 0.925 (4) 0.825 (5) 0.900 (7)
8 uTeBC-NLP 0.975 (2) 0.875 (6) 0.825 (6) 0.850 (7) 0.750 (7) 0.892 (8)
8 jkarolczak 0.975 (2) 0.875 (6) 0.825 (6) 0.875 (6) 0.775 (6) 0.892 (8)
8 kubapok 0.925 (4) 0.900 (5) 0.850 (5) 0.900 (5) 0.825 (5) 0.892 (8)
8 yangqi 0.900 (5) 0.900 (5) 0.875 (4) 0.900 (5) 0.875 (4) 0.892 (8)
9 Mothman 0.975 (2) 0.850 (7) 0.800 (7) 0.850 (7) 0.700 (9) 0.875 (9)
10 zero_shot_is_all_you_need 0.950 (3) 0.825 (8) 0.825 (6) 0.800 (9) 0.725 (8) 0.867 (10)
10 OUNLP 0.950 (3) 0.875 (6) 0.775 (8) 0.850 (7) 0.725 (8) 0.867 (10)
11 justingu 0.950 (3) 0.825 (8) 0.775 (8) 0.825 (8) 0.700 (9) 0.850 (11)
11 BAMO 0.900 (5) 0.825 (8) 0.825 (6) 0.825 (8) 0.700 (9) 0.850 (11)
12 YNU-HPCC 0.900 (5) 0.825 (8) 0.800 (7) 0.825 (8) 0.725 (8) 0.842 (12)
13 FtG-CoT 0.900 (5) 0.825 (8) 0.775 (8) 0.800 (9) 0.675 (10) 0.833 (13)
13 MasonTigers 0.850 (6) 0.825 (8) 0.825 (6) 0.800 (9) 0.700 (9) 0.833 (13)
14 AILS-NTUA 0.850 (6) 0.825 (8) 0.775 (8) 0.825 (8) 0.700 (9) 0.817 (14)
15 RiddleMaster 0.800 (8) 0.775 (10) 0.800 (7) 0.725 (12) 0.650 (11) 0.792 (15)
15 UMBCLU 0.750 (10) 0.850 (7) 0.775 (8) 0.725 (12) 0.600 (13) 0.792 (15)
16 johnp 0.850 (6) 0.775 (10) 0.725 (10) 0.750 (11) 0.675 (10) 0.783 (16)
16 MABUSETTEH 0.800 (8) 0.775 (10) 0.775 (8) 0.775 (10) 0.700 (9) 0.783 (16)
16 KnowComp 0.825 (7) 0.775 (10) 0.750 (9) 0.725 (12) 0.625 (15) 0.783 (16)
17 ehsan.tavan 0.800 (8) 0.800 (9) 0.725 (10) 0.775 (10) 0.675 (10) 0.775 (17)
17 amr8ta 0.775 (9) 0.775 (10) 0.775 (8) 0.750 (11) 0.650 (11) 0.775 (17)
18 yiannispn 0.800 (8) 0.800 (9) 0.700 (11) 0.750 (11) 0.625 (12) 0.767 (18)
19 haha123 0.825 (7) 0.775 (10) 0.675 (12) 0.750 (11) 0.625 (12) 0.758 (19)
19 adriti 0.750 (10) 0.725 (12) 0.800 (7) 0.725 (12) 0.675 (10) 0.758 (19)
19 TienDat23 0.725 (11) 0.800 (9) 0.750 (9) 0.675 (14) 0.525 (16) 0.758 (19)
20 Deja Vu 0.775 (9) 0.700 (13) 0.775 (8) 0.700 (13) 0.625 (12) 0.750 (20)
20 NIMZ 0.750 (10) 0.725 (12) 0.775 (8) 0.700 (13) 0.675 (10) 0.750 (20)
21 iREL 0.775 (9) 0.725 (12) 0.700 (11) 0.700 (13) 0.575 (14) 0.733 (21)
21 GeminiPro 0.750 (10) 0.750 (11) 0.700 (11) 0.700 (13) 0.600 (13) 0.733 (21)
22 caoyongwang 0.800 (8) 0.700 (13) 0.675 (12) 0.700 (13) 0.550 (15) 0.725 (22)
23 IIMAS 0.650 (12) 0.675 (14) 0.650 (13) 0.600 (16) 0.500 (17) 0.658 (23)
24 IUST-NLPLAB 0.625 (13) 0.625 (15) 0.575 (15) 0.625 (15) 0.500 (17) 0.608 (24)
25 ROSHA 0.625 (13) 0.575 (16) 0.600 (14) 0.500 (17) 0.375 (18) 0.600 (25)
26 Team DaVinci 0.575 (14) 0.550 (17) 0.425 (17) 0.500 (17) 0.300 (19) 0.517 (26)
27 StFX-NLP 0.425 (15) 0.400 (18) 0.475 (16) 0.350 (18) 0.200 (20) 0.433 (27)
28 Team 9 0.275 (17) 0.275 (19) 0.200 (20) 0.100 (20) 0.000 (23) 0.250 (28)
28 DeBERTa 0.225 (18) 0.250 (20) 0.275 (19) 0.200 (19) 0.075 (21) 0.250 (28)
29 amirhallaji 0.225 (18) 0.200 (21) 0.300 (18) 0.050 (22) 0.025 (22) 0.242 (29)
30 maryam.najafi 0.225 (18) 0.275 (19) 0.200 (20) 0.100 (20) 0.025 (22) 0.233 (30)

Word Puzzle

Rank Team Original Semantic Context Ori & Sem Ori & Sem & Con Overall
1 Theo 1.000 (1) 1.000 (1) 0.969 (2) 1.000 (1) 0.969 (1) 0.990 (1)
1 gerald 1.000 (1) 1.000 (1) 0.969 (2) 1.000 (1) 0.969 (1) 0.990 (1)
2 somethingx95 1.000 (1) 1.000 (1) 0.938 (3) 1.000 (1) 0.938 (2) 0.979 (2)
2 zero_shot_is_all_you_need 1.000 (1) 1.000 (1) 0.938 (3) 1.000 (1) 0.938 (2) 0.979 (2)
2 MasonTigers 0.969 (2) 0.969 (2) 1.000 (1) 0.969 (2) 0.969 (1) 0.979 (2)
3 HW-TSC 0.969 (2) 0.938 (3) 1.000 (1) 0.938 (3) 0.938 (2) 0.969 (3)
3 Maxine 0.969 (2) 0.938 (3) 1.000 (1) 0.938 (3) 0.938 (2) 0.969 (3)
3 YingluLi 0.969 (2) 0.938 (3) 1.000 (1) 0.938 (3) 0.938 (2) 0.969 (3)
4 kubapok 0.906 (4) 1.000 (1) 0.938 (3) 0.906 (4) 0.844 (3) 0.948 (4)
5 BITS Pilani 0.938 (3) 0.938 (3) 0.875 (4) 0.938 (3) 0.812 (4) 0.917 (5)
5 justingu 0.938 (3) 0.938 (3) 0.875 (4) 0.906 (4) 0.781 (5) 0.917 (5)
6 jkarolczak 0.906 (4) 0.938 (3) 0.781 (7) 0.875 (5) 0.688 (8) 0.875 (6)
6 yangqi 0.906 (4) 0.938 (3) 0.781 (7) 0.906 (4) 0.688 (8) 0.875 (6)
6 ehsan.tavan 0.906 (4) 0.875 (5) 0.844 (5) 0.812 (6) 0.750 (6) 0.875 (6)
7 AILS-NTUA 0.875 (5) 0.906 (4) 0.781 (7) 0.812 (6) 0.719 (7) 0.854 (7)
7 johnp 0.875 (5) 0.906 (4) 0.781 (7) 0.812 (6) 0.719 (7) 0.854 (7)
7 caoyongwang 0.844 (6) 0.844 (6) 0.875 (4) 0.781 (7) 0.719 (7) 0.854 (7)
7 KnowComp 0.844 (6) 0.906 (4) 0.812 (6) 0.844 0.656 (9) 0.854 (7)
8 RiddleMaster 0.844 (6) 0.844 (6) 0.844 (5) 0.781 (7) 0.656 (9) 0.844 (8)
9 yiannispn 0.844 (6) 0.844 (6) 0.812 (6) 0.719 (9) 0.625 (10) 0.833 (9)
10 AmazUtah_NLP 0.844 (6) 0.812 (7) 0.750 (8) 0.781 (7) 0.594 (11) 0.802 (10)
11 OUNLP 0.781 (7) 0.812 (7) 0.781 (7) 0.719 (9) 0.531 (12) 0.792 (11)
11 UMBCLU 0.781 (7) 0.750 (8) 0.844 (5) 0.719 (9) 0.625 (10) 0.792 (11)
11 TienDat23 0.844 (6) 0.750 (8) 0.781 (7) 0.750 (8) 0.625 (10) 0.792 (11)
12 GeminiPro 0.781 (7) 0.719 (9) 0.844 (5) 0.594 (11) 0.594 (11) 0.781 (12)
13 YNU-HPCC 0.781 (7) 0.719 (9) 0.812 (6) 0.719 (9) 0.625 (10) 0.771 (13)
14 iREL 0.719 (8) 0.719 (9) 0.781 (7) 0.562 (12) 0.531 (12) 0.740 (14)
15 Team DaVinci 0.719 (8) 0.719 (9) 0.625 (9) 0.594 (11) 0.469 (13) 0.688 (15)
16 Abdelhak 0.625 (10) 0.625 (10) 0.594 (10) 0.562 (12) 0.406 (15) 0.615 (16)
17 amr8ta 0.625 (10) 0.625 (10) 0.562 (11) 0.594 (11) 0.438 (14) 0.604 (17)
17 adriti 0.656 (9) 0.625 (10) 0.531 (12) 0.625 (10) 0.375 (16) 0.604 (17)
18 MABUSETTEH 0.594 (11) 0.625 (10) 0.531 (12) 0.562 (12) 0.281 (17) 0.583 (18)
19 NIMZ 0.438 (12) 0.469 (11) 0.438 (13) 0.406 (13) 0.219 (19) 0.448 (19)
20 Deja Vu 0.375 (14) 0.469 (11) 0.375 (15) 0.344 (15) 0.125 (20) 0.406 (20)
20 ROSHA 0.438 (12) 0.375 (12) 0.406 (14) 0.375 (14) 0.250 (18) 0.406 (20)
21 StFX-NLP 0.406 (13) 0.219 (14) 0.344 (16) 0.125 (16) 0.062 (21) 0.323 (21)
22 IIMAS 0.250 (15) 0.250 (13) 0.281 (17) 0.125 (16) 0.062 (21) 0.260 (22)

Average Score

Rank Team Sentence Puzzle Word Puzzle Average
1 Theo 0.950 (4) 0.990 (1) 0.97
2 HW-TSC 0.967 (2) 0.969 (3) 0.968
3 gerald 0.942 (5) 0.990 (1) 0.966
4 Maxine 0.958 (3) 0.969 (3) 0.9635
5 somethingx95 0.942 (5) 0.979 (2) 0.9605
6 YingluLi 0.950 (4) 0.969 (3) 0.9595
7 zero_shot_is_all_you_need 0.867 (10) 0.979 (2) 0.923
8 kubapok 0.892 (8) 0.948 (4) 0.92
9 BITS Pilani 0.900 (7) 0.917 (5) 0.9085
10 MasonTigers 0.833 (13) 0.979 (2) 0.906
11 jkarolczak 0.892 (8) 0.875 (6) 0.8835
11 justingu 0.850 (11) 0.917 (5) 0.8835
11 yangqi 0.892 (8) 0.875 (6) 0.8835
12 AmazUtah_NLP 0.925 (6) 0.802 (10) 0.8635
13 AILS-NTUA 0.817 (14) 0.854 (7) 0.8355
14 OUNLP 0.867 (10) 0.792 (11) 0.8295
15 ehsan.tavan 0.775 (17) 0.875 (6) 0.825
16 johnp 0.783 (16) 0.854 (7) 0.8185
16 KnowComp 0.783 (16) 0.854 (7) 0.8185
17 RiddleMaster 0.792 (15) 0.844 (8) 0.818
18 YNU-HPCC 0.842 (12) 0.771 (13) 0.8065
19 yiannispn 0.767 (18) 0.833 (9) 0.8
20 Abdelhak 0.983 (1) 0.615 (16) 0.799
21 UMBCLU 0.792 (15) 0.792 (11) 0.792
22 caoyongwang 0.725 (22) 0.854 (7) 0.7895
23 TienDat23 0.758 (19) 0.792 (11) 0.775
24 GeminiPro 0.733 (21) 0.781 (12) 0.757
25 iREL 0.733 (21) 0.740 (14) 0.7365
26 amr8ta 0.775 (17) 0.604 (17) 0.6895
27 MABUSETTEH 0.783 (16) 0.583 (18) 0.683
28 adriti 0.758 (19) 0.604 (17) 0.681
29 Team DaVinci 0.517 (26) 0.688 (15) 0.6025
30 NIMZ 0.750 (20) 0.448 (19) 0.599
31 Deja Vu 0.750 (20) 0.406 (20) 0.578
32 ROSHA 0.600 (25) 0.406 (20) 0.503
33 IIMAS 0.658 (23) 0.260 (22) 0.459
34 StFX-NLP 0.433 (27) 0.323 (21) 0.378

Important Dates

Event Date
Tasks announced (with sample data available) 17 July 2023
Training data ready 4 September 2023
Competition Practice Phase start 13 September 2023
Evaluation start 10 January 2024
Evaluation end by by 31 January 2024 (latest date; task organizers may choose an earlier date)
Paper submission due 19 February 2024
Notification to authors 1 April 2024
Camera ready due 22 April 2024
SemEval workshop TBD, 2024 (co-located with a major NLP conference)

Organization

Name Affiliation Email
Yifan Jiang USC,ISI yifjia@isi.edu
Filip Ilievski USC,ISI ilievski@isi.edu
Kaixin Ma CMU, LTI kaixinm@andrew.cmu.edu

We have some important notes for our task as follows:


1, We update the public dataset on both GitHub and Codalab platforms. Due to the limitation of our dataset, we replaced the previous trial dataset with a proportion of training data. We wish to save more data for testing to get more consistent, and the current phase only aims to let participants be familiar with the submission format. So, for all teams who already sent their submission file on the Codalab, you can send a new submission file for the new trial data if you want to get the latest feedback. This is not mandatory, as the new trial data has the same format as the previous one, and the previous successful submission means your submission format is correct. 
2. We also allow participants to perform the task in a zero-shot manner (without access to the training data). To make the competition more fair and convenient for later system analysis, we wish each team to add ZS  in their team name to state their result is conducted in a zero-shot manner. We will assume everyone that doesn't have  ZS   in their name is fine-tuning
3. Here is the detailed description of the three phases in Codalab (https://semeval.github.io/SemEval2024/codalab). Here is a brief summary:
Practice Phase:

Evaluation Phase:

Post-Evaluation Phase: