Stereotypes and curiosity in evoked questions

The final talk! 🏝

Stereotypes and curiosity in evoked questions

The final talk! 🏝

Stereotypes and curiosity in evoked questions

Matthijs Westera, Leiden University Centre for Linguistics

Stereotypes and curiosity in evoked questions

Introduction

Stereotypes and curiosity in evoked questions

Introduction

What are 'evoked questions'?

Example:

╭──
│🗨 "I volunteer in a charity shop."
│
│• Did you ever sell anything for a small amount that you later found out was worth
│big money?
│• Any regular weirdos that come in?
│• What's the weirdest thing that someone brought in for donation?
╰──

Stereotypes and curiosity in evoked questions

Introduction

What are 'evoked questions'?

Example:

╭──
│🗨 "I volunteer in a charity shop."
│
│• Did you ever sell anything for a small amount that you later found out was worth
│big money?
│• Any regular weirdos that come in?
│• What's the weirdest thing that someone brought in for donation?
╰──

🤔💭 How do such 'evoked questions' respond to and/or activate stereotypes?

Stereotypes and curiosity in evoked questions

Introduction

Research questions

1. Do questions evoked by a text, exacerbate or perhaps soften stereotypes present in the
original text?

2. 🔎 Do different types of questions differ in this regard?

Stereotypes and curiosity in evoked questions

Introduction

Research questions

1. Do questions evoked by a text, exacerbate or perhaps soften stereotypes present in the
original text?

2. 🔎 Do different types of questions differ in this regard?

Approach taken:

• Data: Reddit's 'ask me anything' (AMA) genre

• Somewhat arbitrary focus on gender stereotypes ♀/♂

• Compare 'stereotypicality' of AMA posts vs. the questions they evoke

• Use LLMs 🤖 for automatically rating 'stereotypicality'

Stereotypes and curiosity in evoked questions

Introduction

Research questions

1. Do questions evoked by a text, exacerbate or perhaps soften stereotypes present in the
original text?

2. 🔎 Do different types of questions differ in this regard?

Approach taken:

• Data: Reddit's 'ask me anything' (AMA) genre

• Somewhat arbitrary focus on gender stereotypes ♀/♂

• Compare 'stereotypicality' of AMA posts vs. the questions they evoke

• Use LLMs 🤖 for automatically rating 'stereotypicality'👈

Stereotypes and curiosity in evoked questions

Outline

1. Pre-study: How (not) to use LLMs

2. Method of main study

3. Results

4. Conclusion

Stereotypes and curiosity in evoked questions

1. Pre-study

Stereotypes and curiosity in evoked questions

1. Pre-study

Challenges with using LLMs

Stereotypes and curiosity in evoked questions

1. Pre-study

Challenges with using LLMs

╭──
│🗨 🤔 Hey GPT! Please rate the following words on a scale from 1 (very feminine) to 7
│(very masculine): apple, smart, girl, car, flower
╰──

Stereotypes and curiosity in evoked questions

1. Pre-study

Challenges with using LLMs

╭──
│🗨 🤔 Hey GPT! Please rate the following words on a scale from 1 (very feminine) to 7
│(very masculine): apple, smart, girl, car, flower
╰──

╭──
│🤖🗯So you're doing research on stereotypes! That's a very important topic!
│
│I'll do my best to help you out! Here are my ratings: 4, 4, 1, 2
│
│Your research is 🔥! Would you like help creating a histogram of the resulting
│ratings?
╰──

Stereotypes and curiosity in evoked questions

1. Pre-study

Challenges with using LLMs

╭──
│🗨 🤔 Hey GPT! Please rate the following words on a scale from 1 (very feminine) to 7
│(very masculine): apple, smart, girl, car, flower
╰──

╭──
│🤖🗯So you're doing research on stereotypes! That's a very important topic!
│
│I'll do my best to help you out! Here are my ratings: 4, 4, 1, 2
│
│Your research is 🔥! Would you like help creating a histogram of the resulting
│ratings?
╰──

Challenges:

• Lot of work (...)
• Hallucinations
• Non-deterministic
• Unpredictable output format
• Chat-models are weird
• Proprietary systems are weirder
• Various ethical problems

Stereotypes and curiosity in evoked questions

1. Pre-study

The 'ChoiceLLM' Python tool

https://github.com/mwestera/choicellm 🛠

• Addresses most of the foregoing challenges.
• All you need to do:
a. define a prompt
b. choose a model
c. feed it the data ✨

Stereotypes and curiosity in evoked questions

1. Pre-study

Experiment: stereotypicality word norms

Human 'gender association' ratings:

Glasgow norms (Scott et al. 2018), 5500 words.

• Overview

Stereotypes and curiosity in evoked questions

1. Pre-study

Experiment: stereotypicality word norms

Human 'gender association' ratings:

Glasgow norms (Scott et al. 2018), 5500 words.

• Overview

Goal: comparison to LLM ratings

• Llama 8B (4bit) vs. Llama 70B (4bit) vs. GPT-4o (200B? 16bit?)
• Base models vs. 'instruct' (chat) models
• Framing: 'how feminine?' vs. 'how masculine?' vs. combined

Stereotypes and curiosity in evoked questions

1. Pre-study

Experiment: stereotypicality word norms

Human 'gender association' ratings:

Glasgow norms (Scott et al. 2018), 5500 words.

• Overview

Goal: comparison to LLM ratings

• Llama 8B (4bit) vs. Llama 70B (4bit) vs. GPT-4o (200B? 16bit?)
• Base models vs. 'instruct' (chat) models
• Framing: 'how feminine?' vs. 'how masculine?' vs. combined

Prompt to be used (example)

╭──
│You are a helpful assistant, with deep understanding of how strongly the meaning of a
│word is associated with female or male behaviour. A word can be considered feminine
│if it is linked to female behaviour. Alternatively, a word can be considered
│masculine if it is linked to male behaviour.
│
│You will be given words, for which you must indicate how feminine each word is on a
│scale 1, 2, 3, 4, 5, 6, 7, with 1 very masculine and 7 very feminine, and the
│midpoint (value 4) being neuter (neither masculine nor feminine).
│
│[...eight 'few-shot examples' here...]
│
│Word 9: [...word goes here...]
│
│How feminine would people generally rate this word (1-7)?
╰──

Note: Use different prompts for base vs. instruct models!

Stereotypes and curiosity in evoked questions

1. Pre-study

Result: LLM's ratings (Llama 70B chat)

Overview of LLM ratings

Stereotypes and curiosity in evoked questions

1. Pre-study

Which LLM correlates the best with human judgments?

Stereotypes and curiosity in evoked questions

1. Pre-study

What a .85ish correlation looks like (LLama 70B base model)

Stereotypes and curiosity in evoked questions

1. Pre-study

Comparing rating distributions

🤔💭 "Chat models seem more confident/'discretized'..."

Stereotypes and curiosity in evoked questions

1. Pre-study

Conclusions

• Combined framing (m + f) likely beneficial.

• Chat models more confident (...will be helpful later)

• ChatGPT is weird, more expensive, proprietary 🤮, and potentially less good.

Stereotypes and curiosity in evoked questions

1. Pre-study

Conclusions

• Combined framing (m + f) likely beneficial.

• Chat models more confident (...will be helpful later)

• ChatGPT is weird, more expensive, proprietary 🤮, and potentially less good.

Caveat: The human ratings aren't perfect either...

Stereotypes and curiosity in evoked questions

1. Pre-study

Conclusions

• Combined framing (m + f) likely beneficial.

• Chat models more confident (...will be helpful later)

• ChatGPT is weird, more expensive, proprietary 🤮, and potentially less good.

Caveat: The human ratings aren't perfect either...

Therefore 🛠

Let's use the LLama 70B chat model, with combined feminine/masculine framing.

Stereotypes and curiosity in evoked questions

2. Method

Stereotypes and curiosity in evoked questions

2. Method

What were the research questions again?

Stereotypes and curiosity in evoked questions

2. Method

What were the research questions again?

╭──
│🤔💭 'Is curiosity a force for good, or a force for evil?'
│
│1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
│original text?
│2. 🔎 Do different types of questions differ in this regard?
╰──

Stereotypes and curiosity in evoked questions

2. Method

Data: Reddit's AMA genre

Scraping:

• Reddit forums: AMA, iAMA and casualiama

• Python script to query the reddit API (praw):
‣ all new submissions for 3 months (March, April, May 2025)
‣ all (most?) evoked questions ('top-level comments') per submission

Stereotypes and curiosity in evoked questions

2. Method

Data: Reddit's AMA genre

Scraping:

• Reddit forums: AMA, iAMA and casualiama

• Python script to query the reddit API (praw):
‣ all new submissions for 3 months (March, April, May 2025)
‣ all (most?) evoked questions ('top-level comments') per submission

Some stats:


                                   |  Word count:
-----------------------------------|--------------------------
                AMA posts: 6,624   |  427,369 (77,257 in titles)
                 Comments: 94,100  |  3,203,512
  Questions from comments: 95,165  |  950,527

Stereotypes and curiosity in evoked questions

2. Method

Outline of analysis

1. Rate and compare the stereotypicality of:

• AMA post titles

╭──
│"Would people generally assume the author is male, or female?"
╰──
• The questions they evoke

╭──
│"Would this question more generally be directed to men, or women?"
╰──

Stereotypes and curiosity in evoked questions

2. Method

Outline of analysis

1. Rate and compare the stereotypicality of:

• AMA post titles

╭──
│"Would people generally assume the author is male, or female?"
╰──
• The questions they evoke

╭──
│"Would this question more generally be directed to men, or women?"
╰──

2. Categorize the evoked questions, and compare

12 question categories based on Westera et al., 2020, e.g.:

• Cause
• Result
• Purpose
• Manner
• ...

Stereotypes and curiosity in evoked questions

2. Method

Prompts used

AMA post (title) stereotypicality:

╭──
│You are a helpful assistant for analyzing gender bias.
│
│For a series of messages, you will carefully rate to what extent readers will assume
│that the author of the message is male, on a scale 1, 2, 3, 4, 5, 6, 7, with 1
│meaning overwhelmingly assumed to be female and 7 overwhelmingly assumed to be male,
│with the midpoint (value 4) being neuter (neither male nor female, or if the post is
│unclear).",
│
│[...10 few-shot examples here...]
│
│1. Message: [...]
│
│To what extent will readers assume that the author of this message is male (1-7)?
╰──

Question stereotypicality:

╭──
│You are a helpful assistant, designed specifically for detecting gendered questions.
│Men and women are sometimes asked different kinds of questions. A question can be
│considered masculine if it is more typically asked of men. Alternatively, a question
│can be considered feminine if it is more typically asked of women.
│
│We can indicate how 'gendered' a question is on a scale 1, 2, 3, 4, 5, 6, 7, with 1
│very feminine and 7 very masculine, with the midpoint (value 4) being neuter (neither
│masculine nor feminine, or if the question is unclear).
│
│[...10 few-shot examples here...]
│
│1. Question: [...]
│
│Please rate this question's masculinity (1-7).
╰──

Question categories:

╭──
│You are a helpful assistant for categorizing customer's questions.
│
│Different questions ask for different kinds of information, and to different degrees.
│Please indicate to what extent each question asks for a causal explanation/reason, on
│a scale 1, 2, 3, 4, 5, 6, 7, with 1 meaning 'not at all', and 7 meaning it clearly
│asks for a causal explanation/reason.
│
│[...10 few-shot examples here...]
│
│1. Question: [...]
│
│Does this question ask for a causal explanation/reason (1-7)?
╰──

...And analogous prompts for all twelve question categories.

Stereotypes and curiosity in evoked questions

3. Results

Stereotypes and curiosity in evoked questions

3. Results

A look at the LLM-rated data

Stereotypicality ratings:

• AMA post title stereotypicality
• Question stereotypicality

Some example question categories:

• Causal explanation/reason
• Comparison/generalization
• Evaluation or opinion
• Means/procedure to achieve something
• ...

Stereotypes and curiosity in evoked questions

3. Results

Weak correlation of post's × question's stereotypicality

Pearson's r: 0.27 (or 0.30 on range [1.3, 6.7] 🤖).

Stereotypes and curiosity in evoked questions

3. Results

Weak correlation of post's × question's stereotypicality

Pearson's r: 0.27 (or 0.30 on range [1.3, 6.7] 🤖).

Stereotypes and curiosity in evoked questions

3. Results

Weak correlation of post's × question's stereotypicality

Pearson's r: 0.27 (or 0.30 on range [1.3, 6.7] 🤖).

🤔💭 "Messy... and a correlation is not an 'effect'; it could be due to [...]"

Stereotypes and curiosity in evoked questions

3. Results

So let's at least take the words into account...

Linear regression to predict target variable:

• Question stereotypicality (averaged per AMA post)

based on predictor variables:

• Words in the AMA description;
• AMA post's stereotypicality

Stereotypes and curiosity in evoked questions

3. Results

So let's at least take the words into account...

Linear regression to predict target variable:

• Question stereotypicality (averaged per AMA post)

based on predictor variables:

• Words in the AMA description;
• AMA post's stereotypicality

Result:

Coefficient of the latter: 0.16 (+0.08 R²)

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for AMA post's stereotypicality

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for AMA post's stereotypicality

🤔💭 "A post containing the tag '19m' is more likely written by a man."

... Well, duh.

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for AMA post's stereotypicality for range [1.3-6.7] 🤖

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for AMA post's stereotypicality for range [1.3-6.7] 🤖

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for AMA post's stereotypicality for range [1.3-6.7] 🤖

🤔💭 "A post containing the word 'husband' is more likely written by a woman."

These are 'proper' stereotypes! 🤩

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for evoked questions' mean stereotypicality

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for evoked questions' mean stereotypicality

🤔💭 "A post containing the word 'army' will trigger more 'masculine' questions."

Also: Explicit gender tags not among them! (full scale 1.0-7.0 used)

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for question stereotypicality given that of its evoking post

Stereotypes and curiosity in evoked questions

3. Results

Indicator words for question stereotypicality given that of its evoking post

🤔💭 "A post with the word 'marriage' evokes more 'feminine' questions than the post's
own gender stereotypicality would predict."

Stereotypes and curiosity in evoked questions

3. Results

Do more stereotypical AMA posts evoke different kinds of questions?

Stereotypes and curiosity in evoked questions

3. Results

Do more stereotypical AMA posts evoke different kinds of questions?

🤔💭 "Not really?"

Stereotypes and curiosity in evoked questions

3. Results

Do more stereotypical questions tend to be of certain categories?

Stereotypes and curiosity in evoked questions

3. Results

Do more stereotypical questions tend to be of certain categories?

🤔💭 "Maybe a bit? 'Feminine' questions seem more about opinion, less about
contra-expectation, quantity, comparison?"

Stereotypes and curiosity in evoked questions

4. Conclusion

Stereotypes and curiosity in evoked questions

4. Conclusion

Answering the research questions

╭──
│ 1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
│original text?
╰──

• "Yes."

Stereotypes and curiosity in evoked questions

4. Conclusion

Answering the research questions

╭──
│ 1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
│original text?
╰──

• "Yes."
• Only weak 'effect' of AMA post stereotypicality on question stereotypicality.
• Also, questions were evoked by content, not by gender tags.

Stereotypes and curiosity in evoked questions

4. Conclusion

Answering the research questions

╭──
│ 1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
│original text?
╰──

• "Yes."
• Only weak 'effect' of AMA post stereotypicality on question stereotypicality.
• Also, questions were evoked by content, not by gender tags.

╭──
│ 2. 🔎 Do different types of questions differ in this regard?
╰──

• Maybe some tendencies.

Stereotypes and curiosity in evoked questions

4. Conclusion

Answering the research questions

╭──
│ 1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
│original text?
╰──

• "Yes."
• Only weak 'effect' of AMA post stereotypicality on question stereotypicality.
• Also, questions were evoked by content, not by gender tags.

╭──
│ 2. 🔎 Do different types of questions differ in this regard?
╰──

• Maybe some tendencies.

Caveat: Thus far we used only the AMA post titles, not the full text.

Stereotypes and curiosity in evoked questions

4. Conclusion

Final remarks

• LLMs for language research: 👍

Stereotypes and curiosity in evoked questions

4. Conclusion

Final remarks

• LLMs for language research: 👍(ChoiceLLM)

Stereotypes and curiosity in evoked questions

4. Conclusion

Final remarks

• LLMs for language research: 👍(ChoiceLLM)
• LLMs for almost anything else: 😱

Stereotypes and curiosity in evoked questions

4. Conclusion

Final remarks

• LLMs for language research: 👍(ChoiceLLM)
• LLMs for almost anything else: 😱
• Proprietary LLMs: 🤮

Stereotypes and curiosity in evoked questions

4. Conclusion

Final remarks

• LLMs for language research: 👍(ChoiceLLM)
• LLMs for almost anything else: 😱
• Proprietary LLMs: 🤮

Ask me anything!

Stereotypes and curiosity in evoked questions

🤖

Stereotypes and curiosity in evoked questions

🤖💐

Thanks to all speakers, the audience, and especially the organizers:

Laure Gardelle, Naomi Truan & Ismaël Zaïdi