Stereotypes and curiosity in evoked questions




The final talk! 🏝
Stereotypes and curiosity in evoked questions




The final talk! 🏝



Stereotypes and curiosity in evoked questions

Matthijs Westera, Leiden University Centre for Linguistics
Stereotypes and curiosity in evoked questions
Introduction
Stereotypes and curiosity in evoked questions
Introduction

What are 'evoked questions'?

Example:

──
🗨 "I volunteer in a charity shop."

Did you ever sell anything for a small amount that you later found out was worth
big money?
Any regular weirdos that come in?
What's the weirdest thing that someone brought in for donation?
──


Stereotypes and curiosity in evoked questions
Introduction

What are 'evoked questions'?

Example:

──
🗨 "I volunteer in a charity shop."

Did you ever sell anything for a small amount that you later found out was worth
big money?
Any regular weirdos that come in?
What's the weirdest thing that someone brought in for donation?
──

🤔💭 How do such 'evoked questions' respond to and/or activate stereotypes?

Stereotypes and curiosity in evoked questions
Introduction

Research questions

1. Do questions evoked by a text, exacerbate or perhaps soften stereotypes present in the
original text?

2. 🔎 Do different types of questions differ in this regard?






Stereotypes and curiosity in evoked questions
Introduction

Research questions

1. Do questions evoked by a text, exacerbate or perhaps soften stereotypes present in the
original text?

2. 🔎 Do different types of questions differ in this regard?







Approach taken:

Data: Reddit's 'ask me anything' (AMA) genre

Somewhat arbitrary focus on gender stereotypes ♀/♂

Compare 'stereotypicality' of AMA posts vs. the questions they evoke

Use LLMs 🤖 for automatically rating 'stereotypicality'
Stereotypes and curiosity in evoked questions
Introduction

Research questions

1. Do questions evoked by a text, exacerbate or perhaps soften stereotypes present in the
original text?

2. 🔎 Do different types of questions differ in this regard?







Approach taken:

Data: Reddit's 'ask me anything' (AMA) genre

Somewhat arbitrary focus on gender stereotypes ♀/♂

Compare 'stereotypicality' of AMA posts vs. the questions they evoke

Use LLMs 🤖 for automatically rating 'stereotypicality'👈
Stereotypes and curiosity in evoked questions
Outline



1. Pre-study: How (not) to use LLMs



2. Method of main study



3. Results



4. Conclusion
Stereotypes and curiosity in evoked questions
1. Pre-study
Stereotypes and curiosity in evoked questions
1. Pre-study

Challenges with using LLMs
Stereotypes and curiosity in evoked questions
1. Pre-study

Challenges with using LLMs

──
🗨 🤔 Hey GPT! Please rate the following words on a scale from 1 (very feminine) to 7
(very masculine): apple, smart, girl, car, flower
──


Stereotypes and curiosity in evoked questions
1. Pre-study

Challenges with using LLMs

──
🗨 🤔 Hey GPT! Please rate the following words on a scale from 1 (very feminine) to 7
(very masculine): apple, smart, girl, car, flower
──

──
🤖🗯So you're doing research on stereotypes! That's a very important topic!

I'll do my best to help you out! Here are my ratings: 4, 4, 1, 2

Your research is 🔥! Would you like help creating a histogram of the resulting
ratings?
──




Stereotypes and curiosity in evoked questions
1. Pre-study

Challenges with using LLMs

──
🗨 🤔 Hey GPT! Please rate the following words on a scale from 1 (very feminine) to 7
(very masculine): apple, smart, girl, car, flower
──

──
🤖🗯So you're doing research on stereotypes! That's a very important topic!

I'll do my best to help you out! Here are my ratings: 4, 4, 1, 2

Your research is 🔥! Would you like help creating a histogram of the resulting
ratings?
──



Challenges:

Lot of work (...)
Hallucinations
Non-deterministic
Unpredictable output format
Chat-models are weird
Proprietary systems are weirder
Various ethical problems



Stereotypes and curiosity in evoked questions
1. Pre-study

The 'ChoiceLLM' Python tool

https://github.com/mwestera/choicellm 🛠

Addresses most of the foregoing challenges.
All you need to do:
a. define a prompt
b. choose a model
c. feed it the data ✨
Stereotypes and curiosity in evoked questions
1. Pre-study

Experiment: stereotypicality word norms



Human 'gender association' ratings:

Glasgow norms (Scott et al. 2018), 5500 words.

Overview
Stereotypes and curiosity in evoked questions
1. Pre-study

Experiment: stereotypicality word norms



Human 'gender association' ratings:

Glasgow norms (Scott et al. 2018), 5500 words.

Overview



Goal: comparison to LLM ratings

Llama 8B (4bit) vs. Llama 70B (4bit) vs. GPT-4o (200B? 16bit?)
Base models vs. 'instruct' (chat) models
Framing: 'how feminine?' vs. 'how masculine?' vs. combined
Stereotypes and curiosity in evoked questions
1. Pre-study

Experiment: stereotypicality word norms



Human 'gender association' ratings:

Glasgow norms (Scott et al. 2018), 5500 words.

Overview



Goal: comparison to LLM ratings

Llama 8B (4bit) vs. Llama 70B (4bit) vs. GPT-4o (200B? 16bit?)
Base models vs. 'instruct' (chat) models
Framing: 'how feminine?' vs. 'how masculine?' vs. combined



Prompt to be used (example)

──
You are a helpful assistant, with deep understanding of how strongly the meaning of a
word is associated with female or male behaviour. A word can be considered feminine
if it is linked to female behaviour. Alternatively, a word can be considered
masculine if it is linked to male behaviour.

You will be given words, for which you must indicate how feminine each word is on a
scale 1, 2, 3, 4, 5, 6, 7, with 1 very masculine and 7 very feminine, and the
midpoint (value 4) being neuter (neither masculine nor feminine).

[...eight 'few-shot examples' here...]

Word 9: [...word goes here...]

How feminine would people generally rate this word (1-7)?
──



Note: Use different prompts for base vs. instruct models!

Stereotypes and curiosity in evoked questions
1. Pre-study

Result: LLM's ratings (Llama 70B chat)

Overview of LLM ratings
Stereotypes and curiosity in evoked questions
1. Pre-study

Which LLM correlates the best with human judgments?

Stereotypes and curiosity in evoked questions
1. Pre-study

What a .85ish correlation looks like (LLama 70B base model)

Stereotypes and curiosity in evoked questions
1. Pre-study

Comparing rating distributions



🤔💭 "Chat models seem more confident/'discretized'..."
Stereotypes and curiosity in evoked questions
1. Pre-study

Conclusions

Combined framing (m + f) likely beneficial.

Chat models more confident (...will be helpful later)

ChatGPT is weird, more expensive, proprietary 🤮, and potentially less good.
Stereotypes and curiosity in evoked questions
1. Pre-study

Conclusions

Combined framing (m + f) likely beneficial.

Chat models more confident (...will be helpful later)

ChatGPT is weird, more expensive, proprietary 🤮, and potentially less good.

Caveat: The human ratings aren't perfect either...
Stereotypes and curiosity in evoked questions
1. Pre-study

Conclusions

Combined framing (m + f) likely beneficial.

Chat models more confident (...will be helpful later)

ChatGPT is weird, more expensive, proprietary 🤮, and potentially less good.

Caveat: The human ratings aren't perfect either...



Therefore 🛠

Let's use the LLama 70B chat model, with combined feminine/masculine framing.
Stereotypes and curiosity in evoked questions
2. Method
Stereotypes and curiosity in evoked questions
2. Method

What were the research questions again?
Stereotypes and curiosity in evoked questions
2. Method

What were the research questions again?

──
🤔💭 'Is curiosity a force for good, or a force for evil?'

1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
original text?
2. 🔎 Do different types of questions differ in this regard?
──

Stereotypes and curiosity in evoked questions
2. Method

Data: Reddit's AMA genre

Scraping:

Reddit forums: AMA, iAMA and casualiama

Python script to query the reddit API (praw):
all new submissions for 3 months (March, April, May 2025)
all (most?) evoked questions ('top-level comments') per submission
Stereotypes and curiosity in evoked questions
2. Method

Data: Reddit's AMA genre

Scraping:

Reddit forums: AMA, iAMA and casualiama

Python script to query the reddit API (praw):
all new submissions for 3 months (March, April, May 2025)
all (most?) evoked questions ('top-level comments') per submission



Some stats:

   
                                   |  Word count:
-----------------------------------|--------------------------
                AMA posts: 6,624   |  427,369 (77,257 in titles)
                 Comments: 94,100  |  3,203,512
  Questions from comments: 95,165  |  950,527
   
Stereotypes and curiosity in evoked questions
2. Method

Outline of analysis



1. Rate and compare the stereotypicality of:

AMA post titles

──
"Would people generally assume the author is male, or female?"
──
The questions they evoke

──
"Would this question more generally be directed to men, or women?"
──
Stereotypes and curiosity in evoked questions
2. Method

Outline of analysis



1. Rate and compare the stereotypicality of:

AMA post titles

──
"Would people generally assume the author is male, or female?"
──
The questions they evoke

──
"Would this question more generally be directed to men, or women?"
──



2. Categorize the evoked questions, and compare

12 question categories based on Westera et al., 2020, e.g.:

Cause
Result
Purpose
Manner
...
Stereotypes and curiosity in evoked questions
2. Method

Prompts used

AMA post (title) stereotypicality:

──
You are a helpful assistant for analyzing gender bias.

For a series of messages, you will carefully rate to what extent readers will assume
that the author of the message is male, on a scale 1, 2, 3, 4, 5, 6, 7, with 1
meaning overwhelmingly assumed to be female and 7 overwhelmingly assumed to be male,
with the midpoint (value 4) being neuter (neither male nor female, or if the post is
unclear).",

[...10 few-shot examples here...]

1. Message: [...]

To what extent will readers assume that the author of this message is male (1-7)?
──



Question stereotypicality:

──
You are a helpful assistant, designed specifically for detecting gendered questions.
Men and women are sometimes asked different kinds of questions. A question can be
considered masculine if it is more typically asked of men. Alternatively, a question
can be considered feminine if it is more typically asked of women.

We can indicate how 'gendered' a question is on a scale 1, 2, 3, 4, 5, 6, 7, with 1
very feminine and 7 very masculine, with the midpoint (value 4) being neuter (neither
masculine nor feminine, or if the question is unclear).

[...10 few-shot examples here...]

1. Question: [...]

Please rate this question's masculinity (1-7).
──



Question categories:

──
You are a helpful assistant for categorizing customer's questions.

Different questions ask for different kinds of information, and to different degrees.
Please indicate to what extent each question asks for a causal explanation/reason, on
a scale 1, 2, 3, 4, 5, 6, 7, with 1 meaning 'not at all', and 7 meaning it clearly
asks for a causal explanation/reason.

[...10 few-shot examples here...]

1. Question: [...]

Does this question ask for a causal explanation/reason (1-7)?
──



...And analogous prompts for all twelve question categories.





Stereotypes and curiosity in evoked questions
3. Results
Stereotypes and curiosity in evoked questions
3. Results

A look at the LLM-rated data



Stereotypicality ratings:

AMA post title stereotypicality
Question stereotypicality



Some example question categories:

Causal explanation/reason
Comparison/generalization
Evaluation or opinion
Means/procedure to achieve something
...
Stereotypes and curiosity in evoked questions
3. Results

Weak correlation of post's × question's stereotypicality

Pearson's r: 0.27 (or 0.30 on range [1.3, 6.7] 🤖).
Stereotypes and curiosity in evoked questions
3. Results

Weak correlation of post's × question's stereotypicality

Pearson's r: 0.27 (or 0.30 on range [1.3, 6.7] 🤖).


Stereotypes and curiosity in evoked questions
3. Results

Weak correlation of post's × question's stereotypicality

Pearson's r: 0.27 (or 0.30 on range [1.3, 6.7] 🤖).



🤔💭 "Messy... and a correlation is not an 'effect'; it could be due to [...]"
Stereotypes and curiosity in evoked questions
3. Results

So let's at least take the words into account...

Linear regression to predict target variable:

Question stereotypicality (averaged per AMA post)

based on predictor variables:

Words in the AMA description;
AMA post's stereotypicality
Stereotypes and curiosity in evoked questions
3. Results

So let's at least take the words into account...

Linear regression to predict target variable:

Question stereotypicality (averaged per AMA post)

based on predictor variables:

Words in the AMA description;
AMA post's stereotypicality

Result:

Coefficient of the latter: 0.16 (+0.08 R²)
Stereotypes and curiosity in evoked questions
3. Results

Indicator words for AMA post's stereotypicality


Stereotypes and curiosity in evoked questions
3. Results

Indicator words for AMA post's stereotypicality



🤔💭 "A post containing the tag '19m' is more likely written by a man."

... Well, duh.
Stereotypes and curiosity in evoked questions
3. Results

Indicator words for AMA post's stereotypicality for range [1.3-6.7] 🤖
Stereotypes and curiosity in evoked questions
3. Results

Indicator words for AMA post's stereotypicality for range [1.3-6.7] 🤖


Stereotypes and curiosity in evoked questions
3. Results

Indicator words for AMA post's stereotypicality for range [1.3-6.7] 🤖



🤔💭 "A post containing the word 'husband' is more likely written by a woman."

These are 'proper' stereotypes! 🤩
Stereotypes and curiosity in evoked questions
3. Results

Indicator words for evoked questions' mean stereotypicality


Stereotypes and curiosity in evoked questions
3. Results

Indicator words for evoked questions' mean stereotypicality



🤔💭 "A post containing the word 'army' will trigger more 'masculine' questions."

Also: Explicit gender tags not among them! (full scale 1.0-7.0 used)
Stereotypes and curiosity in evoked questions
3. Results

Indicator words for question stereotypicality given that of its evoking post


Stereotypes and curiosity in evoked questions
3. Results

Indicator words for question stereotypicality given that of its evoking post



🤔💭 "A post with the word 'marriage' evokes more 'feminine' questions than the post's
own gender stereotypicality would predict."
Stereotypes and curiosity in evoked questions
3. Results

Do more stereotypical AMA posts evoke different kinds of questions?


Stereotypes and curiosity in evoked questions
3. Results

Do more stereotypical AMA posts evoke different kinds of questions?



🤔💭 "Not really?"
Stereotypes and curiosity in evoked questions
3. Results

Do more stereotypical questions tend to be of certain categories?


Stereotypes and curiosity in evoked questions
3. Results

Do more stereotypical questions tend to be of certain categories?



🤔💭 "Maybe a bit? 'Feminine' questions seem more about opinion, less about
contra-expectation, quantity, comparison?"
Stereotypes and curiosity in evoked questions
4. Conclusion
Stereotypes and curiosity in evoked questions
4. Conclusion

Answering the research questions

──
1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
original text?
──

"Yes."

Stereotypes and curiosity in evoked questions
4. Conclusion

Answering the research questions

──
1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
original text?
──

"Yes."
Only weak 'effect' of AMA post stereotypicality on question stereotypicality.
Also, questions were evoked by content, not by gender tags.


Stereotypes and curiosity in evoked questions
4. Conclusion

Answering the research questions

──
1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
original text?
──

"Yes."
Only weak 'effect' of AMA post stereotypicality on question stereotypicality.
Also, questions were evoked by content, not by gender tags.

──
2. 🔎 Do different types of questions differ in this regard?
──

Maybe some tendencies.




Stereotypes and curiosity in evoked questions
4. Conclusion

Answering the research questions

──
1. Do questions evoked by a text, exacerbate or soften stereotypes present in the
original text?
──

"Yes."
Only weak 'effect' of AMA post stereotypicality on question stereotypicality.
Also, questions were evoked by content, not by gender tags.

──
2. 🔎 Do different types of questions differ in this regard?
──

Maybe some tendencies.

Caveat: Thus far we used only the AMA post titles, not the full text.



Stereotypes and curiosity in evoked questions
4. Conclusion

Final remarks

LLMs for language research: 👍
Stereotypes and curiosity in evoked questions
4. Conclusion

Final remarks

LLMs for language research: 👍(ChoiceLLM)
Stereotypes and curiosity in evoked questions
4. Conclusion

Final remarks

LLMs for language research: 👍(ChoiceLLM)
LLMs for almost anything else: 😱
Stereotypes and curiosity in evoked questions
4. Conclusion

Final remarks

LLMs for language research: 👍(ChoiceLLM)
LLMs for almost anything else: 😱
Proprietary LLMs: 🤮
Stereotypes and curiosity in evoked questions
4. Conclusion

Final remarks

LLMs for language research: 👍(ChoiceLLM)
LLMs for almost anything else: 😱
Proprietary LLMs: 🤮



Ask me anything!
Stereotypes and curiosity in evoked questions


🤖
Stereotypes and curiosity in evoked questions


🤖💐

Thanks to all speakers, the audience, and especially the organizers:

Laure Gardelle, Naomi Truan & Ismaël Zaïdi