On Being an Outlier

@[email protected] · 11 months ago

On Being an Outlier

@ozymandias117 · 11 months ago

I understand this is partially because I have the mindset of the programmer they’re referring to, but this sounds really interesting

Rather than looking to big data for solutions to hegemonically defined problems, what if we used it to find the catalysts of inequality themselves

…

What are the conditions in which the outlier is culled? What if we used AI to identify the pruning mechanism and dismantle it?

Using more in depth analysis of what gets pruned to understand why it’s being pruned is a very interesting concept to find marginalized groups

I don’t know how to fix those underlying problems, but identifying them and showing that data to leaders seems like a really good endeavor

@[email protected] · 11 months ago

That kind of analysis is done all the time. But, even if we can collect all the relevant data (big if), the methods required are difficult to interpret and easy to abuse (we can’t do an RCT of being born female vs male, or black vs white, &c). A good example is the proliferation of analyses claiming that the gender pay gap does not exist (after you’ve ‘controlled’ for all the things that cause the gender pay gap).

It’s not easy to do ‘right’ even when done in good faith.

The article isn’t claiming that it is easy, of course. It’s asking why power is so keen on one type of question and not its inverse. And that is a very good question, albeit one with a very easy answer. Power is not in the business of abolishing itself.

@ozymandias117 · 11 months ago

after you’ve controlled for all the things that cause the gender pay gap

Isn’t that a continuation of “why the outlier was culled”?

More emphasis on how the data set is selected (while hard) is very useful

@[email protected] · 11 months ago

Isn’t that a continuation of “why the outlier was culled”?

Not sure I follow, but I think the answer is “no”.

If you control for all the causes of a difference, the difference will disappear. Which is fine if you’re looking for causal factors which are not already known to be causal factors, but no good at all if you’re trying to establish whether or not a difference exists.

It’s really quite difficult to ask a coherent question with real-world data from the messy, complicated reality of human beings.

A simple example:

Women are more likely to die from complications after a coronary artery bypass.

But if you include body surface area (a measure of body size) in your model, the difference between men and women disappears.

And if you go the whole hog and measure vein size, the importance of body size disappears too.

And, while we can never do an RCT to prove it, it makes perfect sense that smaller veins would increase the risk for a surgery which involves operating on blood vessels.

None of that means women do not, in fact, have a higher risk of dying after coronary artery bypass surgery. Collect all the data which has ever existed and women will still be more likely to die from the surgery. We have explained the phenomenon and found what is very likely to be the direct cause of higher mortality. Being a woman just makes you more likely to have that risk factor.

It is rare that the answer is as neat and simple as this. It is very easy to ask a different question from the one you thought you were asking (or pretend to be answering one question when you answered another).

You can’t just throw masses of data into a pot and expect sensible answers to come out. This is the key difference between statisticians and data scientists. And, not to throw shade on data scientists, they often end up explaining to the world that oestrogen makes people more likely to die from complications of coronary artery bypass surgery.

@ozymandias117 · 11 months ago

Maybe it’s a crude interpretation, but over controlling for all the the cause of a change, and removing outliers in your data that is training these AI models seem like similar issues when trying to actually understand the data

@[email protected] · 11 months ago

The data cannot be understood. These models are too large for that.

Apple says it doesn’t understand why its credit card gives lower credit limits to women that men even if they have the same (or better) credit scores, because they don’t use sex as a datapoint. But it’s freaking obvious why, if you have a basic grasp of the social sciences and humanities. Women were not given the legal right to their own bank accounts until the 1970s. After that, banks could be forced to grant them bank accounts but not to extend the same amount of credit. Women earn and spend in ways that are different, on average, to men. So the algorithm does not need to be told that the applicant is a woman, it just identifies them as the sort of person who earns and spends like the class of people with historically lower credit limits.

Apple’s ‘sexist’ credit card investigated by US regulator

Garbage in, garbage out. Society has been garbage for marginalised groups since forever and there’s no way to take that out of the data. Especially not big data. You can try but you just end up playing whackamole with new sources of bias, many of which cannot be measured well, if at all.

@ozymandias117 · 11 months ago

You are pointing out specific biases that we already know about. The article you posted seems to posit using the data to find the unknown biases we have as well

@[email protected] · 11 months ago

It’s asking why don’t we use it for that purpose, not suggesting that there is anything easy about doing so. I don’t know how you think science works, but it’s not like that.