5 Common types of Bias
1- Sample bias
Happens when the collected data doesn’t accurately represent the environment the program is expected to run into.
There is no algorithm that can be trained on the entire universe of data, rather than a subset that is carefully chosen.
There’s a science of choosing this subset that is both large enough and representative enough to mitigate sample bias.
Example: Security cameras
If your goal is to create a model that can operate security cameras at daytime and nighttime, but train it on nighttime data only. You’ve introduced sample bias into your model.
Sample bias can be reduced or eliminated by:
Training your model on both daytime and nighttime.
Covering all the cases you expect your model to be exposed to. This can be done by examining the domain of each feature and make sure we have balanced evenly-distributed data covering all of it. Otherwise, you’ll be faced by erroneous results and outputs the don’t make sense will be produced.
2- Exclusion bias
Happens as a result of excluding some feature(s) from our dataset usually under the umbrella of cleaning our data.
We delete some feature(s) thinking that they’re irrelevant to our labels/outputs based on pre-existing beliefs.
Example: Titanic Survival prediction
In the famous titanic problem where we predict who survived and who didn’t. One might disregard the passenger id of the travelers as they might think that it is completely irrelevant to whether they survived or not.
Little did they know that Titanic passengers were assigned rooms according to their passenger id. The smaller the id number the closer their assigned rooms are to the lifeboats which made those people able to get to lifeboats faster than those who were deep in the center of the Titanic. Thus, resulting in a lesser ratio of survival as the id increases.
The assumption that the id affects the label is not based on the actual dataset, I’m just formulating an example.
Exclusion bias can be reduced or eliminated by:
Investigate before discarding feature(s) by doing sufficient analysis on them.
Ask a colleague to look into the feature(s) you’re considering to discard, afresh pair of eyes will definitely help.
If you’re low on time/resources and need to cut your dataset size by discarding feature(s). Before deleting any, make sure to search the relation between this feature and your label. Most probably you’ll find similar solutions, investigate whether they’ve taken into account similar features and decide then.
Better than that, since humans are subject to bias. There are tools that can help. Take a look at this article (Explaining Feature Importance by example of a Random Forest), containing various ways to calculate feature importance. Ways that contain methods that don’t require high computational resources.
3- Observer bias (aka experimenter bias)
The tendency to see what we expect to see, or what we want to see. When a researcher studies a certain group, they usually come to an experiment with prior knowledge and subjective feelings about the group being studied. In other words, they come to the table with conscious or unconscious prejudices.
Example: Is Intelligence influenced by status? — The Burt Affair
One famous example of observer bias is the work of Cyril Burt, a psychologist best known for his work on the heritability of IQ. He thought that children from families with low socioeconomic status (i.e. working class children) were also more likely to have lower intelligence, compared to children from higher socioeconomic statuses. His allegedly scientific approach to intelligence testing was revolutionary and allegedly proved that children from the working classes were in general, less intelligent. This led to the creation of a two-tier educational system in England in 1960s which sent middle and upper-class children to elite schools and working-class children to less desirable schools.
Burt’s research was later of course debunked and it was concluded he falsified data. It is now accepted that intelligence is not hereditary.
Observer bias can be reduced or eliminated by:
Ensuring that observers (people conducting experiments) are well trained.
Screening observers for potential biases.
Having clear rules and procedures in place for the experiment.
Making sure behaviors are clearly defined.
4- Prejudice bias
Happens as a result of cultural influences or stereotypes. When things that we don’t like in our reality like judging by appearances, social class, status, gender and much more is not fixed in our machine learning model. When this model applies the same stereotyping that exists in real life due to prejudiced data it is fed.
Example: A computer vision program that detects people at work
If your goal is to detect people at work. Your model has been fed to thousands of training data where men are coding and women are cooking. The algorithm is likely to learn that coders are men and women are chefs. Which is wrong since women can code and men can cook.
The problem here is that the data is consciously or unconsciously reflecting stereotypes.
Prejudice bias can be reduced or eliminated by:
Ignoring the statistical relationship between gender and occupation.
Exposing the algorithm to a more even-handed distribution of examples.
5- Measurement bias
Systematic value distortion happens when there’s an issue with the device used to observe or measure. This kind of bias tends to skew the data in a particular direction.
Example: Shooting images data with a camera that increases the brightness.
This messed up measurement tool failed to replicate the environment on which the model will operate, in other words, it messed up its training data that it no longer represents real data that it will work on when it’s launched.
This kind of bias can’t be avoided simply by collecting more data.
Measurement bias can be reduced or eliminated by:
Having multiple measuring devices.
Hiring humans who are trained to compare the output of these devices.https://towardsdatascience.com/5-types-of-bias-how-to-eliminate-them-in-your-machine-learning-project-75959af9d3a0?fbclid=IwAR0sPFXUsRqbtjI2gN0oRRp350X2lHS8VZrrNdYbnyAuEotd0vCYn8S85e8