This article is for the business leader who may wonder where AI has the potential to go wrong. AI has massive potential; so this is not about “dooms-saying” AI. Rather, it’s to help you know what to look for when you start AI projects so you can watchdog to ensure brand reputation, ROI, human safety, risk management and so much more. At the most basic level, if something unethical or functionally wrong is happening in an AI system, there’s a good chance it has happened in one of three areas: Data, Algorithms, and Training.
Today’s article will focus on the data. It is my firm belief that 90% of problems in AI systems, start with data. Subsequent articles will address algorithms and training.
Why is data so important in AI?
People will say, but “data is data, what can go wrong”? Or my favorite saying, “numbers don’t lie”. If you are of this opinion, then let me refute with an equally well known saying oft-repeated in the analytics profession, "There are three kinds of lies: lies, damned lies, and statistics.” In other words, data can be used to create numbers around any point you want to make. And AI, with machine learning and more specifically deep learning at its core is nothing more than probabilistic statistics. That’s right, still statistics. Statistics derive conclusions from data. The data that goes in forms the base of your AI system. So this is step one of getting the AI system right. This step gets messed up and your problems will only snowball from here.
Data is almost always imperfect.
The majority of a data scientist team’s time is preparing data. Ask them.
But my data is perfect. The three stages of Data Disbelief.
1) The systems I rely on work; so my data must be good
2) My data was already perfected for our dashboards and analytics
3) I bought already-perfect data from a business partner or data broker
We had a saying in the data consulting biz, if a client - usually a department head (e.g. HR, Marketing, Finance, Ops) says their data is perfect add at least 3 weeks to the amount of time it will take to prep the data. The reason we said this was because many groups do not deal directly with their own data and so often times people assume since systems are working right, then the data must be fantastic. Or they have manipulated a particular pool of data that they use for various dashboards or analytics to perfection…but then it turns out that data is not a good fit for the AI project. Or they thought they had bought or traded for perfect data with business partners or data brokers. Buyer beware, getting access to data is not always the same as getting perfected data so make sure you have someone internally who can inspect and validate the data as its coming in. If you have a Chief Data Office, I would highly recommend talking to them about your AI project and your data needs.
Its all about making assumptions and using the past to predict the future
Essentially the way statistical models work is you use background assumptions to build hypotheses and then you start trying to rule out certain of your hypotheses with probabilities. If a particular bit of data seems to increase your probabilities to a threshold you like then you’ll keep it, if not you throw that bit out of the statistical model and continue until finally, using a set of training data, you are able to predict an outcome that has already happened or I should say has already been “classified”. For example, I want to test my algorithm that can identify people whose faces look happy. So I feed it pictures of happy-faced people so it can notice the pattern of what a happy face looks like. Then I feed it a set of test faces that include a pre-determined group of happy faces to see if it can figure out which ones are happy on its own. If it can only figure it out half the time, then I need to keep adding more data or happy faces in this case until it picks up enough patterns to get better and better.
To non-Data Scientists, this sounds kind of counter-intuitive doesn’t it? You will use old data and predict old outcomes and then apply that model to new data. Huh? But I thought this was supposed to predict new outcomes? Things that haven’t happened yet? Right. But by its very nature data is from the past, even if the past is one second ago. Data does not typically define the exact present state nor the future state. In the strictest sense data is evidence or exhaust of something that already happened. Machine learning algorithms work best off of massive continual streams of data that are constantly changing because the more and varying data it gets the more it can learn and get better.
Here’s where I have seen things go awry in AI with regards to data:
The AI creeped people out by getting a little too personal
You really have to figure out if you should even have access to certain data; or if you should use certain data to label people. If it violates public trust in you or creates an opportunity to be creepy, then seriously ask yourself if your clients will enjoy the personalized experience that much, or if it will creep them out so much you’ll have major backlash.
Garbage In Garbage Out
You may not realize this, but data scientists spend the majority of their time manipulating and cleaning up data. My Chief Data Officer friend Ursula Cottone has coined a term for this. She calls this being a “data maid” because data can tend to be like laundry. You just clean and clean it, but there’s always more that gets generated every day. Just because it’s AI does not mean that it can filter out all the yucky stuff in data. Unless its specifically designed to do that which most AI is not. As validated in Pedro Domingo's book, The Master Algorithm, data noise and random events contributes to overfitting of algorithms. “Overfitting happens when you have too many hypotheses and not enough data to tell them apart.” What’s a random event? An example of this would be my sensors went down for a half day or suddenly starting feeding data into new data fields or data became corrupted and looked like gobbledegook. So instead of getting soil readings from my lettuce field that showed numbers between 1-40, I started getting readings outside of the normal range which were 1000. Or random events could even be caused by real events such as a volcano erupted and torched my sensors and caused my readings to go off the charts. Regardless, this data would not be suitable to help me predict my usual yield of lettuce because these would not be something that happen all the time. So I don’t want my algorithm to try and predict my lettuce yields based off of those types of events because they are random.
Using Wrong Data Proxies - Using an alternative for the actual data you need because: 1) The data you need does not exist or 2) you cannot gain access to it. There are lots of reasons why you may not be able to gain access to the actual data: because it is cost-prohibitive to buy it, high security level may be required, too hard to organize/clean up/standardize, existing in too many places to bring together, existing in unusable formats. So you try to come up with a next best alternative to the actual data you need. But all too often the next best alternative can often be based on something like financial based scores that creates bias or machine generated prejudice. So you have to pay very close attention to what data was used exactly.
Credit scores are used as proxies as a measure for how responsible a person is.
My favorite quote on this comes from Cathy O’Neil’s book “Weapons of Math Destruction”, “How you manage your money can matter more than how you drive a car” when it comes to how much you pay for auto insurance. She gives examples of people with perfect driving records but poor credit scores paying $1552 more than drivers with excellent credit scores and a drunk driving conviction. Now perhaps you are thinking but what about crime, maybe the premium is high because its a high crime area. My response to this is then use actual crime statistics not credit scores as proxies. Credit scores create bias because they tend to penalize those with lesser means which can create a cycle of debt. So if all AI-pricing algorithms had credit scores in their criteria, then people who were already of lesser means would then pay more for everything.
Using Old Data to get to New Outcomes
It is always the tendency to want to use more data. Because using a ton of data means more accurate machine learning models. Unless…its data that is not relevant to your new way of doing things.
The case that stands out here is pre-crime also known as predictive policing. Oh yeah that’s a thing. A big scary thing. Go ahead and think about the movie Minority Report. I know I did when I first heard about this. The problem is that predictive policing models that use prior arrests data are more indicative of police practices. Many police departments have used intimidation tactics in high crime areas. Where they would haul in a person and book them just to make an example that will intimidate others from committing crimes. Then they changed tactics. They decided to fund pizza nights for block parties and get to know neighborhoods and partner to understand more of the dynamics going on in the neighborhood. But the data on this new way of doing things is not as abundant as the other way. So if you base your algorithm on data from the old way, there might be “false bookings” that would influence or bias your results and recommend the old tactics instead of the new.
Now you know what to look out for in terms of things that can go wrong with data. I hope this can help you work with your data scientists or vendors to really get to the bottom of how you are setting the most important building block of your AI system. Next time I’ll give you some things to think about in creating and using the algorithm.