We're in the midst of a new global revolution, one that is driven by and focused on the data that surrounds us and infuses everything we do from making toast, to driving cars across the country, to inventing new paradigms for social interaction. And the key to all of this is data science. In this article, we'll explore some of the ways that data science allows us to ask and answer new questions that we previously didn't even dream of. To do that, we'll see how data science connects to other data-rich fields like artificial intelligence, machine learning, and prescriptive analytics. We'll map out the fundamental practices for gathering and analyzing data, formulating rules for classification and decision making, and implementing those insights. We'll touch on some of the tools that you can use in data science, but we'll focus primarily on the meaning and the promise of data in our lives. Because this discussion focuses on ideas as opposed to specific techniques, if you want to know how you can thrive in the new world of data regardless of your technical background, you can get a better understanding of how to draw on data to do the things that are important to you and to do them more effectively and more efficiently.
What is data science?
Back in the early 60's, Barbra Streisand sang, "People who need people "are the luckiest people in the world." And really, the need for belonging, the need to connect, and the need to be valued are all fundamental human motivations. As it happens, working in data science can actually help fulfill some of these foundational needs. By placing you in a position to one, do something wonderful for other people and two, in return to be valued for it. It goes back to something that a Harvard Business Review had to say about it back in 2012. Thomas Davenport and D.J. Patil made the extraordinary claim that data science, of all things, was the sexiest job of the 21st century. But they had good reasons for saying this. They are argued that data scientists had one, a valuable combination of rare qualities that two, put them in high demand. So here are some of the rare qualities. Data scientists are able to find order, meaning, and value in unstructured data. That's online sources, the graphs of social networks, audio, images, videos, and so on. They're able to predict outcomes like who's likely to purchase something or who poses a security threat, or who's likely to develop a disease, or respond well to a new treatment. They're able to automate processes like getting individualized recommendations while shopping, identifying friends in photographs, or giving psychological support in AI chat bots. And they're in high demand for a couple of really simple reasons. They provide hidden insight. Data science is able to show you things that you simply can't find through other means. And that hidden insight, in turn, provides significant competitive advantage to any organization that has the good foresight to employ and really make the most of data science. Give you a little bit more information about supply and demand here. Number one, there's been extraordinary growth in job ads. People are looking for data science. So for example, a January 2019 report from Indeed reported a 29% increase in job ads for data science over one year and 344% over six years. This is extraordinary growth. Next, they showed that there's growth in job searches. People actually trying to find jobs in data science. And they found only a 14% growth over one year. Now that may sound a little low, but the important thing to hear is that the demand is outstripping the availability. And any time that happens, you've got value. And what this means, you know, again, specifically, the gap in supply and demand is it significant? LinkedIn reported a gap of over 150,000 jobs in data science. And it's even more dramatic when you show in related fields like machine learning engineer, artificial intelligence specialist, and so on. It's big. There are so many possibilities here to do something that is valued for others. And it's reflected in the salaries for data science. The average salary in data science is $107,000 a year. Which, just for comparison purposes, is over twice the national median in the U.S. of $47,000 a year. And that means that this is one of the best jobs in the U.S. Glass Door in January of 2019 published it's annual list of best jobs in America and for the fourth year in a row, data scientist is at the top of the list, based on job satisfaction, number of openings, salary. It really lets you know there is extraordinary potential here, something that you can provide of amazing value for potential employers. And you fan fulfill that great need and you can be valued for the things that you're able to contribute by embracing the methods and the benefits of data science in your work.
Sometimes the whole is greater than the sum of its parts, and you can see this in a lot of different places. Take, for instance, music. You take John, and Paul, and George, and Ringo, all wonderful musicians in their own rights, but put them together and you have The Beatles and you have revolutionized popular culture. Or take the fact that everybody has a circle of friends and basically everybody has the internet now, and you have created social networks and you have revolutionized the computing world. Or in 2013, Drew Conway proposed the combination of hacking skills, that's computer programming, and math and statistics, and substantive or topical domain expertise, together give you data science, a new field that has revolutionized both the technology and the business world. And I want to talk a little more about why each of those three elements in the Venn diagram of data science are so important. First one, the hacking skills or computer programming, the reason that's important is because you have such novel sources of data. You have social media and social networks, you have challenging formats like the graph data from those social networks or images or video that don't fit into the rows and columns of a spreadsheet, or you have streaming data like sensor data or live web data that comes in so fast that you can't pause it to analyze it. All of these require the creativity that comes with hacking and the ability to work freely with what you have in front of you. Now in terms of actual computer programming skills, there's a few things that are very useful in data science. The ability to work with a language like Python or R, these are programming languages that are very frequently used for data manipulation and modeling. There's C, and C++, and Java. These are general-purpose languages that are used for the back end, the foundational elements of data science, and they provide maximum speed. There's S-Q-L or SQL. That stands for structured query language. This is a language for working with relational databases to do queries and data manipulation. And then there are packages you can use in other languages like TensorFlow. This is an open-source library that's used for deep learning and it has revolutionized the way that data science is performed right now. And then there's the mathematical elements of data science. First off, there's several forms of mathematics that are particularly useful in data science. There's probability, and linear algebra, and calculus, and regression, and I'll talk about some of each of these, but they allow you to do something important. Number one, they allow you to choose the procedures. You want to judge the fit between your question, which is always the first and most important thing, the data that you have available to you, and then you choose a procedure that answers your questions based on your data. And if you understand the mathematics and how it works, you'll be able to make a much better and more informed choice. And also, you'll be able to diagnose problems. Murphy's Law applies in data science as well as everywhere else that anything that can go wrong, will go wrong, and you need to know what to do when the procedures that you've chosen fail or they give sometimes impossible results. You need to understand exactly how the data's being manipulated so you can see where the trouble areas are and how to resolve them. And then the third area of Conway's data science Venn diagram is substantive expertise. This is each domain or topic area has its own goals, methods, and constraints. If you're working in social media marketing, you're going to have a very different set of goals and methods than if you're working in biomedical informatics. And you need to know what constitutes value in the particular domain you're working in. And finally, how to implement the insights because data science is an action-oriented field. It's designed to tell you what to do next to get the most value, provide the best service that you possibly can based on the data that you have. So taken together, the hacking or programming, the math and statistics, and the substantive expertise are the individual elements or components, the parts that make up the larger-than-the-sum whole of data science.
The insights you get from data science can feel like a gift to your business, but you don't get to just open your hands and get it delivered to you with a bow on it. Really, there are a lot of moving parts and things that have to be planned and coordinated for all of this to work properly. I like to think of data science projects like walking down a pathway, where each step gets you closer to the goal that you have in mind. And with that I want to introduce you to a way of thinking about the data science pathway. It begins with planning your project. You first need to define your goals. What is it that you're actually trying to find out or accomplish? That way you can know when you're on target or when you need to redirect a little bit. You need to organize your resources. That can include things as simple as getting the right computers and the software, accessing the data, getting people and their time available. You need to coordinate the work of those people because data science is a team effort. Not everybody's going to be doing the same thing and some things have to happen first and some happen later. You also need to schedule the project so it doesn't expand to fill up an enormous amount of time. Time boxing, or saying we will accomplish this task in this amount of time, can be especially useful in working on a tight timeframe or you have a budget and you're working with a client. After planning, the next step is going to be wrangling, or preparing the data. That means you need to first get the data. You may be gathering new data, you may be using open data sources, you may be using public APIs, but you have to actually get the raw materials together. The next one, step six, is cleaning the data, which actually is an enormous task within data science. It's about getting the data ready so it fits into the paradigm, for instance, the program and the applications that you're using, that you can process it to get the insight that you need. Once the data's prepared and it's in your computer, you need to explore the data, maybe making visualizations, maybe doing some numerical summaries, a way of getting a feel of what's going on in there. And then, based on your exploration, you may need to refine the data. You may need to re-categorize cases. You may need to combine variables into new scores. Any of the things that can help you get it prepared for the insight. The third category in your data pathway is modeling. This is where you actually create the statistical model and you do the linear regression. You do the decision tree. You do the deep learning neural network. But then, you need to validate the model. How well do you know this is going to generalize from the current data set to other data sets. In a lot of research that step is left out and you often end up with conclusions that fall apart when you go to new places. So, validation's a very important part of this. The next step is evaluating the model. How well does it fit the data? What's the return on investment for it? How usable is it going to be? And then, based on those, you may need to refine the model. You may need to try processing a different way, adjust the parameters in your neural network, get additional variables to include in your linear regression. Any one of those can help you build a better model to achieve the goals that you had in mind in the first place. And then finally, the last part of the data pathway is applying the model and that includes presenting the model, showing what you learned to other people, to the decision makers, to the invested parties, to your client, so they know what it is that you've found. Then you deploy the model. Say for instance, you created a recommendation engine. You actually need to put it online so that it can start providing these recommendations to clients or you put it into a dashboard so it can start providing recommendations to your decision makers. You will eventually need to revisit the model, see how well it's performing, especially when you have new data and maybe a new context in which it's operating. And then, you may need to revise it and try the process over again. And then finally, once you've done all of this there's the matter of archiving the assets, really cleaning up after yourself is very important in data science. It includes documenting where the data came from and how you process it. It includes commenting the code that you used to analyze it. It includes making things future proof. All of these together can make the project more successful, easier to manage, easier to get the return on investment calculations for it, and those together will make the project more successful by following each of these steps. Taken together those steps on the pathway get you to your goal. It could be an amazing view at the end of your hike, or it could be an amazing insight into your business model, which was your purpose all along.
Data science is fundamentally a team sport. There are so many different skills and so many different elements involved in a data science project that you're going to need people from all sorts of different backgrounds with different techniques to contribute to the overall success of the project. I want to talk about a few of these important roles. The first one is the data engineers. These are the developers, and the system architects, the people who focus on the hardware and the software that make data science possible. They provide the foundation for all of the other analyses. They focus on the speed, and the reliability, and the availability of the work that you do. Next are machine learning specialists. These are people who have extensive work in computer science and in mathematics. They work in deep learning. They work in artificial intelligence. And they're the ones who have the intimate understanding of the algorithms and understand exactly how they're working with the data to produce the results that you're looking for. And then, an entirely different vein are people who are researchers, and by that I mean topical researchers. They focus on domain-specific research like, for instance, physics and genetics are common, so is astrophysics, so is medicine, so is psychology, and these kinds of researchers, while they connect with data science, they are usually better versed in the design of research within their particular field and doing common statistical analyses, that's where their expertise lies, but they connect with data science in that they're trying to find the answers to some of these big-picture questions that data scientists can also contribute to. Also, any business doing its job has analysts. These are people who do the day-to-day data tasks that are necessary for any business to run efficiently. Those include things like web analytics, and S-Q-L, that's SQL or Structured Query Language, data visualizations, and the reports that go into business intelligence. These allow people to make decisions. It's for good business decision-making that lets you see how you're performing, where you need to reorient, and how you can better reach your goals. Then there are the managers. These are the people who manage the entire data science project, and they're in charge of doing a couple of very important things. One is they need to frame the business-relevant questions and solutions. So, they're the ones who have the big picture. They know what they're trying to accomplish with that. And then, they need to keep people on track and moving towards it. And, to do that, they don't necessarily need to know how to do a neural network, they don't need to make the data visualization, but they need to speak data so they can understand how the data relates to the question they're trying to answer, and they can help take the information that the other people are getting and putting it together into a cohesive whole. Now, there are people who are entrepreneurs. And, in this case, you might have a data-based startup. The trick here is you often need all of the skills, including the business acumen, to make the business run well. You also need some great creativity in planning your projects and the execution that get you towards your entrepreneurial goals. And then there's the unicorn, also known as the rock star, or the ninja. This is a full-stack data scientist who can do it all, and do it at absolute peak performance. Well, it's a nice thing to have, on the other hand, that thing is very rare which is why we call them the unicorn. Also, you don't want to rely on one person for everything. Aside from the fact that they're hard to find, and sometimes hard to keep, you're only getting a single perspective or approach to your business questions, and you usually need something more diverse than that. And what suggests is the common approach to getting all the skills you need for a data project, and that is by team, and you can get a unicorn by team where you can get the people who have all the necessary skills, from the foundational data engineer, to the machine learning specialist, to the analyst, to the managers, all working together to get the insight from your data and help your project reach its greatest potential in moving your organization towards its own goals.
At this exact moment in history, when people think about data science, the mind turns inexorably towards artificial intelligence, often with humanoid robots lost deep in thought. But before I compare and contrast data science and AI, I want to mention a few things about the nature of categories and definitions. First, categories are constructs, and by construct I mean something that you have to infer, something that is created in the mind, doesn't have this essential existence. It's a little bit like, when is something comedy and when is something performance art, and when is something acting? There's nothing that clearly separates one from the other. These are all mental categories, and the same thing is true of any category or definition, including things like data science and AI. The second one is that categories serve functional purposes. A letter opener is anything that's used to open letters. I actually use a knife to open letters. On the other hand, I know a family that uses knives to scoop ice cream exclusively. And, the tool is whatever you use it for. It's defined by its utility. The same thing is true of categories. And then finally, the use of categories varies by needs. If you're putting books on the shelf, you can use the Library of Congress system, the Dewey Decimal system. I know people who stack them by size or by color, or turn them around and do it decoratively. Any of those is going to work because they're serving different purposes. And so, when we're trying to think about categories and defining whether a particular activity is AI or whether it's data science, all of these principles are going to apply. A good example of this is the question of whether tomatoes are fruits or vegetables. Everyone knows that tomatoes are supposed to be fruit, but everyone also knows you'd never put tomatoes in a fruit salad. Tomatoes go on veggie plates along with carrots and celery. The answer to this paradox, its fruit versus vegetable nature, is simple. The word fruit is a botanical term, and the word vegetable is a culinary term. They're not parallel or even very well coordinated systems of categorizations, which is why confusion can arise. It's a little like the joke about the bar that plays both kinds of music, country and western. The categories don't always divide logically or exclusively, and the same is true for artificial intelligence and data science. So, what exactly is artificial intelligence? Well I'm going to let you know, there are a lot of different statements about this, and none of them are taken as definitive. And some of them I find to be useful, and some of them I find to be less useful. There's a little joke that AI means anything that computers can't do. Well, obviously, computers learn to do new things, but as soon as a computer learns how to do something, people say, well that's not intelligent, that's just a machine doing stuff. And so there's a sort of moving target here to this particular definition in terms of things computers can't do. You can also think of AI in terms of tasks that normally require humans. Like placing a phone call and making an appointment. Or like returning an email, or categorizing text. Traditionally humans have done that, but when a machine is able to do that, when a program's able to do it, that's probably a good example of artificial intelligence. Probably the most basic and useful definition is that artificial intelligence refers to programs that learn from data. And so, you give them some data, they build a model, and that that model adapts over time. A few common examples of this are things like categorizing photos. Is this a photo of a horse, a car, a balloon, a person? And programs learn how to do this by first having lots and lots, and lots, and lots of photos that are labeled by the people as one thing or another, as a cat or a dog. But then the algorithm is able to start learning on its own and abstracting the elements of the photo that best represent cat or dog. It's also used for translations going from one language, like English, to another, like French. The use of artificial intelligence programs has made enormous leaps in the ability of computers to do this automatically. Another one is games, like Go here. It was a very big deal when not very long ago a computer was able to beat the world champion of Go. And, it was thought to be this intuitive game that couldn't really be explained. What's fascinating about that is the computer actually taught itself how to play go. And, we'll talk a little more about that when we talk about the derivation of rules in another video. But all three of these can be good examples of artificial intelligence, simply by the sorts of things it's able to do. And so, probably, this is the best working definition of AI. And, while it can include even simple regression models, which really don't require much in the way of computing power, it usually refers to two approaches in particular. AI is usually referring to machine learning algorithms, and in particular, deep learning neural networks. I'm going to talk about those more elsewhere, but I did want to bring up one more important distinction when talking about AI. And that's the difference between what is called strong or general AI, which is the idea that you can build a computer replica of the human brain that can solve any cognitive task. This is what happens in science fiction. You have a machine that's just like a human in a box. And that was the original goal of artificial intelligence back in the 50s, but it turned out that has been very difficult. Instead, you also have what is called weak or narrow, or specific, or focused AI. And these are algorithms that focus on one specific well defined task. Like, is this a photo of a cat or a photo of a dog? That has been where the enormous progress in AI has been over the last several years. So with all this in mind, how does artificial intelligence compare and contrast to data science? Well, it's a little bit like the fruit versus vegetable conundrum. Artificial intelligence means algorithms that learn from data. Broadly speaking, there's an enormous amount of overlap between our concept of AI and the field of machine learning. Data science on the other hand is the collection of skills and techniques for dealing with challenging data. You can see that these two are not exclusive. There's a lot of overlap between them, and AI nearly always involves the data science skillset. You basically can't do modern AI without data science. But there's an enormous amount of data science that does not involve artificial intelligence. If you want to draw a diagram, I personally think of it this way. If this is data science, here's machine learning, ML. There's a lot of overlap between those two, and then within machine learning there's a specific approach called neural networks. Those have been amazingly productive, and AI refers to this diffuse, not well defined category that mostly overlaps with neural networks and with machine learning. And, it gets at some of the ambiguities, and some of the difficulty in separating these, which is why there's no consistent definition, and why there's so much debate over what one thing is, and what the other one is. But I will say this. Artificial intelligence has been enormously influential within the field of data science recently, even though data science has many other things that it does.
Back in the day a machine was just a machine. It did whatever machine things it did like stamping metal or turning a propellor or maybe washing your clothes with a fair amount of help on your part. But nowadays, machines have to do more than just their given mechanical function. Now a washing machine's supposed to be smart. It's supposed to learn about you and how you like your clothes and it's supposed to adjust its functions according to its sensors and it's supposed to send you a gentle message on your phone when it's all done taking care of everything. This is a big change, not just for washing machines, but for so many other machines and for data science processes as well. This gets to the issue of machine learning and a very simple definition of that is the ability of algorithms to learn from data and to learn in such a way that they can improve their function in the future. Now, learning is a pretty universal thing. Here's how humans learn. Humans, memorization is hard. On the other hand, spotting patterns is often pretty easy for humans as is reacting well and adaptively to new situations that resemble the old ones in many but not all ways. On the other hand, the way that machines learn is a little bit different. Unlike humans, memorization is really easy for machines. You can give them a million digits, it'll remember it perfectly and give it right back to you. But for a machine, for an algorithm, spotting patterns, in terms of here's a visual pattern, here's a pattern over time, those are much harder for algorithms. And new situations can be very challenging for algorithms to take what they learned previously and adapt it to something that may differ in a few significant ways. But the general idea is that once you figure out how machines learn and the ways that you can work with that, you can do some useful things. So for instance, there's the spam email and you get a new email and the algorithm can tell whether it's a spam. I used a couple of different email providers and I will tell you, some of them are much better at this than others. There's also image identification. For instance, telling whether this is a human face or who's face it is. Or there's the translation of languages where you enter text, either written or spoken, and it translates it back. A very complicated task for humans but something that machines have learned how to do much better than they used to. Still not 100% but getting closer all the time. Now, the important thing here is that you're not specifying all the criteria in each of these examples and you're not laying out a giant collection of if this then that statements in a flow chart. That would be something called an expert system. Those were created several decades ago and have been found to have limited utility and they're certainly not responsible for the modern developments of machine learning. Instead, a more common approach it really is to just teach your machine. You train it and the way you do that is you show the algorithm millions of labeled examples. If you're trying to teach it to identify photos of cats versus other animals, you give it millions of photos and you say this is a cat, this is a cat, this is not, this is not, this is. And then the algorithm finds its own distinctive features that are consistent across many of the examples of cats. Now what's important here is that the features, the things in the pictures that the algorithm latches onto may not be relevant to humans. We look at things like the eyes and the whiskers and the nose and the ears. It might be looking at the curve on the outside of the cheek relevant to the height of one ear to another. It might be looking just at a small patch of lines around the nose. Those may not be the things that humans latch onto and then sometimes they're not even visible to humans. It turns out that algorithms can find things that are very subtle, pixel by pixel changes in images or very faint sounds in audio patches or individual letters in text and it can respond to those. That's both a blessing and a curse. It means that it can find things that humans don't but it also can react in strange ways occasionally. But once you take all this training, you give your algorithm millions of labeled examples and it starts classifying things, well, then you want use something like a neural network which has been responsible for the major growth in machine learning and data science in the past five years or so. These diagrams here are different layouts of possible neural networks that go from the left to the right, some of them circle around or they return back to where they were. But all of these are different ways of taking the information and processing it. Now, the theory of neural networks or artificial neural networks has existed for years. The theory is not new. What's different, however, is that computing power has recently caught up to the demands that the theory places and in addition, the availability of labeled data primarily thanks to social media has recently caught up too. And so now we have this perfect combination. The theory has existed but the computing power and the raw data that it needs have both arrived to make it possible to do these computations that in many ways resembles what goes on in the human brain and then allow it to think creatively about the data, find its own patterns and label things. Now, I do want to say something about the relationship between data science and machine learning. Data science can definitely be done without machine learning. Any traditional classification task, logistic regression, decision tree. That's not usually machine learning and it's very effective data science. Most predictive models or even something like a sentiment analysis of social media text. On the other hand, machine learning without data science, well, you know, not so much. It's possible to do machine learning without extensive domain expertise so that's one element of data science. On the other hand, you would nearly always want to do this in collaboration with some sort of topical expert. Mostly I like to think of machine learning as a subdiscipline of data science. And that just brings up one more thing I want to say. The neural networks and the deep learning neural networks in particular that have been responsible for nearly all of these amazing developments in machine learning are a little bit of a black box which means it's hard to know exactly what the algorithm is looking at or how it's processing the data and one result of that is it kind of limits your ability to interpret what's going on even though the predictions in classifications can be amazingly accurate. I'll say more about neural networks and these issues elsewhere, but they highlight the trade-offs, the potential and the compromises that are inherent in some of these really exciting developments that have been taking place in one extraordinary influential part of the data science world.
If you've ever been around a baby, you know that babies take very little steps. But the thing about baby steps is that you still get moving and eventually, babies grow and they take bigger steps. And before you know it, you've got a world-class sprinter. And there's a similar thing, I like to think, that happens with neural networks. And what happens here is that tiny steps with data can lead to amazing analytical results. Now, an artificial neural network in computing is modeled roughly after the neurons that are inside a biological brain. Those neurons are nothing more than simple on and off switches that are connecting with each other, but give rise to things like love and consciousness. In the computing version, the idea is to take some very basic pieces of information, and by connecting it with many other nodes, you can give rise to the sort of emergent behavior, which really is very high-level cognitive decisions and classifications. It works this way. Over here on the left, you start with an input layer. That's where your raw data comes in. And then, it gets passed along to one or more hidden layers. That's what makes it a neural network, that you have these hidden layers. And these lines all represent connections like the connections between neurons in a biological brain. And then, after going through several hidden layers, you have an output layer, which is where you get the final classification or decision about what's happening. And I want to give you an example of how this might work. Now, please understand, this is an analogy. The actual operation of neural networks is much more complicated and sometimes a little more mysterious than what's going on here. But let's take a simple example where you're taking your input data from a digital image. In that case, for each pixel in the image, you're going to have basically five pieces of information. You're going to have the X and Y coordinates of that pixel. And then, for that pixel, you're going to have its red, green, and blue color components. And then you're going to repeat these five things for every pixel in the image. But that's your raw input data. Those are numerical values. And you put those into the input layer. And then, what it does is it starts to combine these different X and Y positions and the RGB colors, and then, it decides whether it has found a line. Does this represent a distinct line against a color background? So, that might be the first layer. And then, from there, it's going to say, I've found some lines, and now I'm going to see if I can combine those lines to determine whether that line is the edge of an object as opposed to some sort of service marker. And then, if I found edges, I can then take the information about edges and then combine that to determine what's the shape that I'm looking at. Is it a circle, a square, a hexagon, or something much more complex than that? And then maybe, it takes all of this shape information and then it goes to the output layer, and it says what the actual object is. So, we've gone from the X Y RGB pieces of information about each pixel, and we've put that in. And we've gone to lines and we've gone to edges, and then to shapes, and then possibly to objects. That's one idea of how a neural network, and especially a deep-learning neural network, which has many hidden layers, might work. If you want to see something that's slightly more complicated, here's an example. And neural networks can potentially have millions of neurons. There can be a lot of stuff going on in there. And they might be arranged in a lot of different ways. These are called feedforward ones
, where the information starts on the left and just kind of keeps moving forward to the right. But there are a lot of other potential arrangements for the data transfers within a neural network. These are some of the possible examples. They behave slightly differently. You'll want to use some of the different versions in different circumstances. There is one interesting thing, though. Just like a human brain, things can get a little complicated in a neural network, or really massively complicated. And it can be hard to know exactly what it is that's going on inside there. What that means is you actually have to resort to inference. You sometimes have to infer how the neural network is functioning. What is it looking at? How is it processing that information? And curiously, that means you actually have to use some of the same methods as psychological researchers who are trying to find out what's going on inside a human brain. You are testing and then inferring the processes that are going on. One other really important thing to keep in mind about neural networks is there's a collection of legal issues that apply to these that don't quite apply the same way to other forms of machine learning or data science. For instance, the European Union's General Data Protection Regulation
, better known as just GDPR
, is a collection of policies that govern privacy and really how organizations gather and process information that they get from people. One really important part of this that relates to neural networks is what's called a right to explanation
. If a person feels that they have been harmed by a decision made by a neural network, such as it refused a loan application, they can sue the organization and demand an explanation. How did it reach that process? Now, because neural networks are so complicated, they tend to be kind of opaque and it's hard to know what's going on. You may have a difficult situation explaining exactly how it got there. That's a problem, because there are some very stiff fines associated with violations of the GDPR
. So, you will want to get a little more information on this. If you're going to be using neural networks, you owe it to yourself to spend a little bit of time on the social context and on the legal context of how these things work. But the most exciting thing about them is the amazing progress that's been made in machine learning over just the past few years with neural networks and deep-learning neural networks, in particular, to model the general processes going on in the human brain, and to be able to reach some very, very sophisticated and a highly accurate conclusions based on that processing.
There was a time just a few years ago when data science and big data were practically synonymous terms as were semi-magical words like Hadoop that brought up all the amazing things happening in data science. But things are a little different now, so it's important to distinguish between the two fields. I'll start by reminding you what we're talking about when we talk about big data. Big data is data that is characterized by any or all of three characteristics. Unusual volume, unusual velocity, and unusual variety. Again, singly or together can constitute big data. Let me talk about each of these in turn. First, volume. The amount of data that's become available even over the last five years is really extraordinary. Things like customer transactions at the grocery store. The databases that track these transactions and compile them and consumer loyalty programs have hundreds of billions of rows of data on purchases. GPS data from a phone includes information from billions of people constantly throughout the day. Or scientific data, for instance, this image of the black hole in Messier 97 from the Event Horizon Telescope that was released in April of 2019. It involved half a ton of hard drives that had to be transported on airplanes to central processing locations because that was several times faster than trying to use the internet. Any one of these is an overwhelming dataset for normal method, and that brought about some of the most common technologies associated with the big data, distributed file systems like Hadoop, that made it possible to take these collections that were simply too big to fit on any one computer, any one drive, put it across many, and still be able to integrate them in ways that let you get collective intelligence out of them. Then there's velocity. The prime culprit in this one is social media. YouTube gets 300 hours of new video uploaded every single minute. That gets about five billion views per day. Instagram had 95 million posts per day, and that was back in 2016 when it only had half as many users as it does not. And Facebook generates about four petabytes of data per day. The data is coming in so fast it's a fire hose that no common methods that existed before the big data revolution could handle it. This required new ways of transporting data, integrating data, and being able to update your analyses constantly to match the new information. And then finally there's the variety, probably one of the most important elements of big data. That included things like multimedia data, images, and video, and audio. Those don't fit into spreadsheets. Or biometric data, facial recognition, your fingerprints, your heart readings, and when you move the mouse on your computer to find out where the cursor went, that's a distinctive signature that's recorded and identified for each user. And then there's graph data. That's the data about the social networks and the connections between people. That requires a very special kind of database. Again, doesn't fit into the regular rows and columns of conventional dataset. So all of these showed extraordinary challenges for simply getting the data in, let alone knowing how to process it in useful ways. Now it is possible to distinguish big data and data science. For instance, you can do big data without necessarily requiring the full toolkit of data science, which includes computer programming, math and statistics, and domain expertise. So for instance, you might have a large dataset, but if it's structured and very consistent, maybe you don't have to do any special programming. Or you have streaming data. It's coming in very fast. But only has a few variables, a few kinds of measurements. Again, you can set it up once and kind of run with it as you go. And so that might be considered big data, but doesn't necessarily require the full range of skills of data science. You can also have data science without big data. And that's anytime you have a creative combination of multiple datasets or you have unstructured texts like social media posts. Or you're doing data visualization. You may not have large datasets with these, but you're definitely going to need the programming ability and the mathematical ability as well as the topical expertise to make these work well. So now that I've distinguished them I want to return to one particularly important question. You can find this on the internet. And the question is, is big data dead? Because it's interest peaked about four or five years ago. And it looks like it's been going down since then. So is big data passed? Is it no longer there? Well, it's actually quite the opposite. It turns out that big data is alive and well. It's everywhere. It has simply become the new normal for data. The practices that it introduced, the techniques that it made possible are used every single day now in the data science world. And so while it's possible to separate big data and data science, the two become so integrated now that big data is simply taken for granted as an element of the new normal in the data world.
When a person is convicted of a crime, a judge has to decide what the appropriate response is and how that might help bring about positive outcomes. One interesting thing that can contribute to that is what's called restorative justice
. This is a form of justice that focuses on repair to the harm done as opposed to punishment, and it often involves, at the judge's discretion and the victim's desire, mediation between the victim and the offender. Now one of the interesting things about this is it's a pretty easy procedure, and it has some very significant outcomes. Participating in restorative justice predicts improved outcomes on all of the following. People feel that they were able to tell their story and that their opinion was considered. They feel that the process or outcome was fair. They feel that the judge or mediator was fair. They feel that the offender was held accountable. An apology or forgiveness was offered. There's a better perception of the other party at the end of all of this. The victim is less upset about the crime. The victim is less afraid of revictimization. Those are absolutely critical. And then one more is that there's a lower recidivism rate. Offenders who go through restorative justice are less likely to commit crimes again in the future. All of these are very significant outcomes and can be predicted with this one relatively simple intervention of restorative justice. And so when a judge is trying to make a decision, this is one thing they can keep in mind in trying to predict a particular outcome. Now in the world of predictive analytics
, where you're using data to try to predict outcomes, the restorative justice is a very simple one based on simple analysis. Within data science and predictive analytics, you'll see more complicated things like, for instance, whether a person is more likely to click on a particular button or make a purchase based on a particular offer. You're going to see medical researchers looking at things that can predict the risk of a disease as well as the responsiveness of particular treatments. You'll also look at things like the classification of photos, and what's being predicted there is whether a machine can accurately predict what a human would do if they did the same particular task. These are all major topics within the field of predictive analytics. Now the relationship between data science and predictive analytics is very vaguely like this. Data science is there, predictive analytics is there, and there's a lot of overlap. An enormous amount of the work in predictive analytics is done by data science researchers. There are a few important meeting points at that intersection between the two, so predictions that involve difficult data, if you're using unstructured data like social media posts or a video that doesn't fit into the nice rows and columns of a spreadsheet. You're probably going to need data science to do that. Similarly, predictions that involve sophisticated models like the neural network we have here, those require some really high-end programming to make them happen. And so data science is going to be important to those particular kinds of predictive analytics projects. On the other hand, it's entirely possible to do predictions without the full data science tool kit. If you have clean, quantitative data sets, nice rows and columns of numbers, then you're in good shape. And if you're using a common model like a linear regression or a decision tree, both of which are extremely effective, but they're also pretty easy to do and pretty easy to interpret. So in these situations, you can do useful and accurate predictions without having to have the entire background of data science. Also, it's possible to do data science without necessarily being involved in the business of predictions. If you're doing things like clustering cases or counting how often something happens, or mapping like what we see here, or a data visualization, these can be significant areas of data science, depending both on the data that you're bringing in and the methods that you're using. But they don't involve predictions per se, and so what this lets you know is that while data science can contribute significantly to the practice of predictive analytics, they are still distinguishable fields, and depending on your purposes, you may or may not need the full range of data science skills, the full took kit to get to your predictive purposes. But either way, you're going to be able to get more insight into how people are likely to react and how you can best adapt to those situations.
It's an article of faith for me that any organization will do better by using data to help with their strategy, and with their day-to-day decisions. But it reminds me of one of my favorite quotes from over 100 years ago. William James was one of the founders of American psychology and philosophy, and he's best known for functionalism in psychology and pragmatism in philosophy, and he had this to say: he said, "My thinking is first and last and always for the sake of my doing." That was summarized by another prominent American psychologist, Susan Fiske, as, "Thinking is for doing." The point is, when we think, the way that our brain works, it's not just there because it's there, it's there to serve a particular purpose. And I think the same thing is true about data and data science in general. In fact, I like to say data is for doing. The whole point of gathering data, the whole point of doing the analysis, is to get some insight that's going to allow us to do something better. And truthfully, business intelligence is the one field that epitomizes this goal. Business intelligence, or B.I., is all about getting the insight to do something better in your business. And business intelligence methods, or B.I. methods, are pretty simple. They are designed to emphasize speed, and accessibility, and insight, right there. You can do them on your tablet, you can do them on your phone. And they often rely on structured dashboards. Maybe you do a social media campaign, and you can go and see the analytics dashboard. Or you have videos on YouTube, or Vimeo, or someplace. You can get the analytics and see how well is this performing, who's watching it and when. That's a business intelligence dashboard of a form. So, if this is all about the goal of data, that data is for doing, and B.I. does that so well, where does data science come in to all of this? Well, it actually comes in sort of before the picture. Data science helps set things up for business intelligence, and I'll give you a few examples. Number one, data science can help tremendously in collecting, and cleaning, and preparing, and manipulating the data. In fact, some of the most important developments in business intelligence, say, for instance, companies like Domo. Their major property is about the way that they ingest and process the information to make it easily accessible to other people. Next, data science can be used to build the models that predict the particular outcomes. So, you will have a structure there in your data, that will be doing, for instance, a regression, or a decision tree, or some other model to make sense of the data. And while a person doesn't have to specifically manipulate that, it's available to them, and that's what produces the outcomes that they're seeing. And then finally, two of the most important things you can do in business intelligence are find trends, to predict what's likely to happen next, and to flag anomalies. This one's an outlier, something may be wrong here, or we may have a new case with potential hidden value. Any one of those is going to require some very strong data science to do it well. Even if the user-facing element is a very simple set of graphs on a tablet, the data science goes into the preparation and the offering of the information. And so really, I like to think of it this way: Data science is what makes business intelligence possible. You need data science to get the information together from so many different sources, and sometimes doing complex modeling. And also, I like to think, that business intelligence gives purpose to data science. It's one of the things that helps fulfill the goal-driven, application-oriented element of data science. And so, data science makes B.I. possible, but B.I. really shows to the best extent how data science can be used to make practical decisions that make organizations function more effectively and more efficiently.
When we think about Artificial Intelligence and how it works, and how it might make decisions, and act on it's own, we tend to think of things like this. You've got the robot holding the computer right next to you. But the fact is, most of the time when we're dealing with Artificial Intelligence, it's something a lot closer to this. Nevertheless, I want to suggest at least four ways that working data science can contribute to the interplay of human and Artificial Intelligence of personal and machine agency. The first is what I call simple Recommendations. And then there's Human-in-the-Loop decision making. Then Human-Accessible decisions, and then Machine-Centric processing and action. And I want to talk a little more about each of these. Let's start with Recommendations. This is where the algorithm processes your data and makes a recommendation, or suggestion to you and you can either take it or leave it. A few places where this approach shows up are things like, for instance, online shopping, where you have a recommendation engine that says "Based on your past purchase history, "you might want to look at this." Or, the same thing with online movies or music, it looks at what you did, it looks at what you like, and it suggests other things. And, you can decide whether you want to pick up on that or not. Another one is an online News Feed. This says "Based on what you've clicked in the past "and the things that you've selected, "you might like this." It's a little bit different, because this time it's just a yes or no decision. But, it's still up to you what you click on. Another one is Maps, where you enter your location and it suggests a route to you based on traffic, based on time and you can follow it or you can do something else if you want. But in all of these, data science is being used to take, truly, a huge amount of information about your own past behavior, about what other people have done under similar circumstances, and how that can be combined to give the most likely recommendations to you. But, the agency still rests in the human. They get to decide what to do. Next is Human-in-the-Loop decision making. And, this is where advanced algorithms can make and even implement their own decisions, as with self-driving cars. And, I remember the first time my car turned it's steering wheel on it's own. But humans are usually at the ready to take over, if needed. Another example might be something as simple as spam filters. You go in every now and then, and you check up on how well it's performing. So, it can do it on it's own, but you need to be there to take over just in case. A third kind of decision making in the interplay between the algorithm and the human is what I call Human-Accessible decision making. Many algorithmic decisions are made automatically, and even implemented automatically. But they're designed such that humans can at least understand what happened in them. Such as, for instance, with an online mortgage application. You put the information in, and it can tell you immediately whether you're accepted or rejected. But because of recent laws, such as the European Union GDPR, that's the General Data Protection Regulation, the organizations who run these algorithms need to be able to interpret how it reached its decision. Even if they're not usually involving humans in making of the decisions, it still has to be open to humans. And then finally, there's Machine-Centric. And this is when machines are talking to other machines. And the best example of this is the internet of things. And that can include things like Wearables. My smart watch talks to my phone, which talks to the internet, which talks to my car in sharing and processing data at each point. Also Smart Homes. You can say hello to your Smart Speaker which turns on the lights, adjusts the temperature, starts the coffee, plays the news and so on. And there are Smart Grids, which allows for example, for two way communication between maybe a power utility and the houses or businesses they serve. It lets them have more efficient routing end of power, recovery from blackouts, integration with consumer generated power, and so on. The important thing about this one is this last category, the Machine-Centric decisions or the internet of things, is starting to constitute an enormous amount of the data that's available for data science work. But any one of these approaches from the Simple Recommendations up to the Machine-Centric, all of them show the different kinds of relationships between data, human decision makers, and machine algorithms, and the conclusions that they reach. Any one of these is going to work in different circumstances. And so it's your job, as somebody who may be working in data science, to find the best balance of the speed and efficiency of machine decision making and respect for the autonomy and individuality of the humans that you're interacting with.
Data science can make you feel like a superhero who's here to save the world or at least your business's world. But an alarming amount of data work can also end up in court or on the wrong side of a protest so I want to talk about a few things that can help keep you, your company and your data science work on the up and up. First, there are some important legal issues. Now it used to be when data science first came about, you know, oh, 10 years ago, we were kind of in the Wild West and people just kind of did what they wanted but now we've had some major developments in the legal frameworks that govern data and its use. Probably the most important of these is an entire collection of privacy laws, the most significant of which at the moment is the GDPR, that's the European Union's General Data Protection Regulation. This is a law about privacy that has some very serious teeth. It can potentially have fines of billions of dollars for companies that seriously violate its policies. This is why you have so many cookie notices and why opting in becomes such an important thing when you go to a website. And in the United States, there are a lot of other regulations that also affect privacy, things like HIPAA, that's the Health Insurance Portability and Accountability Act and FERPA, the Family Educational Rights and Privacy Act. And then there are state laws like the California Consumer Privacy Act. All of these place serious regulations on how you can gather data, the consent that you need to get from people, what you can do from the data, whether people can request it, whether they can be forgotten from the system and it's important for you to remember all of these so you don't end up crossing a very significant line when doing your work. And I want to remind you, it's not just the 31 member states of the European Union and the European Economic Area that have these kinds of privacy laws. They're spreading around all over the world and there are more coming every day. I mean, that's an awful lot of the world that's covered by these various regulations that you need to keep in mind when doing your work. The next thing is ethical issues. These may or may not specifically address legal barriers but they have a lot to do with how your work is perceived. Now for instance, there are the three forms of fairness. We talked about distributing things by equity, that is proportional to some kind of input or equality, everybody gets the same amount or need, the people who need the most, and get the most. Or forms of justice. This includes distributive justice which is the actual things that you end up with, procedural justice, how are the decisions made and interactional justice, how is the decision communicated? How are people involved in it? And then there's issues of authenticity. You need to know who you're dealing with and what you're dealing with and none of these specifically address legal issues but they have a very big impact on whether people feel that your organization and your work is ethical and can be trusted and whether they want to engage with you. And that finally gets to the social issues. Whenever you or your company or clients engage with people, you need to engage with them with respect. People don't like getting fooled, they don't like getting exploited, they don't like getting ignored and all of those are serious risks when working with data and don't forget that unpopular projects can lead to protests, by the general populace and walkouts by the people in your own company and so, there's a lot more that I could say about all of these. I do say a lot more about every one of these elements and all of them can have an impact on your ability to use data to do something that's productive for your company but not just in the short term, also in the long term in a way that is sustainable and respective of the people and the environment that you work in.
Anybody who's cooked knows how time-consuming food prep can be, and that doesn't say anything about actually going to the market, finding the ingredients, putting things together in bowls and sorting them, let alone cooking the food. And it turns out there's a similar kind of thing that happens in data science, and that's the data preparation part. The rule of thumb is that 80% of the time on any data science project is typically spent just getting the data ready. So data preparation, 80%, and everything else falls into about 20%, and you know, that can seem massively inefficient and you may wonder what is your motivation to go through something that's so time consuming and really, this drudgery? Well, if you want in one phrase, it's GIGO, that is garbage in, garbage out. That's a truism from computer science. The information you're going to get from your analysis is only as good as the information that you put into it, and if you want to put in really starker terms, there's a wonderful phrase from Twitter, and it says most people who think that they want machine learning or AI really just need linear regression on cleaned-up data. Linear regression is a very basic, simple, and useful procedure and it lets you know just as a rule of thumb, if your data is properly prepared, then the analysis can be something that is quick and clean and easy and easy to interpret. Now when it comes to data preparation and data science, one of the most common phrases you'll hear is tidy data, which seems a little bit silly, but the concept comes from data scientist Hadley Wickham, and it refers to a way of getting your data set up so it can be easily imported into a program and easily organized and manipulated. And it revolves around some of these very basic principles. Number one, each column in your file is equivalent to a variable, and each row in your file is the same thing as a case or observation. Also you should have one sheet per file. If you have an Excel sheet, you know you can have lots of different sheets in it, but a CSV file has only one sheet and also that each file should have just one level of observations. So you might have a sheet on orders, another one on the SKUs, another one on individual clients, another one on companies and so on and so forth. If you do this, then it makes it very easy to import the data and to get the program up and running. Now this stuff may seem really obvious and you say, well, why do we even have to explain that? It's because data in spreadsheets frequently is not tidy. You have things like titles and you have images and figures and graphs and you have merged cells and you have color to indicate some data value or you have sub-tables within the sheet or you have summary values or you have comments and notes that might actually contain important data. All of that can be useful if you're never going beyond that particular spreadsheet, but if you're trying to take it into another program, all of that gets in the way. And then there are other problems that show up in any kind of data. Things like, for instance, do you actually know what the variable and value labels are? Do you know what the name of this variable is? 'Cause sometimes, they're cryptic. Or what does a three on employment status mean? Do you have missing values where you should have data? Do you have misspelled text? If people are writing down the name of the town that they live in or the company they work for, they could write that really in infinite number of ways. Or in a spreadsheet, it's not uncommon for numbers to accidentally be represented in the spreadsheet as text, and then you can't do numerical manipulations with it. And then there's a question of what to do with outliers and then there's metadata, things like where did the data come from? Who's the sample? How was it processed? All of this is information you need to have in order to have a clean dataset that you know the context and the circumstances around it that you can analyze it. And that's to say nothing about trying to get data out of things like scanned PDFs or print tables or print graphs, all of which require either a lot of manual transcription or a lot of very fancy coding. I mean, even take something as simple as emojis, which are now a significant and meaningful piece of communication, especially in social media. This is the rolling on the floor laughing emoji. There are at least 17 different ways of coding this digitally. Here's a few of them, and if you're going to be using this as information, you need to prepare your data to code all of these in one single way so that you can then look at these summaries all together and try to get some meaning out of it. I know it's a lot of work, but just like food prep, is a necessary step to get something beautiful and delicious. Data prep is a necessary, vital step to get something meaningful and actionable out of your data. So give it the time and the attention it deserves, you'll be richly rewarded.
Data science projects can feel like this epic expedition or this massive group project. But sometimes, you can get started right here, right now. That is, your organization may already have the data that you need. And there a few major advantages to using this kind of in-house data. The first is it's fast. It's the fastest way to start. It's right there and it's ready to go. An interesting one is that certain restrictions on data may not apply for use that is entirely within the boundaries of your organization. So if you have data that includes individual identifiers, you may be able to use that for your organization's own research. Next, you may actually be able to talk with the people who gathered the data in the first place. You can have questions for them. They can tell you how they sampled it, what the things mean, why they did it in this particular way, all of that can save you an enormous amount of time and headaches, and also it may be in the same format that you currently need. And what that means, is that pieces may fit together perfectly. They may have the same code, they may have the same software, they may use the same standards and style guides, and that can save you a lot of time. On the other hand, there are some potential downsides to in-house data. Number one is, if it was an ad-hoc project, it may not be well documented. They may have just kind of thrown it together, and never quite put in all the information about it. It may not be well maintained. Maybe they gathered it five years ago, and kind of let it slip since then. And the biggest one is the data simply may not exist. Maybe what you need really isn't there in your organization. And so, in-house isn't an option in that particular case. So, there are some potential downsides. But really, the benefits are so big, are so meaningful, that it's always worth your time to take just a few minutes to look around and see what's there, and get started on your project right now.
When You draw a picture or write a letter, chances are that you can draw well with one of your hands, your dominant hand and not so much with the other. I recently heard someone describe this as having a well developed API for your dominant hand but only a clunky one for the non-dominant hand. An API or Application Programming Interface isn't a source of data but rather it's a way of sharing data, it can take data from one application to another or from a server to your computer. It's the thing that routes the data, translates it, and gets it ready for use. I want to show you a simple example of how this works. So I've gone to this website that has what's called the JSON Placeholder. JSON stands for JavaScript Object Notation, it's a data format, tiny piece of code and what it says is go to this web address and get the data there and then show it, include it and you can just click on this to see what it does. There's the data in JSON format. If you want to go to just this web address directly, you can and there's the same data. You can include this information in a Python script or a R script or some other web application that you are developing. It brings it in and it allows you to get up and running very, very quickly. Now API's can be used for a million different things, three very common categories include social API's that allow you to access data from Twitter or Facebook and other sources as well as use them as logins for your own sites. Utilities, things like Dropbox and Google Maps so you can include that information in your own apps. Or commerce, Stripe for processing payments or MailChimp for email marketing or things like Slack or a million other applications. The data can be opened which means all you need is the address to get it or it maybe proprietary maybe you have to have a subscription or you purchase it and then you'll need to log in. But the general process is the same. You include this byte code and it brings the data in and gets you up and running, you can then use that data in data analysis, so it becomes one step of a data science project or maybe your creating a app, you can make a commercial application that relies on data that it pulls from any of several different API's like weather and directions. Really the idea here is that API's are teamwork. API's facilitate the process of bringing things together and then adding value to your analysis and to the data science based services that you offer.
When people think about data science, machine learning and artificial intelligence, the talk turns almost immediately to tools. Things like programming languages and sophisticated computer setups, but remember, the tools are simply a means to an end, and even then only a part of it. The most important part of any data science project by far is the question itself, and the creativity that comes in exploring that question, and working to find possible answers using the tools that best match your questions. And sometimes, those tools are simple ones. It's good to remember even in data science that we should start with the simple, and not move on to the complicated until it's necessary. And for that reason, I suggest we start with data science applications. And so, you may wonder, "Why apps?" Well number one, they're more common. They're generally more accessible, more people are able to use them. They're often very good for exploring the data, browsing the data. And they can be very good for sharing. Again, because so many people have them and know how to use them. By far the most common application for data work is going to be the humble spreadsheet, and there are a few reasons why this should be the case. Number one, I consider spreadsheets the universal data tool. It's my untested theory that there are more datasets in spreadsheets than in any other format in the world. The rows and columns are very familiar to a very large number of people and they know how to explore the data and access it using those tools. The most common by far is Microsoft Excel and its many versions. Google Sheets is also extremely common, and there are others. The great thing about spreadsheets is they're good for browsing. You sort through the data, you filter the data. It makes it really easy to get a hands-on look at what's going on in there. They're also great for exporting and sharing the data. Any program in the world can read a .csv file, a "comma separated values", which is the generic version of a spreadsheet. Your client will probably give you the data in a spreadsheet, they'll probably want the results back in a spreadsheet. You can do want in-between, but that spreadsheet is going to serve as the common ground. Another very common data tool, even though it's not really an application, but a language, is S-Q-L or SQL, which stands for "Structured Query Language." This is a way of accessing data storing databases, usually relational databases, where you select the data, you specify the criteria you want, you can combine it and reformat in ways that best work. You only need maybe a dozen or so commands in SQL to accomplish the majority of tasks that you need. So a little bit of familiarity with SQL is going to go a very long way. And then there are the dedicated apps for visualization. That includes things like Tableau, both the desktop and the public and server version, and Qlik. What these do is they facilitate data integration, that's one of their great things. They bring in data from lots of different sources and formats, and put it together in a pretty seamless way. Their purpose is interactive data exploration. To click on set groups, to drill down, to expand what you have, and they're very very good at that. And then there are apps for data analysis. So these are applications that are specifically designed for point-and-click data analysis. And I know a lot of data scientists think that coding is always better at everything, but the point-and-click graphical user interface makes things accessible to a very large number of people. And so this includes common programs like SPSS, or JASP, or my personal favorite, jamovi. JASP and jamovi are both free and open source. And what they do is they make the analysis friendly. Again, the more people you can get working with data, the better, and these applications are very good at democratizing data. But whatever you do, just remember to stay focused on your question, and let the tools and the techniques follow your question. Start simple, with the basic applications, and move on only as the question requires it. That way, you can be sure to find the meaning and the value as you uncover it in your data.
I love the saxophone. I play it very badly but I love to listen to it and one of the curious things about the saxophone is that if you want to play gigs professionally you can't play just the saxophone at the very least you have to play both the alto and tenor saxophone as well as the flute and the clarinet and for other gigs you may need to be able to play obo, English horn, bassoon bass clarinet, and even recorder and chrome horn like one of my teachers. You have to be a musical polyglot. I mention this because one of the most common questions in data science is whether you should work in Python or in R, two very common languages for working with data. The reason this question comes up is because programming languages give you immense control over your work and data science. You may find that your questions go beyond the capabilities of data analysis applications and so the ability to create something custom tailored that matches your needs exactly, which is the whole point of data science in the first place is going to be critical. But let me say something more about Python and R Now Python is currently the most popular language for data science and machine learning. It's a general purpose programming language. You can do anything with Python and people do enormous numbers of things that are outside of data science with it. Also, Python code is very clean and it's very easy to learn So there are some great advantages to Python R on the other hand, is a programming language that was developed specifically for work in data analysis. And R is still very popular with scientists and with researchers. Now, there are some important technical differences between the two such as the fact that R works natively with vectorized operations and as non standard evaluation and Python manages memory and large data sets better in its default setup but neither of those is fixed, both of them can be adapted to do other things. And really, the thumb of this is that like any professional saxophonist is going to need to be able to play several different instruments any professional data scientist is going to need to be able to work comfortably in several different languages. So those languages can include both Python and R they can include SQL or Structured Query Language or Java or Julia or Scala or MATLAB really, all of these serve different purposes they overlap but depending both on the question that you are trying to answer, the kind of data that you have, and the level at which you're working, you may need to work with some, many or all of these. Now, I do want to mention one other reason why programming languages are so helpful in data science and that's because you can expand their functionality with packages these are collections of code that you can download that give extra functionality or facilitate the entire process of working with data and often, it is the packages that are more influential than the actual language. So things like TensorFlow which make it so easy to do deep learning neural networks you can use that in Python or in R and it's going to facilitate your work but no matter what language you use and what packages you use it is true that the programming languages that are used in data science are going to give you this really fine level control over your analysis and let you tailor it to the data and to the questions that you have.
One of the things that is most predictable about technology is that things get faster, smaller, easier, and better over time. This is the essence of Moore's Law, which originally talked about just the density of transistors on circuits doubling every two years, but think about, for instance, the women working here on ENIAC, that's the Electronic Numerical Integrator and Computer, which was the first electronic general-purpose computer back in 1945. It was huge. It filled up a room and it took a whole team of people to run it. Then things evolved, for instance, to very colorful reel-to-reel computers, then you get your desktop Macintosh, I still have my Classic II, and before you know it, you're running your billion-dollar tech company from your cell phone. One of the most important developments in the internet era has been SaaS, or software as a service. Just think of anytime you've used an online application like Excel Online instead of an application installed locally on your computer like the regular desktop version of Excel. That is a way of making the application more accessible because anybody can get onto it with any machine connected to the internet, and really more useful to people. Well, a similar revolution is happening now in data science with machine learning as a service. It's not as easy to pronounce MLaaS, but it's a way of making the entire process of data science, machine learning, and artificial intelligence easier to access, easier to setup, and easier to get going. All the major cloud data providers have recently announced these machine learning as a service offerings. So for instance, there's Microsoft Azure ML, there's Amazon Machine Learning, and Google AutoML, and IBM Watson Analytics, and all of these have some very closely-related advantages, things that make your life a lot easier. Number one, they put the analysis where the data is stored. You've got your massive and complex data sets but they're stored in the servers that each of these services have, and so you don't have to do an export and import, you just go right where the data is. Also, most of them give you very flexible computing requirements. It's not like you have to purchase the hardware. You can rent CPUs and GPUs. You can rent RAM and hard drive space as you need it. They also frequently give you a drag-and-drop interface that makes the programming and the setup of the analysis dramatically easier. Again, the point here is that it democratizes the process, it makes it more accessible for more people in more circumstances. And that is part of the promise of data science. Now, it's too early to project very far into the future and see what all of the exact consequences of this revolution will be, especially because new services are being announced and major changes are being made all the time. But the idea, again, with machine learning as a service is that it puts the analysis where the data is and it makes it more open and more available to more people. Again, it's the democratization of data science, and that's the promise of machine learning as a service.
This may be the end of this article, but it's just a beginning for you. And so it's time for you to start making plans on how you are going to produce your own data science revolution. I want you to remember that this article is designed to be a foundation, it's an introduction that gives you a feel for the breadth and depth of data science. I haven't been able to go into detail on these topics, but that's okay. Consider learning new things, like, for instance, how to program in Python or R, or how to work with open data, or how to build algorithms for machine learning and artificial intelligence. Any of these would be fantastically useful tools and approaches in data science. Also, learn how to apply the things that you've worked with. Read some articles on data-driven decision making in business settings, get more information on business strategy and how the information you use can help any organization make better decisions in their own operations. And then get information on how people work, and the elements that are most important in fields like marketing or nonprofits or healthcare or education, or whatever is of greatest use and interest to you. Also, get connected with the actual people in data science. Go to conferences, there's so many different ways you can do this. For example, you can go to large, national general topic conferences like the Strata Data Conferences, or one of my favorites, ODSC, for the Open Data Science Conference. Both of these meet in multiple locations throughout the year. There's one that's going to be near you that you can go to. Or maybe you have a specialized topic, like, for instance, I've gone to the Marketing Analytics and Data Science, or MADS conference, which very specifically focuses on data science within this particular marketing realm. Or where you live, you may have local events. I'm in Ghana, and each year, we have the Ghana Data Science Summit, which has tracks focused on artificial intelligence, machine learning, and data science. But remember, all of this, all of this is fundamentally about people and the way that you can both better understand the people and the world around you, and turn that understanding into something of value for the people you work with. So, thanks for reading this article, and good luck in the great things that you'll accomplish.
No comments?
There are intentionally no comments on this site. Enjoy! If you found any errors in this article, please feel free to edit on GitHub.