Introduction to data science

April 28th, 2020/Share/Edit ✏️

There are so much information out there on data science and analytics career paths, however, there is very little in terms of how you can get started with your first step.

I've seen many students and professionals struggling and being overwhelmed when they're beginning to pursue their data science career. I feel that data science and analytics is giving you a similar challenge, especially when you are new to the field. I'll start by going over the history of data science and major concepts behind it. Since there is a huge demand in the data science marketplace, we'll cover the skills you will need that your perspective employers are looking for. Then I'll address how you can add value to your team by executing your role extremely well and demonstrate your proficiency through certifications. Your future will be a promising one as long as you keep up with your profession and enjoy the challenges that await you. Let's take our first step.

This article is for those who are interested in knowing more about data science careers, but not sure of how to get started. If you're excited about the prospect of working in data mining and analytics, big data, machine learning, and data visualization, I strongly encourage you to read this article. The goal of this article is to clearly define data science, provide you with insight into the data science marketplaces, and outline the specific skills you'll need to master as a data scientist. Additionally, this article will provide an overview of the certification opportunities and future technical trends you need to be aware of throughout your journey towards becoming a data scientist.

Have you ever thought about how data science started? Are you familiar with data science and analytics terms? Do you know what are the enabling technologies driving innovations in data science and analytics fields? If any of your answers to these questions is no, then you've come to the right place. Before embarking on your journey in data science and analytics, it is important to know its terrain and to get acclimated as best as you can. This way, you will have a soft landing when it's time to make your decisions and seriously pursue your career in data science and analytics.

The origin of data science coincides with the wide adoption of computers. The discipline of statistics existed well before computer science, but computers empowered statisticians to solve a wide variety of practical problems with real life implications, since heavy number crunching and massive storage of data became feasible due to the emergence of modern computing technologies. The invention of database management systems in the 1960s, and relational database management systems in the 1970s, accelerated the pace of this marriage between statistics and computer science. In the late 1980s, terms such as knowledge discovery and data mining started being used widely. In the early 1990s, database industry practitioners started noticing the explosion of business data in the form of big data. The official start of using the phrase big data can be traced back to an article published in the ACM Digital Library in 1997. In the late 1990s, the phrase data science first appeared to inspire researchers and professionals to harness the predictive power of data by effectively analyzing them and producing useful intelligence. At about the same time, the word statistician began to be used interchangeably with the term data scientist. In the mid 2000s, the word analytics was adopted by data scientists to emphasize the fact that an increasing number of companies started to heavily rely on the statistical and quantitative analysis of data, as well as predictive modeling to make informed decisions so that they can compete better with other businesses. As you can see, the history of data science is that of endless scientific and technological innovations to cope with newly emerging challenges, as we move into the era of information age.

Data science is a highly comprehensive term that encompasses a multitude of disciplines and concepts including big data, machine learning, data mining and data analytics. Big data is especially relevant to data science these days. Think of the sheer amount of data becoming available to various organizations and individuals today. As a result of this trend, data science has to increasingly deal with big data. Essentially big data refers to a data set whose nature including its volume, variety, and velocity defies the conventional ways of processing and requires extraordinary treatment. Therefore, big data is a relative term. It is a moving target. One terabyte may be considered to be big today, but it may not be anymore in the near future as the storage and processing technologies become cheaper and faster. Machine learning frees humans from doing the mundane tasks of trying numerous possibilities of solving a problem to isolate the best solution. The relevance of machine learning and data science stems from the fact that humans are not good at repetitive work and bound to make mistakes when it comes to handling data. This is especially true when the repetition is driven by the size, complexity, and speed of the data as in the case of big data. Due to its large scale, to obtain any meaningful insight from big data, data science today can no longer rely on humans, but beginning to depend heavily on algorithms that in turn drive computers, hence the name machine learning. Data mining is one of the aspects of data science. It is a process of discovering a pattern in a data set. In the beginning of a mining process, you don't know what you're looking for. You employ various algorithms, such as those used in machine learning, to unearth a previously unknown pattern or relationship. Therefore, data mining uses machine learning as a tool in its search for new knowledge without any preconceived notions or hypotheses. The data set being used for data mining often reaches the realm of big data. Unlike data mining, data analytics starts with a specific hypothesis. That is, its purpose is testing the hypothesis. For example, the hypotheses used by data analytics could be something like social media content such as Tweets can predict the risk of a heart disease for individuals.

Big data analytics leverages distributed computing technologies and data analytics techniques to overcome computational challenges presented by big data sets. Distributed computing means an approach used in computer science to break down a task into smaller pieces that are easier to process. "Divide and conquer" is a philosophy behind this classic technique. Once partitioned into smaller chunks, each element of the task is assigned to a processor which could be geographically dispersed. For example, a fragment of your task can be processed in Accra, Ghana, while another piece can be worked on in Lagos. Cloud computing provides a platform on which distributed computing can be implemented with low cost and scalable methods. To simply put it, cloud computing offers a bunch of computers housed in data centers. In addition to the hardware, a software solution is necessary to manage various aspects of distributed computing. This is why we need software tools such as Hadoop and NoSQL databases. Once you get with both hardware and software infrastructures to store, manage, and process big data sets, you're finally ready to run data analytics programs to ask your specific questions on certain big data sets. These questions can touch on applications like fraud detection, online dating, network security, disease control, and climate changes.

There are a number of underlying technologies that make data science a reality. These include data infrastructure, data management, and visualization technologies. Data infrastructure technologies support how data is shared, processed and consumed. One of the most popular data infrastructure technologies data scientists use today is distributed computing in general and in particular cloud computing. There are key underlying technologies that enable cloud computing. Virtualization is one of them, distributed file sharing is another. In particular, redundant array of independent disks or RAID and Hadoop distributed file system or HDFS are prominent ones. Data Management is handled by database management systems or DBMS. Data Science requires highly scalable, reliable, and efficient ways to store, manage, and process data. Which is why DBMS plays a critical role in data science. As big data becomes mainstream, unstructured data is also becoming more prevalent. In fact, the majority of business related data is unstructured. It consists of word processing, presentation, log files, and so on. However, a significant portion of our data is still stored in conventional relational DBMS and in a structured data format. As a result, the new generation of data science professionals have to be versatile enough to be able to deal with both unstructured and structured data sets. Knowledge in SQL is still invaluable in the context of data management. Once data analysis is over, the newly acquired insight needs to be conveyed to the leadership and the rest of an organization. No matter how significant the discoveries are, if data scientists fail to communicate them effectively, especially in the context of strategic goals of the organization, their impact will be minimal. This completely beats the purpose of various data science efforts made in support of the organization.

Data science knowledge is quickly becoming an underlying skill many companies seek in their employees regardless of their fields. In case you are wondering why, it's time for you to understand why data science plays a critical role in tackling many of everyday business challenges out there. Fraud detection is a prime example. Take PayPal. It is leveraging machine learning to combat fraudulent transactions according to Dr. Hui Wang, senior director of Risk Sciences for PayPal. Social media analytics is another example. IBM has a product called Personality Insight. You can type in a text and their tool can analyze your personality, as you can see in this demo. Dating services are also starting to use more sophisticated form of data science. For example, Johnathon Mora, eHarmony's director of data science, spoke at the Predictive Analytics Innovation Summit in 2016, and described how his company and others are using data to enhance their customers' experience. Are you now ready to learn more about how data science is used in various marketplaces? I'm thrilled to talk about this topic and share my insight with you.

Data science marketplace is diverse. For example, one of its key markets is fraud detection. As we move toward the digital economy, criminals and crooks are finding various and ingenious ways to commit fraud against the banking sector. The stakes are high. The loss due to unauthorized credit card transactions alone is estimated to be billions of dollars each year. Therefore, banks are extremely interested in figuring out what's fraudulent and what's not as fast as possible or as they occur. Until very recently, fraud detection involved significant human intervention. Suspicious activities would be flagged for additional scrutiny. Then a fraud detection specialist looked into the case more closely. One of the major challenges in this approach has been the number of false positives, that is, there tend to be too many cases for a human operator to review, and a significant number of them turn out to be normal transactions anyway. Therefore, improving the accuracy of fraud detection is a key to success in this case. Machine learning and big data analytics are revolutionizing the fraud detection market by drastically reducing the number of legitimate customer events falsely identified as fraud attempts. What machine learning brings to the table is its ability to learn on its own what's the best way to detect a fraud through numerous trial and error. Big data contributes to this process by providing rich data sets machine learning algorithms can use to train themselves. The more data points there are, the more accurate the outcome becomes.

More and more people are using social media. This in turn generates an enormous amount of data. Data scientists are naturally attracted to these newly emerging types of data sets. Social media refers to websites where users can post their own content to share it with their friends and beyond. Depending on their focus, social media sites have different types of interests they promote. For example, Facebook offers a forum for building informal and personal relationships, compared to a professional networking tool like LinkedIn. In addition to its size qualifying as big data, another unique value of social media data lies in the data about data, or metadata, it carries. For example, a post on Facebook can accompany location information as well as timestamps. With this kind of unstructured but very rich data sets, a lot of useful insight can be derived about a person who is posting and consuming information. For example, IBM has a product called Personality Insights which offers a profiling service for companies that would like to know more about their customers. In the case of social media analytics, text mining and parsing are the very important and necessary first step. Social media companies often make their content available through their application programming interface, or API. Using this API, data scientists can retrieve the data they want. Collecting the social media data is one thing, but manipulating it for analysis purposes is another. A lot of skills and efforts are necessary before attempting to apply analytics methods, although standards like JSON helps.

One of the areas of social media analytics applications is disease control. University of Pennsylvania conducted a study on a predictive relationship between Twitter post content and heart disease. Emotional factors are linked to heart disease. The University of Pennsylvania study identified indicators of emotional distress expressed in words and correlated them to the occurrences of heart disease. Their study used linguistic analysis techniques as well as various big data analytics techniques to reveal key words of emotion such as hate to be strongly correlated to the incidence of heart disease. On the other hand, positive words like wonderful showed the opposite correlation. The Twitter data they collected consisted of tweets posted by 88 percent of the people from countries in 2009 and 2010. This ample data set provided much stronger evidence of correlation than what they could provide through conventional surveys of subjects.

These days, it is commonplace for people to find their love online. There are a number of online dating services available today. To name a few, eHarmony, Match.com, and OkCupid are those. There are various methods used by these companies to find the right matches for clients seeking their love. For instance, some online dating services use a match percentage to decide on how similar two individuals are. The match percentage is typically calculated by the similarities found between answers to a questionnaire. The answers to the questions are weighted depending on their importance. For example, a question on the level of education is more highly weighted than those on music preference. Other online dating services use a more sophisticated method, such as a compatibility predicted model. The leaders of online dating providers are now starting to adopt big data analytics to enhance the quality of their services. Match.com has collected more than 70 terabytes of data on their customers. According to Match.com, big data analytics allowed them to create 500,000 relationships, which, in turn, resulted in 92,000 marriages and 1 million babies. The primary method used by the big data analytics algorithms when making matches is to keep track of candidates' online behavior. One caveat in this case is the possibility of people fabricating their online behavior.

Simulations imitates the operative of a real world system. The true power of simulation comes from its predictive nature. A computer simulation can completely rely on a mathematical model, can be interpreted into an algorithm, and then finally implemented into a piece of code. A physics engine used for gaming, is an excellent example of this. This purely model-driven simulation is not always accurate and can misrepresent what can happen in the real world. Which is why data science can play an important role in simulations. By feeding real life data into a simulation model, scientists can improve the accuracy of a simulation drastically. In addition to the improved accuracy, the simulation model aided by this infusion of a large amount of real-life data can also significantly enhance its predictive power. For example, the field of Climatology is one of the beneficiaries of the recent progresses of big data analytics. Now, meteorologists can predict future weather patterns with much more accuracy. In fact, it is becoming quite feasible to predict your weather with 95% accuracy 48 hours ahead of time.

One of the areas where simulations can be used is predictive modeling, powered by data science. Among its many applications, climate and ecosystem change predictions stand out as one of the most timely and significant way of harnessing the power of data science. For example, there is the United Nations initiative called Data for Climate Action Challenge. It's a competition aimed at encouraging climate and data scientists to develop innovative climate change research projects, by leveraging data analytics. Going a little further, now it's no longer a pipe dream to simulate the entire ecosystem of the Earth. The Madingley Model project, sponsored by Microsoft, is making this dream a reality. Using the Madingley Model, scientists can simulate the impact of climate changes on all lifeforms on Earth. The data fed into these predictive models of climate changes and ecosystems include environmental data reported through social media and sensor readings coming from various Internet of Things, or IoT devices, as well as conventional climate data.

As cyber threats increase, more organizations are making network security as their top priority. The attacks on the internet are getting more sophisticated at light speed while network security vendors are always trying to catch up with the advances made in new hacking techniques. Despite many recent stories of network vulnerabilities being breached, it is an encouraging sign that many of the network security solutions providers are now moving towards leveraging machine learning and big data analytics to enhance their products. One of the frontiers of network security is the field of logging and monitoring. Many of the software companies offering network security solutions are incorporating machine learning and big data analytics into their product line. Microsoft is a great example of this trend. Through their cloud product called Azure, they offer a machine learning service on which users can build their own intrusion detection solutions or use any of the built-in services provided by Microsoft.

You may be wondering about what skills you need to be successful in data science and analytics careers. Although there are boundless possibilities that could positively affect your request for landing a data science and analytics job it helps to start with some obvious ones such as data mining, machine learning, natural language processing, statistics, and visualization. Data mining is a broad term referring to the practice of examining a large amount of data for the purpose of finding meaningful patterns and establishing significant relationships to help solve a problem. Machine learning is a subfield of artificial intelligence. It focuses on optimizing ways to use algorithms to conduct data analysis and analytics tasks with as little human supervision as possible. Natural language processing allows a computer to make sense of its interactions with human beings through linguistic means, such as spoken and written languages. Statistics is a foundation for data analysis and analytics in general. Without statistics it is impossible to do any sophisticated data processing like data mining or machine learning. Visualization is usually the last step of data science and analytics projects. The goal here is to communicate the results of analysis or analytics in the most intuitive way so that our stakeholders can quickly get the gist of the meaning and significance of the report. All these skills are essential in becoming a well-rounded data scientist. And a typical data science and analytics education program touches upon all of these areas. Dr. Abhijit is Professor of Practice of Computer Science at the University of Maryland at Baltimore County, and the director of the Data Science Masters Degree program. Dr. Dutt calls these skills pillars of data science and analytics. As you can see becoming a data scientist demands a lot of your efforts which is why the job is so valued and sought out.

Data mining and analytics involve a myriad of data manipulation techniques. Text retrieval is one of the most well-known data mining techniques. It builds on many foundational concepts and methods developed by Natural Language Processing, or NLP. Classification constructs a model that labels a group of data objects into a specific category. In the classification model, the classes with their own labels are discrete in nature. For instance, the same classification model can categorize people into groups of trustworthy and untrustworthy users of an online banking system. Prediction builds a model that produces continuous or ordered values that form a trend. For instance, a prediction model can provide estimated mean time to failure or MTTF values for a computer. Clustering is a process of grouping similar data objects into a class. Clustering helps reveal features that distinguish one class of data objects from the other, leading to new discoveries on a dataset. Uses of clustering analysis range from pattern recognition and image processing to market research. For example, clustering can reveal people of similar purchasing behaviors. As you might have noticed already, the difference between classification and clustering is that classification starts with predefined labels while the labels are created after the fact for clustering.

Machine learning is based on self-learning or self-improving algorithms. In machine learning, a computer starts with a model, and continues to enhance it through trial and error. It can then provide meaningful insight in the form of classification, prediction, and clustering. There are two types of machine learning. One is supervised and the other is unsupervised. Supervised learning is reinforced by feedback in the form of training data. Suppose that you have a basket of apples and oranges, you'd like to separate them into two distinct groups of fruits. There is apples and oranges. In the supervised learning environment, you already have training data which can tell your machine learning algorithm what fruit belongs to which group after it makes its decision. In this scenario the algorithm already knows that there are two labels to be used in its attempts to separate the fruits. Therefore, this process is accumulative classification, a concept used in the data mining and analytics domain. In the unsupervised learning environment, there is no training data. In this case the machine learning algorithm solely depends on clustering and keep enhancing its algorithm without external feedback.

Natural Language Processing, or NLP, refers to a collection of different ways for a computer to make sense out of its interactions with a human being through a natural language. NLP is a comprehensive discipline in computer science and involves topics such as artificial intelligence, computer linguistics, and human computer interaction, or HCI. There are NLP subfields that are particularly relevant to a data scientist. Tokenization, parsing, sentence segmentation, and named entity recognition are some of them. Tokenization and parsing isolate each text symbol from a text and conduct a grammatical analysis. Sentence segmentation separates one sentence from the other in a text. Named entity recognition identifies which text symbol maps to what types of proper names. A significant portion of data you're dealing with as a data scientist is unstructured. That is, they are text extracted not from a database, but from sources such as social media sites, text documents, pictures, and so on. Therefore, one of the biggest challenges of a data scientist is to sort through this unstructured data and pre-process it so that data mining and analytics tools can take over to extract the ultimate knowledge they are seeking. Luckily for the data scientists, there are already well-developed NLP tools patched into program languages such as Python. Some of these tools are also built into an operating system such as Unix or Linux.

Statistics lays a foundation for data science. In fact, statistics is where data science started. Therefore, developing a reasonable understanding of statistics is a must for a data scientist. In fact, the more you know about statistics, the better. At a minimum, a data scientist needs to be proficient with concepts such as probability, correlation, variables, distributions, regression, null hypothesis significance tests, confidence intervals, t-test, ANOVA and chi-square. You also need to know how to use common statistical analysis tools, including R, Excel and SAS. At a more advanced level a data scientist needs to be familiar with concepts and algorithms, like logistic regression, support vector machines, or SVMs, and Bayesian methods.

To overcome the challenge of effectively communicating the results of data analytics to a lay audience, there are scientists frequently rely on visualization. Therefore, it is to their scientists advantage to have a good understanding of effective visualization techniques, so that they can use the most effective one for a given problem and audience. Some of the well-known characteristics of effective visualization are readily available. These include displaying data at multiple levels of details, and avoiding distorting the message to be conveyed while attempting to visualize it. It is also very helpful to know how to use some of the software tools offered by the industry leaders of visualization solutions. For example, Tableau offers one of the most popular and comprehensive visualization tools for data scientists. It supports a variety of visualization elements such as different types of charts, graphs, maps, and other more advanced options. Always remember that your job as a data scientist is that of a middle man interfacing with both experts working with sophisticated technologies. And non-experts who don't have the luxury of wading through mountains of information to get the message they want to hear and see.

There are a number of opportunities you can take advantage of to play an active role and contribute to data science and analytics fields. To name just a few, there are job titles such as data scientist, data engineer, business intelligence architect, machine learning specialist, data analytics specialist, and data visualization developer. Each of these roles are critical in effectively leveraging data and its potential despite numerous challenges. For example, big data requires special processing by data engineers before an analytics specialist can even try to do their job. Take network security. Let's assume that you need to analyze a terabyte of data every day. The goal here is detecting suspicious behavior. There are numerous roles involved in this including domain experts, such as cyber security professionals, data base administrators, cloud and distributed computing specialists, network engineers, software engineers, and last but not least, data scientists. The list goes on and on. In fact, you can see this in action in my recent course on data driven network security essentials. Based on my experience working as a network management software architect for Sprint in the U.S. I can share some good stories about this from the trenches. I can definitely attest that it does take a village to get anything accomplished in the data science and analytics business.

Data science is an all-encompassing term. Similarly, data scientist is also an all-encompassing job title. Data is everywhere and its volume is also ever increasing. Every organization can benefit from hiring a person who can provide data analysis and analytics to reflect on its past performances and to attempt to predict its future. However, not every company can afford to hire a person whose job is dedicated to working on corporate data. Not to mention, hiring multiple experts specializing in different aspects of data science. The role of a data scientist is that of a generalist instead of a specialist. In an environment where a data scientist works with other data science specialists, such as a machine learning scientist, the data scientist can act as a liaison between the leadership of the company and the data science specialists. Therefore, one of the hallmarks of a competent data scientist is an ability to communicate effectively. Compared to other highly specialized jobs in data science, the entrance barrier to a data science job is relatively lower. A solid training in computer science or statistics, may be enough for you to get started at an entry-level position. But a Master's degree in data science is a big plus. In this case, what's more important is your passion for data. Also, a potential for job growth is very high as you become a seasoned data scientist and take on various leadership positions in a company.

Business Intelligence, or BI, is a process of collecting, managing and processing corporate data to provide actionable information for the leadership and employees of a company. BI is heavily technology-driven, and leverages various software applications to perform the analyses and analytics of company data. In the information technology industry, the architect job title is often given to a senior member of a technical team, such as group of software developers. An architect is usually a person who is at the pinnacle of a technical career. This person is a seasoned veteran who leads a major technology initiatives of a company. Therefore, an architect position is by no means considered to be an entry level position in most companies. With this understanding of what is BI and what it means to be an architect, let's explore more the job of BI architect. One of the core responsibilities of a BI architect is to design and implement system architectures to maximize the potential of a company's data assets. To make this happen, the BI architect needs to be able to build a system that links various standalone IT systems throughout a company to pool relevant and useful information for strategic decision-making.

Qualified machine learning scientists are sought after and their salaries are also climbing up. At the top of their career, machine learning scientists program computers to learn on their own. So then more specifically what is required of a machine learning scientist? At its highest level the job requires you to be highly creative and independent. Nobody can tell you what to do when your job is to enhance customer interactions at a multi-billion dollar company through machine learning techniques. You also need the discipline to follow through and meet deadlines. Finally, attention to details and quality is critical. As small, seemingly insignificant mistake can cause a havoc on your entire project when you have to deal with millions of unhappy customers. Now let's talk about technical skills. The most foundational ones are usual suspects required for any advanced IT professions. Math skills are essential because they form the foundations of the technical language machine learning scientists use. In particular, deep knowledge of statistics and probability is important. Next is an ability to develop and validate a mathematical model representing various aspects of machine learning. Once a model is developed, it needs to be translated into an algorithm or unambiguous and discrete processes computer can execute. Finally, you also need some practical IT skills. Proficiency in programming languages, such as Python, C++, Java and R is very helpful. Your work efficiency as a machine learning scientist is often dependent upon your ability to preprocess a large amount of text very quickly and efficiently. Therefore, your familiarity with Unix Linux tools like sed, awk, grep, find, and sort is highly useful. Last but not least is your understanding of distributed computing because your machine learning program will most probably have to take advantage of technologies such as Hadoop and cloud computing. As we can store data more cheaply and easily, there is an increasing number of data sources available to us. These include images, videos, maps, networking data, social media data and so on. Therefore, naturally there is a growing need for data processing. Machine learning scientists are at the forefront of this kind of efforts for leveraging the data around us.

The word business strongly implies a requirement for solid business knowledge and sense for this job. Therefore, whoever is aspiring to become a business analytics specialist, must be both business and technology savvy. Business analytics specialist are those who make things happen under the overarching vision of the architect. In other words, they implement BI architectures according to the direction and supervision of the BI architect. When a company can't afford to hire a BI architect, it needs to march on by solely relying on a business analytics specialist to obtain its BI. This also means that they cannot architect and build their own BI systems due to the lack of resources. Naturally, in this scenario, the business analytics specialist has no other choice but to depend on off the shelf software products offering business analytics capabilities. Luckily for these smaller companies with less resources, the service of business analytics is being quickly commoditized and becoming easily accessible. The quality and ease of use of these products are also improving very rapidly. For example, Amazon has a service called QuickSight. According to their marketing literature it is a very fast, easy to use, cloud-powered business intelligence for 1/10th the cost of traditional BI solutions. This kind of product is a perfect tool for a business analytics specialist to take advantage of.

Data visualization is also industry neutral. That is, data visualization developer can work in any industries because their skills are applicable to wherever data is being used. For example, there are a lot of research and development organizations out there including companies specializing in visualization seeking new innovative ways to visualize data. Media companies such as news outlets are also hiring visualization developers. They need infographics to draw attention from their readers and viewers. Of course any data analytics groups in industry and academia would like to hire their own data visualization developers. Although working independently at times, data visualization developers are expected to frequently work with various teams of people dealing with different aspects of data science. For instance there will be working very closely with data scientists, business intelligence architects, machine learning specialists, and business analytics specialists on a daily basis. Their primary job is to work collaboratively to identify the most appealing and effective means to visually express data mining and analytics results to help develop new insight and to assist in making critical business decisions. Since it is a development position, this job requires programming skills especially in the area of web development and other Graphical User Interface platforms. Also a visualization specialist must have knowledge in various database systems and query languages because part of their job is to interface with the database APIs to pull the data before it gets displayed. Finally, they need to be proficient with mainstream data visualization software tools so that they can speed up their development. After hearing about what it takes to be a data visualization developer, are you interested in pursuing this career path? I believe this job is for you if you're a creative and artistic kind.

Can you guess how much data science and analytics professionals are making on the average in terms of their annual salary? Of course the salary numbers are always changing, and a lot depends on where you are working due to the cost of living differences. The good news is that the sky's the limit. Data science and analytics professionals are at a high demand. Therefore your salary potential will be higher than or comparable to that of other high demand IT professionals, such as cyber security experts. Because I know how difficult it is to nail down salary statistics due to its diverse and fluctuating nature let me give you some anecdotal snapshots of how well specific types of data science and analytics professionals are getting paid. For example, according to payscale.com, entry level data scientists receive above $80,000 per year. Mid-level data scientists make around 120k, while senior data scientists bring home close to 150k. The bottom line here is that the salary prospects of data science and analytics professionals are excellent. You're at the right place and at the right time to start your career as a data science and analytics professional.

As the field of data science and analytics matures, we're seeing more certification opportunities becoming available. Just like any professionals out there, data scientists can establish more credibility in what they're capable of by earning well-known certifications in their field. There are myriads of certification opportunities available, and the number is growing as we speak. The key here is not how many certifications you have but how many relevant skills you can demonstrate through the certifications you have. Many of the certifications also require significant industry experience, therefore, it is important for you to pick and choose the best ones that fit your own needs. If you're a student who is about to graduate, your best bet would be certifications that do not require any experience in the industry. For example, the Certified Analytics Professional certification allows an associate CAP option so that students can pass the exam portion of the certification requirements without any industry experience and get fully certified once they obtain enough industry experience. If you're a working industry professional who is interested in enhancing your marketability, a more specialized form of certification may be appropriate. These can be directly tied to software products you're required to use on a daily basis, and include certifications such as MCSE Business Intelligence, Cloudera Certified Professional, EMC Data Science Associate, Oracle BI Implementation Specialist, and SAS Certified Data Scientist certifications. To share an industry perspective, having certifications is definitely a plus, because it proves a candidate's proficiency in a specific area of knowledge domain or software tools. But certainly you want to be strategic about what certification you pursue, because it can be a very expensive and time-consuming endeavor. The key here is to pair a certification with your experience and further strengthen your credentials. I'm hoping that I can shed some light on this topic by going over several well-known certifications in data science and analytics with you. You'll be amazed and pleasantly surprised by the diversity and depth of what's already available out there.

The future of data science is very bright. There are new technological breakthroughs happening almost daily, and this trend will only accelerate. To be better prepared for what the future will bring us, it is critical for us to review representative emerging technologies in data science and analytics. These emerging technologies will, in turn, create new job opportunities. Therefore, if you can understand the technology trends well, and use it to focus your education and training, you will be better positioned in taking advantage of the new career options resulting from the advances in technologies. As you gain more capabilities as a data science and analytics expert, you will feel more empowered, and maybe tempted to abuse your power. We see this happening all the time. It is very easy to manipulate your data and tell an entirely different story, which is why you need to seriously consider the ethical aspect of your job as a data science and analytics professional. Finally, due to the technology driven nature of data science and analytics fields, your knowledge can become very quickly obsolete if you're not careful. This is why you need to constantly update and upgrade your knowledge, especially in this field, through continuing professional development. If you have a curious mind, you will find the profession of data science and analytics fascinating and fulfilling. After all, the job never gets boring, but you will also have to make sure that you keep up with it. Let's find out how.

Just like many other technology fields, the discipline of data science is dynamic and constantly changing. Therefore, it is a must for a data scientist to keep refreshing their knowledge to stay relevant. One of the prominent new trends is the convergence among cloud computing, big data analytics, and machine learning. In fact, it's no longer necessary to provision private resources housed in your own organization to deploy a distributed computing solution like Hadoop. Various online retail data services, including warehouses, mining, and analytics, are already available in the cloud through vendors like Amazon, IBM, Google, and so on. This makes it cheaper for companies to use data science techniques to solve their business problems, which in turn increases the demand for data scientists. Some other salient features, making these cloud-based data science services more attractive, are their scalability and ease of use. The majority of data scientists no longer have to worry about data infrastructure and management problems due to these emerging online data science services. In conjunction with the consolidation between cloud computing and big data analytics, the importance of machine learning as the critical part of the data science equation is also rapidly growing. Especially deep learning hinging on a set of machine learning algorithms taking advantage of neural networking is getting more traction.

There's no doubt that data science careers are trending upwards. In fact, many industry watchers are reporting shortages in qualified professionals for the foreseeable future. Therefore, I'm happy to say that the outlook for data science job opportunities is extremely bright. Now, the challenge is to prepare yourself for these wonderful opportunities. A good starting point is to develop passion for data science by first getting exposed to the field as much as you can. Once you know you're ready for the task of diving into your journey of training and educating yourself on data science, the next big step to take is to actually commit yourself to lifelong learning. Identify a degree program or online curriculum that will provide a roadmap to your ultimate goal of becoming a data scientist, and simply plunge into it. If you need more guidance, find a mentor who could coach you along the way. This could be your professor, colleague, or someone you get to know through a LinkedIn invitation. All the careers in data science are fairly new still emerging and evolving. These careers include job titles such as Data scientist, Business intelligence architect, Machine learning specialist, Business analytics specialist, Data visualization developers.

The threats are everywhere, in fact, we hear about new data breeches all the time. Sometimes, these security incidents are insider jobs, disgruntled employees, or industrial spies maybe lurking around you. You yourself may be tempted to eavesdrop on your coworkers or supervisor's data, just out of curiosity. As a result, the ethical integrity of a data scientist can make up a huge difference in guarding the security and privacy of user data. In addition to watching out for an insider threat and keeping yourself out of the danger zone, it is also an ethical thing to intentionally and proactively build in security into a data science product you are developing. If you don't do your job as a data scientist to ensure the security and privacy of your customer data, somebody is bound to fall victim to a crime down the road. It is often the case that security is a second thought when working on a data science project. Time to market pressure seems to be always winning. However, many organizations are realizing that, considering security and privacy risks, and their countermeasures from the very beginning of a data science project can ultimately save all the troubles of legal, monetary, and reputational liabilities. The data science code of ethics is still being formed, as the profession of data scientist keeps evolving. As the profession matures, you need to be constantly reminding yourself of all the basic elements of ethical conducts as a data scientist.

Data scientists need to stay abreast of new developments in both their specialty and discipline, at large, as well as the overall IT world. The breadth of the knowledge requirements for them is quite wide because data science builds on many of its underlying IT technologies, such as data infrastructure, and management. For example, one of the emerging data science technologies is in-memory analytics. Which means, running entire data analytics operations in the main memory of a computer instead of reading some of the data back and forth, from a secondary memory device. Such as an external storage device. As a data scientist, you need to know the implications of this relatively new technology in your job, on a daily basis, and make a decision on whether to adopt it now or later. As you may already be well aware there are a number of ways to keep current with the latest developments in your field of expertise. One of the ways to force yourself to do this is to get certified. After getting your certification, such as certified analytics professional, or CAP, you need to get recertified by acquiring a certain number of professional development units or PDUs, every once in awhile. Networking with other data science professionals by participating in conferences and workshops is another great way of keeping up to date with your profession. After all, these activities are what makes your job more exciting and fun.

Thanks for reading this article.

No comments?