Journey to Data Science

  • Home
  • Blogs
  • Journey to Data Science

                                                    Data Science Perception

“If you really believe in what you’re doing, work hard, take nothing personally and if something blocks one route, find another. Never give up.” — Laurie Notaro

In this blog, I’m going to share my journey and sense perception that evolved throughout my career, however, people might have a different angle of perceptions about it which are anticipated.

“Learning is a never-ending process. Whether you focus or not, life will keep teaching.”

As a part of my major project, I alongside my group built a project called "Remote Control PC" - used for controlling  PC  using  hand  gestures.  Since then, I developed a passion for the field and eventually joined Extensodata as an intern in 2018.

“Turn obstacles into opportunity and opportunity into possibilities.” - Roy T. Bennett

When I joined the company, there were only 3 employees working on a table namely Suresh Gautam (CEO), Rhishikesh Nepal (Data & Business Analyst), and a QA. I was under the supervision of Rhishikesh Nepal (Data and Business Analyst). As I started exploring, I started to face different problems. My day-to-day life cycle used to revolve around the problems, reaching out to different contexts, and then encountering new problems. At that time, I recklessly did a lot of exploration regarding data preprocessing, visualization, machine learning algorithms, and other standard practices. I used to scrutinize all the learned practices that I was supposed to be a fringe of data science skills. I used to automate repeated and tedious tasks; was engrossed in building a panacea for all the problems I faced. The more I used to write the lines of code, the less effective they used to be. My inner fear at that time was unanswered curiosities and, subsequently, a lack of confidence. The more I progressed, the more I encountered challenges every day. At some point in time, I realized the gap that exists between theoretical axioms and practical implementation.

“In theory, there is no difference between theory and practice. In practice, there is. ” - Manfred Eigen

At that time, probably my curve was moving sharply towards Mt. Stupid peak as shown in the above Dunning-Kruger effect. It can be said that trying to become a data scientist and actually being a data scientist are two different ball games. If you don't encounter the problems, what will be the likelihood of having solutions for them? Until and unless you don’t deal with real-world industry problems, your skills might be inconsequential for solving those problems and making viable products. The problem with conducting self-research sometimes becomes insufficient. You might switch yourself desperately after encountering problems because it’s not compelling you enough to have solutions at any cost that ultimately gives you a superficial understanding. Eventually, you are unable to execute it properly when it comes to solving that problem. How can we avoid these problems when people are building careers at this level? Hamm, here the solution isn't avoiding the problems, rather facing them early and overcoming them by continuous efforts.

Let's be more precise and technical. In my opinion, data science is a systematic approach to creating business value by identifying hidden patterns from the data using various tools and techniques. Data can be structured, unstructured, or semi-structured depending upon the

problems. Tools and techniques extend from simple transformations to complex algorithms. The system can be made up of either an inference engine or machine learning model or a hybrid form. As we know, the inference engine is an expert system that has fewer benefits with more limitations. However, when it comes to machine learning, we can have supervised, unsupervised, and semi-supervised learning approaches whose adaptation is more fruitful compared to the inference engine. The use of programming language is not a primary concern - it depends on the type of product you’re working on. When it comes to AI, Python is preferred among the developer community, and R is also used to a great extent among statisticians, but still, there is no hard and fast rule. Unlike cutting-edge technology, which is widely explored and used in the fields of computer vision (CV), natural language processing (NLP), and reinforcement learning  (RL), there is no compulsion to use cutting-edge technology all the time in data science. A main noticeable difference between a traditional software application and data science is that a traditional software application is a predefined systematic flow where required functionalities are achieved through the software development process. On the other hand, data science is the extraction of hidden patterns from the data using various tools and techniques according to the problems. Due to this nature, traditional software applications are considered more static compared to data science applications. Unlike traditional software, machine learning applications accommodate new behavioral changes of data through the model training process, which is not possible in traditional software. Let’s see how data science is explained in industry.

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from noisy, structured, and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.

To compare data science with other AI disciplines such as machine learning and deep learning, it has one distinguishable discipline i.e. domain knowledge. It doesn’t mean that machine learning and deep learning don’t require domain knowledge at all. But of course, data science has a higher dependency on domain knowledge because data science is incomplete without analytical skills which start from the problem statements of the domain knowledge. Data analytical skill helps to understand the nature of data, business use-cases, and overall bird-eye view of the problem statement. The field of the domain varies according to the businesses and thus, the nature of the data varies accordingly. Some of the widely used data science domains are Banking, Finance,

Retail, Healthcare, Weblog, Scientific domain, etc. Although data science shares similar practices, the analytical skills of each domain require specific background. Due to this variability of domain knowledge, problems can be different, and therefore, require different products to fulfill their business needs. After identifying clear business requirements, the technical team can start the project through research and development. As we can see in the above picture, data science has interdisciplinary fields: Domain knowledge, Math and Statistics, Machine Learning. We can even build machine learning models without knowing the aforementioned skills, however, we have to be prepared for the recipe of disasters. Building a machine learning model in data science isn’t writing thousands of lines of code, rather it’s the inclusion of fewer lines of code with a deeper understanding of the consequences that it produces. A good machine learning model (identified from model evaluation) requires good input features. As we know the concept “garbage in, garbage out” implies that feature engineering in data science is another crucial skill that primarily distinguishes data science from computer vision (CV) and natural language processing (NLP). Feature engineering is a blend of domain knowledge with a technical analytical skill that most data science practitioners ignore and even have trouble with most of the time. The advantage of good feature engineering is also a way of reducing model complexity.

“The richest feature with the simpler model is the best one.”

The overall data science life cycle is illustrated in the above figure. Model deployment is another key aspect of data science that is not there. Nevertheless, data science practitioners are acquainted with the overall data science life cycle. When I was a beginner, I used to think data science was primarily concerned with model building. It is undoubtedly a crucial step in data science, but building a product isn’t just about applying machine learning algorithms. If you ignore some of these steps, a successful product will be merely a dream. You are curious about knowing how we can master these skills. The sad part is that there are no such predefined steps that drive you quickly towards data science. It’s a practice with a steadfast determination that ultimately leads you towards achieving your dreams. You can learn each skill from various sources; there are unlimited resources available on the internet. For example, it could be Kaggle, Medium, research papers, Udemy courses, etc. Here the expectation can be, "Do these kernels, blogs, and video tutorials suffice for surpassing data science problems?" There is less chance that those resources will perfectly help you build a good product in your organization. The main reason is that data science is a data-driven approach in which its practices have to be done according to the nature of data and associated business problems. How can you presume that someone can completely give you the solutions through those resources? For example, you can expertise yourself in building a forecasting model. Let’s suppose, air ticket forecasting. Even if  you build a highly accurate forecasting model for airline tickets, there is no guarantee that you can build the same level of accuracy in stock market predictions. The thing that can be done here is just having good practices that have a higher possibility of success than other things that don’t.

“Don’t skip each ladder. Sooner or later you have to pay for it heavily.”

Deploying  a  model  and  monitoring  its  performance  is  the  most  essential  part when it comes to serving   your   model   for   the   clients.   Unlike   traditional   software,   deploying   a   model   after completing all the test cases doesn’t guarantee the same model performance in machine learning. Other important factors that trouble us over time: data drift and concept drift. These drifts cause model  performance  degradation,  which  can  bring  serious  disasters over time. A possible way to fix this problem is through model performance monitoring. It helps you decide whether you have to  work  on  feature  engineering  again  or  retrain  the  model  with  newly  changed  data.  Data drift occurs when new behavior appears in the data or its sources. For example, when I was building a model,  I  used  to  suppose  the  average monthly income of the customers was somewhere around 40k,  but  over  time,  the average salary increased to 1 lakh. Our previously trained model doesn’t have  the  clue  that  the  average  monthly  salary  could  be  1  lakh.  Thus,  previously  made  decision criteria regarding average monthly salary fall apart. On the other hand, concept drift occurs when your assumptions change over time. For example, when you were building a model, you assumed the  average  monthly  income  would  work  well,  but  it  didn’t  perform well. Is there a chance that the weighted average monthly income performs better this time? It’s all about exploration.

“A theory has only the alternative of being right or wrong. A model has a third possibility: it may be right but irrelevant.” - Manfred Eigen

Poor data quality is the biggest enemy to the widespread, profitable use of machine learning. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for machine learning. The quality

demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions. To properly train a predictive model, historical data must meet exceptionally broad and high-quality standards. First, the data must be right: It must be correct, properly labeled, de-deduped, and so forth. But you must also have the right data — lots of unbiased data, over the entire range of inputs for which one aims to develop the predictive model. Most data quality work focuses on one criterion or the other, but for machine learning, you must work on both simultaneously.

“If your data is bad, your machine learning tools are useless.”

Probably, you’re curious about knowing the required roles that a data science company should have. Look at the above picture. We can form a functional data science team by having these roles: Business Stakeholders, Data Engineer, Data Scientist, Machine Learning Engineer, QA, and DevOps Engineer. It depends on the stage of the company and available resources to build the functional data science team. Although there are fewer advantages to having “Jack of All, Master

of None”, building a functional team helps to get work done efficiently and helps to build quality products. Due to the functional nature of the team, each individual can focus on the narrower scope which helps to expertise on the problems than having superficial knowledge. But it is always better to have some level of knowledge of other interconnected fields which assist in team communication with different stakeholders. What happens if there is a gap between the two roles? Let’s say, there is no data engineer then there will be a gap between business stakeholders and data scientists that lead to several implications because a data scientist is inefficient for making a reliable data warehouse. If you don’t need an ETL pipeline, then the absence of a data engineer might work for a moment but without data engineering, it is difficult to build scalable data science applications. Not having the right position at the right place muddles the workflow.

“Getting the right people in the right jobs is a lot more important than developing strategy.” - Jack Welch

As I started working in Extensodata, there was no fully functional team and I had taken the responsibility of numerous roles alone. Understanding business problems, system architecture design, ETL pipeline, data analytics, engine development, and deployment. All of these skills were utilized in the development of Foneloan, which has done remarkably well in the market. Later, as the size of the team grew, tasks were broken down according to their functional requirements. Now, the company has a highly functional team where roles and specific tasks are assigned, which has improved productivity and helped to focus on role-specific scopes. The size of the data science team and its functional level depends on the stage and availability of the resources of the company. However, it is always better to have functional teams to achieve higher productivity with standardized products.

“Without execution, 'vision' is just another word for hallucination.” - Mark V. Hurd

AI is unfairly biased, or we can say biasing in AI is the result of man-made decisions concluded over how AI applications are designed, examined, and employed. From employment decisions to credit allocation, many circumstances have witnessed where peoples’ decision-making results in unfair consequences for vulnerable classes. And at the same time, if AI is designed and trained to replicate the human behavior of those human decision-makers, undeniably AI systems also reflect biases in their decision-making. Nonetheless, designing such AI systems that address these biases is challenging, and demands circumspect consideration and accuracy in particular of technology and societal context where these systems will be implanted. In fact, almost well-devised and tested AI systems can restrict such unfair biases, and help society to recognize and cope with biasing in human unfair decision making.

“By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” — Eliezer Yudkowsky

To sum up, I would like to write about how a data scientist can add value to your business. These are the eight important values that a data scientist can make.

  1. Empowering Management and Officers to Make Better Decisions
  2. Directing Actions Based on Trends—which in Turn Help to Define Goals
  3. Challenging the Staff to Adopt Best Practices and Focus on Issues That Matter
  4. Identifying Opportunities
  5. Decision Making with Quantifiable, Data-driven Evidence
  6. Testing These Decisions
  7. Identification and Refining of Target Audiences
  8. Recruiting the Right Talent for the Organization

Thank you!
Sagar Paudel

Senior Data Scientist