laitimes

Data Scientists: The Four Indispensable Mindsets

author:Synod Data Storytelling
Data Scientists: The Four Indispensable Mindsets

In the modern data-driven world, data scientists play a vital role. Not only do they need to be proficient in machine learning, statistics, and data visualization, but they also need to have the ability to solve complex problems. However, what really sets data scientists apart is their unique way of thinking. Just as software development engineers need to master not only programming languages and tools, but also problem-solving methods and mindsets, data scientists need to keep some principled approaches in mind.

Data thinking, business thinking, iterative thinking, and engineering thinking, the four core mindsets are essential skills for every data scientist. This article will delve into these four mindsets to help you go further and be more successful on your path to data science.

Data thinking: Prioritize data and data quality

Data Scientists: The Four Indispensable Mindsets

In the field of data science, it is crucial to understand and prioritize data. One big mistake that is very common for data science newbies and non-technical people working with data scientists to make is to focus too much on model aspects, such as:

  • Choose the most complex model
  • Over-tuning of hyperparameters
  • Trying to solve all data problems with machine learning

The fields of data science and machine learning are evolving rapidly, with new libraries, faster technologies, and better models emerging all the time. However, the most sophisticated and up-to-date options are not always the best. There are many factors to consider when choosing a model, even whether machine learning is really needed.

A common task is outlier detection. When performing outlier detection, it is important to choose the right method. Although complex machine learning methods may seem appealing, sometimes simpler methods are more effective. For example, z-score is used for outlier detection because it captures most of the relevant anomalies and is highly interpretable and easy to explain to stakeholders without technical backgrounds, while being computationally simple and requires little to no more memory or computing resources.

If you decide to adopt a machine learning approach, you should first prioritize improving data quality, including data cleaning, feature selection, and feature engineering. Don't focus too much on hyperparameter tuning, as no amount of tuning will help with poor data quality. No matter how big the search space is and how long it takes to optimize, if the data itself is not of good quality, the effect is still limited.

Some specific measures to optimize data quality include:

  • Use methods such as z-score or IQR to remove outliers
  • Fill in the missing data (interpolation) with the previous value or median, provided that there are not many missing values
  • You can add new features to your model, such as different combinations of time features in time series data, and data that is affected by unexpected events

At the heart of data thinking is prioritizing data quality and explainability, avoiding over-reliance on complex models and over-tuning hyperparameters. On the premise of ensuring data quality, model optimization can achieve twice the result with half the effort.

Business Thinking: Domain knowledge

Data science exists in almost any field, such as energy, finance, marketing, social media, and food, to name a few. This means that your skills can have a profound impact in a myriad of areas. Domain knowledge refers to expertise or understanding of a particular domain or topic.

In the energy industry, for example, let's say you're a data scientist in your industry and want to build a model that predicts a building's electricity consumption. How do you know which features can be used to build a model? You need to understand which variables typically affect electricity use, such as:

  • temperature
  • The hours of the day
  • What day of the week
  • What month of the year

These are a good starting point, but a deeper understanding is needed. For example, understand the type of building: is it commercial, industrial/manufacturing, or residential? This will affect the building's response to the above variables.

Commercial buildings are typically busiest during typical business hours (Monday to Friday from 9 a.m. to 5 a.m.) and can contain a binary variable such as "business hours" or "weekends." Holidays also need to be taken into account. Manufacturing plants can have different hours and days, and residential buildings have different schedules and are affected differently by holidays.

Domain knowledge not only helps you build a model the first time, but also guides you in the final deliverables. For example, when detecting outliers, it is common for a value with a z-score greater than 3 or less than -3 to be considered an outlier. However, domain knowledge and regular interactions with customers have taught us that we only need to focus on the case where the meter reading is too high, not if the reading is too low. Therefore, we set the initial z-score threshold to 3 and only treat values above 3 as outliers.

Understanding the context of the problem is a prerequisite for building and training an effective model. Increasing domain knowledge means developing an intuitive sense of problem-solving in your industry.

Iterative thinking: The iterative cycle process of data science

Data Scientists: The Four Indispensable Mindsets

Data scientists must understand that most of their work is iterative and cyclical. That's why it's called the machine learning lifecycle, not a machine learning linear process. Typically, you'll develop a model, test it, or even deploy it – but after a few weeks of running in production, you'll probably go back to development. It's not a failure, it's the norm for data science projects.

To be successful, you need to build a system that not only allows iterative development, but also encourages that way of developing. Data scientists can do this by running experiments. The basic process of experimentation is to run the model under different conditions (e.g., test different features, preprocess data differently, scale the data, etc.) and compare performance. The goal is to find the best model for deployment.

Running an experiment can be tedious, especially if you keep running the notebook repeatedly and copying and pasting the results into a table to compare them one by one. This method of recording each important piece of information takes a lot of time and effort, and includes not only metrics such as MSE, R2, and MAPE, but may also include:

  • The size of the training and test sets (for time series data, start and end dates are also recorded)
  • Date/time when the model was trained
  • Feature importance or coefficient
  • Model files
  • Training and testing data files
  • Other visualizations such as residual plots, line/scatter plots of the data

Sure, you can manually store all of this information in different places, but who wants to do that? If you're going to change the model five times, this can get very confusing.

There are many platforms built specifically for data science experiment tracking, with MLFlow being one of the most popular. SmartNotebook is a notebook-based data science platform for data scientists, with an intuitive interface and a great user experience, SQL and Python integration, to track files, charts, and many other forms of model metadata and artifacts. You can see the order in which each experiment is run, and compare them.

Engineering Thinking: Software Engineering Principles in Data Science

Data Scientists: The Four Indispensable Mindsets

Data science isn't just about data analysis and modeling, it's about effectively deploying solutions into production. This requires the introduction of engineering thinking, i.e., following software engineering best practices to ensure that the solution is scalable, maintainable, and reliable.

Scalability: In data science projects, scalability refers to the ability of a system to handle increasing data volumes and user demands. To achieve scalability, data scientists should consider the following:

  • Modular code: Divide code into small, reusable modules for reuse in different projects.
  • Distributed computing: Leverage distributed computing frameworks, such as Apache Spark, to process large-scale datasets.
  • Cloud services: Adopt cloud services so that you can dynamically adjust your computing resources to meet changing demands.

Maintainability: Data science projects need to be constantly updated and optimized, so maintainability is critical. Here are some best practices for maintainability:

  • Code versioning: Use a version control system such as Git to manage code changes and ensure smooth team collaboration.
  • Documentation: Write detailed documentation of code and data flows to enable team members to quickly understand and take on projects.
  • Test-driven development: Write unit tests and integration tests to ensure the stability and reliability of your code.

Reliability: The reliability of a data science solution in a production environment has a direct impact on business stability and user experience. Here are some ways to improve reliability:

  • Monitoring and alarm: Set up monitoring system and alarm mechanism to find and solve problems in time to ensure the continuous operation of the system.
  • Error handling: Write robust error handling code to deal with all kinds of exceptions and prevent system crashes.
  • Automated deployment: Reduce human error by automating the deployment of code to production using CI/CD (continuous integration and continuous deployment) tools such as Jenkins or GitLab CI.

The application of engineering thinking in a data science project can significantly increase the success and impact of the project. By following software engineering best practices, data scientists can build efficient, stable, and scalable solutions that create greater value for the business. Mastering engineering thinking not only helps solve current problems, but also lays a solid foundation for future innovation and development.

It explores the four mindsets that data scientists must have: data thinking, business thinking, iterative thinking, and engineering thinking. From prioritizing data quality, mastering domain knowledge, and iterative development to the application of software engineering principles, it reveals how to effectively solve complex problems and deploy efficient solutions. Each mindset emphasizes the importance and applied approach in a data-driven world that helps data scientists succeed in different fields.

Read on