Data Scientists: The Four Indispensable Mindsets

In the modern data-driven world, data scientists play a vital role. Not only do they need to be proficient in machine learning, statistics, and data visualization, but they also need to have the ability to solve complex problems. However, what really sets data scientists apart is their unique way of thinking. Just as software development engineers need to master not only programming languages and tools, but also problem-solving methods and mindsets, data scientists need to keep some principled approaches in mind.

Data thinking, business thinking, iterative thinking, and engineering thinking, the four core mindsets are essential skills for every data scientist. This article will delve into these four mindsets to help you go further and be more successful on your path to data science.

Data thinking: Prioritize data and data quality

In the field of data science, it is crucial to understand and prioritize data. One big mistake that is very common for data science newbies and non-technical people working with data scientists to make is to focus too much on model aspects, such as:

Choose the most complex model
Over-tuning of hyperparameters
Trying to solve all data problems with machine learning

The fields of data science and machine learning are evolving rapidly, with new libraries, faster technologies, and better models emerging all the time. However, the most sophisticated and up-to-date options are not always the best. There are many factors to consider when choosing a model, even whether machine learning is really needed.

A common task is outlier detection. When performing outlier detection, it is important to choose the right method. Although complex machine learning methods may seem appealing, sometimes simpler methods are more effective. For example, z-score is used for outlier detection because it captures most of the relevant anomalies and is highly interpretable and easy to explain to stakeholders without technical backgrounds, while being computationally simple and requires little to no more memory or computing resources.

If you decide to adopt a machine learning approach, you should first prioritize improving data quality, including data cleaning, feature selection, and feature engineering. Don't focus too much on hyperparameter tuning, as no amount of tuning will help with poor data quality. No matter how big the search space is and how long it takes to optimize, if the data itself is not of good quality, the effect is still limited.

Some specific measures to optimize data quality include:

Use methods such as z-score or IQR to remove outliers
Fill in the missing data (interpolation) with the previous value or median, provided that there are not many missing values
You can add new features to your model, such as different combinations of time features in time series data, and data that is affected by unexpected events

At the heart of data thinking is prioritizing data quality and explainability, avoiding over-reliance on complex models and over-tuning hyperparameters. On the premise of ensuring data quality, model optimization can achieve twice the result with half the effort.

Business Thinking: Domain knowledge

Data science exists in almost any field, such as energy, finance, marketing, social media, and food, to name a few. This means that your skills can have a profound impact in a myriad of areas. Domain knowledge refers to expertise or understanding of a particular domain or topic.

In the energy industry, for example, let's say you're a data scientist in your industry and want to build a model that predicts a building's electricity consumption. How do you know which features can be used to build a model? You need to understand which variables typically affect electricity use, such as:

temperature
The hours of the day
What day of the week
What month of the year

These are a good starting point, but a deeper understanding is needed. For example, understand the type of building: is it commercial, industrial/manufacturing, or residential? This will affect the building's response to the above variables.

Commercial buildings are typically busiest during typical business hours (Monday to Friday from 9 a.m. to 5 a.m.) and can contain a binary variable such as "business hours" or "weekends." Holidays also need to be taken into account. Manufacturing plants can have different hours and days, and residential buildings have different schedules and are affected differently by holidays.

Domain knowledge not only helps you build a model the first time, but also guides you in the final deliverables. For example, when detecting outliers, it is common for a value with a z-score greater than 3 or less than -3 to be considered an outlier. However, domain knowledge and regular interactions with customers have taught us that we only need to focus on the case where the meter reading is too high, not if the reading is too low. Therefore, we set the initial z-score threshold to 3 and only treat values above 3 as outliers.

Understanding the context of the problem is a prerequisite for building and training an effective model. Increasing domain knowledge means developing an intuitive sense of problem-solving in your industry.

Iterative thinking: The iterative cycle process of data science

Data scientists must understand that most of their work is iterative and cyclical. That's why it's called the machine learning lifecycle, not a machine learning linear process. Typically, you'll develop a model, test it, or even deploy it – but after a few weeks of running in production, you'll probably go back to development. It's not a failure, it's the norm for data science projects.

To be successful, you need to build a system that not only allows iterative development, but also encourages that way of developing. Data scientists can do this by running experiments. The basic process of experimentation is to run the model under different conditions (e.g., test different features, preprocess data differently, scale the data, etc.) and compare performance. The goal is to find the best model for deployment.

Running an experiment can be tedious, especially if you keep running the notebook repeatedly and copying and pasting the results into a table to compare them one by one. This method of recording each important piece of information takes a lot of time and effort, and includes not only metrics such as MSE, R2, and MAPE, but may also include:

The size of the training and test sets (for time series data, start and end dates are also recorded)
Date/time when the model was trained
Feature importance or coefficient
Model files
Training and testing data files
Other visualizations such as residual plots, line/scatter plots of the data

Sure, you can manually store all of this information in different places, but who wants to do that? If you're going to change the model five times, this can get very confusing.

There are many platforms built specifically for data science experiment tracking, with MLFlow being one of the most popular. SmartNotebook is a notebook-based data science platform for data scientists, with an intuitive interface and a great user experience, SQL and Python integration, to track files, charts, and many other forms of model metadata and artifacts. You can see the order in which each experiment is run, and compare them.

Engineering Thinking: Software Engineering Principles in Data Science

Data science isn't just about data analysis and modeling, it's about effectively deploying solutions into production. This requires the introduction of engineering thinking, i.e., following software engineering best practices to ensure that the solution is scalable, maintainable, and reliable.

Scalability: In data science projects, scalability refers to the ability of a system to handle increasing data volumes and user demands. To achieve scalability, data scientists should consider the following:

Modular code: Divide code into small, reusable modules for reuse in different projects.
Distributed computing: Leverage distributed computing frameworks, such as Apache Spark, to process large-scale datasets.
Cloud services: Adopt cloud services so that you can dynamically adjust your computing resources to meet changing demands.

Maintainability: Data science projects need to be constantly updated and optimized, so maintainability is critical. Here are some best practices for maintainability:

Code versioning: Use a version control system such as Git to manage code changes and ensure smooth team collaboration.
Documentation: Write detailed documentation of code and data flows to enable team members to quickly understand and take on projects.
Test-driven development: Write unit tests and integration tests to ensure the stability and reliability of your code.

Reliability: The reliability of a data science solution in a production environment has a direct impact on business stability and user experience. Here are some ways to improve reliability:

Monitoring and alarm: Set up monitoring system and alarm mechanism to find and solve problems in time to ensure the continuous operation of the system.
Error handling: Write robust error handling code to deal with all kinds of exceptions and prevent system crashes.
Automated deployment: Reduce human error by automating the deployment of code to production using CI/CD (continuous integration and continuous deployment) tools such as Jenkins or GitLab CI.

The application of engineering thinking in a data science project can significantly increase the success and impact of the project. By following software engineering best practices, data scientists can build efficient, stable, and scalable solutions that create greater value for the business. Mastering engineering thinking not only helps solve current problems, but also lays a solid foundation for future innovation and development.

It explores the four mindsets that data scientists must have: data thinking, business thinking, iterative thinking, and engineering thinking. From prioritizing data quality, mastering domain knowledge, and iterative development to the application of software engineering principles, it reveals how to effectively solve complex problems and deploy efficient solutions. Each mindset emphasizes the importance and applied approach in a data-driven world that helps data scientists succeed in different fields.

Data Scientists: The Four Indispensable Mindsets

Data thinking: Prioritize data and data quality

Business Thinking: Domain knowledge

Iterative thinking: The iterative cycle process of data science

Engineering Thinking: Software Engineering Principles in Data Science

Read on

Typical poor thinking, see how many you have hit?

Why do I feel that boys were smarter before, but now there is not much difference in the thinking ability of men and women

17-year-old Zhang Zhijie left in too much hurry! Cai Yun, Shi Yuqi, and Zheng Siwei, the three world champions, mourned

"The cause of Zhang Zhijie's death was exposed, fans appealed for compensation, and Lin Dan Zheng Siwei strongly supported it"

Talk about the extremes of thinking

The scientist went camping with his assistant. Funny jokes, humorous and interesting, you savor them!

Will there be life on Mars? Probes found Martian caves, scientists: they may live underground

How good are "domestic" scientists? I didn't have any study abroad experience, but my achievements shocked the world

It's also scary! Scientists photographed quantum entanglement, but actually photographed the yin and yang Taiji diagram?

Scientists have discovered that the 3,138-kilometer-diameter celestial ocean, or suitable for life, is in the solar system

She is 27 years old as a doctoral supervisor! In just 4 years, from postdoc to chief scientist...

Kruger: How do you build a non-binary way of thinking?

The true cause of Zhang Zhijie's death was exposed, and fans called for compensation from the event party, and Lin Dan Zheng Siwei spoke out

Scientists estimate that an "unusually active" hurricane season could affect the U.S. supply chain

How to write a mind map for "Childhood"?

Really powerful women are good at these 3 kinds of "men's thinking", which is really smart

Green was unhappy with the Clippers' rejection of the Warriors' offer, showing why the American bandit mentality

Scientists are amazed at why some people don't get the coronavirus

63 years ago, the United States sent a 3-year-old chimpanzee into space, and after returning to Earth in 16 minutes, he found that there was nothing wrong with him, and he also ate apples and oranges to a quacky taste, but it didn't take long

Scientists have "grown" a one-dimensional sub-nanometer transistor without the use of lithography machines

The next Einstein? Meet the young genius scientist Sabrina Pastesky

How good is Guiguzi's thinking? In the vertical and horizontal, find the best solution for the crisis, and the more chaotic the situation, the more effective it is

A huge breakthrough! Scientists have cracked the genetic code of wheat, and the native wheat that is about to become extinct has been saved?

Scientists speculate that the Sun may have companion stars, arriving every 26 million years and threatening Earth

"The secret of success in the world is revealed! Thinking breakthrough will take you to meet the ideal second half of the year! "

Strategic Thinking with Explosive Growth Mindset: Learn to give up and be a ruthless prioritizer!

Cai Lei said that the company is about to go bankrupt! That's 40 million left. said that he would donate 100 million, but now he is penniless. Cai Lei cried poor, had no money to see a doctor, and the live broadcast was repeatedly blocked, it was really difficult to make money! His wife

Inspired by lizards! Scientists design earthquake-resistant and safe buildings

What age do you do what? It's time to get rid of that thinking

Lectures | Searching for the "Tao": Exploration of the imagery and thinking of language and writing

【Scientist Spirit】Serving the country with great self and sincerity - a review of the spirit of scientists

When I learned to think "dopamine" and saw my son, my internal drive exploded and I wanted to learn