天天看点

[001] 7个简单步骤搞定Data Science的“课后作业”!1. Introduction2. Exploratory Data Analysis3. Baseline — Simple Model4. Feature Engineering5. Complex Model6. Next Steps7. Conclusion

“哈哈,可能有很多朋友会对这个栏目觉得新鲜,me too,我来简单介绍一下:这个栏目的文章全部会来自于国外优秀的博文,通过阿Sam的精挑细选,主要会涵盖机器学习、大数据、数据挖掘、Python等主题,每篇文章的篇幅会遵循“小而精”原则,也就是篇幅不长但内容精华,而且,在每篇文章的开头,都会附上阿Sam的“文章观后感”,便于大家对整体文章有一个概括性了解,欢迎大家持续关注,谢谢。

本篇文章主要是讲了较为简单的数据科学项目的7个必要步骤以及步骤必须要做的事情,简单明了地让初步接触数据科学项目的朋友对这块的工作内容有个简单认识,另外,里面也提醒了我们在实际工作中做数据科学项目不只是要考虑如何做出好模型,而更重要的是要培养以及锻炼自己框架性处理复杂问题的能力,并把问题描述清楚与解释明白!这篇文章还是值得啃一下的!

Congratulations, you passed the initial interview and have moved on to the data science project! The recruiter has given you a set of extremely vague instructions at this point. You might find yourself at 2 am staring at the computer, resembling the man in the photo above.

““The project should only take a few hours to complete, but we will give you a week to hand it in”

This part of the interview process should not be this difficult. I am here to help you pass the data challenge and successfully make it to the next part of the interview process.

Remember they do not want you to build the most successful model, they have huge resources that are spending months to create viable models for the problem at hand. They are more concerned with how you approach a complex problem, your code, comments, explanation, and organization.

Below are the 7 steps to follow in your data science project.

1. Introduction

Start with why. Why would this project be important and who benefits from a well-defined model and solution? Why does the company care?

Take the opportunity to showcase your ability to identify the business need for the data challenge. Usually, the company will give you a topic related to their business. This is your chance as a Data Scientist to bound the project to a meaningful story.

Example (Fraud Case Model):

“As a credit card provider it is important to have safeguards in place to battle fraud. For the 2019 fiscal year fraudulent transaction have accounted for $1.5 million dollars [gathered from the data]. Not only is there a monetary loss but there is also a impact to our customer’s trust in the company to protect them …

2. Exploratory Data Analysis

Explore Explore Explore!

Remember that a significant amount of your time (in the job) will be used to wrangling data. Use this section to create some graphs and insight into the data.

Some graphs to consider:

  • Histogram (Fraudulent -v- Non Fraudulent Transactions)
  • Correlation Heat Map
  • Average Transaction over time.

Always remember to explore missing data. Missing information is part of the process, and being able to identify it is crucial. Things to look for:

  • NA, Nan, Null
  • Empty String “ ”
  • The word “Missing” or “Blank”

This is not the time to create new features. Instead, use this section to set the foundation.

3. Baseline — Simple Model

How will we determine that our model performance is successful? Success is a relative term. A model is successful when compared to the current process. For instance, you can build a model with 95% (65%) accuracy, but if the current method has 97% (55%) accuracy you didn’t (did) added value.

This section will be used to create a baseline model. This model will be the foundation so that other complex models can be compared. Since this is your “first try” at the problem, it does not need to be overly complex or fine-tuned. You can preface this section by stating that you will throw the data at a model (simple model) and see how well it does.

Explain, Explain, Explain!

When choosing the baseline model (or any model) you must explain why you choose the modeling technique. There is no correct answer, but having an answer will demonstrate that you thought about the problem before modeling.

An acceptable answer is:

““ This model is both simple and intuitive and will be a good baseline before moving on to more complex models. The simplicity has the advantage when interpreting the model to business stakeholders.”

Finally, display the success metric (RMSE, Precision/Recall, Confusion Matrix, ROC curve) and (once again) explain it. Do not be alarmed if the model is not performing well.

4. Feature Engineering

You can now begin to create some new features from the existing data. As a reminder, they are not looking for some abstract world winning feature. Create a handful of features that you believe are important and explain why you think it will help in improving performance. Some of these features will stem from the EDA section, where you gain insight into the dataset.

Continuing on our fraud example here are two features:

  • Over_Avg: Is the Current transaction > Average transaction?
  • Credit_Util: Card Balance / Credit Limit

5. Complex Model

Up to this point, you have set the foundation for the business problem. Now you are at a crossroad, you can augment the baseline model by tuning the hyperparameters and implementing the new features or use a different model (E.g. changing from logistic regression to a random forest). Whichever choice you make, make sure to explain the reasoning behind it.

Once again highlight the success metric and explain the results. Fingers crossed that the model is performing better than the baseline.

6. Next Steps

Almost there, just hold on a few hours longer! You are probably tempted to go back to step 5 and tweak the model some more to get the “perfect” results, but there is no such thing as perfect in this assignment. Instead, outline what you would do as a next step if you spent more time on this assignment.

Things to consider writing about:

  • Why did your model fail — Is there evidence to suggest that you overfit?
  • Additional hyperparameter tuning — Use cross-validation to search a grid.
  • Additional features you would like to develop
  • Balance the dataset so that each class is better represented

As you are outlining the next step, also include a brief description of how you think it would increase model performance.

7. Conclusion

Remember that you are telling a data story. Reiterate why this project is important and the problem you want to solve. Summarize the insight you gathered in the exploratory section and how well the model performed. Lastly, explain how your model will solve the problem.

Before you send out that project, be sure to do some last-minute tasks. Proofread your presentation, make sure you have comments throughout the code and that it is well organized.

Good Luck!