“哈哈,可能有很多朋友會對這個欄目覺得新鮮,me too,我來簡單介紹一下:這個欄目的文章全部會來自于國外優秀的博文,通過阿Sam的精挑細選,主要會涵蓋機器學習、大資料、資料挖掘、Python等主題,每篇文章的篇幅會遵循“小而精”原則,也就是篇幅不長但内容精華,而且,在每篇文章的開頭,都會附上阿Sam的“文章觀後感”,便于大家對整體文章有一個概括性了解,歡迎大家持續關注,謝謝。
本篇文章主要是講了較為簡單的資料科學項目的7個必要步驟以及步驟必須要做的事情,簡單明了地讓初步接觸資料科學項目的朋友對這塊的工作内容有個簡單認識,另外,裡面也提醒了我們在實際工作中做資料科學項目不隻是要考慮如何做出好模型,而更重要的是要培養以及鍛煉自己架構性處理複雜問題的能力,并把問題描述清楚與解釋明白!這篇文章還是值得啃一下的!
Congratulations, you passed the initial interview and have moved on to the data science project! The recruiter has given you a set of extremely vague instructions at this point. You might find yourself at 2 am staring at the computer, resembling the man in the photo above.
““The project should only take a few hours to complete, but we will give you a week to hand it in”
This part of the interview process should not be this difficult. I am here to help you pass the data challenge and successfully make it to the next part of the interview process.
Remember they do not want you to build the most successful model, they have huge resources that are spending months to create viable models for the problem at hand. They are more concerned with how you approach a complex problem, your code, comments, explanation, and organization.
Below are the 7 steps to follow in your data science project.
1. Introduction
Start with why. Why would this project be important and who benefits from a well-defined model and solution? Why does the company care?
Take the opportunity to showcase your ability to identify the business need for the data challenge. Usually, the company will give you a topic related to their business. This is your chance as a Data Scientist to bound the project to a meaningful story.
Example (Fraud Case Model):
“As a credit card provider it is important to have safeguards in place to battle fraud. For the 2019 fiscal year fraudulent transaction have accounted for $1.5 million dollars [gathered from the data]. Not only is there a monetary loss but there is also a impact to our customer’s trust in the company to protect them …
2. Exploratory Data Analysis
Explore Explore Explore!
Remember that a significant amount of your time (in the job) will be used to wrangling data. Use this section to create some graphs and insight into the data.
Some graphs to consider:
- Histogram (Fraudulent -v- Non Fraudulent Transactions)
- Correlation Heat Map
- Average Transaction over time.
Always remember to explore missing data. Missing information is part of the process, and being able to identify it is crucial. Things to look for:
- NA, Nan, Null
- Empty String “ ”
- The word “Missing” or “Blank”
This is not the time to create new features. Instead, use this section to set the foundation.
3. Baseline — Simple Model
How will we determine that our model performance is successful? Success is a relative term. A model is successful when compared to the current process. For instance, you can build a model with 95% (65%) accuracy, but if the current method has 97% (55%) accuracy you didn’t (did) added value.
This section will be used to create a baseline model. This model will be the foundation so that other complex models can be compared. Since this is your “first try” at the problem, it does not need to be overly complex or fine-tuned. You can preface this section by stating that you will throw the data at a model (simple model) and see how well it does.
Explain, Explain, Explain!
When choosing the baseline model (or any model) you must explain why you choose the modeling technique. There is no correct answer, but having an answer will demonstrate that you thought about the problem before modeling.
An acceptable answer is:
““ This model is both simple and intuitive and will be a good baseline before moving on to more complex models. The simplicity has the advantage when interpreting the model to business stakeholders.”
Finally, display the success metric (RMSE, Precision/Recall, Confusion Matrix, ROC curve) and (once again) explain it. Do not be alarmed if the model is not performing well.
4. Feature Engineering
You can now begin to create some new features from the existing data. As a reminder, they are not looking for some abstract world winning feature. Create a handful of features that you believe are important and explain why you think it will help in improving performance. Some of these features will stem from the EDA section, where you gain insight into the dataset.
Continuing on our fraud example here are two features:
- Over_Avg: Is the Current transaction > Average transaction?
- Credit_Util: Card Balance / Credit Limit
5. Complex Model
Up to this point, you have set the foundation for the business problem. Now you are at a crossroad, you can augment the baseline model by tuning the hyperparameters and implementing the new features or use a different model (E.g. changing from logistic regression to a random forest). Whichever choice you make, make sure to explain the reasoning behind it.
Once again highlight the success metric and explain the results. Fingers crossed that the model is performing better than the baseline.
6. Next Steps
Almost there, just hold on a few hours longer! You are probably tempted to go back to step 5 and tweak the model some more to get the “perfect” results, but there is no such thing as perfect in this assignment. Instead, outline what you would do as a next step if you spent more time on this assignment.
Things to consider writing about:
- Why did your model fail — Is there evidence to suggest that you overfit?
- Additional hyperparameter tuning — Use cross-validation to search a grid.
- Additional features you would like to develop
- Balance the dataset so that each class is better represented
As you are outlining the next step, also include a brief description of how you think it would increase model performance.
7. Conclusion
Remember that you are telling a data story. Reiterate why this project is important and the problem you want to solve. Summarize the insight you gathered in the exploratory section and how well the model performed. Lastly, explain how your model will solve the problem.
Before you send out that project, be sure to do some last-minute tasks. Proofread your presentation, make sure you have comments throughout the code and that it is well organized.
Good Luck!