Exploratory Data Analysis (EDA) Using Decision Trees

The success of Decision Trees (DTs) in the field of machine learning and data science stems from their logical flow that mimics the human decision-making process. This process is similar to a flowchart, where each node processes a given variable through a simple dichotomous decision until the final decision is made.

In the case of buying a t-shirt, the decision may be based on variables such as budget, brand, size, and color:

If the price is more than 30 yuan, I will give up the purchase; Otherwise, a purchase will be considered.
If the price is less than $30 and it's a brand I like, consider sizing further.
Finally, if the price is less than 30 RMB, it's a brand I like and the right size, and the color is black, I'll buy it.

This logic is simple and reasonable, and it works with all kinds of data. However, decision trees are very sensitive to dataset changes, especially on small datasets, and are prone to overfitting.

Exploratory Data Analysis (EDA) is a key stage in a data science project that aims to explore datasets and their variables in depth to understand as much as possible which factors have the greatest impact on the target variables. In this phase, data scientists gain initial insights by understanding the distribution of data, checking for errors or missing data, and visually analyzing how each explanatory variable affects the target variable.

The use of decision trees in the EDA process is extremely beneficial because of their ability to capture small patterns and relationships in the data. In the exploratory phase, you don't need to focus too much on data segmentation or algorithm fine-tuning, and simply run a decision tree to gain insights.

Here's an example of exploring the impact on final grade G3 from the Student Performance Data Set:

The dataset used below is an example of a student performance dataset from the UCI repository, starting with loading the package and dataset:

# Importing libraries
import pandas as pd
import seaborn as sns
sns.set_style()
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.tree import plot_tree


# Loading a dataset
from ucimlrepo import fetch_ucirepo
student_performance = fetch_ucirepo(id=320)


# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets


df = pd.concat([X,y], axis=1)
df.head(3)

The following two examples are examples of how the impact on final grade G3 is explored from the student performance dataset.

示例一：'failures', 'absences', 'studytime' 对G3的影响

Analyze the impact of the number of failures, absences, and learning time on grades by building a decision tree. Higher grades were observed for students who failed less and studied more than 2.5 hours.

# Columns to explore
cols = ['failures', 'absences', 'studytime']
X = df[cols]
y = df.G3


# Fit Decision Tree
dt = DecisionTreeRegressor().fit(X,y)


# Plot DT
plt.figure(figsize=(20,9))
plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, fontsize=8)

Now we have a good visualization to understand the relationship between these variables that we've listed. Here are the insights we can gain from this tree:

For the conditions in the first row of each box, the left side means "yes" and the right side means "no".
Students who failed fewer times (< 0.5, or we could say zero) had higher grades. Just observe that the value of each box on the left is higher than the value on the right.
Among students with no failure record, students who studied more than 2.5 years achieved higher grades. This value is almost one point higher.
Students with no failure record, less than 1.5 study hours and less than 22 absences had a higher final grade than those who studied less and had more absences.

示例二：'freetime', 'goout'对G3的影响

Explore student achievement through free time and frequency of outings. Students who don't go out often and have less free time, as well as students who go out more frequently and have more free time, were found to have lower grades. Students with the best grades usually go out more than 1.5 and have between 1.5 and 2.5 free time.

cols = ['freetime', 'goout']


# Split X & Y
X = df[cols]
y = df.G3


# Fit Decision Tree
dt = DecisionTreeRegressor().fit(X,y)


# Plot DT
plt.figure(figsize=(20,9))
plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, fontsize=10);

The variables goout and freetime are scored on a scale of 1 (very low) to 5 (very high). It was noted that students who were infrequent (< 1.5) and had no free time (< 1.5), as well as those who went out frequently (> 4.5) and had more free time, had lower grades. The best results were those who went out more than 1.5 and had free time in the range of 1.5 to 2.5.

Example 3: Exploration with a categorical decision tree

The same example can be done using a classification tree algorithm. The logic and coding are the same, but the result is now displayed as a predicted category instead of a numeric value. Let's look at a simple example using another dataset, which is Taxis in the Seaborn package (BSD license) that contains a set of taxi trip data for New York City.

The relationship between the total amount of the trip and the payment method is analyzed through the classification decision tree. Lower total amounts are observed to be more likely to be paid in cash:

# Load the dataset
df = sns.load_dataset('taxis').dropna()


cols = ['total']
X = df[cols]
y = df['payment']


# Fit Decision Tree
dt = DecisionTreeClassifier().fit(X,y)


#Plot Tree
plt.figure(figsize=(21,9))
plot_tree(dt, filled=True, feature_names=X.columns, max_depth=3, 
          fontsize=10, class_names=['cash', 'credit_card'])

Just by looking at the generated decision tree, we can see that the lower total amount is more likely to be paid in cash. Trips totaling less than $9.32 are usually paid in cash.

Decision Tree algorithms are widely used in data science and machine learning due to their intuitive visualization methods and rapid capture of data patterns. It enables us to intuitively understand the complex relationships between variables to optimize forecasting and decision-making processes.

Exploratory Data Analysis (EDA) Using Decision Trees

示例一：'failures', 'absences', 'studytime' 对G3的影响

示例二：'freetime', 'goout'对G3的影响

Example 3: Exploration with a categorical decision tree