Hello, I'm EarlGrey, who insists on sharing dry goods, and has translated and published technical books such as "Python Programming Without Teachers" and "Python Parallel Computing Handbook".

If my sharing is helpful to you, please follow me and strike up together.

Source丨Internet

1. Overview of Machine Learning

1) What is machine learning

Artificial intelligence (AI) is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. It is a general and broad concept, and the ultimate goal of AI is to enable computers to simulate the way people think and behave. Around the 50s of the last century, artificial intelligence began to rise, but due to limitations such as data and hardware devices, the development was slow at that time.

An article to get started with machine learning

Machine learning is a subset of artificial intelligence and is one way to achieve it, but it is not the only way. It is a discipline that specializes in how computers simulate or implement human learning behaviors in order to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Around the 80s of the last century, it began to flourish, giving birth to a large number of machine learning models related to mathematical statistics. Deep learning is a subset of machine learning inspired by the human brain and consists of artificial neural networks (ANNs) that mimic similar structures present in the human brain. In deep learning, learning is carried out through a deep, multi-layered "network" of interrelated "neurons". The term "depth" usually refers to the number of hidden layers in a neural network. After 2012, it exploded and was widely used in many scenarios. Let's take a look at the definition of machine learning by well-known foreign scholars:

Machine learning is the study of how computers can simulate human learning behaviors in order to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve themselves. In a practical sense, machine learning is supported by big data, through various algorithms to allow machines to conduct in-depth statistical analysis of data for "self-learning", so that artificial intelligence systems can obtain inductive reasoning and decision-making capabilities.

Through the classic spam filtering application, let's understand the principles of machine learning and what T, E, and P refer to in the definition:

2) The three elements of machine learning

The three elements of machine learning include data, models, and algorithms. The relationship between these three elements can be illustrated by the following diagram:

(1) Data

Data-driven: Data-driven means that we use objective quantitative data to support decision-making through the collection and analysis of active data. On the other hand, it is experience-driven, as we often call "patting the head".

(2) Models & Algorithms

Model: In the category of AI data-driven, a model refers to a hypothetical function that makes decisions based on data X, which can have different forms, such as computational and rule-based. Algorithm: Refers to the specific calculation method of the learning model. Statistical learning is based on the training data set, according to the learning strategy, the optimal model is selected from the hypothesis space, and finally the calculation method to solve the optimal model needs to be considered. It's usually a matter of optimization.

3) The development of machine learning

The term artificial intelligence first appeared in 1956 to explore effective solutions to a number of problems. In 1960, the U.S. Department of Defense used the concept of "neural networks" to train computers to mimic human reasoning processes. Before 2010, tech giants such as Google and Microsoft improved their machine learning algorithms to take queries to new levels of accuracy. Then, as the amount of data increased, advanced algorithms, and computing and storage capacities increased, machine learning was further developed.

4) Machine learning core technology

Classification: The model is trained with classification data, and new samples are accurately classified and predicted based on the model. Clustering: Identify similarities and differences in data from massive amounts of data and aggregate them into multiple categories based on the greatest commonality. Anomaly detection: Analyzes the distribution of data points to identify outliers that are significantly different from normal data. Regression: Based on the training of known attribute value data, the model finds the best-fitting parameters and predicts the output value of new samples based on the model.

5) Basic machine learning process

机器学习工作流（WorkFlow）包含数据预处理（Processing）、模型学习（Learning）、模型评估（Evaluation）、新样本预测（Prediction）几个步骤。

Data preprocessing: input (unprocessed data + labels), → processing process (feature processing + amplitude scaling, feature selection, dimension reduction, sampling) → output (test set + training set). Model learning: model selection, cross-validation, result evaluation, and hyperparameter selection. Model Evaluation: Understand how the model scored against the dataset test. New Sample Prediction: Predicts the test set.

6) Machine learning application scenarios

As a data-driven approach, machine learning has been widely used in data mining, computer vision, natural language processing, biometrics, search engines, medical diagnosis, detecting credit card fraud, securities market analysis, DNA sequence sequencing, speech and handwriting recognition, and robotics.

Intelligent medical care: intelligent prosthetics, exoskeletons, healthcare robots, surgical robots, intelligent health management, etc. Face recognition: access control system, attendance system, face recognition security door, electronic passport and ID card, you can also use face recognition system and network to search for fugitives nationwide. Robot control field: industrial robots, robotic arms, multi-legged robots, sweeping robots, drones, etc.

2. Basic terms for machine learning

监督学习（Supervised Learning）：训练集有标记信息，学习方式有分类和回归。无监督学习（Unsupervised Learning）：训练集没有标记信息，学习方式有聚类和降维。强化学习（Reinforcement Learning）：有延迟和稀疏的反馈标签的学习方式。

Example/Sample: One piece of data from the dataset above. Attributes/characteristics: "color", "root", etc. Attribute Space/Sample Space/Input Space X: A space made up of all attributes. Eigenvector: A coordinate vector corresponding to each point in space. Marker: Information about the sample result, e.g. (color = green, root = curling, knocking sound = voice), good melon), where "good melon" is called a marker. Classification: If you want to predict discrete values, such as "good melons" and "bad melons", this kind of learning task is called classification. Hypothesis: The learned model corresponds to some latent law about the data. Truth: The latent law itself. Learning process: It is to find out or approach the truth. Generalization ability: The ability to adapt the learned model to new samples. In general, the larger the training sample, the more likely it is that you will learn to obtain a model with strong generalization ability.

3. Classification of machine learning algorithms

1) Problem scenarios on which machine learning algorithms are based

In the past 30 years, machine learning has developed into a multi-disciplinary discipline, involving probability theory, statistics, approximation theory, convex analysis, computational complexity theory and other disciplines. Machine learning theory is the design and analysis of algorithms that allow computers to "learn" automatically. Machine learning algorithms automatically analyze patterns from data and use them to make predictions on unknown data. Machine learning theory focuses on achievable and effective learning algorithms. Many inference problems are difficult without a program, so part of the machine learning research is to develop approximate algorithms that are easy to handle.

The most important categories of machine learning are: supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning: A function is learned from a given training dataset and can be used to predict outcomes when new data arrives. The training set of supervised learning is required to include inputs and outputs, which can also be said to be features and objectives. The goals in the training set are marked by people. Common supervised learning algorithms include regression analysis and statistical classification.

For a summary of the supervised learning algorithm model, you can check out ShowMeAI's article AI Knowledge and Skills Quick Reference | Machine Learning-Supervised Learning (Official accounts cannot be jumped, see the link at the end of this article).

Unsupervised learning: In contrast to supervised learning, the training set does not have artificially labeled results. Common unsupervised learning algorithms include generative adversarial networks (GANs) and clustering.

For more summary of unsupervised learning algorithm models, please check out ShowMeAI's article AI Knowledge and Skills Quick Reference| Machine Learning - Unsupervised Learning.

Reinforcement learning: Learn how to do it through observation. Each action has an impact on the environment, and the learner makes judgments based on the feedback observed from the surrounding environment.

2) Classification issues

Classification problems are a very important part of machine learning. Its goal is to determine which known sample class a new sample belongs to based on certain characteristics of a known sample. Classification problems can be broken down as follows: Binary classification problems: Indicates which known sample class the new samples of two categories in the classification task belong to. Multiclass classification problem: indicates that there are multiple classes in a classification task. Multilabel classification problem: Each sample is given a series of target labels.

Learn more about machine learning classification algorithms: KNN algorithm, logistic regression algorithm, naive Bayes algorithm, decision tree model, random forest classification model, GBDT model, XGBoost model, support vector machine model, etc. (The official account cannot be jumped, see the link at the end of this article)

3) Regression issues

Learn more about machine learning regression algorithms: decision tree models, random forest classification models, GBDT models, regression tree models, support vector machine models, and more.

4) Clustering problems

Learn more about machine learning clustering algorithms: Clustering algorithms.

5) Dimensionality reduction

Learn more about machine learning dimensionality reduction algorithms: PCA dimensionality reduction algorithms.

4. Evaluation and selection of machine learning models

1) Machine learning and data fitting

The most typical supervised learning of machine learning is classification and regression problems. In the classification problem, we learn a "decision boundary" to complete data differentiation; In the regression problem, we learn to fit the curve of the sample distribution.

2) Training set and dataset

Let's take house price estimation as an example to illustrate the concepts involved. Training Set: Helps train the model, simply put, through the data of the training set, to determine the parameters of the fitting curve. Test Set: To test the accuracy of the trained model. Of course, the test set does not guarantee the correctness of the model, but only that similar data will yield similar results with this model. Because when training the model, the parameters are all modified and fitted according to the data in the existing training set, there may be overfitting, that is, this parameter only fits the data in the training set more accurately, and then there is another data that needs to use the model to predict the results, and the accuracy may be very poor.

3) Empirical error

Learn on the data of the training set. The error of the model on the training set is called "Empirical Error". But the empirical error is not always better, because we want to have good estimates on new data that we haven't seen before.

4) Overfitting

Overfitting means that the model performs well on the training set, but does not perform well on the cross-validation set test set, that is, the model has average prediction performance on unknown samples and poor generalization ability.

How can overfitting be prevented? Common methods include Early Stopping, Data Augmentation, Regularization, Dropout, etc. Regularization: It refers to adding a regularization term after the objective function, usually L1 regularization and L2 regularization. The L1 positive is based on the L1 norm, that is, the L1 norm and term of the parameter are added after the objective function, that is, the absolute value of the parameter and the product term of the parameter. Dataset expansion: It is necessary to obtain more data that meets the requirements, that is, it is independently and identically distributed with the existing data, or is approximately independent and identically distributed. Common methods include collecting more data from the data source, duplicating the original data and adding random noise, resampling, estimating data distribution parameters based on the current dataset, and using the distribution to generate more data. DropOut: This is achieved by modifying the structure of the neural network itself.

5) Deviation

Bias, which usually refers to the degree of deviation in the fit of the model. Given an infinite number of training sets, the model that is expected to fit is the average model. Biases are the difference between the true model and the average model. The simple model is a set of straight lines, and the average model obtained after averaging is a straight dashed line, which is quite different from the curve of the real model (the gray shaded part is larger). As a result, simple models are often highly biased.

A complex model is a set of wavy lines that cancel each other out after averaging, and the curves of the real model are less different from the curves, so the complex model usually has low bias (see the yellow curve and the green dotted line almost coincide).

6) Variance, which usually refers to how smooth the model is (how simple it is). The corresponding functions of the simple model are the same, they are all horizontal lines, and the functions of the average model are also horizontal lines, so the variance of the simple model is small and insensitive to changes in the data.

The corresponding functions of the complex model are strange and have no rules, but the function of the average model is also a smooth curve, so the variance of the complex model is large and sensitive to changes in the data.

7) Balance of deviation and variance

An article to get started with machine learning

8) Performance metrics

The performance metric is a numerical evaluation measure of the generalization ability of a model, reflecting the current problem (task requirements). Using different performance metrics may result in different judgements. For more details, see Model Evaluation Methods and Guidelines (see link at the end of this article).

(1) Regression issues

The judgment on the "goodness" of the model depends not only on the algorithm and data, but also on the current task requirements. The commonly used performance measures for regression problems are: mean absolute error, mean square error, root mean square error, R squared, etc.

Mean Absolute Error (MAE), also known as mean absolute deviation, is the average of the absolute values of all label values that deviate from the predicted values of the regression model. Mean Absolute Percentage Error (MAPE) is an improvement of MAE that considers the proportion of absolute error relative to the true value. Mean Square Error (MSE) is the average of the squares of the deviation of all label values from the predicted values of the regression model relative to the mean absolute error. Root-Mean-Square Error (RMSE), also known as standard error, is an open-squared operation based on the mean square error. The RMSE is used to measure the deviation between the observed value and the true value. R-squared, the coefficient of determination, reflects the proportion of the total variation of the dependent variable that can be explained by the independent variables in the model through the current regression model. The closer the scale is to 1, the better the current regression model interprets the data and accurately describes the true distribution of the data.

(2) Classification issues

Commonly used performance metrics for classifying problems include Error Rate, Accuracy, Precision, Recall, F1, ROC Curve, AUC Curve, and R-Squared. For more details, see Model Evaluation Methods and Guidelines (see link at the end of this article).

Error rate: The number of samples with misclassification as a percentage of the total number of samples. Precision: The ratio of the number of samples correctly classified to the total number of samples. Accuracy (also known as accuracy) is the proportion of the results returned after a search that are really correct to the number of results that you think are correct. Recall rate (also known as recall rate) is the proportion of the number of truly correct items in the search results to the number of truly correct items in the entire dataset (retrieved and unretrieved). F1 is a measure that takes into account the accuracy and recall rate, and it is based on the harmonized average definition of the accuracy and recall rate: that is, the general form of the F1 measure-Fβ, which allows us to express different preferences for the accuracy and recall rate.

The full name of ROC curve (Receiver Operating Characteristic Curve) is "Receiver Operating Characteristic Curve". The quality of probability prediction ranking is comprehensively considered, which reflects the "expected generalization performance" of the learner under different tasks. The vertical axis of the ROC curve is the "true case rate" (TPR) and the horizontal axis is the "false positive rate" (FPR). Area Under ROC curve (AUC) is the area under the ROC curve, which represents the ranking quality of sample prediction.

Understanding AUC from a relatively high perspective: Taking the identification of abnormal users as an example, a high AUC value means that the model still has a low false positive rate for normal users when it can identify as many abnormal users as possible (it will not misjudge a large number of normal users as abnormal because it is necessary to identify abnormal users).

9) Assessment Methods

We don't have unknown samples on our hands, how can we reliably evaluate them? The key is to have a reliable "test set", i.e. the test set (for evaluation) should be "mutually exclusive" with the training set (for model learning).

Common evaluation methods are: Hold-out, Cross Validation, and Bootstrap. For more details, see Model Evaluation Methods and Guidelines (see link at the end of this article). Hold-out is one of the most common evaluation methods in machine learning, which retains a validation sample set from the training data, which is not used for training, but for model evaluation.

Another common method of evaluation in machine learning is Cross Validation. K-fold cross-validation averages the results of training in k different groups to reduce variance, so the performance of the model is less sensitive to the division of the data, the use of the data will be more sufficient, and the model evaluation results will be more stable.

Bootstrap is a non-parametric method for estimating population values with a small sample, and is widely used in evolutionary and ecological research. Bootstrap generates a large number of pseudo-samples through back-sampling, and calculates the pseudo-samples to obtain the distribution of statistics, so as to estimate the overall distribution of the data.

10) Model tuning and selection criteria

We want to find a model that can express the current problem well and has low model complexity:

Models with good expressiveness can learn the rules and patterns in the training data better.
The model with low complexity has a small variance, is not easy to overfit, and has a good generalization expression.

11) How to choose the optimal model

(1) Validation set evaluation selection

The data is divided into a training set and a validation set.
For the prepared candidate hyperparameters, the model is performed on the training set and evaluated on the validation set.

(2) Grid search/random search cross-validation

Candidate sets of hyperparameters are produced through grid search/random search.
For each set of hyperparameters in the parameter group, cross-validation is used to evaluate the effect.
Pick the best performing hyperparameters.

(3) Bayesian optimization

Hyperparameter tuning based on Bayesian optimization.

-EDF-

I've seen this in the article, don't forget to click "like" and "watching" in the lower right corner to encourage~

Recommended Click on the title to jump

1. Python project engineering best practices

2. Python can be faster than C!

3. streamlit, a super powerful Python library

4. Douban's C++ classic with a score of 8.9 is free to give!

5. What changes have been made in Python 3.12?

Recently, I opened a Taobao store, called [Breaking Barriers], which focuses on the sharing of paid materials and tools related to program development to help you reduce trial and error and use costs. Everyone is welcome to pay attention.

Reply to the keyword "pybook03", receive the electronic version of "Think Python 2e" translated by Attack Grey and his friends reply keyword "Book List 02", and receive the electronic version of 10 Python primers organized by Attack Grey

Tell you more details

Welcome to my circle of friends

👆 Update your thoughts and understandings every day

An article to get started with machine learning

1. Overview of Machine Learning

1) What is machine learning

2) The three elements of machine learning

(1) Data

(2) Models & Algorithms

3) The development of machine learning

4) Machine learning core technology

5) Basic machine learning process

6) Machine learning application scenarios

2. Basic terms for machine learning

3. Classification of machine learning algorithms

1) Problem scenarios on which machine learning algorithms are based

2) Classification issues

3) Regression issues

4) Clustering problems

5) Dimensionality reduction

4. Evaluation and selection of machine learning models

1) Machine learning and data fitting

2) Training set and dataset

3) Empirical error

4) Overfitting

5) Deviation

An article to get started with machine learning

8) Performance metrics

(1) Regression issues

(2) Classification issues

9) Assessment Methods

10) Model tuning and selection criteria

11) How to choose the optimal model

(1) Validation set evaluation selection

(2) Grid search/random search cross-validation

(3) Bayesian optimization

Read on