ML之NB：基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

核心代码

class GaussianNB Found at: sklearn.naive_bayes

class GaussianNB(_BaseNB):

"""

Gaussian Naive Bayes (GaussianNB)

Can perform online updates to model parameters via :meth:`partial_fit`.

For details on algorithm used to update feature means and variance online,

see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:

http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf

Read more in the :ref:`User Guide <gaussian_naive_bayes>`.

Parameters

----------

priors : array-like of shape (n_classes,)

Prior probabilities of the classes. If specified the priors are not

adjusted according to the data.

var_smoothing : float, default=1e-9

Portion of the largest variance of all features that is added to

variances for calculation stability.

.. versionadded:: 0.20

Attributes

class_count_ : ndarray of shape (n_classes,)

number of training samples observed in each class.

class_prior_ : ndarray of shape (n_classes,)

probability of each class.

classes_ : ndarray of shape (n_classes,)

class labels known to the classifier

epsilon_ : float

absolute additive value to variances

sigma_ : ndarray of shape (n_classes, n_features)

variance of each feature per class

theta_ : ndarray of shape (n_classes, n_features)

mean of each feature per class

Examples

--------

>>> import numpy as np

>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

>>> Y = np.array([1, 1, 1, 2, 2, 2])

>>> from sklearn.naive_bayes import GaussianNB

>>> clf = GaussianNB()

>>> clf.fit(X, Y)

GaussianNB()

>>> print(clf.predict([[-0.8, -1]]))

[1]

>>> clf_pf = GaussianNB()

>>> clf_pf.partial_fit(X, Y, np.unique(Y))

>>> print(clf_pf.predict([[-0.8, -1]]))

@_deprecate_positional_args

def __init__(self, *, priors=None, var_smoothing=1e-9):

self.priors = priors

self.var_smoothing = var_smoothing

def fit(self, X, y, sample_weight=None):

"""Fit Gaussian Naive Bayes according to X, y

Parameters

----------

X : array-like of shape (n_samples, n_features)

Training vectors, where n_samples is the number of samples

and n_features is the number of features.

y : array-like of shape (n_samples,)

Target values.

sample_weight : array-like of shape (n_samples,), default=None

Weights applied to individual samples (1. for unweighted).

.. versionadded:: 0.17

Gaussian Naive Bayes supports fitting with *sample_weight*.

Returns

-------

self : object

"""

X, y = self._validate_data(X, y)

y = column_or_1d(y, warn=True)

return self._partial_fit(X, y, np.unique(y), _refit=True,

sample_weight=sample_weight)

def _check_X(self, X):

return check_array(X)

@staticmethod

def _update_mean_variance(n_past, mu, var, X, sample_weight=None):

"""Compute online update of Gaussian mean and variance.

Given starting sample count, mean, and variance, a new set of

points X, and optionally sample weights, return the updated mean and

variance. (NB - each dimension (column) in X is treated as independent

-- you get variance, not covariance).

Can take scalar mean and variance, or vector mean and variance to

simultaneously update a number of independent Gaussians.

See Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and

LeVeque:

Parameters

n_past : int

Number of samples represented in old mean and variance. If sample

weights were given, this should contain the sum of sample

weights represented in old mean and variance.

mu : array-like of shape (number of Gaussians,)

Means for Gaussians in original set.

var : array-like of shape (number of Gaussians,)

Variances for Gaussians in original set.

total_mu : array-like of shape (number of Gaussians,)

Updated mean for each Gaussian over the combined set.

total_var : array-like of shape (number of Gaussians,)

Updated variance for each Gaussian over the combined set.

if X.shape[0] == 0:

return mu, var

# Compute (potentially weighted) mean and variance of new datapoints

if sample_weight is not None:

n_new = float(sample_weight.sum())

new_mu = np.average(X, axis=0, weights=sample_weight)

new_var = np.average((X - new_mu) ** 2, axis=0,

weights=sample_weight)

else:

n_new = X.shape[0]

new_var = np.var(X, axis=0)

new_mu = np.mean(X, axis=0)

if n_past == 0:

return new_mu, new_var

n_total = float(n_past + n_new)

# Combine mean of old and new data, taking into consideration

# (weighted) number of observations

total_mu = (n_new * new_mu + n_past * mu) / n_total

# Combine variance of old and new data, taking into consideration

# (weighted) number of observations. This is achieved by combining

# the sum-of-squared-differences (ssd)

old_ssd = n_past * var

new_ssd = n_new * new_var

total_ssd = old_ssd + new_ssd + (n_new * n_past / n_total) * (mu -

new_mu) ** 2

total_var = total_ssd / n_total

return total_mu, total_var

def partial_fit(self, X, y, classes=None, sample_weight=None):

"""Incremental fit on a batch of samples.

This method is expected to be called several times consecutively

on different chunks of a dataset so as to implement out-of-core

or online learning.

This is especially useful when the whole dataset is too big to fit in

memory at once.

This method has some performance and numerical stability overhead,

hence it is better to call partial_fit on chunks of data that are

as large as possible (as long as fitting in the memory budget) to

hide the overhead.

Training vectors, where n_samples is the number of samples and

n_features is the number of features.

classes : array-like of shape (n_classes,), default=None

List of all the classes that can possibly appear in the y vector.

Must be provided at the first call to partial_fit, can be omitted

in subsequent calls.

return self._partial_fit(X, y, classes, _refit=False,

def _partial_fit(self, X, y, classes=None, _refit=False,

sample_weight=None):

"""Actual implementation of Gaussian NB fitting.

_refit : bool, default=False

If true, act as though this were the first time we called

_partial_fit (ie, throw away any past fitting and start over).

X, y = check_X_y(X, y)

sample_weight = _check_sample_weight(sample_weight, X)

# If the ratio of data variance between dimensions is too small, it

# will cause numerical errors. To address this, we artificially

# boost the variance by epsilon, a small fraction of the standard

# deviation of the largest dimension.

self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()

if _refit:

self.classes_ = None

if _check_partial_fit_first_call(self, classes):

# This is the first call to partial_fit:

# initialize various cumulative counters

n_features = X.shape[1]

n_classes = len(self.classes_)

self.theta_ = np.zeros((n_classes, n_features))

self.sigma_ = np.zeros((n_classes, n_features))

self.class_count_ = np.zeros(n_classes, dtype=np.float64)

# Initialise the class prior

# Take into account the priors

if self.priors is not None:

priors = np.asarray(self.priors)

# Check that the provide prior match the number of classes

if len(priors) != n_classes:

raise ValueError('Number of priors must match number of'

' classes.')

# Check that the sum is 1

if not np.isclose(priors.sum(), 1.0):

raise ValueError('The sum of the priors should be 1.') # Check that

the prior are non-negative

if (priors < 0).any():

raise ValueError('Priors must be non-negative.')

self.class_prior_ = priors

else:

self.class_prior_ = np.zeros(len(self.classes_),

dtype=np.float64) # Initialize the priors to zeros for each class

if X.shape[1] != self.theta_.shape[1]:

msg = "Number of features %d does not match previous data %d."

raise ValueError(msg % (X.shape[1], self.theta_.shape[1]))

# Put epsilon back in each time

::]self.epsilon_

self.sigma_[ -=

classes = self.classes_

unique_y = np.unique(y)

unique_y_in_classes = np.in1d(unique_y, classes)

if not np.all(unique_y_in_classes):

raise ValueError("The target label(s) %s in y do not exist in the "

"initial classes %s" %

(unique_y[~unique_y_in_classes], classes))

for y_i in unique_y:

i = classes.searchsorted(y_i)

X_i = X[y == y_i:]

if sample_weight is not None:

sw_i = sample_weight[y == y_i]

N_i = sw_i.sum()

sw_i = None

N_i = X_i.shape[0]

new_theta, new_sigma = self._update_mean_variance(

self.class_count_[i], self.theta_[i:], self.sigma_[i:],

X_i, sw_i)

self.theta_[i:] = new_theta

self.sigma_[i:] = new_sigma

self.class_count_[i] += N_i

self.sigma_[::] += self.epsilon_

# Update if only no priors is provided

if self.priors is None:

# Empirical prior, with sample_weight taken into account

self.class_prior_ = self.class_count_ / self.class_count_.sum()

return self

def _joint_log_likelihood(self, X):

joint_log_likelihood = []

for i in range(np.size(self.classes_)):

jointi = np.log(self.class_prior_[i])

n_ij = -0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i:]))

n_ij -= 0.5 * np.sum(((X - self.theta_[i:]) ** 2) /

(self.sigma_[i:]), 1)

joint_log_likelihood.append(jointi + n_ij)

joint_log_likelihood = np.array(joint_log_likelihood).T

return joint_log_likelihood

class MultinomialNB Found at: sklearn.naive_bayes

class MultinomialNB(_BaseDiscreteNB):

Naive Bayes classifier for multinomial models

The multinomial Naive Bayes classifier is suitable for classification with

discrete features (e.g., word counts for text classification). The

multinomial distribution normally requires integer feature counts. However,

in practice, fractional counts such as tf-idf may also work.

Read more in the :ref:`User Guide <multinomial_naive_bayes>`.

alpha : float, default=1.0

Additive (Laplace/Lidstone) smoothing parameter

(0 for no smoothing).

fit_prior : bool, default=True

Whether to learn class prior probabilities or not.

If false, a uniform prior will be used.

class_prior : array-like of shape (n_classes,), default=None

Number of samples encountered for each class during fitting. This

value is weighted by the sample weight when provided.

class_log_prior_ : ndarray of shape (n_classes, )

Smoothed empirical log probability for each class.

Class labels known to the classifier

coef_ : ndarray of shape (n_classes, n_features)

Mirrors ``feature_log_prob_`` for interpreting MultinomialNB

as a linear model.

feature_count_ : ndarray of shape (n_classes, n_features)

Number of samples encountered for each (class, feature)

during fitting. This value is weighted by the sample weight when

provided.

feature_log_prob_ : ndarray of shape (n_classes, n_features)

Empirical log probability of features

given a class, ``P(x_i|y)``.

intercept_ : ndarray of shape (n_classes, )

Mirrors ``class_log_prior_`` for interpreting MultinomialNB

n_features_ : int

Number of features of each sample.

>>> rng = np.random.RandomState(1)

>>> X = rng.randint(5, size=(6, 100))

>>> y = np.array([1, 2, 3, 4, 5, 6])

>>> from sklearn.naive_bayes import MultinomialNB

>>> clf = MultinomialNB()

>>> clf.fit(X, y)

MultinomialNB()

>>> print(clf.predict(X[2:3]))

[3]

Notes

-----

For the rationale behind the names `coef_` and `intercept_`, i.e.

naive Bayes as a linear classifier, see J. Rennie et al. (2003),

Tackling the poor assumptions of naive Bayes text classifiers, ICML.

References

C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to

Information Retrieval. Cambridge University Press, pp. 234-265.

https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-

classification-1.html

def __init__(self, *, alpha=1.0, fit_prior=True, class_prior=None):

self.alpha = alpha

self.fit_prior = fit_prior

self.class_prior = class_prior

def _more_tags(self):

return {'requires_positive_X':True}

def _count(self, X, Y):

"""Count and smooth feature occurrences."""

check_non_negative(X, "MultinomialNB (input X)")

self.feature_count_ += safe_sparse_dot(Y.T, X)

self.class_count_ += Y.sum(axis=0)

def _update_feature_log_prob(self, alpha):

"""Apply smoothing to raw counts and recompute log probabilities"""

smoothed_fc = self.feature_count_ + alpha

smoothed_cc = smoothed_fc.sum(axis=1)

self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc.

reshape(-1, 1))

"""Calculate the posterior log probability of the samples X"""

return safe_sparse_dot(X, self.feature_log_prob_.T) + self.class_log_prior_

class BernoulliNB Found at: sklearn.naive_bayes

class BernoulliNB(_BaseDiscreteNB):

"""Naive Bayes classifier for multivariate Bernoulli models.

Like MultinomialNB, this classifier is suitable for discrete data. The

difference is that while MultinomialNB works with occurrence counts,

BernoulliNB is designed for binary/boolean features.

Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.

binarize : float or None, default=0.0

Threshold for binarizing (mapping to booleans) of sample features.

If None, input is presumed to already consist of binary vectors.

class_count_ : ndarray of shape (n_classes)

class_log_prior_ : ndarray of shape (n_classes)

Log probability of each class (smoothed).

Empirical log probability of features given a class, P(x_i|y).

>>> Y = np.array([1, 2, 3, 4, 4, 5])

>>> from sklearn.naive_bayes import BernoulliNB

>>> clf = BernoulliNB()

BernoulliNB()

https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-

model-1.html

A. McCallum and K. Nigam (1998). A comparison of event models

for naive

Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning

for

Text Categorization, pp. 41-48.

V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering

with

naive Bayes -- Which naive Bayes? 3rd Conf. on Email and Anti-Spam

(CEAS).

def __init__(self, *, alpha=1.0, binarize=.0, fit_prior=True,

class_prior=None):

self.binarize = binarize

X = super()._check_X(X)

if self.binarize is not None:

X = binarize(X, threshold=self.binarize)

return X

def _check_X_y(self, X, y):

X, y = super()._check_X_y(X, y)

return X, y

"""Apply smoothing to raw counts and recompute log

probabilities"""

smoothed_cc = self.class_count_ + alpha * 2

self.feature_log_prob_ = np.log(smoothed_fc) - np.log

(smoothed_cc.reshape(-1, 1))

n_classes, n_features = self.feature_log_prob_.shape

n_samples, n_features_X = X.shape

if n_features_X != n_features:

raise ValueError(

"Expected input with %d features, got %d instead" %

(n_features, n_features_X))

neg_prob = np.log(1 - np.exp(self.feature_log_prob_))

# Compute neg_prob · (1 - X).T as ∑neg_prob - X · neg_prob

jll = safe_sparse_dot(X, (self.feature_log_prob_ - neg_prob).T)

jll += self.class_log_prior_ + neg_prob.sum(axis=1)

return jll

ML之NB：基于news新闻文本数据集利用纯统计法、kNN、朴素贝叶斯(高斯/多元伯努利/多项式)、线性判别分析LDA、感知器等算法实现文本分类预测

核心代码

继续阅读

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

笔试面试题目：滑动窗口(二)

27. Remove Element(列表)题目代码

数据结构与算法（27）——排序（二）

无人机--飞控科普

Dijkstra--简易版（最短路径）

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入

hdu7108哈希