天天看点

python概率密度函数参数估计_如何估算密度函数并计算其峰值?

python概率密度函数参数估计_如何估算密度函数并计算其峰值?

I have started to use python for analysis. I would like to do the following:

Get the distribution of dataset

Get the peaks in this distribution

I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?

import pandas as pd

import numpy as np

import pylab as pl

import scipy.stats

df = pd.read_csv('D:\dataset.csv')

pdf = scipy.stats.kde.gaussian_kde(df)

x = np.linspace((df.min()-1),(df.max()+1), len(df))

y = pdf(x)

pl.plot(x, y, color = 'r')

pl.hist(data_column, normed= True)

pl.show(block=True)

解决方案

I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.

import pandas as pd

import numpy as np

import scipy.stats as stats

import matplotlib.pyplot as plt

np.random.seed(0)

gaussian1 = -6 + 3 * np.random.randn(1700)

gaussian2 = 4 + 1.5 * np.random.randn(300)

gaussian_mixture = np.hstack([gaussian1, gaussian2])

df = pd.DataFrame(gaussian_mixture, columns=['data'])

# non-parametric pdf

nparam_density = stats.kde.gaussian_kde(df.values.ravel())

x = np.linspace(-20, 10, 200)

nparam_density = nparam_density(x)

# parametric fit: assume normal distribution

loc_param, scale_param = stats.norm.fit(df)

param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)

fig, ax = plt.subplots(figsize=(10, 6))

ax.hist(df.values, bins=30, normed=True)

ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')

ax.plot(x, param_density, 'k--', label='parametric density')

ax.set_ylim([0, 0.15])

ax.legend(loc='best')

python概率密度函数参数估计_如何估算密度函数并计算其峰值?

From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.

To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.

# get mode: exhastive search

x[np.argsort(nparam_density)[-1]]