python概率密度函数参数估计_如何估算密度函数并计算其峰值？

I have started to use python for analysis. I would like to do the following:

Get the distribution of dataset

Get the peaks in this distribution

I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?

import pandas as pd

import numpy as np

import pylab as pl

import scipy.stats

df = pd.read_csv('D:\dataset.csv')

pdf = scipy.stats.kde.gaussian_kde(df)

x = np.linspace((df.min()-1),(df.max()+1), len(df))

y = pdf(x)

pl.plot(x, y, color = 'r')

pl.hist(data_column, normed= True)

pl.show(block=True)

解决方案

I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.

import pandas as pd

import numpy as np

import scipy.stats as stats

import matplotlib.pyplot as plt

np.random.seed(0)

gaussian1 = -6 + 3 * np.random.randn(1700)

gaussian2 = 4 + 1.5 * np.random.randn(300)

gaussian_mixture = np.hstack([gaussian1, gaussian2])

df = pd.DataFrame(gaussian_mixture, columns=['data'])

# non-parametric pdf

nparam_density = stats.kde.gaussian_kde(df.values.ravel())

x = np.linspace(-20, 10, 200)

nparam_density = nparam_density(x)

# parametric fit: assume normal distribution

loc_param, scale_param = stats.norm.fit(df)

param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)

fig, ax = plt.subplots(figsize=(10, 6))

ax.hist(df.values, bins=30, normed=True)

ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')