-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
FigureFactory.create_distplot
is intended to compare the histogram of a data set with the kde estimation of the probability density function, and more. But Plotly Histogram can create 5 types of histograms, each one set via the key histnorm
.
The method create_hist
has a drawback. It does not choose the right value for the histnorm key. Its default value is histnorm='probability'
:
https://github.com/plotly/plotly.py/blob/master/plotly/tools.py#L5084, and this is contrary to the theoretical definition of this kind of histogram, and that of the probability density function (pdf).
When histnorm='probability'
, the height of a bar in histogram equals the probability that data fall within the corresponding bin. Comparing the kde estimation of the pdf with such a histogram means that we admit that the pdf takes only values in [0,1], and this is not right.
In a histogram plotted with histnorm='probability density', the height h
of a bar is such that h*bin_size=probability that data fall in that bin=bar area. The bar area approximates the area under the pdf graph above that bin.
Hence the probability density function (pdf) or its kde estimation should be compared to the Plotly Histogram corresponding to histnorm='probability density'
.
I illustrate below the plot of the pdf of Beta distribution over the two types of histograms, to point out the drawback of the make_hist
method.
See also in https://plot.ly/~chelsea_lyn/11601/group-1-group-2-group-3-group-4-group-1-group-2-group-3-group-4-group-1-group-2-/ how distant from the corresponding histograms are the last two pdfs.
Here http://nbviewer.ipython.org/0f42b607de8f0d0c50b0ffb0ccfdff08 is the updated plot with histnorm='probability density'
.