Fitting 'Old Faithful' Data using mix'EM ======================================== .. code:: python import pandas as pd import numpy as np from bokeh.plotting import figure, show, vplot, hplot from bokeh.io import output_notebook; output_notebook() import mixem .. raw:: html
BokehJS successfully loaded.
For this example, we'll fit the classic `Old Faithful data set `__ of geyser eruption durations and waiting times between eruptions using a mixture of two bivariate normal distributions. We have the data prepared in a CSV-formatted file: .. code:: python data = pd.read_csv("faithful.csv") print(data.head()) .. parsed-literal:: eruptions waiting 0 3.600 79 1 1.800 54 2 3.333 74 3 2.283 62 4 4.533 85 To see that we have two independent components in the original data, we create a scatterplot of the data: .. code:: python fig = figure(title="Old Faithful Data", x_axis_label="Eruption duration (minutes)", y_axis_label="Waiting time (minutes)") fig.scatter(x=data.eruptions, y=data.waiting) show(fig); .. raw:: html
Now it's time to fit the data using `mixem.em `__: We create a numpy array of shape :math:`(N, 2)` (where N is the number of data points) from the ``pandas.DataFrame`` specify a model of two `mixem.distribution.MultivariateNormalDistribution `__\ s using rough initial parameter estimates of :math:`\mu_1 = (2, 50)^T, \sigma_1=I_2`, :math:`\mu_2 = (4, 80)^T, \sigma_2=I_2`: .. code:: python weights, distributions, ll = mixem.em(np.array(data), [ mixem.distribution.MultivariateNormalDistribution(np.array((2, 50)), np.identity(2)), mixem.distribution.MultivariateNormalDistribution(np.array((4, 80)), np.identity(2)), ]) .. parsed-literal:: iteration 0 (log-likelihood=-5.97474e+03): p(x|Φ) = 0.354*MultiNorm[μ=[ 2.064 54.316], σ=[[ 0.121 0.667], [ 0.667 30.598]]] + 0.646*MultiNorm[μ=[ 4.268 79.987], σ=[[ 0.222 1.154], [ 1.154 34.977]]] iteration 1 (log-likelihood=-9.64626e+02): p(x|Φ) = 0.357*MultiNorm[μ=[ 2.039 54.497], σ=[[ 0.071 0.458], [ 0.458 33.81 ]]] + 0.643*MultiNorm[μ=[ 4.291 79.985], σ=[[ 0.169 0.922], [ 0.922 35.802]]] iteration 2 (log-likelihood=-9.53829e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.037 54.483], σ=[[ 0.07 0.439], [ 0.439 33.725]]] + 0.644*MultiNorm[μ=[ 4.29 79.973], σ=[[ 0.169 0.934], [ 0.934 35.975]]] iteration 3 (log-likelihood=-9.53864e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.48 ], σ=[[ 0.069 0.436], [ 0.436 33.703]]] + 0.644*MultiNorm[μ=[ 4.29 79.969], σ=[[ 0.17 0.939], [ 0.939 36.029]]] iteration 4 (log-likelihood=-9.53880e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.699]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.94 ], [ 0.94 36.042]]] iteration 5 (log-likelihood=-9.53884e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.698]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.045]]] iteration 6 (log-likelihood=-9.53885e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.697]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.046]]] iteration 7 (log-likelihood=-9.53885e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.697]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.046]]] iteration 8 (log-likelihood=-9.53885e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.697]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.046]]] iteration 9 (log-likelihood=-9.53885e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.697]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.046]]] iteration 10 (log-likelihood=-9.53885e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.697]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.046]]] iteration 11 (log-likelihood=-9.53885e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.697]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.046]]] iteration 12 (log-likelihood=-9.53885e+02): p(x|Φ) = 0.356*MultiNorm[μ=[ 2.036 54.479], σ=[[ 0.069 0.435], [ 0.435 33.697]]] + 0.644*MultiNorm[μ=[ 4.29 79.968], σ=[[ 0.17 0.941], [ 0.941 36.046]]] mix'EM will return a tuple ``(weights, distributions, log_likelihood)`` with mixing weights, distributions with parameters and the total log-likelihood of the model, respectively. We can use this model to compute probabilities for new data points by using `mixem.probability `__. Here, we create a grid of ``N * N`` values and compute probabilities for all grid points. We then map the result onto a bokeh image: .. code:: python N = 100 x = np.linspace(np.min(data.eruptions), np.max(data.eruptions), num=N) y = np.linspace(np.min(data.waiting), np.max(data.waiting), num=N) xx, yy = np.meshgrid(x, y, indexing="ij") # Convert meshgrid into a ((N*N), 2) array of coordinates xxyy = np.array([xx.flatten(), yy.flatten()]).T # Compute model probabilities p = mixem.probability(xxyy, weights, distributions).reshape((N, N)) fig2 = figure(title="Fitted Old Faithful Data", x_axis_label="Eruption duration (minutes)", y_axis_label="Waiting time (minutes)") # Plot the grid of model probabilities -- attention: bokeh expects _transposed_ input matrix! fig2.image(image=[p.T], x=np.min(data.eruptions), y=np.min(data.waiting), dw=np.ptp(data.eruptions), dh=np.ptp(data.waiting), palette="Spectral11") # Plot data points fig2.scatter(x=data.eruptions, y=data.waiting, color="#000000") show(fig2); .. raw:: html