Embrace labeled multi-dimensional arrays for better exploratory analysis of your Bayesian models This blog post is an adaptation of the Tagging Basketball Events with HMM in Stan case study. It will not cover any new topics or analysis and assumes you have at least skimmed the original case study. So what is this post about? I will use cmdstanpy+ArviZ integration to show the potential of using labeled arrays when doing exploratory analysis of Bayesian models. I will use xarray's automatic broadcasting and alignment of arrays and the Each section maps to an example on the original case study: simple HMM example, tagging drive events and defensive assignment. All sections follow the same structure. The beginning is as concise as possible to avoid duplication: the data needed for the model is read, the model is compiled and sampled. If you are interested you'll be able to read the stan code of the model clicking on the "Show Output" buttons. We then move to the target of this blog post: conversion of the cmdstanpy fit to ArviZ Link to this same section in the original Stan case study. Click the button below to see the Stan code:
To convert a CmdStanPy fit to Optionally, you can also give coordinate values to some of the dimensions. The dimensions without coordinate values provided are initialized with integers starting from 0 as their coordinate values. Dimensions are provided as a dictionary whose keys are variable names and whose values are a list with the dimension names. Coordinates are provided as a dictionary whose keys are now dimension names, and whose values are coordinate values. We have now created an Each group is an xarray.Dataset. As you can see, The dimensions section lists all the dimensions and their lenghts. There we can quickly see that we have 2 states, and have sampled 1000 draws in 4 independent chains... The coordinates section lists information in the following order: coordinate name, dimension name, type of coordinate values and coordinate values. Moreover, in the beginning there can be an The data variables lists: variables name, dimensions, type and values. Each variable, stored as a The attributes section lists We can customize the appearance of the summary with the Further guidance on sorting and customizing ArviZ labels can be found in the ArviZ label guide Following the case study, we will perform posterior predictive sampling in Python instead of in the When we do With the means that correspond to each posterior predictive sample, we need to generate random draws from a normal with mean We can then generate the distribution and generate the random samples with the Before plotting we will use the extract_dataset function to get a random subset of 100 samples. Plotting the 4000 samples we have available would be excessive and not add any information to the plot. Link to this same section in the original Stan case study.
In this example we also use the We end reproducing the plot in the original case study to show that the posterior predictive samples do indeed look the same. Link to this same section in the original Stan case study.
In this example we already have the data as an xarray object, so we won't use the
Here I have chosen Note that This doesn't make any difference in this case, because we are multiplying the components of However, in many cases we need to operate with all the draws of each random variable. xarray makes it straightforward to work with all the samples and average only once we have the quantity of interest. Package versions used to generate this post: Comments are not enabled for this post, to inquiry further about the contents of the post, ask on Stan Discourse. Feel free to tag me at @OriolAbrilCmdStanPy and ArviZ integration
stats
module of xarray-einstats for posterior predictive sampling.InferenceData
and postprocessing with xarray and xarray-einstats.import cmdstanpy
import pandas as pd
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
az.style.use("arviz-darkgrid")
Simple HMM examplehmm_data = pd.read_csv("data/hmm_example.csv")
with open("stan_codes/hmm_example.stan", "r") as f:
print(f.read())
model = cmdstanpy.CmdStanModel(stan_file="stan_codes/hmm_example.stan")
stan_data = dict(N = len(hmm_data), K = 2, y = hmm_data["y"])
hmm_fit = model.sample(data = stan_data)
Conversion to InferenceData
InferenceData
, only the CmdStanMCMC
object is needed. However, to make the most out of ArviZ and xarray features, the dimensions of each variable should also be provided.states = [1, 2]
idata = az.from_cmdstanpy(
hmm_fit,
dims={"theta": ["origin_state", "end_state"], "mu": ["state"], "z_star": ["time"]},
coords={"state": states, "origin_state": states, "end_state": states}
)
idata
InferenceData
object with two groups, the posterior
(shown below) contains all posterior samples, and the sample_stats
one contains sampler information like the log probability, which samples are divergent or the treedepth.Dataset
s have dimensions, coordinates, data variables and attributes. When printed (either as text or as html repr) each element has its own section with the relevant information.*
which indicates it is an indexing coordinate. With indexing coordinates, you can use .sel
method on either InferenceData
or Dataset
to select a subset of the data using coordinate values.DataArray
object, is independent of the others. They can have any of the dimensions of the Dataset
and in any order.Dataset
level attributes. By default, ArviZ adds some attributes to give an idea of how the data was generated.idata.posterior
az.summary(idata, var_names=["theta", "mu"])
labeller
argument. The arviz.labels
module includes some common labeller classes. The default is showing only variable name and coordinate values. We will now use the DimCoordLabeller
to show also the dimension name:az.summary(idata, var_names=["theta", "mu"], labeller=az.labels.DimCoordLabeller())
Posterior predictive samplinggenerated_quantities
block of Stan. We will use xarray-einstats to generate the random samples from xarray objects.from xarray_einstats.stats import XrContinuousRV
from scipy.stats import norm
post = idata.posterior
psi_seq = post["mu"].sel(state=post["z_star"])
# the conversion to dataset is for dislpay reasons only
psi_seq.to_dataset(name="psi_seq")
.sel(state=DataArray)
we are telling xarray to use the values in the provided DataArray
as labels with which to index the state
dimension. xarray takes care of aligning and broadcasting the dimensions for the indexing to work and generates the desired 3d array with chain, draw and time dimensions.mu
and standard deviation 1
.
xarray-einstats provides the XrContinuousRV class to wrap SciPy distributions and have them take DataArray
s as inputs.rvs
method like we would do with SciPy. The to_dataset
method is called so we can then add the data as a new group to our InferenceData
.idata.add_groups(posterior_predictive=XrContinuousRV(norm, psi_seq, 1).rvs().to_dataset(name="y"))
print(idata)
idata.posterior_predictive
pp_subset = az.extract_dataset(idata, group="posterior_predictive", num_samples=100)
_, ax = plt.subplots()
ax.plot(hmm_data["y"], "k-", zorder=3, label="Observed")
ax.set_title("Observed vs Predicted Output")
ax.set_ylabel("Observation Value")
ax.set_xlabel("Time")
ax.plot(pp_subset["y"], color="#ff668890", alpha=.2)
ax.plot([], [], color="#ff668890", label="Predicted")
ax.legend();
Tagging Drive Eventsdf = pd.read_csv("data/evt140_0021500411.csv")
stan_data = dict(
N = len(df),
K = 2,
u = np.log(1/df["lavine_speed_smooth"].values),
v = np.log(df["lavine_dist"].values),
alpha = np.array([[4,2],[2,4]]),
tau = 0.1,
rho = 0.1
)
with open("stan_codes/drive_1.stan", "r") as f:
print(f.read())
model = cmdstanpy.CmdStanModel(stan_file="stan_codes/drive_1.stan")
drive_fit = model.sample(data = stan_data)
Conversion to InferenceData
observed_data
argument to add some data to the observed_data
group. This can be useful to have the observations also as xarray objects and ease postprocessing operations, or to share the model and InferenceData file for collaborators to reproduce the fit or work with the results directly.states = [1, 2]
drive_idata = az.from_cmdstanpy(
drive_fit,
dims={
"theta": ["origin_state", "end_state"],
"alpha": ["origin_state", "end_state"],
"phi": ["state"],
"lambda": ["state"],
"z_star": ["time"],
"v": ["time"],
"u": ["time"],
},
observed_data={k: v for k, v in stan_data.items() if k in {"u", "v", "alpha"}},
coords={"state": states, "origin_state": states, "end_state": states}
)
drive_idata.posterior
post = drive_idata.posterior
ds_seq = post[["phi", "lambda"]].sel(state=post["z_star"])
phi_hat = XrContinuousRV(norm, ds_seq["phi"], .1).rvs()
lambda_hat = XrContinuousRV(norm, ds_seq["lambda"], .1).rvs()
drive_idata.add_groups(
posterior_predictive=xr.Dataset({"phi": phi_hat, "lambda": lambda_hat})
)
drive_idata.posterior_predictive
pp_subset = az.extract_dataset(drive_idata, group="posterior_predictive", num_samples=100)
obs_data = drive_idata.observed_data
_, axes = plt.subplots(3, 1)
ax = axes[0]
ax.plot(obs_data["v"], "k-", zorder=3, label="Observed")
ax.set_title("Distance from hoop (log scale)")
ax.set_ylabel("Distance from hoop")
ax.set_xlabel("Time (25Hz)")
ax.plot(pp_subset["lambda"], color="#ff668890")
ax.plot([], [], color="#ff668890", label="Predicted")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize="medium");
ax = axes[1]
ax.plot(obs_data["u"], "k-", zorder=3, label="Observed")
ax.set_title("Smooth speed")
ax.set_ylabel("Speed (log scale)")
ax.set_xlabel("Time (25Hz)")
ax.plot(pp_subset["phi"], color="#ff668890")
ax.plot([], [], color="#ff668890", label="Predicted")
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize="medium");
ax = axes[2]
ax.plot(post["z_star"].mean(("chain", "draw")), "k-")
ax.set_title("Hidden states")
ax.set_ylabel("State")
ax.set_xlabel("Time (25Hz)")
ax.set_ylim((.5, 2.5))
ax.set_yticks([1, 2], labels=["Drive", "None"], size="medium");
Defensive assignmentds = xr.open_dataset("data/defense_example.nc")
print(ds)
stan_data = {k: v.item() if v.size == 1 else v.values for k, v in ds.items()}
with open("stan_codes/defense_0a.stan", "r") as f:
print(f.read())
model = cmdstanpy.CmdStanModel(stan_file="stan_codes/defense_0a.stan")
defense0a_fit = model.sample(data = {**stan_data, "lambda": [1/3, 1/3, 1/3]})
observed_data
group. If you still wanted to include it, it might be easier to use .add_groups
with the already existing Dataset
like we did with the posterior predictive samples.states = [1, 2, 3, 4, 5]
defense0a_idata = az.from_cmdstanpy(
defense0a_fit,
dims={
"theta": ["origin_state", "end_state"],
"z_star": ["time"],
},
coords={"state": states, "origin_state": states, "end_state": states}
)
defense0a_idata.posterior
with open("stan_codes/defense_0b.stan", "r") as f:
print(f.read())
model = cmdstanpy.CmdStanModel(stan_file="stan_codes/defense_0b.stan")
defense0b_fit = model.sample(data = {**stan_data, "alpha": [3., 3., 3.]})
param
as dimension name for lambda
because each component multiplies a different provided variable, and used o, h, b
as coordinate names to match the variable names in the data block, but they could be offensive player, hoop, ball
as well, there is no need to restrict oneself to one character coordinate values.states = [1, 2, 3, 4, 5]
defense0b_idata = az.from_cmdstanpy(
defense0b_fit,
dims={
"theta": ["origin_state", "end_state"],
"lambda": ["param"],
"z_star": ["time"],
},
coords={"state": states, "origin_state": states, "end_state": states, "param": ["o", "h", "b"]}
)
defense0b_idata.posterior
lambda0b = defense0b_idata.posterior["lambda"]
mu0a = ds["o"] / 3 + ds["h"] / 3 + ds["b"] / 3
mu0b = (
ds["o"] * lambda0b.sel(param="o")
+ ds["h"] * lambda0b.sel(param="h")
+ ds["b"] * lambda0b.sel(param="b")
)
mu0b
is now a 5d array. Thanks to xarray automatic alignment and broadcasting capabilities we have calculated its values for all players, all time steps and all samples at once:mu0b.to_dataset(name="mu0b")
lambda
by quantities that are not random variables, so we will get the same result averaging on lambda
before operating or averaging on mu
after operating.
Further reading