Statistics (labtoolbox.stats)#
Statistical tools for LabToolbox.
Provides histogram, residual analysis, outlier rejection, Bayesian inference, and basic statistical estimators.
- labtoolbox.stats.bayes_factor(x, y, y_err, f1, p0_1, f2, p0_2, burn=1000, steps=5000, thin=10, maxfev=5000, prior_bounds1=None, prior_bounds2=None, verbose=True)[source]#
Estimate the Bayes factor between two models using the Bayesian Information Criterion (BIC).
- Parameters:
x (array-like) – Independent variable values of the dataset.
y (array-like) – Dependent variable (observed data).
y_err (array-like) – Uncertainties (standard deviations) on the observed data.
f1 (callable) – First model function to compare.
p0_1 (list or array-like) – Initial guess for the parameters of the first model.
f2 (callable) – Second model function to compare.
p0_2 (list or array-like) – Initial guess for the parameters of the second model.
burn (int, optional) – Number of initial MCMC steps to discard. Default is 1000.
steps (int, optional) – Total number of MCMC steps per walker. Default is 5000.
thin (int, optional) – Thinning factor applied when flattening the MCMC chains. Default is 10.
maxfev (int, optional) – Maximum number of function evaluations for the curve fitting. Default is 5000.
prior_bounds1 (list of tuples, optional) – Bounds for the uniform priors of the first model. Each element must be a (min, max) tuple. If None, unbounded uniform priors are assumed.
prior_bounds2 (list of tuples, optional) – Bounds for the uniform priors of the second model.
verbose (bool, optional) – If True, prints a formatted table of the returns. Default is True.
- Returns:
lnB12 (float) – Estimated natural logarithm of the Bayes factor. A positive value favors model M1; a negative value favors model M1.
BIC1 (float) – Bayesian Information Criterion for the first model.
BIC2 (float) – Bayesian Information Criterion for the second model.
Notes
Interpretation of ln(B12):
ln B12 > 5 : Strong evidence for model 1
ln B12 ∈ [2.5, 5) : Moderate evidence for model 1
ln B12 ∈ [1, 2.5) : Weak evidence for model 1
ln B12 ∈ [-1, 1) : Inconclusive
ln B12 ∈ [-2.5, -1) : Weak evidence for model 2
ln B12 ∈ [-5, -2.5) : Moderate evidence for model 2
ln B12 < -5 : Strong evidence for model 2
The approximation assumes that the prior volume is not too informative and that the maximum a posteriori estimate is close to the maximum likelihood.
- labtoolbox.stats.hist(data, data_err, scale=0, bins='auto', label='', unit='', verbose=True, figsize=(7.68, 5.76))[source]#
Plots the histogram of a dataset and assess its Gaussianity using statistical indicators and a Shapiro-Wilk test.
This function visualizes the empirical distribution of a dataset data, optionally accounting for individual measurement uncertainties data_err. It overlays a Gaussian curve parameterized by the estimated mean and standard deviation, and performs a normality test, reporting skewness, kurtosis, and the p-value from the Shapiro-Wilk test.
- Parameters:
data (array-like) – Numerical data representing the variable of interest.
data_err (scalar or array-like or None) – Array of uncertainties associated with each element of data. If None, uncertainties are not included in the computation of the effective standard deviation.
scale (int, optional) – Scaling exponent for data and data_err (default is 0). For example, scale = -2 rescales the inputs by 1e2.
bins (int or str, optional) – Number of bins or binning strategy passed to matplotlib.pyplot.hist. Defaults to “auto”.
label (str, optional) – Label for the x-axis, typically the name of the variable.
unit (str, optional) – Unit of measurement for the x-axis variable (e.g., “cm”). If provided, it will be displayed in the axis label and summary output.
verbose (bool, optional) – If True, prints a formatted table of the histogram parameters. Defaults to True.
figsize (tuple of float, optional) – Figure size passed to matplotlib. Default is (6.4 * 1.2, 4.8 * 1.2).
- Returns:
mean (float) – Arithmetic mean of the scaled data.
sigma (float) – Effective standard deviation of the distribution, accounting for both the empirical spread and uncertainties (if provided).
skewness (float) – Skewness of the distribution.
kurtosis (float) – Kurtosis of the distribution.
p_value (float) – p-value from the Shapiro-Wilk test for normality.
Notes
The effective standard deviation is computed as np.sqrt(data.std()**2 + np.sum(data_err**2)/len(data_err)) if data_err is provided.
The function rescales both data and data_err by 10**scale for display purposes, but all statistics are computed on the scaled data.
The normal distribution is refferd to as N(mu, sigma**2).
Example
>>> import numpy as np >>> from labtoolbox.stats import hist >>> x = np.random.normal(loc=5, scale=0.2, size=100) >>> x_err = np.full_like(x, 0.05) >>> hist(x, x_err, scale=-2, label="Length", unit="cm")
- labtoolbox.stats.lin_fit(x, y, y_err, x_err=None, fitmodel='wls', xlabel='x [ux]', ylabel='y [uy]', xlim=None, ylim=None, showlegend=True, legendloc=None, log=None, xscale=0, yscale=0, mscale=0, cscale=0, m_units='', c_units='', confidence=2, confidencerange=True, residuals=True, norm=True, verbose=False, summary=True, figsize=(7.68, 5.76))[source]#
Performs a linear fit (Weighted Least Squares or Ordinary Least Squares) and displays experimental data along with the regression line and uncertainty band.
- Parameters:
x (array-like) – Values of the independent variable.
y (array-like) – Values of the dependent variable.
y_err (array-like) – Uncertainties associated with y values.
x_err (array-like, optional) – Uncertainties associated with x values.
fitmodel (str, optional) – Fitting model, either “wls” or “ols”. Default is “wls”.
xlabel (str, optional) – Label for the x-axis, including units in square brackets (e.g., “x [m]”).
ylabel (str, optional) – Label for the y-axis, including units in square brackets (e.g., “y [s]”).
showlegend (bool, optional) – If True, displays a legend with the values of m and c on the plot.
legendloc (str, optional) – Location of the legend on the plot (‘upper right’, ‘lower left’, ‘upper center’, etc.). Default is None.
log (str, optional) – If set to ‘x’ or ‘y’, the corresponding axis is plotted on a logarithmic scale; if ‘xy’, both axes. Default is None.
xscale (int, optional) – Scaling factor for the x-axis (e.g., xscale = -2 corresponds to 1e-2, to convert meters to centimeters).
yscale (int, optional) – Scaling factor for the y-axis.
xlim (tuple, optional) – Limits for the x-axis, in the form (xmin, xmax). The values should already be scaled with respect to xscale. If None or an empty tuple, the default limits will be automatically determined from the data.
ylim (tuple, optional) – Limits for the y-axis, in the form (ymin, ymax). The values should already be scaled with respect to yscale. If None or an empty tuple, the default limits will be automatically determined from the data.
mscale (int, optional) – Scaling factor for the slope m.
cscale (int, optional) – Scaling factor for the intercept c.
m_units (str, optional) – Unit of measurement for m. Default is “”.
c_units (str, optional) – Unit of measurement for c. Default is “”.
confidence (int, optional) – Residual confidence interval to display, i.e., [-confidence, +confidence].
confidencerange (bool, optional) – If True, shows the 1σ uncertainty band around the fit line.
residuals (bool, optional) – If True, adds an upper panel showing fit residuals.
norm (bool, optional) – If True, residuals in the upper panel will be normalized.
verbose (bool, optional) – If True, prints the output of wls_fit (or ols_fit) to the screen. Default is False.
summary (bool, optional) – If True, prints a formatted summary report of the fit, including reduced chi-square, p-value, and percentage of residuals within ±2σ.
figsize (tuple of float, optional) – Figure size passed to matplotlib. Default is (6.4 * 1.2, 4.8 * 1.2).
- Returns:
m (float) – Slope of the regression line.
c (float) – Intercept of the regression line.
sigma_m (float) – Uncertainty on the slope.
sigma_c (float) – Uncertainty on the intercept.
chi2_red (float) – Reduced chi-square value (χ²/dof).
p_value (float) – Fit p-value.
Notes
The values of xscale and yscale affect only the axis scaling in the plot and have no impact on the fitting computation itself. All model parameters are estimated using the original input data as provided.
LaTeX formatting is already embedded in the strings used to display the units of m and c. You do not need to use “$…$”.
If c_scale = 0 (recommended when using c_units), then c_units will represent the suffix corresponding to 10^yscale (+ y_units).
If m_scale = 0 (recommended when using m_units), then m_units will represent the suffix corresponding to 10^(yscale - xscale) [+ y_units/x_units].
- labtoolbox.stats.mean(x, kind='arith')[source]#
Compute the mean of an input scalar or numpy array, with the specified type of mean.
- Parameters:
x (float or array-like) – Input data. Can be a scalar or a numpy array. If a scalar is provided, it is returned as the mean.
kind (str or float, optional) –
- Type of mean to compute. Supported options are:
’max’ : maximum value of x
’min’ : minimum value of x
’arith’ : arithmetic mean (default)
’geom’ : geometric mean
’harmonic’ : harmonic mean
’rms’ : root mean square (quadratic mean)
’cubic’ : cubic mean (third-order)
’agm’ : arithmetic-geometric mean
float (number) : generalized mean of order p
- Returns:
result – The computed mean of x, according to the specified type.
- Return type:
float
Examples
>>> from labtoolbox.stats import mean >>> mean([1, 2, 3, 4], 'arith') 2.5 >>> mean([1, 2, 3, 4], 'rms') 2.7386127875258306 >>> mean([1, 2, 3, 4], 2) 2.7386127875258306
- labtoolbox.stats.model_fit(x, y, f, x_err=None, y_err=None, p0=None, xlabel='x [ux]', ylabel='y [uy]', xlim=None, ylim=None, showlegend=True, legendloc=None, bounds=None, confidencerange=True, log=None, maxfev=5000, xscale=0, yscale=0, confidence=2, residuals=True, norm=True, verbose=True, print_parameters=True, figsize=(7.68, 5.76))[source]#
General-purpose fit of multi-parameter functions.
- Parameters:
x (array-like) – Measured values of the independent variable.
y (array-like) – Measured values of the dependent variable.
f (function) – Function of one variable (first argument of f) with N free parameters.
x_err (array-like, optional) – Uncertainties associated with the independent variable. Default is None.
y_err (array-like, optional) – Uncertainties associated with the dependent variable. Default is None.
p0 (list or array-like of floats, optional) – Initial guess for the model parameters, in the form [a, …, z]. Default is None.
xlabel (str, optional) – Label (and units) for the independent variable.
ylabel (str, optional) – Label (and units) for the dependent variable.
xlim (tuple, optional) – Limits for the x-axis, in the form (xmin, xmax). The values should already be scaled with respect to xscale. If None or an empty tuple, the default limits will be automatically determined from the data.
ylim (tuple, optional) – Limits for the y-axis, in the form (ymin, ymax). The values should already be scaled with respect to yscale. If None or an empty tuple, the default limits will be automatically determined from the data.
showlegend (bool, optional) – If True, displays a legend with the reduced chi-square and p-value in the plot.
legendloc (str, optional) – Location of the legend in the plot (‘upper right’, ‘lower left’, ‘upper center’, etc.). Default is None.
bounds (2-tuple of array-like, optional) – Tuple ([lower_bounds], [upper_bounds]) specifying bounds for the parameters. Default is None.
confidencerange (bool, optional) – If True, displays the 1σ uncertainty band around the best-fit curve.
log (str, optional) – If set to ‘x’ or ‘y’, the corresponding axis is plotted on a logarithmic scale; if ‘xy’, both axes. Default is None.
maxfev (int, optional) – Maximum number of iterations allowed by curve_fit.
xscale (int, optional) – Scaling factor for the x-axis (e.g., xscale = -2 corresponds to 1e-2, to convert meters to centimeters).
yscale (int, optional) – Scaling factor for the y-axis.
confidence (int, optional) – Residual confidence interval to display, i.e., [-confidence, +confidence].
residuals (bool, optional) – If True, adds an upper panel showing fit residuals.
norm (bool, optional) – If True, residuals in the upper panel will be normalized.
verbose (bool, optional) – If True, prints a formatted summary report of the fit, including reduced chi-square, p-value, and percentage of residuals within ±2σ.
print_parameters (bool, optional) – If True, prints the best-fit parameters along with their standard uncertainties.
figsize (tuple of float, optional) – Figure size passed to matplotlib. Default is (6.4 * 1.2, 4.8 * 1.2).
- Returns:
popt (array-like) – Array of optimal parameters estimated from the fit.
perr (array-like) – Uncertainties on the optimal parameters (only if y_err is provided).
chi2_red (float) – Reduced chi-square value (χ²/dof).
p_value (float) – Fit p-value.
Notes
The values of xscale and yscale affect only the axis scaling in the plot and have no impact on the fitting computation itself. All model parameters are estimated using the original input data as provided.
- labtoolbox.stats.posterior(x, y, y_err, f, p0, burn=1000, steps=5000, thin=10, maxfev=5000, verbose=True, names=None, prior_bounds=None, plot_dataset=False, plot_density=True, color='k', figsize=(7.68, 5.76), **kwargs)[source]#
Performs a Bayesian parameter estimation using MCMC for a user-defined model function.
This function fits a given model f to the experimental data (x, y) with associated uncertainties y_err by first performing a frequentist optimization (curve_fit) to obtain initial estimates, and then running a Markov Chain Monte Carlo (MCMC) sampling using the emcee package to derive the full posterior distribution of the model parameters. It returns the flattened MCMC sample chain representing the posterior samples, the maximum likelihood estimate (MLE) of the parameters, and it visualizes the posterior distribution via a corner plot.
- Parameters:
x (array-like,) – Independent variable data points.
y (array-like,) – Dependent variable data points to be fitted.
y_err (array-like) – Uncertainties of the dependent data y. Used for computing chi-squared and log-likelihood in the MCMC sampling. Can be None.
f (callable) – The model function to fit. Must be of the form f(x, *params), where x is the independent variable and params are the free parameters of the model.
p0 (list or array-like) – Initial guess for the M free parameters of the model. Used both for curve_fit and to initialize the MCMC walkers. Can be None.
burn (int, optional) – Number of burn-in steps to discard from the beginning of each walker chain before flattening. These initial steps are typically biased by the starting conditions. Default is 1000.
steps (int, optional) – Total number of MCMC steps for each walker. Default is 5000.
thin (int, optional) – Subsampling factor to reduce autocorrelation between samples. Only every thin-th sample is retained. Default is 10.
maxfev (int, optional) – Maximum number of function evaluations for the curve_fit routine. If exceeded, the fit will fail. Default is 5000.
verbose (bool, optional) – If True, prints a formatted table of … Default is True.
names (list of str, optional) – Parameter names to be used in the corner plot and output. If None, defaults to [‘p0’, ‘p1’, …, ‘pN’].
prior_bounds (list of tuple, optional) – Prior bounds on the parameters, as a list of (min, max) tuples for each parameter. If None, assumes uninformative priors that only reject non-positive values.
plot_dataset (bool, optional) – Draw the individual data points. Default is False.
plot_density (bool, optional) – Draw the density colormap. Default is False.
color (str, optional) – A matplotlib style color for all histograms. Default is k
figsize (tuple of float, optional) – Figure size passed to the corner plot. Default is (6.4 * 1.2, 4.8 * 1.2).
**kwargs – Any remaining keyword arguments are passed to corner.corner.
- Returns:
mle_params (numpy.ndarray, shape (M,)) – The parameter vector corresponding to the sample with the highest log-probability (maximum likelihood estimate) in the posterior chain
flat_samples (numpy.ndarray, shape (n_samples, M)) – Flattened MCMC sample chain (after burn-in and thinning). Each row is a sample of the M parameters. This chain can be used for uncertainty analysis, plotting posteriors, or further statistical inference.
Notes
This implementation assumes a uniform prior over all parameters, constrained to be strictly positive. Parameters less than or equal to zero are automatically rejected (log-prob = -inf).
The params object returned is for reference only; it does not contain MCMC results. To use posterior values in further analysis, refer to flat_samples.
The performance and correctness of the posterior heavily depend on the choice of priors, model structure, and convergence of the sampler. Always check convergence diagnostics in real analyses.
Example
>>> import numpy as np >>> from labtoolbox.stats import posterior >>> def model(x, a, b): ... return a * x + b >>> x = np.linspace(0, 10, 50) >>> y = model(x, 2.5, 1.0) + np.random.normal(0, 0.5, size=x.size) >>> y_err = 0.5 * np.ones_like(y) >>> posterior(x, y, y_err, model, [1, 1])
- labtoolbox.stats.propagate(func, x_val, x_err, params=None, method='Monte_Carlo', MC_sample_size=10000)[source]#
Propagates uncertainty from input arrays to a generic function using the uncertainty_class library.
- Parameters:
func (callable) – The base function, in the form f(x, a) where: - x is a vector of variables. - a is a vector of parameters (optional).
x_val (list of numpy.array) – List containing the input variable arrays [x1, x2, …, xn]. Each entry can be: - a scalar, - a numpy array of values (same length across all xi).
x_err (list or numpy.array) – List of uncertainties for each variable (scalars or arrays), or a full covariance matrix.
params (list or numpy.array, optional) – List or array of constant parameters [a1, a2, …, am].
method (str, optional) – The uncertainty propagation method (‘Delta’ or ‘Monte_Carlo’).
MC_sample_size (int, optional) – Sample size for the Monte Carlo method.
- Returns:
f_values (numpy.ndarray) – Values of the function calculated at each point j.
f_err (numpy.ndarray) – Propagated uncertainties on the output function for each point j.
confidence_bands (tuple of numpy.ndarray) – Lower and upper confidence bands for each point j.
- labtoolbox.stats.residuals(data, expected_data, data_err, scale=0, unit='', bins='auto', confidence=2, norm=False, verbose=True, figsize=(7.68, 5.76))[source]#
Analyzes and visualizes the residuals of the quantity of interes, including histogram, Gaussianity test, and autocorrelation test (Durbin-Watson statistic).
- Parameters:
data (array-like) – Measured data points.
expected_data (array-like) – Expected values to compare with data (e.g., from a model, theoretical prediction, or fit).
data_err (array-like) – Uncertainties associated with each data point in data.
scale (int, optional) – Scaling exponent applied to all quantities (e.g. scale = -2 scales meters to centimeters). Default is 0.
unit (str, optional) – Unit of measurement of the data (e.g., “cm” or “s”). Used for labeling axes. Default is an empty string.
bins (int or str, optional) – Number of bins or binning strategy passed to matplotlib.pyplot.hist. Default is “auto”.
confidence (float, optional) – Confidence factor for visualizing bounds (e.g., confidence = 2 draws ±2σ bounds). Default is 2.
norm (bool, optionale) – If True, residuals will be normalized. Default is False.
verbose (bool, optional) – If True, prints a formatted table of the returns (mean value, standard deviation, skewness, kurtosis, p-value and Durbin–Watson statistic). Default is True.
figsize (tuple of float, optional) – Figure size passed to matplotlib. Default is (6.4 * 1.2, 4.8 * 1.2).
- Returns:
mean (float) – Mean value of the residuals, after applying the specified scale.
sigma (float) – Estimated standard deviation of the residuals, weighted by data_err.
skewness (float) – Skewness (third standardized moment) of the residual distribution.
kurtosis (float) – Excess kurtosis (fourth standardized moment minus 3) of the residual distribution.
p_value (float) – p-value from the Shapiro–Wilk normality test.
dw (float) – Durbin–Watson statistic for testing autocorrelation in the residuals.
Notes
The residuals are computed as resid = data - expected_data, and scaled by 10**scale.
The standard deviation is computed as`np.sqrt(resid.std()**2 + np.sum(data_err**2)/len(data_err))`.
The normal distribution is refferd to as N(mu, sigma**2).
- The Shapiro–Wilk test is used to test for normality of the residuals:
If p_value >= 0.05, residuals are considered consistent with a normal distribution.
- The Durbin–Watson statistic is used to detect first-order autocorrelation:
Values ≈ 2 suggest no autocorrelation.
Values < 1.5 suggest positive autocorrelation.
Values > 2.5 suggest negative autocorrelation.