From mode to model: some (re)developments in clustering and classification in the Dirichlet fa

Speaker

  • Johan Ferreira

    Johan Ferreira

Johan Ferreira is a professor in the School of Statistics and Actuarial Science at the University of the Witwatersrand, South Africa, where he also serves as Assistant Focus Area Coordinator for the Statistical Theory and Applied Statistics focus area at the Centre of Excellence in Mathematical and Statistical Science. His research interests include multivariate statistics, the probabilistic modeling of entropy, and directional statistics. He actively participates and contributes to the field of statistics, regularly publishing in accredited peer-reviewed journals and reviewing manuscripts for international journals.


Johan is an ASLP 4.1/4.2 fellow with Future Africa, a prestigious program supporting African leaders in science and innovation. In 2016, he was recognized by the Mail & Guardian as one of the Top 200 South Africans under the age of 35 in the Education category.

Abstract

Multivariate data exhibit unique characteristics that may challenge traditional statistical methods, particularly in model-based clustering and classification. In this context, the Dirichlet and inverted Dirichlet models are often an initial parametric choice when considering either data on the multivariate simplex or multivariate data with positive support; their parameterization, which relies on shape parameters, lacks intuitive and more insightful application. In this context, we (re)develop these Dirichlet-based distributions to address these challenges, inspired by a Gaussian philosophy of a model being parametrized in terms of a location and scale parameter.


Focusing on the multivariate simplex, we first introduce a unimodal Dirichlet (UD) distribution parameterized in terms of its mode and a dispersion parameter; subsequently, we study finite mixtures of UD distributions for clustering and classification. We also propose the contaminated UD distribution to handle atypical observations, a heavy-tailed generalization that allows for flexible tail behaviour. Parameter estimation is achieved through maximum likelihood or an expectation-maximization algorithm, with analyses conducted on simulated data highlighting the impact of atypical observations on parameter estimation and classification.


Secondly, for multivariate-yet-positive contexts, we propose a mode-based parameterization of the inverted Dirichlet (IDir) distribution, resulting in the mode-reparametrized IDir (mIDir), which enhances interpretability and applicability in various contexts, including robust and nonparametric statistics. We define finite mixtures of mIDir for clustering and semiparametric density estimation and introduce a smoother based on mIDir kernels to address boundary bias. Furthermore, we propose the contaminated mIDir distribution, a heavy-tailed extension that robustly handles mild outliers. We demonstrate the flexibility of these models through parameter recovery analysis, sensitivity analysis for mild outliers, and real data applications. These contributions provide a robust framework for modeling multivariate data with complex behaviors such as atypical observations and outliers.

Meeting platform

This webinar will be hosted on the Google Meet platform.

The link will be emailed to you after registration.

Tickets

MDAG event

Prof. Johan Ferreira | 18 March 12:30

Member Price Complimentary