This is a past event. Registration is closed. View other South African Statistical Association events.

Please note that this programme is PROVISIONAL and subject to change.

Bayes Interest Group | Use of informative priors in confirmatory studies, along with a hands-on session in R | Session 1
Rajat Mukherjee (Bayes Workshop Presenter)

The workshop in Bayesian statistics is aimed to provide industry researchers (statisticians as well as domain experts), academicians and students working in medicine and healthcare with an introduction to the topic along with some specific examples and use cases from the pharmaceutical industry. The workshop will focus on translating historical data, for example, from previously conducted randomized clinical trials into informative priors for the parameters of interest which can then be used in the design and analysis of future trials. This approach of ~Bayesian-Borrowing is gaining interest particularly for investigations in rare diseases and for medical devices. We will discuss a common problem of Prior-Data conflict in this setting and methodologies to control borrowing from historical data in the presence of a conflict. We will also be discussing a recently conducted trial COVID vaccine trial that was conducted in the Bayesian framework. The workshop will conclude with a hands-on session implementing a Bayesian design using the open source R software.

  • Rajat Mukherjee (Bayes Workshop Presenter) (VP Advanced Statistics & Data Science at Alira Health)

    Rajat Mukherjee (Bayes Workshop Presenter)

    VP Advanced Statistics & Data Science at Alira Health
Morning Tea

Regency Hall Patio and Foyer

Bayes Interest Group | Use of informative priors in confirmatory studies, along with a hands-on session in R | Session 2
Rajat Mukherjee (Bayes Workshop Presenter)

The workshop in Bayesian statistics is aimed to provide industry researchers (statisticians as well as domain experts), academicians and students working in medicine and healthcare with an introduction to the topic along with some specific examples and use cases from the pharmaceutical industry. The workshop will focus on translating historical data, for example, from previously conducted randomized clinical trials into informative priors for the parameters of interest which can then be used in the design and analysis of future trials. This approach of ~Bayesian-Borrowing is gaining interest particularly for investigations in rare diseases and for medical devices. We will discuss a common problem of Prior-Data conflict in this setting and methodologies to control borrowing from historical data in the presence of a conflict. We will also be discussing a recently conducted trial COVID vaccine trial that was conducted in the Bayesian framework. The workshop will conclude with a hands-on session implementing a Bayesian design using the open source R software.

  • Rajat Mukherjee (Bayes Workshop Presenter) (VP Advanced Statistics & Data Science at Alira Health)

    Rajat Mukherjee (Bayes Workshop Presenter)

    VP Advanced Statistics & Data Science at Alira Health
Lunch

Fairway Terrace Restaurant

Bayes Interest Group | Use of informative priors in confirmatory studies, along with a hands-on session in R | Session 3
Rajat Mukherjee (Bayes Workshop Presenter)

The workshop in Bayesian statistics is aimed to provide industry researchers (statisticians as well as domain experts), academicians and students working in medicine and healthcare with an introduction to the topic along with some specific examples and use cases from the pharmaceutical industry. The workshop will focus on translating historical data, for example, from previously conducted randomized clinical trials into informative priors for the parameters of interest which can then be used in the design and analysis of future trials. This approach of ~Bayesian-Borrowing is gaining interest particularly for investigations in rare diseases and for medical devices. We will discuss a common problem of Prior-Data conflict in this setting and methodologies to control borrowing from historical data in the presence of a conflict. We will also be discussing a recently conducted trial COVID vaccine trial that was conducted in the Bayesian framework. The workshop will conclude with a hands-on session implementing a Bayesian design using the open source R software.

  • Rajat Mukherjee (Bayes Workshop Presenter) (VP Advanced Statistics & Data Science at Alira Health)

    Rajat Mukherjee (Bayes Workshop Presenter)

    VP Advanced Statistics & Data Science at Alira Health
Afternoon Tea

Regency Hall Patio and Foyer

Bayes Interest Group | ISBA President Address
Rajat Mukherjee (Bayes Workshop Presenter)Sudipto Banerjee (Online Bayes Workshop Presenter)
  • Rajat Mukherjee (Bayes Workshop Presenter) (VP Advanced Statistics & Data Science at Alira Health)

    Rajat Mukherjee (Bayes Workshop Presenter)

    VP Advanced Statistics & Data Science at Alira Health
  • Sudipto Banerjee (Online Bayes Workshop Presenter) (President at ISBA)

    Sudipto Banerjee (Online Bayes Workshop Presenter)

    President at ISBA
Models for Health Outcomes using Data from Population Registries and Surveys
Session 1

Ruth Etzioni

This workshop will present methods for analyzing non-normal outcomes in health data studies with a focus on counts and health care costs. The workshop will cover different regression modeling frameworks tailored to the distributional properties of these outcomes with examples drawn from two major data sources in the US – a national cancer registry and a national health survey that includes annual health expenditure information. Additionally, the G-computation method for marginal effect estimation in non-linear regression models, propensity score analysis with inverse probability weighting for causal effect estimation in observational studies, and methods for accommodating complex survey designs will be covered. The properties, strengths and weaknesses of registries and surveys as sources for health outcomes models will also be discussed. All analyses will be programmed in R and code will be provided to all workshop participants. This workshop will draw on material from the text, “Statistics for Health Data Science: An Organic Approach,” co-authored by Dr Etzioni.

  • Ruth Etzioni (Professor at Fred Hutchinson Cancer Research Center)

    Ruth Etzioni

    Professor at Fred Hutchinson Cancer Research Center
Morning Tea

Regency Hall Foyer and Patio

Models for Health Outcomes using Data from Population Registries and Surveys
Session 2

Ruth Etzioni

This workshop will present methods for analyzing non-normal outcomes in health data studies with a focus on counts and health care costs. The workshop will cover different regression modeling frameworks tailored to the distributional properties of these outcomes with examples drawn from two major data sources in the US – a national cancer registry and a national health survey that includes annual health expenditure information. Additionally, the G-computation method for marginal effect estimation in non-linear regression models, propensity score analysis with inverse probability weighting for causal effect estimation in observational studies, and methods for accommodating complex survey designs will be covered. The properties, strengths and weaknesses of registries and surveys as sources for health outcomes models will also be discussed. All analyses will be programmed in R and code will be provided to all workshop participants. This workshop will draw on material from the text, “Statistics for Health Data Science: An Organic Approach,” co-authored by Dr Etzioni.

  • Ruth Etzioni (Professor at Fred Hutchinson Cancer Research Center)

    Ruth Etzioni

    Professor at Fred Hutchinson Cancer Research Center
Lunch

Fairway Terrace Restaurant

A Practical Introduction To Gaussian Process Regression and Bayesian Optimization
Session 1

Robert Gramacy

Gaussian process regression is ubiquitous in spatial statistics, machine learning, and the surrogate modeling of computer simulation experiments. Fortunately their prowess as accurate predictors, along with an appropriate quantification of uncertainty, does not derive from difficult-to-understand methodology and cumbersome implementation. We will cover the basics, and provide a practical tool-set ready to be put to work in diverse applications. As one example of an one application where Gaussian processes play a fundamental role, we will introduce Bayesian optimization. The presentation will involve accessible slides authored in Rmarkdown, with reproducible examples spanning bespoke implementation to add-on packages.

  • Robert Gramacy (Professor at Virginia Polytechnic and State University)

    Robert Gramacy

    Professor at Virginia Polytechnic and State University
Afternoon Tea

Regency Hall Foyer and Patio

A Practical Introduction To Gaussian Process Regression and Bayesian Optimization
Session 2

Robert Gramacy

Gaussian process regression is ubiquitous in spatial statistics, machine learning, and the surrogate modeling of computer simulation experiments. Fortunately their prowess as accurate predictors, along with an appropriate quantification of uncertainty, does not derive from difficult-to-understand methodology and cumbersome implementation. We will cover the basics, and provide a practical tool-set ready to be put to work in diverse applications. As one example of an one application where Gaussian processes play a fundamental role, we will introduce Bayesian optimization. The presentation will involve accessible slides authored in Rmarkdown, with reproducible examples spanning bespoke implementation to add-on packages.

  • Robert Gramacy (Professor at Virginia Polytechnic and State University)

    Robert Gramacy

    Professor at Virginia Polytechnic and State University
SASA 2022 Opening Ceremony
Statistical models for understanding population cancer trends and informing health policies | Plenary Address
Ruth Etzioni

Chair: W Brettenny

Abstract: Trends in the population burden of cancer can be very revealing about the progress of efforts to control the disease. In the US, the National Cancer Institute produces annual estimates of cancer incidence, mortality and survival. I will present a series of models to learn from patterns of these measures over time about the benefits of cancer control interventions such as new screening and treatment approaches. When a new screening test disseminates into population practice as was the case with the PSA test for prostate cancer, this provides an opportunity to also learn about disease natural history. I will share the story of how we used population trends in prostate cancer incidence, mortality, and survival to learn about prostate cancer natural history and inform national policy guidelines for prostate cancer early detection.

  • Ruth Etzioni (Professor at Fred Hutchinson Cancer Research Center)

    Ruth Etzioni

    Professor at Fred Hutchinson Cancer Research Center
Morning Tea

Regency Hall Patio and Foyer

Spatial statistical questions and big spatial datasets | Plenary Session
Edzer Pebesma (Online)

Chair: I Fabris-Rotelli

Abstract: A number of very large spatiotemporal datasets have become openly available, in particular from the domain of Earth Observation. These datasets form the basis for creating all kinds of derived "products", often with global coverage and high resolution, and often using machine learning or deep learning algorithms. A number of problems, like assessing the quality of individual predictions or estimating temporal change in areal averages or areal fractions of certain categories, typically remain unsolved. Spatial statistical concepts such as autocorrelation and change of support are usually ignored. In the talk I will discuss to what extent ignoring these concepts is a missed opportunity, and whether this can be mitigated.

  • Edzer Pebesma (Online) (Professor at University of Münster, Institute for Geoinformatics)

    Edzer Pebesma (Online)

    Professor at University of Münster, Institute for Geoinformatics
A simple learning agent interacting with an agent-based market model | Matthew Dicks

Stream: Data Science
Chair: Sonali Das

Authors: M. Dicks, T. Gebbie
Abstract: We consider the learning dynamics of a single reinforcement learning optimal execution trading agent when it interacts with an event driven agent-based financial market model. Trading takes place asynchronously through a matching engine in event time. The optimal execution agent is considered at different levels of initial order-sizes and differently sized state spaces. The resulting impact on the agent-based model and market are considered using a calibration approach that explores changes in the empirical stylised facts and price impact curves. Convergence, volume trajectory and action trace plots are used to visualise the learning dynamics. Here the smaller state space agents had the number of states they visited converge much faster than the larger state space agents, and they were able to start learning to trade intuitively using the spread and volume states. We find that the moments of the model are robust to the impact of the learning agents except for the Hurst exponent, which was lowered by the introduction of strategic order-splitting. The introduction of the learning agent preserves the shape of the price impact curves but can reduce the trade-sign auto-correlations when their trading volumes increase.

Evolutionary support vector regression for monitoring Poisson profiles | Sandile Shongwe

Stream: Data Science
Chair: Sonali Das

Author: S. Shongwe
Abstract: Many researchers have shown interest in profile monitoring; however, most of the applications in this field of research are developed under the assumption of normal response variable. Little attention has been given to profile monitoring with non-normal response variables, known as general linear models (GLM) which consists of two main categories (i.e., logistic and Poisson profiles). This paper develops a new robust Phase II Poisson profile monitoring tool using support vector regression (SVR) by incorporating some novel input features and evolutionary training algorithm. The new method is quicker in detecting out-of-control (OOC) signals as compared to the conventional statistical methods. Moreover, the performance of the proposed scheme is further investigated for Poisson profiles with both fixed and variable explanatory variables as well as non-parametric profiles. A diagnostic method with machine learning approach is also used to identify the parameters of change in the profile.

Are winters becoming shorter and warmer? Insights from a functional data analysis investigation | Sonali Das

Stream: Data Science
Chair: Sonali Das

Author: S. Das
Abstract: A recurrent narrative, in our case, primarily from agricultural scientists and ecologists in the Southern hemisphere, is that ‘winters are becoming shorter and warmer’, and is noticeably affecting both the physical and biological systems. An ensemble of opportunities within the functional data analysis (FDA) statistical framework is used to explore shifts in annual temperatures by investigating specifically the joint trivariate structure composed of (i) the timing of the onset of winter, (ii) the temperature-trough in the winter season, and (iii) the timing of the onset of spring, in each year.

Lunch

Fairway Terrace Restaurant

Deep Gaussian Process Surrogates for Computer Experiments | Plenary Address
Robert Gramacy

Chair: W Brettenny

Abstract: Deep Gaussian processes (DGPs) upgrade ordinary GPs through functional composition, in which intermediate GP layers warp the original inputs, providing flexibility to model non-stationary dynamics. Recent applications in machine learning favor approximate, optimization-based inference for fast predictions, but applications to computer surrogate modeling -- with an eye towards downstream tasks like calibration, Bayesian optimization, and input sensitivity analysis -- demand broader uncertainty quantification (UQ). We prioritize UQ through full posterior integration in a Bayesian scheme, hinging on elliptical slice sampling the latent layers. We demonstrate how our DGP's non-stationary flexibility, combined with appropriate UQ, allows for active learning: a virtuous cycle of data acquisition and model updating that departs from traditional space-filling design and yields more accurate surrogates for fixed simulation effort. But not all simulation campaigns can be developed sequentially, and many existing computer experiments are simply too big for full DGP posterior integration because of cubic scaling bottlenecks. For this case we introduce the Vecchia approximation, popular for ordinary GPs in spatial data settings. We show that Vecchia-induced sparsity of Cholesky factors allows for linear computational scaling without compromising DGP accuracy or UQ. We vet both active learning and Vecchia-approximated DGPs on numerous illustrative examples and a real simulation involving drag on satellites in low-Earth orbit. We showcase implementation in the deepgp package for R on CRAN.

  • Robert Gramacy (Professor at Virginia Polytechnic and State University)

    Robert Gramacy

    Professor at Virginia Polytechnic and State University
Development of an early career academic supervisor in Statistics in South Africa | Inger Fabris-Rotelli

Stream: Educational Statistics
Chair: Inger Fabris-Rotelli

Authors: I. Fabris-Rotelli, M. von Maltitz, A. Smit, D. Roberts, S. Das, D. Maposa
Abstract: There is an increasing pull of a Mathematical Sciences graduate to enter industry rather than pursue further postgraduate studies. Within South Africa, there is an urgent need to address this even more as the field of Statistics is particularly affected by the 4IR pull, resulting in a crisis in academic capacity building. Two primary factors can be isolated as those preventing the correction of this: First, academic salaries in Statistical sciences are not comparable to what industry would pay at the same level of qualification (especially in light of the growth of ‘Data Science’). Second, the South African National Research Foundation (NRF) student funding is not attractive to a student whom industry is already offering more to, as to pay fees and living expenses on the NRF bursaries are unrealistic for a full-time student. In response to these crises is Statistics, the NRF, since 2016, has provided funding to support postgraduate students who may be trained to enter academia after their PhD. The grant provided larger bursaries than the standard NRF bursaries, as well as funding to bring in expertise to train young staff. In spite of this initiative, the lack of supervisory skills and capacity, especially at the Doctoral level, is evident across South African Statistical sciences departments. In 2020, a group of 8 novice, and near-novice, doctoral supervisors in academic Statistical science in South Africa initiated discussions that delved into the current state of academic Statistics in the country, specifically with regards to the nurturing of an early career academic supervisor in Statistics. These discussions have resulted in a clear need for actionable actions by and for early career academics in Statistics that involve a guiding rubric for the doctoral thesis, coupled with a reference guideline for early career supervisors in South Africa.

Discussion Session
Afternoon Tea

Regency Hall Patio and Foyer

Efficiency Analysis of South African Schools: A Parametric Approach | Aviwe Gqwaka

Stream: Educational Statistics
Chair: Paul van Staden

Authors: A. Gqwaka, W. Brettenny, G. Sharp
Abstract: South African learners have ranked low in global assessments of reading skills and mathematics literacy. To remedy this, government has sought to adequately equip schools in their education provision services. Thus, to get an indication of the state of the South African education sector, understanding the level of performance of schools is apt. To do this, an efficiency analysis is conducted. Here the ability of a school to minimise its use of available resources while maximising learner performance is quantified and assessed. This is done using the parametric approach, stochastic frontier analysis (SFA), where observed performances are compared to a theoretical best practice or frontier. Deviations from this frontier are attributed to effects not in control of the school (random shock) and those that are (inefficiency). Use of this approach allows for the identification of best and wort performing schools. These findings could then assist policy-makers to perhaps review their resource allocations, where they can better attend to the needs of those schools deemed to be inefficient.

The Value Proposition for Industry-Academic Collaboration | André Zitzke

Stream: Educational Statistics
Chair: Paul van Staden

Authors: M. de Villiers, A. Zitzke
Abstract:

Analysis of Strike action on students’ Academic Performance in the Inferential Statistics Module at the University of Fort Hare, South Africa | Ruffin Mpiana Mutambayi

Stream: Educational Statistics
Chair: Paul van Staden

Authors: R.M. Mutambayi, A. Azeez, H. Tshepo, A. Odeyemi
Abstract: This study aims to analyse the impact of strike actions on students who enrolled for the Statistics module as one of the pre-requisite modules. The study was done at the University of Fort Hare, and 142 students participated in the study.
The collection of the data was done through a questionnaire and descriptive Statistics, inferential Statistics combined with quantile regression analysis were used to analyse the data.
The results reveal that the marks of students were normally distributed (p-value: 0.057), and there were no outliers (p-value: 0.515). It was also found that the academic performance of students at quantiles 0.25 (p-value: 0.0005), 0.5 (p-value: 0.0001), and 0.75 (p-value: 0.0652) for ‘nationality’ were having an impact on the performance of students in statistics. Moreover, quantiles 0.50 (p-value: 0.0304) and 0.75 (p-value: 0.0107) for ‘appointment of industrial arbitration panels to review at the interval a measure to eradicate strike actions’ were also statistically significant.

Affective Learning: An insight into Mr Lindo’s #OperationFinishTheSemesterStrong | Lindo Magagula

Stream: Educational Statistics
Chair: Paul van Staden

Author: L. Magagula
Abstract: Learning in the affective domain is often neglected in most educational programs. The disuse is because affective learning needs to be a better-understood concept amongst educators. It goes beyond the scope of practice for education practitioners who would instead invest in cognitive learning. In this presentation, I highlight the preliminary finding of a study conducted on STK120 students. The Department of Statistics presents this course to first-year students majoring in various degrees. In this study, over and above the teaching of content, a small fraction of teaching time (at the beginning of class) was taken and dedicated to students' emotional well-being, after that was evaluating the impact on students' willingness to learn. This was done in an interactive way by making use of the turning point clicker app and a word cloud visualization of responses. It is imperative to remain vigilant on the extent to which the proposed measures of affective learning should be implemented.

When should Paul visit Paris? A time series case study from an introductory first-year statistics & data science course | Paul van Staden

Stream: Educational Statistics
Chair: Paul van Staden

Author: P.J. van Staden
Abstract: At many universities time series analysis is only taught at final-year undergraduate or at postgraduate level. But the proliferation of time series data in, for example, financial markets, social media analytics and, more recently, epidemiology (due to the COVID-19 pandemic), necessitates that students already be introduced to time series analysis at a first-year level.

This talk presents a case study in which students from the presenter’s first-year introductory statistics and data science course have to analyze a time series dataset from Paris, France. Basic tools including time plots and time series decomposition are sufficient for these students to learn about data dependency in time series as well as time series patterns and components such as trend, seasonality, cyclical fluctuations and noise. However, the nature of the chosen dataset lends itself to further scrutiny in that students discover, with the assistance of Dr Google and Professor Wikipedia, how spurious conclusions can be made in the absence of statistical intuition.

So when should Paul visit Paris? Maybe in October…

Robust inference in the presence of censoring, skewness, and extreme values | Sean van der Merwe

Stream: Bayesian Statistics
Chair: Allan Clark

Author: S. van der Merwe
Abstract: This presentation discusses how modern statistical modelling software has enabled the fitting of models that are both flexible enough to accommodate data features and still simple enough to answer research questions via intuitive inferences. Inference regarding location is extremely popular in statistics, but often faces difficulties such as adjusting for skewness and extreme observations. It is often desired to do inference for the typical case instead of the raw mean. A t density is naturally robust to occasional extreme observations, while skew variants are particularly robust to a heavy tail on one side. Further, these distributions are not limited in domain. Fitting of these distributions was historically challenging but is currently seeing a surge in use. This work expands the theory of a particular skew-t variant to accommodate censoring and derives a new prior distribution with excellent properties. The implementation is explained and illustrated for a variety of real problems.

Bayesian meta-regression models for the estimation of population trends in health risk factors | Annibale Cois

Stream: Bayesian Statistics
Chair: Allan Clark

Author: A. Cois
Abstract: The accurate quantification of trends in the distribution of risk factors is critical in public health. Reliable estimates are key for planning prevention activities and treatment services, especially in low-income countries where the optimal allocation of limited resources is a priority.
However empirical data – usually self-report from population surveys – are often sparse (available for selected subpopulations and time points), heterogeneous (collected with inconsistent methods across data sources), and of varying characteristics in terms of precision and risk of bias.
Bayesian meta-regression is an alternative to frequentist approaches to make sense of sub-optimal data by integrating in a principled way additional sources of information and broad epidemiological and biological evidence.
We present an application of Bayesian meta-regression to estimate age- and sex-specific trends in alcohol consumption – a major risk factor for cardiovascular and other diseases – in the South African adult population.
The model accounts for the censored nature of the consumption data and the ubiquitous under-reporting of alcohol use in surveys. It allows for time and age non-linearity and for complex constraints in the parameter space, derived from biological knowledge and administrative records on alcohol sales. Mild assumptions of smoothness in age and time trends and relationship with auxiliary variables allow the model to make estimates where data are sparse or unreliable.
The Bayesian estimator – implemented using Stan Modelling Language and its default NUTS sampling algorithm – accounts for uncertainty beyond sampling error, and the availability of draws from the posterior distribution makes it straightforward to recover estimates of various linear and non-linear functions of the model parameters.
We show how this approach compares favourably to classical rescaling methods used to recover estimates of population alcohol consumption from downward-biased survey data.

difFUBAR: Scalable Bayesian comparison of selection pressure | Hassan Sadiq

Stream: Bayesian Statistics
Chair: Allan Clark

Author: H. Sadiq
Abstract: While many phylogenetic methods exist to characterise evolutionary pressure at individual codon sites, relatively few allow the direct comparison between different a priori selected sets of branches. Indeed, this was only recently addressed by an approach, developed in the frequentist framework, that proposes a site-wise likelihood ratio test to test such hypotheses.

Previously, we have demonstrated that approximate grid-based Bayesian approaches to characterising site-wise variation in selection parameters can outperform individual site-wise likelihood ratio tests. Such grid-based approaches can exhibit poor computational scaling when the number of site-wise parameters expands, but here we show that a simple sub-tree likelihood caching strategy can ameliorate this.

We propose difFUBAR, implemented in MolecularEvolution.jl --- a new framework for phylogenetic models of molecular evolution developed in the Julia language for scientific computing. difFUBAR allows the demarcation of two branch sets of interest and, optionally, a background set, and estimates joint site-specific posterior distributions over α, ω1, ω2 and ωBG using a Gibbs sampler. Evidence for hypotheses of interest can then be quantified directly from the posterior distribution, and we standardly report P (ω1 > ω2 |Data), P (ω2 > ω1 |Data), P (ω1 > 1|Data),
P (ω2 > 1|Data).

We characterise the statistical performance of this approach on previous simulations, comparing it to the site-wise likelihood ratio test approach, and we demonstrate how our subtree-likelihood caching approach improves the speed of the approach, outperforming site-wise likelihood ratio testing. We also showcase difFUBAR on datasets of mammalian immunoglobulin sequences.

Bayesian Tree Growth modelling. An investigation into individual tree competition | Lulama Kepe

Stream: Bayesian Statistics
Chair: Allan Clark

Authors: L. Kepe, K. Little, J. Hugo
Abstract: At some stage after canopy closure, individual trees in a plantation begin to compete for the same resources. To investigate this competition, a Bayesian mixed effects model, similar in characteristics to a SIRE model used for estimating breeding values, as well as variance components, in mixed linear model settings, is proposed. In a similar way to the inclusion of inbreeding coefficients, it is therefore proposed that published competition indices used in tree growth modelling, be included in this Bayesian mixed model. As different competition indices are introduced into the model, posterior probabilities will be observed and compared to what is visually observed on the plot, i.e. if the tree with the highest posterior probability of being the strongest grower, is in fact the largest tree on the plot as well.

Bayesian Analysis of Historical Functional Linear Models with application to air pollution forecasting | Allan Clark

Stream: Bayesian Statistics
Chair: Allan Clark

Authors: A. Clark, Y. Junglee, B. Erni
Abstract: Historical functional linear models are used to analyse the relationship between a functional response and functional predictors. Here we develop a functional data analysis model that handles multiple functional covariates with measurement error and sparseness that can be used to predict functional response surfaces.

The method uses the connection between non-parametric smoothing and Bayesian methods to reduce sensitivity to the number of basis functions used to model the functional regression coefficients of the model. We investigate two methods of estimation. First, propose to smooth the predictors independently from the regression model in a two-stage analysis, and secondly, jointly with a regression model. The efficiency of the MCMC algorithms is increased by implementing a Cholesky decomposition to sample from high-dimensional Gaussian distributions and taking advantage of the orthogonal properties of the functional principal components used to model the functional covariates.

A simulation study suggests substantial improvements in both the recovery of the functional regression surface and the true underlying functional response with higher coverage probabilities, when compared to a classical model under which measurement error is unaccounted for. We also found that a two-stage analysis outperforms the joint model under certain conditions.

A major challenge with the collection of environmental data is that they are prone to measurement error. Hence, our methodology provides a reliable functional data analytic framework for modelling such data. As an application of our method, we forecast the level of daily atmospheric pollutants at certain locations in the City of Cape Town. The forecasts provided by the Bayesian two-stage model are highly competitive where compared to the functional autoregressive models which are traditionally used for functional time series.

Optimal window size detection in Value-at-Risk forecasting: A case study on conditional generalised hyperbolic models | Chun-Sung Huang

Stream: Bayesian Statistics
Chair: Allan Clark

Authors: C-S. Huang, C-K. Huang, J. Hammujuddy, K. Chinhamu
Abstract: The conventional parametric approach for financial risk measure estimation involves determining an appropriate quantitative model, as well as a suitable historical sample period in which the model can be trained. While a lion’s share of the existing literature entertains the identification of the most appropriate model for different types of financial assets, or across conflicting market conditions, little is known about the optimal choice of a historical sample period size (or window size) to train the model and estimate model parameters. In this paper, we propose a method to identify an optimal window size for model training when estimating risk measures, such as the widely-utilised Value-at-Risk (VaR) or Expected Shortfall (ES), under the generalised hyperbolic subclasses. We show that the accuracy of VaR estimates may increase significantly through our proposed method of optimal window size detection. In particular, our results demonstrate that, by relaxing the usual restriction of a fixed window size over time, superior VaR forecasts may be produced as a result of improved model parameter estimates.

Morning Tea

Regency Hall Patio and Foyer

Emerging technologies, globalization and social challenges are redefining the role of intelligent decisioning | Keynote Address
Mark Nasila

Chair: W. Brettenny

  • Mark Nasila (Chief Data and Analytics Officer at FirstRand Risk)

    Mark Nasila

    Chief Data and Analytics Officer at FirstRand Risk
Towards evidence-based public health management: the role of Statistics and Modelling in South Africa | Keynote Address
Sheetal Silal

Chair: W. Brettenny

Abstract: Be it COVID-19, TB, HIV or routine early childhood immunisation, the public health system is continually needing to adapt and adjust health provision policy to maximise the delivery of health services to the population. Increases in health surveillance, clinical trials and digitised systems have resulted in numerous datasets becoming available to support health planning. Statistics and mathematical modelling are in a unique position to leverage these datasets to provide a scientific evidence base to decision-makers. This talk will present past and current applications of statistical and math modelling in various aspects of public health in South Africa, and reflect on the role we can all play in shaping the future of health provision.

  • Sheetal Silal (Director of Modelling and Simulation Hub, Africa (MASHA))

    Sheetal Silal

    Director of Modelling and Simulation Hub, Africa (MASHA)
Examining factors that contribute to under-five mortality rates in South Africa using count models | Kgethego Sharina Makgolane

Stream: Biostatistics
Chair: Sisa Pazi

Author: K.S. Makgolane
Abstract: Under-five mortality remains a major health challenge in most sub-Saharan African countries including South Africa, despite the significant progress made in child survival and the government’s efforts to reduce under-five mortality rate. South Africa has failed to achieve the 4th Millennium Development Goal which aimed at reducing under-five mortality rate by two-thirds between the year 1990 and 2015. Due to this failure, the 3rd Sustainable Development Goal which aims to have no more than 25 deaths per 1 000 live births by the year 2030 was implemented. The aim of this study was to identify factors that contribute to under-five mortality rate using count models. To identify these factors, the study utilised a secondary data set obtained from the South African Demographic and Health survey for 2016. Generalized linear models namely, logistic regression, Poisson regression and negative binomial regression models were employed for the analysis of under-five mortality rate. The results revealed that baby postnatal check-up within the first two months, the child’s health checked prior discharge, childbirth size, toilet facility at home, maternal education, province, type of residence and water source were significantly associated with the risk of experiencing under-five mortality. In conclusion, the study suggests that the Department of Health and various concerned agencies take these factors into account when planning to reduce under-five mortality rate and achieve the 3rd Sustainable Development Goal by 2030.

Joint modeling for longitudinal and interval censored survival data | Isaac Singini

Stream: Biostatistics
Chair: Sisa Pazi

Authors: I. Singini, D. Chen
Abstract: Joint models for longitudinal and survival data are a class of models that jointly analyse an outcome repeatedly observed over time such as a bio-marker and associated event times. These models are useful in two practical applications; firstly focusing on survival outcome whilst accounting for time-varying covariates measured with error and secondly focusing on the longitudinal outcome while controlling for informative censoring. For the survival sub-model this is done by recording the moments of the event of interest and calculation the time span between the event and some initial onset time. The joint modelling framework has mainly focused on right censored data in the survival outcome for the last decade. This has been for two-Stage joint model, shared parameter joint models and latent class joint models. There have been many theoretical developments in the last five decades that have focused on censoring mechanisms in oder to correctly model time to event data e.g. left or right censoring, however interval censoring has seldom been implemented in the joint modeling framework. This has been due to the fact that many are unaware of the impact of inappropriately dealing with interval censoring within the joint modeling framework. The other complexity has been that the necessary software is at that handles interval censored data in the joint modeling framework is not readily available. In this chapter we fill the gap between theory and practice by illustrating our theoretical technique using the interval censored data in the joint model using a cardiology multi-centre clinical trial. We implement our approach with examples using R statistical software.

Contributions to acute physiology scoring for South African intensive care units | Sisa Pazi

Stream: Biostatistics
Chair: Sisa Pazi

Authors: S. Pazi, G. Sharp, E. van der Merwe
Abstract: This study describes research which was conducted in fulfilment of a Doctor of Philosophy degree. The research comprised of four studies, the first of which sought to investigate the epidemiology of acute kidney injury (AKI) at a tertiary hospital in the Eastern Cape. The Simplified Acute Physiology Score III (SAPS III), a severity-of-illness score, was found to be one of the statistically significant risk factors for AKI. As the SAPS III was developed without data from Africa, this opened the opportunity to scrutinise the model, which then led to the second research study, the purpose of which was to assess the SAPS III model in the South African context. The results of the second study provided the motivation to develop a model more suited for the South African context, which led to the third study, the aim of which was to develop a model similar to the SAPS III model but using South African data. The results of that study indicated that the proposed adaptive model was superior to the SAPS III model. Furthermore, a comparative analysis conducted as part of the fourth study indicated that the proposed model was superior to some machine learning models. To broaden the usage of the proposed adaptive model, future research includes collecting data from multiple hospitals in South Africa. The collected data will then be used to externally validate the proposed adaptive model.

Lunch

Fairway Terrace Restaurant

Roundtable | Data Science, Data Literacy and the Future of Statistics
Ruth EtzioniRobert GramacyMark NasilaSheetal SilalAshwell JennekerRenette BlignautPravesh Debba

Moderators: W. Brettenny, I. Fabris-Rotelli

  • Ruth Etzioni (Professor at Fred Hutchinson Cancer Research Center)

    Ruth Etzioni

    Professor at Fred Hutchinson Cancer Research Center
  • Robert Gramacy (Professor at Virginia Polytechnic and State University)

    Robert Gramacy

    Professor at Virginia Polytechnic and State University
  • Mark Nasila (Chief Data and Analytics Officer at FirstRand Risk)

    Mark Nasila

    Chief Data and Analytics Officer at FirstRand Risk
  • Sheetal Silal (Director of Modelling and Simulation Hub, Africa (MASHA))

    Sheetal Silal

    Director of Modelling and Simulation Hub, Africa (MASHA)
  • Ashwell Jenneker (Deputy Director General of StatsSA)

    Ashwell Jenneker

    Deputy Director General of StatsSA
  • Renette Blignaut (2021 SAS® Thought Leader and Professor at University of the Western Cape)

    Renette Blignaut

    2021 SAS® Thought Leader and Professor at University of the Western Cape
  • Pravesh Debba (2020 SAS® Thought Leader and Manager for Inclusive Smart Settlements and Regions at CSIR)

    Pravesh Debba

    2020 SAS® Thought Leader and Manager for Inclusive Smart Settlements and Regions at CSIR
SASA Annual General Meeting
Two New Auxiliary Models for Estimating Error Variances in Heteroskedastic Linear Regression | Thomas Farrar

Stream: Multivariate Statistics
Chair: Sugnet Lubbe

Authors: T. Farrar, R. Blignaut, R. Luus, S. Steel
Abstract: Two new models are proposed for estimating error variances in heteroskedastic linear regression models. These are, respectively, the Auxiliary Linear Variance Model and the Auxiliary Nonlinear Variance Model, which use the squared Ordinary Least Squares residuals as their response and are built around a correct specification of the conditional mean response. The dimensionality of the parameter vector is reduced by assuming a functional relationship between the error variances and the predictor variables. Several different sub-models emerge depending on how one deals with the heteroskedastic function.

Practical problems in applying the models are discussed, such as parameter estimation, tuning of hyperparameters, and feature selection. Methods of parameter estimation include inequality-constrained least squares and quadratic programming for the linear model and maximum quasi-likelihood estimation for the nonlinear model. Methods of hyperparameter tuning include K-fold cross-validation and quasi-generalised cross-validation. Methods of feature selection include feature-wise heteroskedasticity testing, best subset selection, and LASSO.

The new error variance estimation methods are assessed under a variety of experimental conditions in terms of four distinct mean squared error metrics, and are found to outperform existing methods under some conditions. The nonlinear model is particularly effective if the form of the heteroskedastic function is known; the linear model is more reliable otherwise. The new variance models are found to be competitive methods for handling heteroskedasticity in linear regression.

High-dimensional LDA Biplot through the GSVD | Raeesa Ganey

Stream: Multivariate Statistics
Chair: Sugnet Lubbe

Authors: R. Ganey, S. Lubbe
Abstract: Discriminant analysis is a multivariate technique concerned with separating distinct sets of observations. However, a common limitation of trace optimisation in discriminant analysis is that the within cluster scatter matrix must be nonsingular, which restricts the use of data sets when the number of variables is larger than the number of observations, p > n. In this presentation, we show that by applying the generalised singular value decomposition (GSVD), we can achieve the same goal of discriminant analysis regardless on the number of p. This originates from the work done by Howland, Jeon and Park (2003). Furthermore, we describe an attempt to construct a meaningful biplot from the GSVD approach.

Reference: P Howland, M Jeon and H Park, “Structure preserving dimension reduction for clustered text data based on the generalised value decomposition”, Society for Industrial and Applied Mathematics, 2003.

Biplots for individual differences scaling models | Sugnet Lubbe

Stream: Multivariate Statistics
Chair: Sugnet Lubbe

Authors: S. Lubbe, N. le Roux
Abstract: Indscal models deal typically with two mode, three way data. The typical format is a set of K n × n distance matrices, for instance K judges each rating differences between n items. Parallel to classical scaling, also known as Principal Coordinate Analysis, a set of positive semi-definite symmetric matrices are formed by double centring the squared distance matrices. In general, for any set of K positive semi-definite symmetric matrices, the Indscal model finds the best, in the least squares sense, representation of the n objects in r, usually 2, dimensions and an associated set of r weights for each of the K judges. For r=2, two plots can be made: the subject space, based on the K sets of weights and a compromise group stimulus space, representing the n objects / items. Assuming that the dissimilarities between the objects were generated by observations on p variables, we want to simultaneously represent the n objects and p variables in a biplot. In this paper we will discuss how to represent the variables with the objects in the group stimulus space. Representing the variables as biplot axes, allows for the prediction of the p variable values for any point in the r-dimensional biplot space. We will also discuss how to do the converse: finding the r-dimensional coordinates for p (new) observations on the variables.

Census 2022 Journey: Updating the nations statistical landscape | Keynote Session 3
Ashwell Jenneker

Chair: W Brettenny

Census 2022 revolutionised the collection and processing of statistics. This came off the back of the COVID pandemic which delayed the collection by few months as well as numerous challenges to bring an updated population count to the fore.

The presentation will explore the new digital method of collection, as well as give a broad overview of the current state of the society as reflected in our data, which will be further illuminated with the planned release of Census 2022 data in 2023.

  • Ashwell Jenneker (Deputy Director General of StatsSA)

    Ashwell Jenneker

    Deputy Director General of StatsSA
Morning Tea

Regency Hall Patio and Foyer

Setting up of Collaborative Systems between Academic Institutions and Industry, to build a strong Data Skills/Talent Value Chain | Prof Delia North

Chair: C. Clohessy

It is well-evidenced that there is an acute shortage data science/data analytics skilled resources all over the world, and in South Africa in particular. There is further evidence of major unemployment amongst the youth, even for those who have graduated with a degree in higher education from a degree generating institution in the country. This disconnect, between the skills being developed and the skills required by industry, leads to substantial unrealised potential amongst the youth of the country.
This talk will focus on how a university and SAS have partnered to define a set of Industry-integrated Skills Development programs, aimed at increasing the flow of “job ready” data analysts into the workplace.

Statistical Science: Enriching our lives | Keynote Address | SAS® Thought Leader 2020
Pravesh Debba

Chair: C. Clohessy

Abstract: We see statistics being applied on a daily basis, through for example, weather reports, financial markets and pharmaceuticals. Each of these applications has benefits to the general public. Fields like big data analytics and data science have seen an exponential growth in the job markets due to data being created in all forms. The last 2 years has seen 90% of the world’s data being collected.
However, we are sometimes slow and cautious to react to these opportunities and to demonstrate the ability of statistical science to provide real solutions to national and world-wide problems. Yet we have seen the impact of COVID-19 and the impact of loadshedding on our daily lives, even to the extent to which we have adapted to operate. Communication through media and social platforms also form a vital role in disseminating information that is credible by the journalists and reporters. They therefore rely on the use of scientific information for their storytelling.
In this talk, a series of case studies will be presented to demonstrate the way in which statistical science can be used to assist decision makers by providing them with supporting evidence in undertaking key decisions and to assist the public with a better understanding and awareness of relevant issues. This helps both the decision maker and public to better plan for the future and what lies ahead.
Some of the work that would be presented is by the SEPIMOD (Spatial Epidemiological Modelling) group that was formed during COVID-19 outbreak.

  • Pravesh Debba (2020 SAS® Thought Leader and Manager for Inclusive Smart Settlements and Regions at CSIR)

    Pravesh Debba

    2020 SAS® Thought Leader and Manager for Inclusive Smart Settlements and Regions at CSIR
Can statisticians ignore data science, or should it be embraced? | Keynote Address | SAS® Thought Leader 2021
Renette Blignaut

Chair: C. Clohessy

Abstract: In the 1960s Peter Naur used the word datalogy (datalogi), the science of data processes, instead of the term computer science. In 1961, John Tukey, described a field “data analysis” which might be the closer to the field of “data science” as it is known today. The term “data science” appears in the preface of Naur’s 1974 book “Concise Survey of Computer Methods”. In 1985, Chien-Fu Jeff Wu used the term “data science” as an alternative name for statistics. In 1997, Wu gave a lecture entitled "Statistics = Data Science?”. Wu advocated that statistics be renamed to data science and statisticians be called data scientists. Twenty-five years later - are we still grabbling with this?

What are you - a statistician or a data scientist? What are the differences and what are the similarities? This presentation will explore the history and evolution of the term and discipline “data science”.

  • Renette Blignaut (2021 SAS® Thought Leader and Professor at University of the Western Cape)

    Renette Blignaut

    2021 SAS® Thought Leader and Professor at University of the Western Cape
Lunch

Fairway Terrace Restaurant

On testing for the assumptions of mixture cure models in the presence of covariates | James Allison

Stream: Computational Statistics
Chair: Leonard Santana

Authors: J. Allison, J. Visagie, I. Van Keilegom
Abstract: Mixture cure models have become popular models for lifetimes in various fields including medicine and finance. Although tests for the assumptions underlying these models exist in the absence of covariates, no test can be found in the literature which can be used in the presence of covariates. We propose a test that can be employed in the mentioned setting. The test utilises transformed data involving the Kaplan-Meier estimate of the distribution function of the lifetimes. We present a Monte Carlo study in order to demonstrate the finite sample performance of the proposed tests.

Comparing distance-based and traditional parameter estimation techniques for the Lomax distribution | Thobeka Nombebe

Stream: Computational Statistics
Chair: Leonard Santana

Authors: T. Nombebe, L. Santana, J. Allison, J. Visagie
Abstract: We investigate the performance of a variety of estimation techniques for the scale and shape parameter for the Lomax distribution. These methods include the L-moment estimator, the probability weighted moments estimator, the maximum likelihood estimator, maximum likelihood estimator adjusted for bias, method of moments estimator and three different minimum distance estimators. The comparisons will be done by considering the variance and the bias of these estimators. Based on an extensive Monte Carlo study we found that the so-called minimum distance estimators are the best performers for small sample sizes, however for large sample sizes the maximum likelihood estimators outperform these minimum distance estimators. We conclude with a practical example applied in the context of duration models.

On estimating the mode of an angular distribution | Jaco Visagie

Stream: Computational Statistics
Chair: Leonard Santana

Authors: J. Visagie, F. Lombard, C. Pretorius
Abstract: We propose estimators for the mode of an angular distribution, each adapted from a corresponding class of estimators defined on the real line. In addition to point estimation, the construction of confidence intervals using the bootstrap is considered. The asymptotic properties of the proposed estimators are outlined and a Monte Carlo study is included in order to compare the finite sample performance of the proposed estimators.

Assessing the goodness-of-fit tests for Poisson regression models | Leonard Santana

Stream: Computational Statistics
Chair: Leonard Santana

Authors: L. Santana, S. Meintanis, J. Ngatchou-Wandji, M. Smuts
Abstract: We propose goodness-of-fit tests for models of count responses with covariates. We primarily focus on the null hypothesis that the observed data are from a Poisson regression model, however the proposed method is general enough to allow for the responses to follow any discrete distribution, conditional on covariates. The test criteria are formulated by using the probability generating function. In this talk, Monte Carlo results are presented to motivate the use of this test, and some asymptotic theory is also mentioned. An application on a real-world data set is also reported.

SASA 2022 Closing Ceremony

Chair: I. Fabris-Rotelli