SASA2021

・8 characters minimum
・one upper case letter
・one lower case letter
・one number
・one special character (@$!%*?&)

Agenda

Keynote / Stream 1 (Endler)
08:00 - 09:00Registration
Endler Foyer
09:00 - 10:15Opening Ceremony
10:15 - 10:45Tea
10:45 - 11:45Plenary Session
Speaker: Dr McElory Hoffman & Dr Johan van der Merwe

Title: Ethical Machine Learning in Managing a Health Pandemic
11:45 - 12:05Automated quantification of hydraulic failure in plants using deep learning
Data Science

Speaker: Tristan Naidoo (online)

Abstract: Droughts, exacerbated by anthropogenic climate change, threaten plants through hydraulic failure. This hydraulic failure is caused by the formation of embolisms which block water flow in a plant's xylem conduits. By tracking these failures over time, vulnerability curves (VCs) can be created. These curves hold physiological value and can characterise how vulnerable a plant is to hydraulic failure. However, the creation of these curves is laborious and time consuming. Automating the creation of VCs will allow for the vulnerability of a greater number of plants to be characterised. A standout candidate for automation is the optical vulnerability (OV) method of determining hydraulic failure. To automate this method, embolisms need to be segmented across a sequence of images. This presentation will discuss the automation of the OV method. It will consider three fully convolutional models for the segmentation task, namely U-Net, U-Net with a ResNet34 backbone, and W-Net – a repeated U-Net variant. The dataset used consists of three unique species and four unique leaves, where each leaf has its own sequence of images. Using these leaves, three experiments will be discussed: 1) Can a model generalise across samples from the same leaf? 2) Can a model generalise across different leaves of the same species? 3)Can a model generalise across leaves from different species? The results will be assessed on two levels; firstly, how well a model performs on the segmentation task, and secondly how well VCs are reconstructed from model predictions.
12:05 - 12:25An edge preserving median filter for images based on level-sets
Data Science

Speaker: JP Stander (online)

Abstract: In this article, we propose an edge-preserving median filter for noise removal in images. This filter uses connected sets of pixels of the same value to determine flexible regions which contour to edges in the image. The filter determines whether a set is noise or signal and smooths this noise. These regions are flexible since they are created based on their values, namely are data-driven and therefore provide the mechanism for the filter to preserve edges in the image. Current median filters do not preserve edges. Using metrics such as Pratt’s Figure of Merit and Peak Signal to Noise Ratio on example images from the labeled faces in the wild data set was concluded that the proposed filter does remove noise while preserving the edges in the image.
12:25 - 12:45Monitoring and mapping the critically endangered Clanwilliam cedar using aerial imagery and deep learning
Data Science

Speaker: Stefan Britz (in-person)

Abstract: The critically endangered Clanwilliam cedar, Widdringtonia wallichii, is an iconic tree species endemic to the Cederberg mountains in the Fynbos Biome. Consistent declines in its populations have been noted across its range primarily due to the impact of fire and climate change. Mapping the occurrences of this species over its range is key to the monitoring of surviving individuals and is important for the management of biodiversity in the region. Recent efforts have focused on the use of freely available Google Earth TM imagery to manually map the species across its global native distribution. This talk proposes an approach for automating the process of tree detection using deep-learning. The approach involves using sets of high-resolution red, green, blue (RGB) imagery to train artificial neural networks for the task of tree-crown detection. Additional models are trained on colour-infrared imagery, since live vegetation has a red tone on the near-infrared (NIR) spectrum. Preliminary results show that using an intersection-over-union threshold of 0.5 yields an average tree-crown recall of 0.59 with a precision of 0.46, and that the addition of the NIR spectral band does not result in improved performance. The viability of using this approach to regularly update maps of the Clanwilliam Cedar and monitor its population trends in the Cederberg is discussed.
12:45 - 13:30Lunch
13:30 - 13:50Correlated gamma frailty models for bivariate survival time data
Biostatistics

Speaker: Adelino Martins (online)

Abstract: Frailty models have been developed to quantify both heterogeneity as well as association in multivariate time-to-event data. In recent years, numerous shared and correlated frailty models have been proposed in the survival literature allowing for different association structures and frailty distributions. A bivariate correlated gamma frailty model with an additive decomposition of the frailty variables into a sum of independent gamma components was introduced before. Although this model has a very convenient closed-form representation for the bivariate survival function, the correlation among event- or subject-specific frailties is bounded above which becomes a severe limitation when the values of the two frailty variances differ substantially. In this paper, we review existing correlated gamma frailty models and propose novel ones based on bivariate gamma frailty distributions. Such models are found to be useful for the analysis of bivariate survival time data regardless of the censoring type involved. The frailty methodology was applied to right censored and left-truncated Danish twins mortality data and serological survey current status data on varicella-zoster virus and parvovirus B19 infections in Belgium. From our analyses, it has been shown that fitting more flexible correlated gamma frailty models in terms of the imposed association and correlation structure outperforms existing frailty models.
13:50 - 14:10Evaluating the Effect of HIV Status Awareness on HIV Risky Sexual Behaviours and Marriage dissolution using Marginal Structural Models
Biostatistics

Speaker: Halima Twabi (online)

Abstract: Knowledge of HIV status has been shown to impact risky sexual behaviors such as inconsistent condom use and multiple sexual partners and on marriage dissolution. Increase in risky sexual behaviors results in a high HIV transmission. Policy makers would be interested to assess the magnitude of HIV status awareness and its impact on risky sexual behavior and marriage outcomes. This paper aimed to use routine longitudinal data to estimate the effect of HIV status awareness on risky sexual behaviors and marriage dissolution. Data was extracted from the Malawi Longitudinal Study for Families and Health (MLSFH) and complete linked individual data that appeared in 8 waves collected bi-annually was used. A Marginal Structural Model (MSM) using the inverse probability of treatment weights (IPTW) was used to estimate the known HIV-status effect on consistent condom use, multiple sexual partners and marriage outcomes. The findings of the study show that HIV-status awareness had a beneficial effect on condom use and having multiple sexual partners. However, there was an increase in marriage dissolution among individuals who were aware of their HIV-positive status. The study may suggest effectiveness of HIV preventive strategies in Malawi. We recommend continuation of interventions that promote HIV testing and counselling to help people become aware of their HIV status.
14:10 - 14:30Using joint models to study the association between CD4 count and the risk of death in TB and HIV studies
Biostatistics

Speaker: Nobuhle Mchunu (in-person)

Abstract: Background: Joint modeling is the most appropriate method for studying potential associations between biomarkers and time to event outcomes. The association structure linking the two sub-models is of fundamental importance in the joint modeling framework. However, rationale for selecting this association structure has received little attention in the literature. To this end, we aim to explore five alternative association structures between the CD4 count and the risk of death and ultimately select the best association structure for our data. Methods: We used data from CAPRISA, the Starting Antiretroviral Therapy at Three Points in Tuberculosis (SAPIT) study, an open-label, three armed randomised, controlled trial between June 2005 and July 2010 (N=642). We used the Deviance Information Criterion (DIC) to select the final model, with smaller values indicating better model adjustments to the data. Results: Among the 642 patients enrolled in the SAPIT trial, 214 (33.3%) were in the early integrated arm, 215 (33.5%) in the late integrated arm and 213 (33.2%) in the sequential arm. Patient characteristics were similar across the three study arms. The joint model with random effects was chosen as our best model where the baseline levels of the underlying square root CD4 count as well as the longitudinal evolution of the CD4 count were found to be strongly related to the hazard of death. Conclusions: The current value association structure may not always be appropriate in expressing the correct association between the outcomes in all settings. Thus, exploring other clinically meaningful association structures linking the two processes expands the usefulness of the joint modeling framework.
14:30 - 15:00Tea
15:00 - 15:20Data-driven methods for subgroup identification in clinical trials with an application
Biostatistics

Speaker: Charl Janse van Rensburg (online)

Abstract: Randomized clinical trials provides the best evidence on the efficacy of new therapeutic drugs. In many trials, the main treatment effect of interest may not be significant. Post hoc analysis may be conducted to identify subgroups for whom the treatment may have worked. However simple methods suffer from low power, as well as bias. In the last two decades many data-driven approaches have been developed to identify subgroups of patients with treatment effect in failed trials, or enhanced treatment effect compared to the overall effect. Subgroup treatment effect identification could be approached by using decision trees, random forests, support vector machines, as well as model-based approaches. We introduce some of these methods, most notably the GUIDE algorithm, and compare the methods under a simulation study. The methods are also applied to a real-life trial data set. Recommendations are made for good practice when doing exploratory subgroup analysis using these methods.
15:20 - 15:40Comparing Effect of HIV Treatment Regimens on Time to Mortality, and Virological Failure and Rebound among HIV Positive Patients using Inverse Probability of Treatment Weighting Estimation of Marginal Structural Models
Biostatistics

Speaker: Samuel Manda (in-person)

Abstract: Randomised clinical trials have been used to compare the efficacy of highly active antiretroviral therapy (HAART) regimens on time following their initiation to death or virological failure or virological rebound in HIV positive patients. For prospective cohort studies, comparative effectiveness has been established using standard survival models such as the time-dependent Cox proportional hazards model. In most cases, there could be time-dependent confounders that are affected by previous treatment regimens, which may produce biased estimates of the regimens’ effects. Thus, causal statistical models based on the marginal structural Cox proportional hazards model have been used to provide comparative causal effects of the different HAART regimens in most prospective cohorts of HIV patients. We use the inverse probability-of-treatment weighted estimation of a marginal structural proportional hazards model on a large retrospective cohort of HIV patients in southern Africa to compare the causal effects of different HAART regimens on time to death or virological failure or rebound.
15:40 - 16:00Some Statistical Challenges in the Analysis of Single-Cell RNASeq Data
Biostatistics

Speaker: Bernard Omolo (online)

Abstract: In this study, we review some of the statistical challenges that have been encountered in recent analyses of scRNASeq data. We propose an approach for controlling the Type-I error rate when conducting tests on imputed scRNASeq data. For illustration, we apply the proposed approach to colorectal cancer data from a publicly available database.
16:10 - 17:00Plenary Session
Speaker: Prof. Gareth James

Title: Irrational Exuberance: Correcting Bias in Probability Estimates
Stream 2 (Jannasch)
11:45 - 12:05Age and Size related Reference Ranges for lung function measurements of a cohort of South African children
Biostatistics

Speaker: Francesca Little (online)

Abstract: The use of growth percentiles for anthropometric measurements to monitor childhood development is well known. Growth reference standards are derived based on a large representative multi-country and multi-ethnical cohort of children from birth. Childhood growth is then monitored by either comparing actual growth to the percentiles of these growth standards or by calculating a “z-score” that measures the deviation from “normal” growth. The derivation of the centiles and z-scored are based on the modelling of the moments of the underlying distribution of the growth measurements, the mean (or median), standard deviation (or coefficient of variability), the skewness and the kurtosis using cubic or basis splines to capture the nonlinear association with age. The most common methodology for doing this is the technique known as generalized additive models for location, scale and shape (GAMLSS) that extends and incorporates the much used LMS method. Reference ranges are not only important for anthropometric growth but also for other medical measurements, for example laboratory reference ranges and lung function measurements. These measurements often depend not only on age but also on size (for example, height of children), and hence the construction of reference ranges need to take size into account. The GAMLSS methodology allows for a relatively easy incorporation of size in the modelling of the moments of the outcome distributions either as additive or multiplicative factors. We illustrate the derivation of reference ranges for lung function measurements in a cohort of South African children from 6 weeks to 5 years of age.
12:05 - 12:25Dynamic prediction of virologic failure in a cohort of HIV infected individuals on antiretroviral therapy in the Western Cape
Biostatistics

Speaker: Frissiano Honwana (online)

Abstract: Background: Personalized medicine is receiving greater attention as health data collection rapidly digitizes and methodological development has seen an increase in statistical models for individual prediction. We consider dynamic prediction models to model the dependency between longitudinal viral load (VL) and virologic failure (VF) in people living with HIV. Methods: We included 91,818 individuals with longitudinal VL measures from routine data between 2008 and 2018. We used a shared random effects model (SREM) to predict virologic failure based on historical VL trajectory and baseline characteristics. Time-dependent area under the curve (AUC) of the receiver operating characteristics (ROC) curve was used to quantify the prediction accuracy. Results: The SREM fit was acceptable with residual diagnostics satisfying the assumptions of the model. The SREM demonstrated good prediction accuracy with AUCs ranging from 0.69 to 0.76 Conclusion: SREM effectively incorporates baseline covariate with time-varying viral load and continuously updates virological failure probabilities with every new additional repeated measurement. The dynamic predictions from the SREM using routinely captured data provides an opportunity to flag individuals who may be at greater risk for negative outcomes. This may be a first step in individualized care models for people living with HIV.
12:25 - 12:45A joint mixed model of adolescent’s reproductive health service knowledge and utilization, and its associated factors in Jimma zone: A prospective longitudinal cohort study
Biostatistics

Speaker: Tafere Tilahun Aniley (in-person)

Abstract: Background: Adolescents who constitute one-third of the total population in Ethiopia, are usually exposed to reproductive health (RH) related problems. This is because of insufficient access to or inadequate knowledge of health services. Thus, the main aim of the current study was to investigate associated risk factors of adolescents’ RH services knowledge and utilization in Jimma zone, Southwest of Ethiopia.Method: The data used in the study was taken from Jimma longitudinal family survey of youth study conducted in southwest Ethiopia. The responses measure adolescents’ reproductive health service knowledge and utilization with binary outcomes. We proposed a bivariate logit mixed model to analyze both responses jointly, accounting for the correlation that exists within the data through random effects. Result: The results of the analysis with bivariate logit mixed model shows that the covariates gender, place of residence, current romantic relationship, and radio listening were significantly associated with both responses. However, adolescent age, society club participation, school attendance, and work status were significantly associated with adolescents’ reproductive health services knowledge only. Whereas only current work status was significant covariate affecting adolescents’ RH service utilization. Conclusion: Reproductive health service knowledge was not improved over the survey waves, while the exposure of adolescents to utilize RH service has increased. Based on the results we conclude that there was no clear evidence of early contact with adolescents to improve RH services knowledge. Finally, we recommend implementing various health intervention packages especially targetting adolescents to address the gap in knowledge and utilization of RH services.
12:45 - 13:30Lunch
13:30 - 13:50Machine Learning techniques illustrated using R-Shiny
Data Science

Speaker: Lourens Strydom (online)

Abstract: Innovating and exciting models that converge to the pinnacle of human intelligence are at the frontier of research nowadays. This project will provide the reader with fundamental concepts used daily in Machine Learning. From well-established algorithms that have been operating for decades on end to newer and stronger algorithms in full fruition. It is thus fitting to introduce the reader to a modern web application fabricator, R-Shiny, that displays stunning visualizations and accomplishes powerful analysis by using the vigorous abilities of R. Machine Learning is well covered in academics. But textbooks have never done it justice. By allowing students to interact with some of the algorithms, we hope to inspire them. Until recently, R users had no experience in web development. R-Shiny has changed that. It is now possible to dynamically utilize your code for daily tasks with action buttons, sliders, selection lists, and many more features.
13:50 - 14:10Binary Particle Swarm Optimization (BPSO) based feature selection
Data Science

Speaker: Michelle Gilfillan (online)

Abstract: This paper studies feature selection using Binary Particle Swarm Optimization(BPSO) for high dimensional data sets. Logistic regression and k-nearest neighbour(KNN) classifiers are used. BPSO based feature selection uses a meta-heuristic search strategy to find near optimal feature subsets in a small amount of time. These methods are compared with the results of a random forest classifier. Theoretical aspects together with an application are proposed.
14:10 - 14:30Forward stagewise linear regression for ensemble methods
Data Science

Speaker: Danie Uys (in-person)

Abstract: In supervised learning, the forward stagewise regression algorithm is considered a more constrained version of forward stepwise regression. In its turn, the forward stagewise regression algorithm can be refined to produce the incremental forward stagewise regression model. In the latter model, the idea of slow learning is introduce where the residual vector and the appropriate regression coefficient are updated in very small steps at each iteration. Ensemble methods combine a large number of simpler base learners to form a collective model that can be used for prediction. Learning methods such as Bagging, Random Forests and Boosting amongst others, can all be regarded as ensemble methods. In these methods, the linear model is expressed as a linear combination of these simpler base learners, where the coefficients of the base learners are to be estimated by least squares. Since a large number of base learners is typically involved, the residual sum of squares of the linear combination of base learners has to be penalised by, for example, the lasso penalty. However, the large number of base learners also complicates the minimisation of the coefficients in the penalised residual sum of squares criterion. By using the iterative forward stagewise linear regression algorithm for ensemble methods, which includes the idea of slow learning and closely approximate the lasso, estimators of the coefficients of the base learners can be obtained. In the talk, the performance of various ensemble methods is evaluated. This is done by applying the forward stagewise linear regression algorithm for ensemble methods to simulated, as well as real life datasets.
14:30 - 15:00Tea
15:00 - 15:20Exploding biplots with density axes in Plotly
Multivariate Statistics & Stochastic Processes

Speaker: Carel van der Merwe (in-person)

Abstract: Biplots are useful when visualizing multivariate data. It can, however, sometimes be challenging to interpret, for example when the axes and points cause overcrowding of the plot. This overcrowding is often due to the presence of many variables, highly correlated variables, or merely data sets with a large number of observations. In this paper improvements to the biplot are made to address these shortcomings. These improvements include: i) the automatic parallel translation, or "explosion", of axes, ii) the use of densities on the axes to improve interpretation and representation of large data sets, and iii) introducing interactive biplots via the use of the Plotly package in R. These improvements result in a better composition of the plot to make it seem less crowded, more easily interpretable, offer additional information that can get lost in the case of a high volume of data, and allowing the user to inspect the biplot element-wise. An accompanying Shiny web-based application was also created and is available at https://carelvdmerwe.shinyapps.io/ExplodingBiplots/.
15:20 - 15:40Multivariate prediction with machine learning in digital soil mapping
Multivariate Statistics & Stochastic Processes

Speaker: Stephan van der Westhuizen (in-person)

Abstract: Soil maps produced with digital soil mapping (DSM) contain vital information about the spatial distribution of soil properties which are used in fields such as agronomy and ecology. DSM uses statistical models to quantify the relationship between a soil property and a selection of soil-forming representative environmental covariates, and then uses this relationship to predict the soil property at locations where it was not observed. Soil maps are usually produced with univariate statistical models, i.e. each map is produced independently from others without taking into account the underlying correlation structure between the soil properties. This can lead to inconsistent predictions, for example, mapping soil organic C and N concentrations with separate univariate models may lead to unrealistic C:N ratios. Many examples of multivariate mapping exist, and co-kriging is probably the most widely used multivariate technique in DSM. However, co-kriging requires severe restrictions such as the linear model of coregionalisation. Machine learning applications in DSM has gained tremendous popularity over the last decade, but the use of machine learning to perform multivariate mapping in DSM is still lacking. In this presentation we compare the multivariate extensions of random forests and projection pursuit regression to (regression) co-kriging when predicting two soil properties, organic C and N, from the Land Use and Coverage Area Survey (LUCAS) data set. Maintaining a well represented C:N ratio is important for map users as it provides information on the residue decomposition and the nitrogen cycle in soil.
Stream 3 (Lecture A214)
11:45 - 12:05Some Dirichlet mixtures considered in a Bayesian context by using entropy for prior selection
Bayesian Statistics

Speaker: Tanita Botha (in-person)

Abstract: Random determinants play an essential role within multivariate analysis, but their distributions often present theoretical and computational challenges. To circumvent these challenges, this talk proposes a lower bound for the probabilistic analysis of the determinant emanating from a matrix consisting of independent but not necessarily identically distributed generalized beta entries. The 2×2 and 3×3 cases receive particular attention, and a brief simulation study verifies the results.
12:05 - 12:25Using Principled Bayesian inference to assess the viability of wind power in South Africa .
Bayesian Statistics

Speaker: Matthew de Bie (online)

Abstract: In light of the ongoing load-shedding crisis, South Africa must look towards sources of renewable energy to supplement its ageing, coal-reliant power infrastructure. Sites in the coastal regions of South Africa are locations where exploitation of wind energy may be feasible. To begin our investigation, we will fit South African wind speed data to a Weibull model. Existing literature disagrees about which estimation method is best suited to estimate the Weibull shape parameter. Through simulation, we will contrast several estimation methods and recognise the Bayesian application of the PC Prior as most appropriate for estimating the shape parameter of our proposed Weibull model. We will then apply this Bayesian framework, through several models of increasing complexity, to actual wind speed data by means of the R-INLA package. We aim to gain a better understanding of the functional relationship between recorded wind speeds and the altitude, temporal and spatial conditions under which these measurements were taken. Ultimately, our goal is to construct a statistical framework through which the feasibility of harnessing wind energy in these coastal regions may be evaluated.
12:25 - 12:45Seasonal and station effects modelling to extreme temperature data in South Africa
Bayesian Statistics

Speaker: Legesse Kassa Debusho (in-person)

Abstract: Background: Adolescents who constitute one-third of the total population in Ethiopia, are usually exposed to reproductive health (RH) related problems. This is because of insufficient access to or inadequate knowledge of health services. Thus, the main aim of the current study was to investigate associated risk factors of adolescents’ RH services knowledge and utilization in Jimma zone, Southwest of Ethiopia.Method: The data used in the study was taken from Jimma longitudinal family survey of youth study conducted in southwest Ethiopia. The responses measure adolescents’ reproductive health service knowledge and utilization with binary outcomes. We proposed a bivariate logit mixed model to analyze both responses jointly, accounting for the correlation that exists within the data through random effects.Result: The results of the analysis with bivariate logit mixed model shows that the covariates gender, place of residence, current romantic relationship, and radio listening were significantly associated with both responses. However, adolescent age, society club participation, school attendance, and work status were significantly associated with adolescents’ reproductive health services knowledge only. Whereas only current work status was significant covariate affecting adolescents’ RH service utilization.Conclusion: Reproductive health service knowledge was not improved over the survey waves, while the exposure of adolescents to utilize RH service has increased. Based on the results we conclude that there was no clear evidence of early contact with adolescents to improve RH services knowledge. Finally, we recommend implementing various health intervention packages especially targetting adolescents to address the gap in knowledge and utilization of RH services.
12:45 - 13:30Lunch
13:30 - 13:50Clustering time-course data using P-splines and mixed effects mixture models.
General

Speaker: Deidre Bredenkamp (online)

Abstract: This paper addresses cluster analysis of time-course data in a mixture model framework. To take into account the time dependency of such time-course data, as well as the degree of error present in many datasets, the mixed effects model with penalized B-splines has been presented. In this paper the performance of such a mixed effects model has been studied with regards to clustering of time course gene expression data in a mixture model system. The EM algorithm has been implemented to fit the mixture model in a mixed effects model structure. For each subject the best linear unbiased smooth estimate of its time course trajectory has been calculated and subjects with similar mean curves have been clustered in the same cluster. Model validation statistics such has the model accuracy and the coefficient of determination (R2) indicates that the model can cluster stochastic simulated data effectively into clusters that differ in either the form of the curves or the timings to the curves' peaks.The suggested technique is further evidenced by clustering time course gene expression data consisting of microarray samples from lung tissue of mice exposed to different Influenza strains from 14 timepoints.The results show a graphic overview of each cluster's genetic outcome, as well as the goodness-of-fit of the model via the 'mean curve' framework along with the respective confidence intervals.
13:50 - 14:10Modelling the output from a commercial chemical facility using the Cox proportional hazard regression.
General

Speaker: Roelof Coetzer (online)

Abstract: The Cox proportional hazard (CPH) regression model has been used with great success for modelling the time until certain events occur, and for studying the dependency of survival time or time until event on predictor variables. In this paper, we illustrate that the CPH model can also be used for modelling a response variable, which is heterogeneous over time, as a function of predictor variables. In addition, we evaluate a number of alternative distributions for the baseline hazard and quantify the accuracy of the CPH model in predicting the response variable using the predicted root mean square error. The accelerated failure time (AFT) model, where the explanatory variables act multiplicatively on time, is considered as an alternative to the CPH model. The CPH and AFT models are used for the prediction of a key process variable in a commercial chemical facility
14:10 - 14:30An Algorithm for Generating Multi-Label Classification Data
General

Speaker: Trudie Sandrock (in-person)

Abstract: Multi-label classification has become an active area of research. When comparing multi-label classification methods, benchmark datasets are generally used. However, these benchmark datasets have significant shortcomings and ideally methods should be compared using artificially generated data; however, to date, few proposals exist in this regard and the existing proposals are limited in many regards. A new method for generating multi-label classification data is therefore proposed, which offers considerable control over many properties of the simulated data. Of special interest is the option of specifying locally and globally relevant input variables.
14:30 - 15:00Tea
15:00 - 15:20A new look into handling the missing observations problem in data: The random – draw way, focusing on value formats of the data
General

Speaker: Nyiko Muhluri Khoza (in-person)

Abstract: The problem of missing observations in data has since been seen as a problem of the past since the introduction of multiple imputation. However, just like other methods that handle the missing observations problem in data, the study observes gaps in this development. A similar trend is observed in other methods that handle the missing observations problem in data. The study conducts experiments on the current methods of handling the missing observations problem and assess the reliability of the imputation techniques. The study looks at techniques that retain the sample size and those that do not when handling the missing observations problem. The focus is on correcting the gaps identified in each method by the study when looking at each technique. The study proposes a random – draw methodology. The study proposes a new look into handling the missing observations problem in data. Thus, the random – draw of observations until convergence is reached, focusing on value formats of the data when simulating random draws. The proposed method also cleans the imputed observations to fix the random – draw methodology shortfall. A random – draw model is formulated from the gaps observed when imputing values outside the value formats of the data. The gaps identified evaluates variable importance using the HPSPLIT procedure which the study compares before and after data cleaning. The new methodology is formulated from the carry – forward or else backward observations imputation, the multiple regression imputation, the linear interpolation observation imputation and the multiple imputation looking at the formats of the data.
15:20 - 15:40Robust Adaptive LASSO and Adaptive E-NET Variable Selection and Regularization in Quantile Regression in the Presence of Collinearity Influential Points
General

Speaker: Innocent Mudhombo (in-person)

Abstract: Collinearity influential observations greatly influence the variable selection and parameter estimation in regression analysis. In this presentation, we propose modifications of adaptive LASSO and adaptive E-NET penalized quantile regression (QR) variable selection procedures to deal with collinearity influential points in variable selection regularization. Although the penalization problem for variable selection has been dealt with extensively in the literature for the QR scenario, many existing variable selection procedures fail to deal with both variable selection and regularization due to adverse effects of collinearity influential points. Our suggested procedures deal with shortcomings of existing variable selection and regularization methods in QR in the presence of collinearity influential observations. The adaptive weights in our proposed procedures are based on the robust weighted quantile regression RIDGE (WQR-RIDGE) estimator. The proposed procedures satisfy oracle properties under regularity conditions. The simulated data show that our suggested QR adaptive variable selection procedures deal with collinearity influential observations better, more so in the robust weighted scenarios than other variable selection procedures.
15:40 - 16:00LASSO and E-NET Variable Selection and Regularization in Quantile Regression via Minimum Covariance Determinant based Weights
General

Speaker: Edmore Ranganai (in-person)

Abstract: The importance of variable selection and regularization procedures in multiple regression analysis cannot be overemphasized. These procedures are adversely affected by predictor space data aberrations as well as outliers in the response space. To counter the latter, robust statistical procedures such as quantile regression which generalizes the well-known least absolute deviation procedure to all quantile levels have been proposed in the literature. Quantile regression is robust to response variable outliers but very susceptible to outliers in the predictor space (high leverage points) which may alter the eigen-structure of the predictor matrix. High leverage points that alter the eigen-structure of the predictor matrix by creating or hiding collinearity are referred to as collinearity inﬂuential points. In this paper, we suggest generalizing the penalized weighted least absolute deviation to all quantile levels, i.e., to penalized weighted quantile regression using the RIDGE, LASSO, and elastic net penalties as a remedy against collinearity inﬂuential points and high leverage points in general. To maintain robustness, we make use of very robust weights based on the computationally intensive high breakdown minimum covariance determinant. Simulations and applications to well-known data sets from the literature show an improvement in variable selection and regularization due to the robust weighting formulation.
Stream 4 (Lecture A221)
11:45 - 12:05Sales Forecasting Using Linguistic Fuzzy Logic with Weather Data
Financial Statistics

Speaker: Tomas Tichy (in-person)

Abstract: This text proposes a novel approach studying financia quantities using various exogenous variables and is inspired by fuzzy natural logic. The method is based on modeling the influence of exogenous variables on financial quantities by fuzzy linguistic IF-THEN rules. Reliable estimation of customer demand for products and services constitutes a key aspect of financial planning in every company. For example, when estimating future sales, as a proxy to demand, in addition to pure economic quantities, a large selection of (exogenous) variables specific to a given product can be considered. Potential impact of weather conditions on sales has been known for very long time, though the research using weather data has been mostly focused on energy sector. As concerns retail, several authors have started to analyze this issue only recently. The proposed methodology is applied to real sales data and compared with a standard approach. The results are promising especially when frequently collected weather data are considered, even if sales are collected in longer periods.
12:05 - 12:25A Quantitative Analysis of Investor Over-reaction and Under-reaction in the South African Equity Market: A Fuzzy C-Mean Algorithm
Financial Statistics

Speaker: Aude Ines Mbonda Tiekwe (in-person)

Abstract: One of the basic foundations of traditional finance is the theory underlying the efficient market hypothesis (EMH). The EMH states that stocks are fairly and accurately priced, making it impossible for investors to use stock selection, technical analysis, or market timing to outperform the market by earning abnormal returns. Several schools of thought have challenged the EMH by presenting empirical evidence of market anomalies, which seems to contradict the EMH. One such school of thought is behavioural finance, which holds that investors over-react and/or under-react over time, driven by their behavioural biases. In this study, a Fuzzy c-Means Model, based on the technique of pattern recognition is used to investigate investor’s over-reaction and under-reaction in the South African equity market. The study used quarterly data of 163 shares in the Johannesburg Stock Exchange AllShare index, selected from the top 100 shares listed for the period 2006 to 2016 and downloaded from Iress and Bloomberg. Over-reaction and under-reaction were both detected, and differed across sectors. No clear patterns of the two biases investigated were visible over time. The results of the FCM analysis revealed that the resources sector shows the most under-reaction. The results of this study imply that momentum and a contrarian investment strategies can lead to over-performance in the South African equity market, but can also generate under-performance in a poorly performing market. Therefore, no trading strategies can be advised based on the results of this study.
12:25 - 12:45Forecasting Volatility in Commodity Markets with Long-Memory Models
Financial Statistics

Speaker: Mesias Alfeus (in-person)

Abstract: Commodities are the most volatile markets, and forecasting their volatility is an issue of paramount importance. We examine the dynamics of commodity markets volatility by employing three typical long-memory models: fractional integrated generalized autoregressive conditional heteroscedastic (FIGARCH), fractional stochastic volatility (FSV) and heterogeneous autoregressive (HAR) models. Based on a high-frequency futures price dataset of 22 commodities, we confirm that the volatility of commodity markets is rough, and volatility components over different horizons are economically and statistically significant. Long-memory with anti-persistence is evident across all commodities, with weekly volatility dominating in most commodity markets and daily volatility for oil and gold markets. HAR models display a clear advantage in forecasting performance compared to the two other models for short horizons, while fractional volatility models yield comparative better forecasts for longer horizons.
12:45 - 13:30Lunch
13:30 - 13:50Extending GPAbin to visualise missing multivariate continuous data
Multivariate Statistics

Speaker: Johané Nienkemper-Swanepoel (in-person)

Abstract: Multiple imputation is a well-established technique for analysing missing data. Multiple imputed data sets are obtained and analysed separately using standard complete data techniques. The estimates from the separate analyses are then combined for inference. However, the exploratory analysis options of multiple imputed data sets are limited. Biplots are regarded as generalised scatterplots which provide a simultaneous configuration of both samples and variables. Therefore, a visualisation for each of the multiple imputed data sets can be constructed and interpreted individually, but in order to formulate an unbiased conclusion, the visualisations have to be appropriately combined for a unified interpretation. The GPAbin technique has been developed to address this problem for multiple correspondence analysis biplots of multiple imputed data sets. Generalised orthogonal Procrustes analysis (GPA) is used to align the biplots before combining them in a mean coordinate matrix. The name GPAbin is derived from the amalgamation of GPA and Rubin’s rules, which are the combining steps used after multiple imputation. Simulation studies have confirmed the usefulness of the GPAbin method for categorical data. This presentation will show the extension of the GPAbin methodology to multivariate continuous data by using principal component analysis biplots.
13:50 - 14:10A New Approach to Error Variance Estimation in a Heteroskedastic Linear Model
Multivariate Statistics

Speaker: Thomas Farrar (online)

Abstract: Estimation of error variances is an important precursor to both estimation of and inference on regression coefficients in a heteroskedastic linear model. The variance estimates can be used to compute heteroskedasticity-consistent standard errors for coefficient estimates as well as to compute weights for feasible weighted least squares estimation. Most existing heteroskedasticity-consistent covariance matrix estimator (HCCME) methods make element-wise bias corrections to the squared ordinary least squares (OLS) residuals, $e_i^2$. The corrections are however based on the conditional expectation of the $e_i^2$ under homoskedasticity, not under heteroskedasticity. A proposed new approach treats the conditional expectation of the $e_i^2$ under heteroskedasticity as the conditional mean function of an auxiliary regression model with the $e_i^2$ as the responses. Two methods for reducing the dimensionality of the resulting parameter space are considered. The first method, HC8, assumes a functional relationship between the error variance parameters and the design variables from the main model. The second method, HC9, assumes that certain subsets of the observations have equal error variances based on their proximity in the design space. Appropriate subsets may be computed using agglomerative hierarchical clustering. In either case, the auxiliary regression model is nonlinear and is fit using a quasi-likelihood procedure. A simulation experiment is performed to compare the new methods to existing HCCMEs. The problem of feature selection in the auxiliary model is discussed.
14:10 - 14:30An insight to inference on the probabilistic determinant of independent generalized beta entries
Multivariate Statistics

Speaker: Johan Ferreira (in-person)

Abstract: Random determinants play an essential role within multivariate analysis, but their distributions often present theoretical and computational challenges. To circumvent these challenges, this talk proposes a lower bound for the probabilistic analysis of the determinant emanating from a matrix consisting of independent but not necessarily identically distributed generalized beta entries. The 2×2 and 3×3 cases receive particular attention, and a brief simulation study verifies the results.
14:30 - 15:00Tea
15:00 - 15:20The Practical Considerations of a Flexible Finite Mixture Regression (FMRFLEX) Framework.
Computational Statistics

Speaker: Riaan Smit, Sollie Millard, Frans Kanfer (in-person)

Abstract: Finite mixture regression (FMR) models represent a flexible statistical modelling framework which allows for the underlying structure of complex heterogeneous datasets to be quantified and in doing so, offer increased predictive power compared to traditional one-class regression models. In practice, a heavy reliance is placed on linear models as the common input predictors in finite mixture regression models. To improve on the overall predictability of these models, Ahonen et al (2019) proposed a revised formulation of the finite mixture regression methodology (FMRFLEX) which introduces a more flexible structure to the linear predictors included in these models. This flexibility in the structure of the linear predictors is achieved through the combination of a random forest learner and lasso-penalized finite mixture regression model. By following this approach, the nonlinearities and interactions inherent in the data is “captured” by the random forest learner in a flexible and data-driven manner, which is in turn combined with the linear associations derived from the original covariates using standard penalized finite mixture regression methods. Empirical results have shown that the FMRFLEX model proposed by Ahonen et al (2019) is able to achieve greater predictability compared to traditional finite mixture regression models when some of the regression components inherent in a specific dataset, is nonlinear. The practical implementation of the FMRFLEX methodology will be presented with special attention given to the predictive performance of the model.
15:20 - 15:40Robust mixture regression with unspecified error distributions
Computational Statistics

Speaker: Wihan van der Heever (online)

Abstract: Mixture of linear regression models have become innate when fitting a response variable to one or more feature variables in the face of latent components present in data. These models frequently employ maximum likelihood estimation (MLE) for the regression parameters and depend on the assumption that the errors of the underlying components are normally distributed. Naturally, when this assumption is breached, the traditional approach to mixture regression is no longer viable for modelling purposes, due to the bias transpiring from the model not capturing the ad rem relationship between variables. This paper considers a semiparametric mixture of linear regression model, which reduces the bias by introducing a kernel density-based expectation maximisation (KDEEM) algorithm. This algorithm accommodates linear mixture regressions without specifying the component error distributions, thereby allaying the complications arising from the violation of the normality assumption. The paper uses a simulation study to compare the KDEEM approach to standard MLE in cases of normally distributed and non-normally distributed errors. The KDEEM algorithm is also applied to a practical Covid-19 data set.
15:40 - 16:00Probabilistic Graphical Models and Belief Propagation: Approximations via Free Energies
Computational Statistics

Speaker: Francois Kamper (in-person)

Abstract: A probabilistic graphical model (PGM) aims to illustrate dependencies between random variables arising from multivariate distributions. These dependencies can depict dependencies such as causality (Bayesian networks) and conditional independence (Markov networks). Given a PGM, and associated graph structure, one is typically tasked with performing inference on the underlying multivariate distribution. The inference task typically involves tasks such as determining multiple marginal distributions or determining the maximum a posteriori (MAP) assignment. A key reason for the popularity of PGMs is the existence of algorithms that can exploit sparsity in the graph structure to perform inference in a computationally efficient way. Inference can be done either exactly or approximately, where the latter focuses on computational speed rather than inference accuracy. Perhaps one of the more famous approximate inference algorithms is called belief propagation (BP), where a Markov network (say) is converted to another graph type (such as factor or cluster graph) on which inference is then performed. BP is an example of a message-passing algorithm, where nodes in the graph perform operations in parallel, and then communicate the results to neighboring nodes. The purpose of this talk is to introduce belief propagation as a solution to an optimization problem, where a distribution is approximated by a specific type of factorization inspired by tree-structured graph typologies.
Keynote / Stream 1 (Endler)
09:00 - 09:20Estimation of Complier Average Causal Effect using Proportional Hazards Models in Randomised Trials with Competing Risks
Biostatistics

Speaker: Andreas Kryger Jensen (online)

Abstract: Randomised studies are ideal to learn about causal effects. A common problem in such studies, however, is that all subjects may not comply to the treatment they are randomized to receive. It is well-known that the as-treated analysis my render incorrect results and that the intention-to-treat analysis is not targeting the causal effect of the treatment. However, instrumental variable techniques can be used to estimate the causal effect. We consider the situation with a survival outcome using a structural proportional hazards model for the compliers under the active treatment. Subjects under the control treatment do not have access to the active treatment. We present an estimator for the complier average causal effect (CACE) in the proportional hazards setting and give its large sample properties. We also illustrate the extension of this problem to the setting of a competing risks outcome.
09:20 - 09:40Mediation analyses with survival outcomes
Biostatistics

Speaker: Theis Lange (online)

Abstract: In this talk I will present how causal inference methods allow us to rigorously decompose the effect of an exposure on a survival outcome. I will also present how natural effect models allow us to estimate this even with a (semi) high dimensional mediator. Finally, I will provide suggestions for future research.
09:40 - 10:00Competing risks joint models using R-INLA
Biostatistics

Speaker: Janet van Niekerk (online)

Abstract: In this talk, we introduce a framework based on R-INLA to apply competing risks joint models in a unifying way such that non-Gaussian longitudinal data, spatial structures, times-dependent splines and various latent association structures, to mention a few, are all embraced in our approach. Our motivation stems from the SANAD trial which exhibits non-linear longitudinal trajectories and competing risks for failure of treatment.
10:00 - 10:20Continuous time parametric multistate transition models with an application
Biostatistics

Speaker: Henry Mwambi (online)

Abstract: State transition models are an important methodological development in Statistics. The application areas of such models is vast such as in health, ecology, agriculture, environment to mention a few. In this talk a multistate transition model with application to HIV/AIDS disease progression based on CD4 cell count derived stages is discussed. The model structure is such that it allows for inclusion of the effect of covariate in specific state transitions in the process. The model is also used to model viral load suppression and rebound in a multistate model structure. In addition disease stage sojourn times and survival can be estimated hence allowing interventions to mitigate against transitions to worse stages of the disease. An application will be demonstrated using data from a study in KwaZulu-Natal for individuals infected with HIV allowing for a host of covariates both clinical and non-clinical in nature in addition to individual specific characteristics. In conclusion we found out that multistate transition models are an important tool that can be used to manage chronic diseases such as HIV/AIDS and Cancer both at an individual and population level. They are useful to current approaches of care such as personalized care.
10:20 - 10:40Robust joint modelling of longitudinal data and survival data: detection and downweighting of longitudinal measurements
Biostatistics

Speaker: Freedom Gumedze (in-person)

Abstract: Mixed-effects location scale models allow simultaneous modelling of between-subject and within-subject variability. These models include log-linear models for the between-subject and within-subject variability. The log-linear models could potentially include covariates. The models assume that the residual errors and the random effects are normally distributed. This makes them sensitive to outliers. These models have been extended to joint models of longitudinal data and time-to-event data. We explore Cook-type influence diagnostics for the mixed-effects location scale model, assumed for the longitudinal sub-model, and an approach to down-weight outlying subjects. We illustrate the methods using data from a large cardiology clinical trial.
10:40 - 11:10Tea
11:10 - 12:10Plenary Session
Speaker: Prof. Emmanuel Lesaffre

Title: Incorporation of historical information in the analysis of current data: A review of Bayesian methods with applications in pharmaceutical research
12:10 - 13:00Lunch
13:00 - 13:50SASA AGM
14:00 - 14:20Panel Data Changepoint Estimation via Regularization
Time series analysis: Instabilities in various data structures

Speaker: Matúš Maciak (online)

Abstract: Implied volatility (IV) is used as a general but powerful tool for analyzing financial markets. We propose a novel approach to estimate the overall IV dynamics represented by an underlying panel data model with changepoints. A robust semi-parametric regression framework and atomic pursuit techniques---lasso based regularization in particular---are applied to estimate the underlying analytical structure of the IV surface and a formal statistical test is used to detect significant changepoints. The overall complexity of the model assumes changepoints which may occur over time, in the analytical structure of the IV smiles, or both. Theoretical and practical details are discussed and the main statistical properties are derived. Empirical properties are investigated in a simulation study and real-life applications are presented to illustration wide and general applicability.
14:20 - 14:40Changepoint in randomly spaced time series
Time series analysis: Instabilities in various data structures

Speaker: Michal Pesta (online)

Abstract: Linear relations, containing measurement errors in input and output data, are considered. Parameters of these so-called errors-in-variables models can change at some unknown moment. The aim is to test whether such an unknown change has occurred or not. For instance, detecting a change in trend for a randomly spaced time series is a special case of the investigated framework. The designed change point tests are shown to be consistent and involve neither nuisance parameters nor tuning constants, which makes the testing procedures effortlessly applicable. A change point estimator is also introduced and its consistency is proved. As a theoretical basis for the developed methods, a weak invariance principle for the smallest singular value of the data matrix is provided, assuming weakly dependent and non-stationary errors. The results are presented in a simulation study, which demonstrates computational efficiency of the techniques. The completely data-driven tests are illustrated through a problem coming insurance.
14:40 - 15:00Monitoring procedures for strict stationarity based on the multivariate characteristic function
Time series analysis: Instabilities in various data structures

Speaker: Charl Pretorius (in-person)

Abstract: We propose model-free monitoring procedures for strict stationarity of a given data generating process. The new criteria are formulated as L2-type statistics incorporating the multivariate empirical characteristic function. The monitoring procedures are shown to be consistent against nonstationary alternatives, and the null distributions of the monitoring statistics are derived under general conditions which allow for many popular time series models, including stationary ARMA and GARCH models. Results from a numerical study are presented which show that the newly proposed procedures have favourable finite-sample performance when compared to existing monitoring procedures. The talk is concluded with an application in which we test for possible stationarity breaks in financial time series data.
15:00 - 15:30Tea
15:30 - 16:30Plenary Session
Speaker: Dr Ali Joglekar

Title: GEMS: Supporting Data-Driven Agri-Food Innovation from Molecules to Markets
Stream 2 (Jannasch)
09:00 - 09:20A comparative study of the stylized facts of South African and Indian Stock markets
Time Series analysis & General

Speaker: Lindokuhle Mbhense (online)

Abstract: The study looks at different stylized facts within the South African market (JSE Top 40 Index)’s historical data from the finance.Yahoo.com was employed for analysis. A closer look at the behaviour of the South African market brings lot of interest since this is one of the most developing markets, which then gives the perfect opportunity for researchers to develop reliable models for forecasting returns and hence price derivation. It was obtained that most of the stocks in JSE Top 40 Index showed larger upward movements than draw-downs similar to Indian stock market, which makes South African market a promising market to invest in. Some of the stocks revealed the presence of auto-correlation which can be a tool for predicting future prices which favors the investors. The behavior of the South African market is almost the same as that of the Indian stock market regarding stylized facts.
09:20 - 09:40Robust mixture regression using mean-shift penalisation
Time series analysis & General

Speaker: Anika Wessels (online)

Abstract: Finite mixture regression models the relationship between a response variable and feature variables in the presence of latent groups in the population. The regression model parameters are unique to each latent group quantifying the different regression structures. Although the classical normal mixture regression model is mostly used since it simplifies the estimation and interpretation, it can be highly sensitive to outliers present in the data. Failing to account for this may distort the results and lead to inappropriate conclusions. We consider a mean-shift robust mixture regression approach. This method uses a sparse, component specific and scale dependent mean-shift parameterisation to simultaneously identify the outliers and perform robust parameter estimation. The properties of the technique are demonstrated using a simulation study.
09:40 - 10:00The influence of different regimes on the estimation of GARCH volatility parameter estimation
Time series analysis & General

Speaker: Lienki Viljoen (in-person)

Abstract: Volatility is used as a measure of risk within the financial markets. GARCH modelling involves important volatility forecasting methodology and is widely used in finance. It is important to be able to forecast volatility since volatility has an impact on financial portfolios and the risk hedging methodology followed by financial companies. The parameter estimates and volatility forecasts of three GARCH models, the Symmetric GARCH, GJR-GARCH and E-GARCH models, are compared using the JSE All-Share index. This index is divided into two different periods, namely, a tranquil financial period and a turbulent financial period. Different factors influence the performance of GARCH models and consequently determines which GARCH model is the most suited for certain circumstances. These factors are: the window period, forecasting horizon, the financial period and the underlying distribution of the log returns.
10:00 - 10:20Some simple statistical ideas and techniques useful for understanding the COVID-19 pandemic
Time series analysis & General

Speaker: Paul Fatti (online)

Abstract: The presentation will discuss the following topics relating to the pandemic: Estimating the number of infections.Estimating the number of deaths. Screening for COVID 19. Optimal group testing for COVID 19 by pathology laboratories. Testing a vaccine. Each of these topics will be discussed, using generally simple statistical concepts and techniques, giving interesting insights and some surprising results.
10:20 - 10:40Big data analytics through ridge-type and Liu-type estimators
Time series analysis & General

Speaker: Salomi Millard (online)

Abstract: Data that exceed the capacity of standard analytic tools in terms of volume, velocity, and variety. Although such a wealth of information enables innovation in many disciplines of science, it challenges current statistical and computational methodology, data storage and computational efficiency. We focus on obtaining standard statistical models in scenarios where the capacity of a single computer is surpassed due to the high volume of data. We are particularly interested in linear regression models that address the issues that arise due to multicollinearity. Shrinkage methods are frequently utilized to address the adverse effects of multicollinearity in regression models. Although these methods can easily be applied to small or moderate datasets, they face considerable difficulties in the big data domain. Two of these difficulties are: (a) the size of the data is too large to be loaded into the memory of a computer; and (b) the computational burden is such that the results will not be available in a reasonable time. We propose methods and algorithms for model estimation and validation of closed-form solutions to multiple ridge-type and Liu-type estimators with a general structure that are able to overcome these barriers. Our approach requires minimal access to the entire dataset as it utilizes an array of sufficient statistics that can be computed and updated at row level. The efficiency of our approach is illustrated through an extensive simulation study as well as a real-world application.
10:40 - 11:10Tea
12:15 - 13:45Poster Session
Waldo Abrahams
Classification and Clustering-based Methods for Outlier Detection of Solar Resource Data

Christopher Erasmus
Bootstrap-based tolerance intervals for nested two-way random effects models

Michael Willie
Multivariate Analysis of Medical Schemes Marketing Expenditure
14:00 - 14:20A new generalized class of exponential ratio type estimators in ranked set sampling
Theoretical Statistics

Speaker: Timothy Ayeleso (in-person)

Abstract: Ranked set sampling (RSS) is a good alternative for Simple Random Sampling for in sample survey. This study presents a generalized version of a class of exponential ratio type estimators in ranked set sampling (RSS) and compared with an existing generalized version of a class of modified exponential ratio estimators in simple random sampling (SRS). The data set used in this paper is the data on enrolment of students (variable of interest) and staff strength (auxiliary variable) in secondary schools in Egba zone of Ogun State Nigeria in 2015. The zone had 89 schools and a 3cycle ranked set sample of size 27 was selected. The descriptive statistics of the data set were obtained for the estimation of population ratio. The mean square errors (MSEs) for both the proposed estimators and modified SRS estimators were determined to obtain the efficiencies of the proposed estimators when g was set at 5, 2, 1, 0.75, 0.5 and 0.25. The population mean for student enrolment and staff strength were 1581.1 and 66.44 respectively which gave a population ratio of 23.80. At g equals 0.75, the MSEs of the modified SRS estimators were 168631.4; 164651.8; 155701.8; 165614.5; 167417.7; 163252.1; 155374.8; 151460.1; 165533.6 and 167714.7 while those of proposed estimators were 136330.3; 132447.5; 123715; 133386.8; 135146.1; 131081.8; 123396; 119576.4; 133307.8 and 135435.9 respectively. The MSEs for the members of the proposed generalized class of estimators were found to be smaller than those of SRS generalized class of estimators, hence they are more efficient estimators.
14:20 - 14:40Modelling Aleatory and Epistemic Uncertainty in Natural Hazard Distributions
Theoretical Statistics

Speaker: Ansie Smit (in-person)

Abstract: A versatile method to assess natural hazards that accounts for both epistemic and aleatory uncertainty in natural hazard distributions is presented. The events in observed datasets of natural phenomena can be classified as prehistoric, historic or instrumentally recorded data. Each of these types of datasets exhibits different types of epistemic and/or aleatory uncertainties. The described methodology accounts for incomplete datasets, uncertainty associated with the observed event sizes, the applied distributions, and with the validity of occurrence of events in the dataset. These types of uncertainty are addressed using convolution and mixture distributions, and weighted likelihood functions. The different data types are combined using likelihood functions which allows for the maximum likelihood (ML) estimation and Bayesian inference (BI) of the parameters. The methodology is tested on a synthetic natural hazard dataset, with various combinations of uncertainty investigated. Estimates of the parameters yielded markedly different results, with BI providing overall more precise estimates than MLE. This in turn can have a large effect on estimates of the return periods of event sizes of natural hazards.
14:40 - 15:00Shrinkage methods for the estimation of the extreme value index
Theoretical Statistics

Speaker: Luca Steyn (in-person)

Abstract: A fundamental problem in extreme value analysis is the estimation of the extreme value index (EVI), denoted by $\gamma$. The EVI characterises the rate at which the tails of a distribution decay which, in turn, enables the estimation of extreme quantiles or excess probabilities. A common approach to estimate the EVI is the Hill estimator for the $\gamma>0$ case and the generalised Hill estimator for the case of a real-valued EVI. Under a second-order condition of the extreme value theorem, these estimators have been expanded to a regression model that reduces the bias inherent to the Hill-type estimators. It has previously been proposed to use a ridge penalty to reduce the variance of the estimators under this model. We present shrinkage methods for the estimation of the EVI under the second-order condition. Specifically, estimators of the EVI under the non-linear regression model with the L1 and L2 penalties are presented. The asymptotic properties of these estimators are investigated in which expressions for the asymptotic mean squared error of the various estimators are derived. Using these expressions, we provide a framework to select the optimal number of extreme order statistics and regularisation penalty such that the asymptotic mean squared error of the EVI is minimised. These methods are tested on various simulated and real-world data sets to demonstrate the improvement of using a regularised, second-order approach to estimate the EVI.
15:00 - 15:30Tea
Stream 3 (Lecture A214)
09:00 - 09:20A Spatial SEIR Model for COVID-19 in South Africa
Spatial Statistics

Speaker: Inger Fabris-Rotelli (in-person)

Abstract: The virus SARS-CoV-2 has resulted in numerous modelling approaches arising rapidly to understand the spread of the disease COVID-19 and to plan for future interventions. Herein, we present an SEIR model with a spatial spread component as well as four infectious compartments to account for the variety of symptom levels and transmission rate. The model takes into account the pattern of spatial vulnerability in South Africa through a vulnerability index that is based on socioeconomic and health susceptibility characteristics. Another spatially relevant factor in this context is level of mobility throughout. The thesis of this study is that without the contextual spatial spread modelling, the heterogeneity in COVID-19 prevalence in the South African setting would not be captured. The model is illustrated on South African COVID-19 case counts and hospitalisations.
09:20 - 09:40Modelling representative population mobility for COVID-19 spatial transmission in South Africa
Spatial Statistics

Speaker: Arminn Potgieter (online)

Abstract: The COVID-19 pandemic starting in the first half of 2020 has changed the lives of everyone across the world. Reduced mobility was essential due to it being the largest impact possible against the spread of the little understood SARS-CoV-2 virus. To understand the spread, a comprehension of human mobility patterns is needed. The use of mobility data in modelling is thus essential to capture the intrinsic spread through the population. It is necessary to determine to what extent mobility data sources convey the same message of mobility within a region. This paper compares different mobility data sources by constructing spatial weight matrices at a variety of spatial resolutions and further compares the results through hierarchical clustering. We consider four methods for determining connectivity matrices representing mobility between spatial units, taking into account distance between spatial units as well as spatial covariates. This provides insight for the user into which data provides what type of information and in what situations a particular data source is most useful.
09:40 - 10:00Spatial variation in the basic reproduction number of COVID-19: A systematic review
Spatial Statistics

Speaker: Nada Abelatif (online)

Abstract: The sudden emergence of the COVID-19 pandemic in early 2020 gave rise to an explosion of epidemiological modelling. The basic reproduction number (R0) quantifies the average number of people that one infectious person will infect in a fully susceptible population. This was an important parameter at the start of the pandemic, as it was used to predict the course of the novel disease and guide decision-making. This systematic review aims to determine how early estimates of R0 for COVID-19 varied across countries in the initial months of the pandemic (January - June 2020) and which estimates were available for Africa. We found that estimates of R0 differed considerably between countries, ranging between 0.48 and 7.2 without outliers. Although developing countries mostly had fewer available estimates, these were generally lower and less variable. Estimates for Africa were sparse and produced mainly by researchers outside the continent. The spatial variability in estimates of R0 at the start of the pandemic indicates that there was no one-size-fits-all model for the initial spread of COVID-19: this demonstrates the need for modelling at a local level. The sparsity of estimates for developing and particularly African countries and the lack of peer-reviewed papers providing early R0 estimates from African-based researchers is concerning, as research during the early stages of an outbreak is critical for mitigating disease spread.
10:00 - 10:20A spatially explicit modelling strategy for Covid-19 predictions and 4th wave risk analysis in Gauteng
Spatial Statistics

Speaker: Claudia Dresselhaus (online)

Abstract: The COVID-19 mass vaccination roll-out plan is designed to allocate vaccinations according to a three-phase strategy that prioritises frontline healthcare workers and the elderly, especially those who are most likely to present with comorbidities. Current studies show overwhelming evidence that vaccinations protect people from severe COVID-19 outcomes. Given this context, the public and policymakers may already be concerned about the timing and severity of the fourth wave of the pandemic locally. Given over 18 months of data on infection outcomes, recent data on vaccination rates and vaccination centre locations, as well as more clarity and data on risk and vulnerability factors, there is imperative to consider nuanced Susceptible-Exposed-Infectious-Removed (SEIR) models to predict the expected number people with severe outcomes from infection in the fourth wave of the pandemic. In our work, we aim to incorporate the vaccinated individuals into a spatial SEIR model to form a SEIRV model. This model will be used to investigate future waves of the COVID pandemic given that the vaccination roll-out in South Africa.
10:20 - 10:40Multiscale decomposition of spatial lattice data for feature detection
Spatial Statistics

Speaker: Rene Stander (online)

Abstract: The Discrete Pulse Transform (DPT) has only been applied to signals in 1D and 2D on regular lattices. The theory leaves scope for the application on irregular spatial lattice data in 2D, also referred to as areal data in spatial literature. In this paper, we extend the DPT theory for irregular lattice data as well as consider its efficient implementation, the Roadmaker's Pavage, and visualisation. The DPT was derived considering all possible connectivities satisfying the morphological definition of connection. Our implementation allows for any connectivity applicable for regular and irregular lattices. We present the implementation of the Roadmaker's Pavage algorithm on spatial images as well as irregular lattice data with a toy example and illustrate this with two applications. The theory is applied to brain imagery (regular lattice) as well as crime counts (irregular lattice data) for feature detection. Using the multiscale Ht-index as a measure of saliency on the extracted DPT pulses, important features from both the regular and irregular lattice data can be detected.
10:40 - 11:10Tea
12:15 - 13:45Poster Session
Katleho Makatjane
Backtesting A One-Step Ahead Density Predictions of Value-at-Risk

Neill Smit
Model comparison for Bayesian ALT models

Maphoka Qhobela
Cluster Analysis of HIV Risk Behaviours in Eswatini following Multiple Imputation
14:00 - 14:20Covariate construction of nonconvex windows for spatial point patterns
Spatial Statistics

Speaker: Kabelo Mahloromela (online)

Abstract: Window selection for spatial point pattern data is complex. Often, the point pattern window is given a priori. Otherwise, the region is chosen using some objective means reflecting that the window is representative of a larger region. Common approaches used are the smallest rectangular bounding window and convex windows. The chosen window should however cover the true domain of the process. Choosing too large a window results in estimation and inference in regions where the possibility of observations has not been confirmed. We propose a new algorithm for selecting a point pattern domain based on spatial covariate information without the restriction of convexity, allowing for a better fit to the true domain. The proposed algorithm is applied in the setting of rural villages in Tanzania. As a spatial covariate, remotely sensed elevation data is used. The algorithm is able to detect and filter out high relief areas and steep slopes; observed characteristics that make the occurrence of a household in these regions improbable. A modified kernel smoothed intensity estimate using the Euclidean shortest path distance is proposed to estimate the intensity on the resultant nonconvex window, producing more representative intensity estimation.
14:20 - 14:40On the use of Voronoi tessellations for detection of spatial inhomogeneity in regular spatial point patterns
Spatial Statistics

Speaker: Christine Kraamwinkel (online)

Abstract: Much spatial analysis requires the division of the spatial window into equal sized quadrats. Specifically, tests for homogeneity of spatial point patterns use the counts of points in each quadrat to determine the homogeneity. A choice has to be made on the quadrat size, thereby introducing a hyperparameter that must be chosen appropriately. In this paper, we instead partition the spatial window using Voronoi tessellations. A Voronoi tessellation is the partition of the spatial window into convex polygons, called Voronoi cells, consisting of all points of the plane closer to that point than to any other. We show that the shape measures of the polygons can differentiate between a homogeneous and inhomogeneous regular spatial point pattern. Four measures of elongation for Voronoi cells were investigated, namely circularity, radial radius, aspect ratio and elongation, through a simulation study and real data application. The results indicate future development of a robust hypothesis test for homogeneity is possible.
14:40 - 15:00Bayesian Spatial Model for Disease Mapping: Application to HIV Distribution in Ethiopia
Spatial Statistics

Speaker: Leta Lencha Gemechu (in-person)

Abstract: In this study, we applied a Bayesian spatial model to investigate the spatial distribution of HIV in Ethiopia, using district level aggregated HIV cases obtained from the Demographic and Health Survey data (EDHS, 2016). Both informative and non-informative priors were used. The informative priors for coefficients were formulated from previous EDHS data. Whereas the prior for spatial random effect is defined in framework of penalized complex prior, where the pooled spatial variance obtained from previous two surveys used as upper bound or limit of sampling variance. Our proposed model was compared to commonly used models using observed and simulated data. Among the models considered, mofified Besag, York and Mollie (BYM2) model (with informative prior) had fitted the data best, therefore, investigation of factors affecting spatial association of HIV prevalence examined based on this model. The results show the likelihood of being infected by HIV virus varies across clusters and regional location. For instance, High clusters of HIV cases were observed in Gambella region, Harari, Addis Ababa, and borderline districts located in Tigray and Amhara regional states. In addition to regional variation, significant difference was seen with respect to place of residence (Urban or Rural) and gender; individuals in urban areas and women highly affected by HIV burden, with ratios of about 7 to1 and 6 to 1 (compared to rural dwellers and men) respectively. Overall, our study result revealed strong spatial disparity of HIV distribution at different geographical levels. Furthermore, women’s living in urban areas are the top affected social group.
15:00 - 15:30Tea
Stream 4 (Lecture A221)
09:00 - 09:20Bayesian quantile regression analysis for stroke predictors in South Africa using MCMC method
Bayesian Statistics

Speaker: Lyness Matizirofa (online)

Abstract: Background: In South Africa (SA), stroke is the second highest cause of mortality and disability. Yet, little is known about the modelling modifiable and non-modifiable stroke predictors. Bayesian quantile regression (BQR) can be used for this type of modelling. This paper provides a quantile inference approach through the Bayesian modelling approach. Identification of stroke predictors using appropriate statistical methods can help formulate appropriate health programs and policies aimed at reducing the stroke burden. Analysis of stroke predictors have in the main, concentrated on mean regression, yet modelling with quantile regression (QR) is more appropriate than using mean regression. This is because the QR provides flexibility to analyse the stroke predictors corresponding to quantiles of interest. This study aims to identify and quantify stroke predictors, through BQR analysis. Methods: Hospital-based data from 35 730 stroke cases were retrieved from selected private and public hospitals between January 2014 and December 2018. The Markov chain Monte Carlo (MCMC) method is used for obtaining posterior distributions of the parameters of interest. The Bayesian approach is compared to the classical approach. Results: Of the 35730 stroke cases, 22183 were diabetic. The age groups 55-75 and 76-98 years, female gender and black race had a bigger effect on stroke distribution at the lower than upper quantiles. Diabetes, cholesterol, heart problems and hypertension showed a significant impact on stroke distribution (p < 0.0001). Conclusions: Modelling stroke predictors using BQR can provide information beneficial for addressing the stroke burden in SA.
09:20 - 09:40Hierarchical Bayesian Spatial Small Area Model for Binary Data Under Spatial Misalignment
Bayesian Statistics

Speaker: Kindie Fentahun Muchie (online)

Abstract: Small area model has become a popular method for producing reliable estimates for small areas. Small area modeling may be carried out via model assisted approaches within the design-based paradigm or model-based approaches. A model assisted design-based inference may be reliable in situations when there are large or medium samples in areas, while if data are sparse, model-based approach may be a necessity. Model based Bayesian analysis methods are becoming popular for their ability to combine information from several sources as well as taking account of uncertainties in the analysis and spatial prediction of spatial data. However, things become more complex when the geographic boundaries of interest are misaligned. Some authors have addressed the problem of misalignment under hierarchical Bayesian approach. In this study, we developed and assessed the performance of non-trivial extension of existing hierarchical Bayesian model for binary data under spatial misalignment. In this study, we developed a spatial hierarchical Bayesian small area model for a binary response variable under spatial misalignment. The developed model is a fusion model, considering both areal level and unit level latent processes. The process models generated from the predictors were used to construct the basis so as to alleviate the well-known problem of collinearity between the true predictor variables and the spatial random process. A simulation study demonstrated that the model has good performance.
09:40 - 10:00Empowering differential networks using Bayesian analysis
Bayesian Statistics

Speaker: Jarod Smith (online)

Abstract: Differential networks (DN) are important tools for modeling the changes in conditional dependencies between multiple samples. A Bayesian approach for estimating DNs, from the classical viewpoint, is introduced with a computationally efficient threshold selection for graphical model determination. The algorithm separately estimates the precision matrices of the DN using the Bayesian adaptive graphical lasso procedure. Synthetic experiments illustrate that the Bayesian DN performs exceptionally well in numerical accuracy and graphical structure determination in comparison to state of the art methods. The proposed method is applied to South African COVID-19 data to investigate the change in DN structure between various phases of the pandemic.
10:00 - 10:20Computational considerations for asymmetric angular- and real data as direction and distance in the modelling of animal movement
Bayesian Statistics

Speaker: Gopika Ramkilawon (online)

Abstract: Animal movement is a fundamental part of ecology, and aids in understanding and modelling of social responsibility phenomena including population and community structure dynamics. Movement of animals is often characterised by direction (measured on the circle) and distance (measured on the real line); but traditionally employed models do not account for potential asymmetric angular movement. This study focuses on the modelling of circular data in this animal movement setting using previously unconsidered circular distributions which may allow for a departure from symmetry. In addition, mixtures of often considered models for distance is considered and computational aspects of this joint modelling highlighted. A general hidden state Markov model is used to incorporate both these essential components when estimating via the EM algorithm, and goodness-of-fit measures verifies the validity and viable future consideration of these newly proposed theoretical models within this practical and computational animal movement environment.
10:20 - 10:40Wind direction prediction of South African windfarms via circular modeling
Bayesian Statistics

Speaker: Najmeh Nakhaeirad (in-person)

Abstract: Wind energy production depends not only on wind speed but also on wind direction. Thus, predicting and estimating the wind direction for sites accurately will enhance measuring the wind energy potential. One of the major challenges is the uncertain nature of wind direction which can be presented through probability distributions. Bayesian analysis can improve the modeling of the wind direction using the contribution of the prior knowledge to update the empirical shreds of evidence. This must align with the nature of the empirical evidence as to whether the data are skew or multimodal or not. So far mixtures of von Mises within the directional statistics domain, are used for modeling wind direction to capture the multimodality nature present in the data. In this paper, due to the skewed and multimodal patterns of wind direction on different sites of the locations understudy, a mixture of multimodal skewed von Mises is proposed for wind direction. Furthermore, a Bayesian analysis is presented to take into account the uncertainty inherent in the proposed wind direction model. A simulation study is conducted to evaluate the performance of the proposed Bayesian model. This proposed model is fitted to datasets of wind direction of Marion Island and two wind farms in South Africa and show the superiority of the approach. The posterior predictive distribution is applied to forecast the wind direction on a wind farm. It is concluded that the proposed model offers an accurate prediction by means of credible intervals.
10:40 - 11:10Tea
12:15 - 13:45Poster Session
Ricardo Daniel Marques Salgado
Differential Networks as Association Change Detection Tools

Edward James Westraadt
Variational Autoencoders to Enhance Classification Accuracy in Photovoltaic Fault Detection
14:00 - 14:20Wheezing phenotypes and early-life determinants in a South African birth cohort study
Biostatistics

Speaker: Carlyle Mccready (in-person)

Abstract: Objective: We aimed to identify underlying latent patterns of wheezing, and associated risk factors, in a South African birth cohort. Methods: Wheezing was longitudinally identified from birth to 5 years for infants from the Drakenstein Child Health Study. Using repeated binary indicators denoting the presence/absence of wheeze, we derived a set of multi-dimensional indicators which incorporates a spells approach to describe the temporal characteristics of wheeze. These indicators were clustered using the Wishart distance matrix and partitioning around medoids (PAM) algorithm to identify homogenous phenotypes of wheeze. Multinomial logistic regression models were used to investigate phenotype specific risk factors. The stability and validity of the underlying latent phenotypes were investigated using a repeated sampling approach. Results: Wheezing was common in 455/950 (48%) children. Four phenotypes were identified: never-wheezing (495, 52%), early-transient wheezing (202, 21%), late-onset wheezing after 1 year (104, 11%), and recurrent wheeze (149, 16%). Early-life lower respiratory tract infection (LRTI) was a strong risk factor associated with all wheezing phenotypes, but most strongly with recurrent wheeze, as was the concomitant presence of the respiratory syncytial virus (RSV), rhino and adeno viruses, which are viral pathogens known to cause respiratory infections in children. Other factors associated with recurrent wheeze were maternal smoking, intimate partner violence, higher socioeconomic class, or male child. Conclusion: Childhood wheezing represents a heterogenous airway disease with specific identifiable phenotypes and associated risk factors in African children. Early life LRTI and environmental factors influence wheezing risk.
14:20 - 14:40A new fixed point characterisation based test for the Pareto distribution in the presence of random censoring
Biostatistics

Speaker: Lethani Ndwandwe (in-person)

Abstract: We propose a new goodness-of-fit test for the Pareto type I lifetime distribution in the presence of random right censoring. The test is based on a fixed point characterisation, which is a generalisation of the well known Stein method for the approximation of distributions. The finite sample performance of the new test is evaluated and compared to the modified Cramér von Mises and Kolmogorov-Smirnov tests for different censoring proportions and a variety of alternative lifetime distributions by means of a limited Monte Carlo study. It is found that the new test is competitive compared against the two traditional tests for the majority of alternatives considered.
14:40 - 15:00Shared component modelling of early childhood anaemia and malaria in four sub-Saharan African countries
Biostatistics

Speaker: Danielle Roberts (in-person)

Abstract: Malaria and anaemia contribute substantially to child morbidity and mortality. Using a child-level shared component model, we sought to jointly model the residual spatial variation in the likelihood of these two correlated diseases, while controlling for individual-level, household-level and environmental characteristics. This shared component model allowed the district-level spatial effect to be partitioned into a shared and disease-specific spatial component. The results indicated that the spatial variation in the likelihood of malaria was more prominent compared to that of anaemia, for both the shared and specific spatial components. In addition, multiple districts associated with an increased likelihood of anaemia but a decreased likelihood of malaria were identified. This suggests that there are other drivers of anaemia in children in these districts, which warrants further investigation. The maps of the shared and disease-specific spatial patterns provide a tool to allow for more targeted action in malaria and anaemia control and prevention, as well as for the targeted allocation of limited district health system resources.
15:00 - 15:30Tea
Keynote / Stream 1 (Endler)
09:00 - 10:00Plenary Session
Speaker: Prof. Jonathan Crook

Title: Stress testing behavioural and macroeconomic risks in credit portfolios
10:00 - 10:20Marginalized Two-part Joint Models for Generalized Gamma Family of Distributions
Biostatistics

Speaker: Mohadeseh Shojaei (in-person)

Abstract: Positive continuous outcomes with a substantial number of zero values and incomplete longitudinal follow-up are quite common in medical cost data. To jointly model emi-continuous longitudinal data and survival data and to provide marginalized covariate effect estimates, marginalized two-part joint model (MTJM) have been developed for outcomes with lognormal distributions. In this paper, we propose MTJM models for outcomes from a generalized gamma (GG) family of distributions. The GG distribution constitute an extensive family that contains nearly all of the most commonly used distributions including the gamma, exponential, Weibull and log normal. In the proposed MTJM-GG model, the conditional mean from a two-part model with a three-parameter GG distribution is parameterized to provide that marginal interpretation for regression coefficients. MTJM-gamma and MTJM-Weibull are developed as special cases of MTJM-GG. To illustrate the applicability of the MTJM-GG, we applied the model to a set of real electronic health record data recently collected in Iran and we provide SAS code for implementation. The simulation results show that when the response distribution is unknown or mis-specified, which is usually the case in real data sets, the MTP-GG is preferable to other models. The advantage of using the GG family of distribution is that it facilitates estimating a model with improved fit over the standard Weibull or log-normal distributions.
10:20 - 10:40Identification of Latent Growth Classes in a South African Birth Cohort study
Biostatistics

Speaker: Noëlle van Biljon (in-person)

Abstract: Numerous methods are available to model and analyse longitudinal growth data. Conventionally, such growth modelling methods focus on the analysis of average longitudinal trends or identify those belonging to groups of abnormal growth based on standardised z-scores, in addition to investigating potential predictors of abnormal growth. Latent Class Mixed Modelling (LCMM) allows identification of groups of subjects that follow similar longitudinal trends, be they normal or abnormal, based on a combination of a linear mixed-effect, structural equation and multinomial logistic modelling. Here LCMM was used to identify underlying latent profiles of growth for height, weight, head circumference (HC), mid-upper arm circumference (MUAC), triceps skin fold thickness (TRI), body mass index (BMI) and weight for height (WFH) measurements taken from birth until the age of five years for a sample of 1143 children from the Drakenstein Child Health Study (DCHS). Subsequently, three classes of growth within height ($n_1$=42, $n_2$=664, $n_3$=425), weight ($n_1$=606, $n_2$=455, $n_3$=72), HC ($n_1$=684, $n_2$=404, $n_3$=42), MUAC ($n_1$=58, $n_2$=241, $n_3$=710), BMI ($n_1$=673, $n_2$=185, $n_3$=273) and WFH ($n_1$=203, $n_2$=778, $n_3$=93), each with distinct trajectories over childhood were identified and validated. With the identification of these classes, a better understanding of distinct childhood growth trajectories and their predictors may be distinguished, informing interventions to promote optimal childhood growth.
10:40 - 11:00Latent Class Joint Model for Longitudinal and Survival model an alternative to influence diagnostics for shared parameter joint model
Biostatistics

Speaker: Isaac Singini (in-person)

Abstract: Joint models for longitudinal and survival data are a class of models that jointly analyse an outcome repeatedly observed over time such as a bio-marker and associated event times. There are two main classes of these models namely; shared parameter and latent class joint models. The main difference between these two modeling frameworks is that latent class joint models make no assumption about the association between the time-dependent covariate(s) and risk for an event while shared parameter joint models do not explicitly handle heterogeneity in the population These models are useful in two practical applications; firstly focusing on survival outcome whilst accounting for time varying covariates measured with error and secondly focusing on the longitudinal outcome while controlling for informative censoring. Interest on the estimation of these joint models has grown in the past two and half decades with minimal effort directed towards developing influence diagnostics. In this study we compared Cook’s statistics for detecting influential subjects to classes identified by the latent class joint model which in effect would classify influential subjects through population heterogeneity. This approach was illustrated using data from a multi-center clinical trial on TB pericarditis. The data confirmed our hypothesis that latent class joint models can be used to as an alternative to diagnostics to identify influential subjects in the shared parameter joint models for longitudinal and survival data. This is done by classifying heterogeneous classes using a latent variable.
11:00 - 11:20Tea
11:20 - 11:40A new double sampling scheme to monitor the process mean of autocorrelated observations using an AR(1) model with a skip sampling strategy
General

Speaker: Sandile Shongwe

Abstract: There are a lot of academic research on statistical process monitoring schemes that assume that sequential observations are independent and identically distributed (iid); however, in industrial processes, sequential data tends to exhibit serial correlation (i.e. autocorrelation). Implementing monitoring schemes designed for iid observations when in fact data is sampled from an autocorrelated process yields misleading results. In this paper, we propose a side-sensitive double sampling (SSDS) scheme to monitor the mean of autocorrelated observations using a first-order autoregressive model. In order to reduce the negative effect of serial dependence, a sampling strategy that involves sampling of non-neighbouring observations (i.e., skipping s observations before sampling) is incorporated into the computation of the probability values of the run-length distribution and the charting limits. The main findings of this study is that the proposed s-skip SSDS scheme yields a run-length distribution that has uniformly better average run-length (ARL) and expected ARL values as compared to the existing non-side-sensitive double sampling scheme and other well established Shewhart-type schemes (i.e. runs-rules and synthetic) for autocorrelated observations. A real life example for yogurt cup filling process is used to illustrate how the proposed monitoring scheme is implemented.
11:40 - 12:00On an omnibus test for the parametric Cox proportional hazards model
General

Speaker: Jaco Visagie (in-person)

Abstract: We propose an omnibus test of fit for the parametric Cox proportional hazards model in the presence of random right censoring. The proposed test results from a modification of an existing test for the uniform distribution. This test is demonstrated to be able to detect deviations from the hypothesised model in two cases; first when the baseline distribution is misspecified and second when the regression component of the model is misspecified. Two modified classical tests are considered and a Monte Carlo study shows that the newly proposed test outperforms these tests for the majority of alternatives included. As a result of independent interest, we outline the procedure required to use the newly modified test in the framework of independent and identically distributed random variables.
12:20 - 13:20Plenary Session
Speaker: Prof. Saralees Nadarajah

Title: The drastic Under-representation of African Researchers in Africa-related Research
13:20 - 13:50Closing Ceremony
Stream 2 (Jannasch)
10:00 - 10:20Construction and analysis of experimental designs with three factors each having both fixed and random levels
Biostatistics, Biometry & Experimental design

Speaker: Lyson Chaka (online)

Abstract: In experimental designs involving analysis of variance (ANOVA), the knowledge of whether effects are fixed, random or mixed is of paramount importance for the modelling process and interpretation of results. Classification of a factor as fixed or random effect depends on how a researcher selects the factor levels and the desired inference space. Increase of productivity and efficiency in the fourth industrial revolution era calls for development of new strategies and technologies that either replace or improve the old and existing ones. This leads to a shift in the way the fixed and random effects are conceptualised and structured as different factors. In this paper, we present the concepts and methods for designing and analysing experiments with three factors each consisting of both fixed and random levels. Consideration is made for two design structures namely, completely randomized design (CRD) and randomized complete block design (RCBD). The proposed approach allows for drawing of both narrow and broad inference for the same factor in a three-way treatment structure.
10:20 - 10:40Identifying rare cell types among large and diverse populations of immune cells with precision and robustness
Biostatistics, Biometry & Experimental design

Speaker: Miguel Rodo (online)

Abstract: High-dimensional analysis of immune responses informs drug and vaccine design for combating diseases, such as tuberculosis and COVID-19. We consider the problem of identifying and characterising small cell subsets within a diverse population of immune cells based on increased protein production in response to a pathogen. Classifying cells as responding is a difficult problem, as downstream analyses depend on sensitive detection for statistical power, but classification needs to be highly specific as responding cells are typically very infrequent. This can be compounded by measurement batch effects between experiments, and protein distributions overlapping between responding and non-responding cells. Traditionally, classification has been performed by manually drawing lines, but this is time-consuming, computationally irreproducible and prone to subjectivity and lapses in judgment. Previous automated methods use uninterpretable tuning parameters requiring precise settings, do not allow for overlap between responding and non-responding cells and/or assume no batch effects. We propose a simple Empirical Bayes approach, based on the two-groups model. It achieves strong concordance with manually clustered samples, in terms of cell classification and downstream clinical results, in significantly less time. Simulation shows that it is robust to batch effects, performs equivalently across a range of interpretable tuning parameters, and overcomes overlap in the measurement distribution between responding and non-responding cells. Finally, we apply it to a new high-dimensional immunological dataset and discover a novel cell subset associated with tuberculosis disease development. In conclusion, this approach is faster than manual classification, and out-performs existing automated methods.
10:40 - 11:00Application of Semiparametric Model in modelling Diabetic Retinopathy Among Type II Diabetic Patients at Black Lion Specialized Hospital Addis Ababa, Ethiopia
Biostatistics, Biometry & Experimental design

Speaker: Bezalem Eshetu Yirdaw (in-person)

Abstract: The proportion of patients with diabetic retinopathy has grown with increasing number of diabetic mellitus patients in the world. It is among the top risk factors of blindness worldwide, especially those living in developing countries. The main objective of this study was to identify contributing risk factors of diabetic retinopathy among Type II diabetic patients. A sample of 192 patients was selected using systematic random sampling from Black Lion Specialized Hospital diabetic unit from 1 March 2021 to 1 April 2021. A multivariate stochastic regression imputation technique was applied to impute the missing values. The response variable, Diabetic retinopathy is a categorical variable with two outcomes. Plots from univariate analysis showed that duration of diabetes and haemoglobin A1C have a nonlinear relationship with diabetic retinopathy. Therefore, we proposed a semiparametric model, in particular using spline smoothing to analyze the diabetic retinopathy data efficiently. In the multivariate analysis, the statistical test indicated that the spline effects of duration of diabetes and haemoglobin A1C are significant, but the spline effect of cholesterol level was nonsignificant. The model was refitted considering a linear cholesterol level effect. The results revealed that the clinical variables of a type II diabetic patient have strong predictive factors of diabetic retinopathy. Hence, health care workers should be cautious about the possible effects and complications of diabetic mellitus which can be caused by the clinical variables.
11:00 - 11:20Tea
11:20 - 11:40Modelling students' experiences of learning statistics in a threshold concepts-enriched tutorial programme
Educational Statistics

Speaker: Anisha Ananth (online)

Abstract: Scholarship on the factors that affect students’ learning in statistics have relied mainly on quantitative methodologies. As such, qualitative nuances as it relates to student learning remain relatively unexplored. To address this lacuna, this study applied a qualitative approach using a case study design to explore students’ experiences of their learning in a threshold concepts-enriched statistics tutorial programme. Threshold concepts theory (Meyer & Land, 2003) and statistics pedagogy literature informed the tutorial programme activities. The larger part of the data was generated and initially analysed using Interactive Qualitative Analysis (IQA) which comprises two stages: focus groups and interviews. In the focus group phase, participants generated a view of learning on the programme at group level and the affinities (themes) identified by the focus group were arranged into a Systems Influence Diagram (SID) depicting the group’s conception of their learning. The semi-structured individual interviews added depth to the focus group data as participants elaborated on their personal experiences with regard to each affinity. These findings reflect the duality of the cognitive and affective shifts students’ experienced on their pedagogical pilgrimage - a metaphor used to describe the experiences and processes of students’ learning in the threshold concepts-enriched tutorial programme. These findings are broadly consistent with the threshold concepts framework in highlighting that learning has strongly affective aspects entwined with the cognitive and once mastered have — transformative effects; and that disciplinary learning has implications for students’ worldview and identity. The study has distinct implications for introductory statistics programme design and pedagogy.
11:40 - 12:00Automatic Generation of Online Statistics Assessments Using R-exams
Educational Statistics

Speaker: Thomas Farrar (online)

Abstract: Online modes of assessment in higher education have been pushed to the forefront by the COVID-19 pandemic. Online tests have certain advantages over "pen-and-paper" tests in areas such as authenticity and "assessment for learning," but also bring challenges in areas such as integrity. e-Learning platforms (LMSs) have considerable functionality for creating online tests but also significant limitations. For instance, there is limited functionality for randomisation of questions (an important bulwark for integrity). Furthermore, it is labour-intensive to add rich content (tables, figures, mathematical expressions) and manually produce memos within LMS interfaces. The \texttt{exams} package in R statistical software harnesses the computing power of R to generate versatile assessments that can then be imported into the LMS for deployment. This presentation will provide an overview of the functionality of the exams package and the implementation workflow. Examples will be provided of different assessment tasks that can be automated, including not only data analysis and visualisation tasks but also mathematical statistics questions where the objective is a mathematical expression or a proof. The presentation will focus primarily on Blackboard, the LMS software used by the author's institution. However, to make the presentation as widely accessible as possible, implementation in all other LMS software used at South African universities (Moodle, Sakai, Canvas, and Brightspace) will also be discussed. Results of a student feedback survey regarding student experience of these assessments will be shared.
12:00 - 12:20Entry-level statistics supervisor development in South Africa
Educational Statistics

Speaker: Inger Fabris-Rotelli and Danielle Roberts (in-person)

Abstract: In 2020 a group of 8 novice, and near-novice, doctoral supervisors in academic Statistical sciences in South Africa initiated the use of the portfolio developed under this project with their new and current doctoral students. Biannual meetings, as well as virtual meetings every week, have documented the feedback, hurdles and successes of the portfolio. We present our discussions and suggestions for future work in this talk, including publications, research dissemination at conferences and similar, as well as continued mentorship.
Stream 3 (Lecture A214)
10:00 - 10:20Portfolios and interviews: Turning our students into lifelong learners
General, Theoretical & Educational Statistics

Speaker: Michael von Maltitz (in-person)

Abstract: Over the past two years I have experimented with a form of teaching and assessment that is considered novel (and possibly risky) in the statistics education field. This system is based on Fink’s taxonomy and completely authentic learning, aiming to turn students into curious, lifelong learners. I use realistic assessments to eliminate the text-to-test (or cram-and-forget) mentality, to encourage deeper learning, and to pass on critical life skills. I will introduce the pedagogical evolution that led me to adopt these methods, the practical implementation of the system, the advantages and disadvantages of these methods in statistical education, and the insight gained from experimenting with this process so far.
10:20 - 10:40A noncentral Lindley construction illustrated in an INAR(1) environment
General, Theoretical & Educational Statistics

Speaker: Ané Van Der Merwe (in-person)

Abstract: This study proposes a previously unconsidered generalization of the Lindley distribution by allowing for a measure of noncentrality (ncL). Essential structural properties are investigated and derived in explicit and tractable forms, and the estimability of the model is illustrated via real data. This distribution is then used as a candidate for the rate parameter of the Poisson distribution, which allows for departure from the usual equidispersion restriction of the Poisson distribution when modeling count data. This Poisson-noncentral Lindley (PncL) is also systematically investigated and characteristics are derived. The impact of this model is illustrated in both a simulation study as well as real data, by implementing this PncL model as the count error model in an integer autoregressive (INAR) model. The effect of the systematically-induced noncentrality parameter is illustrated and paves the way for future flexible modeling, not only as a stand-alone contender in Lindley-type scenarios (as the ncL) but also in discrete time series scenarios (as the PncL) when the often-assumed equidispersed assumption is not adhered to in practical data environments.
10:40 - 11:00Predictive modelling of Covid-19 new cases and deaths in South Africa
General, Theoretical & Educational Statistics

Speaker: Kajingulu Malandala (in-person)

Abstract: Coronavirus pandemic had already affected more than 8.4 million people in the African continent leading to an estimated 214000 deaths. South Africa (SA) is one of the most affected country in the continent with the highest number of cases. The literature related to Covid-19 is limited and has been focusing on modelling and prediction of the disease in the early stages of the outbreak. In the current study, we propose Generalized Additive model for location, scale and shape models to predict the number of Covid-19 cases and fatalities in SA. Generalized Additive models for location, scale and shape (GAMLSS) models are extension of the generalized linear models (GLM) and the generalized additive models (GAM) with location, scale and shape parameters which are modelled as linear, nonlinear or smooth function of the covariates. The results suggest that GAMLSS approach is flexible and allow us to produce reliable estimate of the variance at each point of time and the distribution of expected values in the future.
11:00 - 11:20Tea
11:20 - 11:40A stochastic network model for estimating population mobility between areal units in an irregular lattice
Spatial Statistics & Big data

Speaker: Renate Thiede (online)

Abstract: Modelling population mobility is essential for many applications, including urban planning, contact tracing and access to facilities. In particular, it is relevant to model how people move between discrete spatial units, such as municipalities or provinces, modelled in spatial statistics as irregular lattices. Complex road networks are well suited to model mobility, however, modelling the road network of a region as a whole is computationally expensive. This paper models the movement of people between spatial units, particularly electoral wards, in a representative, computationally feasible manner. Mobility is modelled as a Markov chain, with the wards as states. To simplify the road network, Louvain clustering is used to select representative nodes as entry points into the network in each ward. One-step transition probabilities are obtained by calculating the probability of moving from one of the representative nodes in a ward into any of the representative points in a spatially adjacent ward. For each ward, we obtain a matrix of probabilities of moving from that ward to any other ward in the study area. Only transitions into spatially adjacent neighbours may have a positive probability, while transitions to non-adjacent wards will have a probability of zero, ensuring that the one-step mobility matrices are sparse. To obtain the probabilities of journeys that cross multiple wards, we multiply the relevant sparse one-step transition matrices. This provides a computationally simple approach to model population mobility, resulting in mobility matrices that can be used as input in spatial epidemiological models, accessibility analyses and other spatial models.
11:40 - 12:00Penalized feature selection in model-based clustering
Spatial Statistics & Big data

Speaker: Luandrie Potgieter (online)

Abstract: Cluster analysis is a popular unsupervised statistical method used to group observations into clusters. Clustering helps to identify latent patterns and groupings in data which aids in the understanding of natural phenomena. The data-driven society we live in today has made high dimensional data quite ubiquitous and hence noise variables are unavoidable. Making use of all available variables when modeling can lead to over-parameterization. In addition,high-dimensional data opens the door for the curse of dimensionality. Thus, performing variable selection will ameliorate the model's fit and ease the interpretation of results obtained through clustering. In this presentation, we perform variable selection through penalized model-based clustering. Specifically, an appropriate penalty is chosen to penalize the log-likelihood and the EM algorithm is used to maximize it.
Stream 4 (Lecture A221)
10:00 - 10:20Failure rate monitoring in generalized gamma-distributed processes
Applied & Official Statistics

Speaker: Niladri Chakraborty (online)

Abstract: With technological advancements, high-quality process monitoring has gained significant importance in the industry. Nowadays, most of the high-performing manufacturing processes produce a large number of conforming items with a few nonconforming items. Monitoring of time-between events is a well-known approach for real-time monitoring of these highly efficient processes. Usually, it is assumed that the time-between-events follow an exponential or gamma distribution. However, the generalized gamma distribution is one of the most popular choices for modeling skewed data. Monitoring of skewed processes poses a challenge in designing an unbiased monitoring scheme where the probability of signal should be higher than the size. In this work, we consider a two-sided monitoring scheme based on the generalized gamma distribution. This would provide a one-stop solution to the two-sided monitoring of many skewed distributions. We also proposed a generalized analytical solution to the unbiased design of a skewed monitoring scheme. The extensive numerical study showed encouraging performance properties. A couple of practical applications in connection to monitoring renewable energy and coal mine explosions have been discussed.
10:20 - 10:40Monitoring multivariate profiles using the quadruple exponentially weighted moving average scheme with fixed and random exploratory variables
Applied & Official Statistics

Speaker: Jean-Claude Malela-Majika (in-person)

Abstract: When the quality process is characterised by a functional relationship between a dependent variable and one or several explanatory variables, classical monitoring schemes become inappropriate and unresponsive. In this case, profile or (regression) monitoring schemes are recommended. This paper proposes new quadruple EWMA (QEWMA) schemes for monitoring linear profile data using fixed and random exploratory variables. In zero-state, the proposed schemes are found to be more responsive for a large range of shifts in the regression parameters and error variance; while in steady-state, the EWMA scheme is more responsive to different shifts as compared to other memory-type schemes considered in this study. Real-life data are used to demonstrate the application and implementation of the newly proposed schemes.
10:40 - 11:00Challenges for using Administrative Data for the compilation of Financial Statistics
Applied & Official Statistics

Speaker: Sagaren Pillay (in-person)

Abstract: There is a growing trend in developed countries to use administrative data to produce official statistics. The demand for timeous and immediately available data by users is also starting to grow in emerging economies. This demand creates an opportunity to save costs in statistical production and reduce response burden but presents numerous technical, capacity, and methodological challenges for Statistics South Africa (Stats SA). The use of administrative data for producing statistics has numerous advantages over sample survey data. The relative collection costs, lower burden placed on respondents, and better coverage make the usage of administrative data very viable. Statistical agencies can, in many instances, obtain administrative data from various sources at virtually no cost. Further, the risks associated with the usage of data obtained from administrative sources are often minimal or manageable. In this study comparisons are made between data from the Annual Financial Statistics (AFS) Survey and two administrative sources. The first part deals with an analysis of the time series on turnover from the AFS survey and VAT turnover data from the South African Revenue Service (SARS). The second part is a multiple case study of data from businesses in the AFS survey linked to data from businesses in Companies and intellectual Property Commission (CIPC) database.
11:00 - 11:20Tea
11:20 - 11:40Climate change detection and attribution: A Bayesian hierarchical approach
General

Speaker: Jason Pillay (online)

Abstract: While climate change has various ways of presenting itself, variables of interest typically take on the form of temperature change. However, the predictor variables used to detect and attribute climate change include time and/or space, but do not use the available knowledge on climate conditions under a certain forcing scenario. In this paper, we show how said knowledge can provide a more practical view to attribution and detection of climate change on a global scale. We assume a linear regression model of temperature change as a dependent variable and forcings-dependent temperature change as our predictors. We quantify uncertainty in model parameter estimations using a Bayesian approach, discuss assumptions of chosen distributions and incorporate Bayesian inference to obtain a posterior distribution of our model's parameters. We then implement our methodology on acquired global air temperature data and on a controlled sample to verify the method. We will discuss the results of the methodology and evaluate our results and highlight limitations and points for further exploration.
11:40 - 12:00Modelling the spread of COVID-19 in South Africa using stratified compartmental models in the period March 2020 - August 2020
General

Speaker: Elona Mbayise (online)

Abstract: The novel coronavirus strand (SARS-CoV-2) first appeared in Wuhan, China in December 2019 and caused the respiratory syndrome COVID-19. A unique feature of COVID-19 is its non-uniform effect on populations. The effects of COVID-19 are more severe amongst the older and people with co-morbidities as seen by the higher mortality, infection and hospitalisation rates observed amongst these groups. This study models the spread of COVID-19 in South Africa March-August 2020 using stratified compartmental models to capture the population heterogeneity. An age and co-morbidity stratified compartmental model was built with additional compartments to capture the unique dynamics of COVID-19. A sensitivity analysis was performed to determine the models' sensitivity to start date and lockdown level to determine the optimal start date and to identify the effects of harsh lockdown restrictions on infections and hospitalisations. A parameter sensitivity analysis was also conducted to determine the parameters that needed to be re-estimated to improve model accuracy and to identify the age groups which were driving infections, hospitalisations, and deaths. These analyses showed that a prolonged harsh lockdown would have reduced infections by approximately 50% and delayed the infection peak by approximately 4 months. The analyses also showed that hospitalisations were driven by the 61-75 age group while infections and deaths were driven by the 76-90 age group. In addition, the model was most sensitive to infection duration, death rate and proportion of asymptomatic infection. These parameters were re-estimated to better capture the age and co-morbidity dependent dynamics of COVID-19.

The 62nd annual conference of the South African Statistical Association

Agenda