← Back

Absenteeism at Work

CONCEPT

Characterized as any failure to report for or remain at work as scheduled, absenteeism holds a pervasive impact on workplace efficiency and financial stability. Annually, employers in the U.S. face a staggering productivity loss of $225.8 billion, as reported by the Center for Disease Control and Prevention (CDC). Similarly, in Canada, absenteeism constitutes a significant portion, ranging from 15% to 20%, of total direct and indirect payroll expenses.

This project revolves around the application of data-driven methodologies to comprehend and address absenteeism challenges. Utilizing statistical models such as linear regression, k-nearest neighbors (KNN), and decision tree modeling (CART), patterns and contributing factors to absenteeism are explored. The aim is not only to predict absenteeism but also to offer actionable insights for proactive employer intervention. Through exploratory data analysis and model evaluation, employing metrics like root mean squared error (RMSE), this project contributes valuable perspectives to the ongoing discourse on workplace productivity and absenteeism. Inclusion of this project in the portfolio underscores a commitment to utilizing data science for practical problem-solving in real-world workforce management scenarios, emphasizing a focus on productivity enhancement.

DATA

The dataset captures absenteeism records within a Brazilian courier company spanning July 2007 to July 2010. Offering versatility for attribute manipulation, it includes binary and integer types, allowing customization based on research needs. Notably, it encompasses 21 categories of absence reasons, ranging from diseases to congenital malformations, and 7 categories for absences without patient follow-up, including blood donation. Key attributes cover diverse factors like transportation expense, service time, education level, social habits, and health-related metrics such as body mass index. Despite its richness, the dataset may contain missing values, prompting careful preprocessing for robust analytical insights.

APPROACH

The analysis commences with exploratory data analysis (EDA), uncovering insights into patterns and relationships within the dataset. Descriptive statistics illuminate key features, providing a comprehensive understanding of the dataset's structure. Subsequently, correlation analysis identifies interdependencies among variables, facilitating insights into potential predictors of absenteeism.

ALGORITHM

This project employs a dual-model strategy, leveraging both Multiple Linear Regression (MLR) and k-Nearest Neighbors (KNN). MLR, a classic regression technique, captures linear relationships between predictors and the target variable—Absenteeism time in hours. Additionally, KNN, a non-parametric method, gauges similarity to predict outcomes based on the majority class among k-nearest neighbors.

For tree-based analysis, the project explores Regression Trees, specifically the Classification and Regression Trees (CART) algorithm. It systematically evaluates variables, splitting the dataset to predict absenteeism patterns. Pruning techniques, such as Minimum Error and Best Pruned Trees, fine-tune model complexity to enhance predictive accuracy. This multifaceted approach ensures a robust exploration of patterns, striking a balance between interpretability and predictive power.