Massachusetts Bay Transportation Authority (MBTA) Risk Management

In this graduate-level project for the Modeling for Business class our group - Husky Consulting Group collected, cleaned, transformed, and analyzed a confidential dataset from the MBTA. Our objective? Minimizing risk of accidents for employees and bus routes. Due to an NDA, no sensitive information can be shared. Fortunately, a variety of data science and programming techniques used for analysis can be disclosed.

ANALYTICS

4/26/20245 min read

As part of the Modeling for Business class at Northeastern University, we were approached by the Massachusetts Bay Transportation Authority (MBTA) to help them mitigate risk by reducing the number of bus accidents in Boston. We were given 6 confidential datasets with 50,000+ observations for bus accidents, employee profiles, attrition, routes, and vehicle information. This data was mostly unstructured, with significant data quality issues including missing values and extreme outliers.

Here we have some additional business context. The MBTA recorded just over 3,000 incidents and over 47,000 harsh events in the previous year. In a city that relies so heavily on public transportation and often has mobility issues, reducing the number of bus accidents is paramount to improve the quality of service. Furthermore, some of these accidents end in injury and potentially deaths, so it is also meaningful to improve public safety.

To achieve our goal of mitigating risk of accidents we will require a model from which we can extract actionable insights. We cleaned and filtered data that we thought was pertinent for exploratory data analysis. With the business objective as our principal focus, we started exploring the available data and trying to identify trends. We then attempted to enrich our data with intelligent assumptions and integrating different datasets using primary keys. Afterwards, we normalized our data to come up with a generalizable risk metric, to compare observations more reliably. This data was fit into logistic regression and decision tree models to make risk predictions. Finally, after cross-validating, we can extract insights based on the most relevant model features.

As part of exploratory data analysis, we can see how two clusters of employees share different characteristics in terms of hours worked and experience. Using this unsupervised clustering method, namely k-means clustering, we can observe notable differences. Group 0 works almost double the hours annually compared to Group 1. We can infer these are full-time versus part-time employees for the most part. However, the whisker plot on the right tells us that the average experience between both groups does not differ greatly. We can formulate a hypothesis that full-time and part-time employees do not vary significantly in experience levels although they work different hours.

Continuing with the clusters, here we have compared incident and suspension rates for both groups. It was critical to create these risk measures in the form of rates. This is to get a "standardized" measure so that we can compare apples to apples, regardless of the total amount of incidents for each group. In this way we can easily extract valuable information: Part-time workers (Cluster 1) have significantly higher incident rates and suspensions rates. To be clear, an incident is defined as an accident or a near miss. A suspension is a way of penalizing inappropriate behavior like late arrivals and no-shows.

Interesting hypothesis, but how can we extract action items from this? We will come back to this later.

Before validating the first hypothesis, we went on to create a predictive model. We used dozens of features such as experience, annual hours worked to predict if an employee had a high risk of causing an accident with 78% accuracy. From the calculated probabilities for each data point, we can see how it is defined as a normal distribution, with the average employee having a 65% chance of belonging to the high risk group.

From this distribution we created 3 bins:

a high-risk group if the employee had over 77% chance of being classified as high-risk
a medium-risk group if the employee has between a 53% and 77% chance of being classified as high-risk
a low-risk group if the employee has less than a 53% chance of being classified as high-risk

After creating these bins quasi-arbitrarily, by cutting the groups at (mean - std dev, mean + std dev) we can see that the bulk of employees are categorized as medium risk with an almost equal split of 15% and 17% for low and high risk employees, respectively.

Moving on to the bus route dataset, here we have a snippet of the random forest model. We split the data into training and testing sets, fit the model, tested data, and tuned hyper parameter to achieve 82% predictive accuracy. Every route got transformed into a column when we created dummy variables. Therefore, every route column got assigned a coefficient, indicating how strongly a route correlates to a high risk of accidents.

We plotted the coefficient calculated to get the graph above. We can see 3 clearly distinct peaks: on near -0.4, another near 0.3, and the tallest one at 0. A positive coefficient indicates that the route is positively correlated to a high-risk route label, thus making it a high-risk route. On the contrary, a negative coefficient indicates that driving on that route indicates a negative correlation with the high-risk label, thus making it a low-risk route. However, we considered that a coefficient of 0 should be a low-risk route since it is not indicating a higher chance of having an accident. Therefore, we interpreted this distribution as bimodal.

From the bimodal route risk distribution, we grouped routes as high-risk and low-risk:

45% of routes are classified as high-risk
55% of routes are classified as low-risk

From our extensive analysis, we classified employees (bus operators) into 3 groups: low, medium and high-risk. We also classified routes into two bins: as low and high-risk.

Our conclusion, is that to reduce risk scores and therefore accident rates, we should:

Pair low-risk operators with the highest-risk routes
Pair high-risk operators with the lowest-risk routes
Pair medium-risk operators remaining routes

Grouping in such way is a feasible alternative that gives the MBTA flexibility over route assignments. Moreover, using our predictive model for employee risk may yield insights on an incoming employee's risk score. This model could be further improved by adding new features like employee driving records or credit score if it has predictive power over the risk score.