SafeTraffic Copilot: adapting large language models for trustworthy traffic safety assessments and decision interventions

Raw data

The raw crash data used in this study were obtained from the HSIS²⁹ and Google Maps³⁰. Data from the HSIS, sourced from multiple systems, encompasses a variety of formats, including categorical, numerical, and textual. In total, four main datasets were used:

Crash data. This dataset captures the essential spatio-temporal and contextual attributes of each crash. It includes crash date, time, day of the week, and month, along with location details such as route number, milepost, and the surrounding area’s classification (e.g., rural or urban). Higher-level planning attributes (e.g., roadway and functional classifications, intersection-related indicators) are also recorded. In addition, it documents the dynamic circumstances leading up to the event, including the number of vehicles and pedestrians involved, vehicle travel directions (increasing or decreasing milepost), and any maneuvers performed (e.g., lane changes, straight-line movement).
Infrastructure data. This dataset details the physical and infrastructural features of the crash site. Key elements include the type of road surface (e.g., asphalt or concrete), average annual daily traffic (AADT), posted speed limits, and access control mechanisms. It also encompasses dimensions such as total road width, right and left shoulder widths, and median width (including median barriers if present), as well as road surface conditions (e.g., dry or wet) and ambient lighting at the time of the crash (e.g., daylight or dusk).
Vehicle data. This dataset consolidates information on the vehicles involved in each crash, including vehicle type (e.g., passenger car or truck), intended use (e.g., commercial or private), mechanical condition (e.g., defects), and relevant driver actions (e.g., lane changes or stopping). Additional information on airbag deployment and occupant ejection status provides further granularity.
Person data. This dataset compiles information about individuals involved in the crash, detailing demographic characteristics such as age, gender, and seating position. It also includes the use of safety equipment (e.g., seat belts or helmets) and any contributing factors, such as driver distraction or impairment.

The satellite images obtained from Google Maps serve as a supplementary data source to complement the HSIS dataset. Overall, we collected 16,188 crash event data from Washington State and 42,715 events from Illinois State for further analysis. We also collect and process 2250 events from Maine, 2250 from Ohio, and 2802 from North Carolina to evaluate the model’s cross-region training-free generalization.

SafeTraffic Event dataset construction

To adapt the raw data for LLMs’ fine-tuning process, we employ the feature engineering and textualization process to generate textual inputs (see Fig. 7). We followed the following process to generate a textual prompt from raw data entry:

Data mapping and organization. For each crash, we associated the crash report with the involved vehicles and individuals using the crash ID, thus obtaining descriptions of the crash and the persons involved. The route ID and milepost were used to identify the specific road segment where the crash occurred, allowing us to gather related road and environment information from infrastructure data. The integrated data was then systematically organized into four categories: general information, infrastructure information, event information, and unit information, aligning with the components outlined above.

Fig. 7: The construction procedure of SafeTraffic Event dataset.

a Data processing. Four raw datasets from HSIS (crash, infrastructure, vehicle, and person data) are used to construct a prompt through four steps. (1) Data mapping and organization: Link the datasets and organize them into four parts: general, infrastructure, event, and unit. (2) Satellite image textualization: Retrieve satellite images via GPS coordinates using the Google Maps API, then employ GPT-4o to extract text-based information. (3) Dimensionality reduction: Combine targets with similar values using GPT-4o. (4) Prompt generation: Use the processed data from the previous steps to generate a prompt for each part. b AI-expert textualization. An example of the infrastructure information part of an event case in the Washington dataset is shown.
Satellite images textualization. The HSIS datasets provide GPS coordinates for crash locations in Washington and Illinois. To address missing information, such as the number of road lanes, high-resolution satellite images (512 × 512 pixels at a zoom level of 19) were retrieved using these GPS coordinates via the Google Maps API. These images supplement the crash dataset with crucial infrastructure and environmental context. Descriptive textual annotations were generated from the satellite images using GPT-4, filling key gaps in the original dataset. These annotations include information such as the number of lanes at the crash site, whether the crash occurred at an intersection, and whether the surrounding area is residential. Image-related information enhances the model’s performance; see Fig. 8 for details.

Fig. 8: Analysis of the impacts of visual-textual information integration in SafeTraffic LLM.

a Examples of prompt modifications with image-derived information removed. b Performance comparison for expected crash prediction on Num. of Injury, Severity, and Crash Type prediction tasks on the Illinois and Washington datasets, using T + I (Text + Image) and T (Text-only) input modalities. c Average contribution of image-derived information at the inference stage. d Contribution of the image-derived paragraph at the training stage. The central line represents the median; the box spans from the 25th to 75th percentiles; whiskers extend to 1.5 × IQR. Source data are provided as a Source data file.
Dimensionality reduction. Raw data include abundant attributes with rich and varied descriptions. However, some features suffer from insufficient distinction between attribute values due to the original classification’s complexity. To address this, we performed dimensionality reduction on these attributes by combining domain experts’ insights with GPT-4o clustering results. For example, similar classifications like “pedalcyclist struck by vehicle” and “pedalcyclist strikes vehicle” were clustered under a broader category such as “pedalcyclist collisions.” This process generalized the data and reduced redundancy. See Supplementary Table 6 for detailed information.
Prompt generation using an AI-expert textualization method. To generate logically coherent and continuous textual data suitable for LLM training, we transformed each category of data into text format using GPT-4o¹². All data are organized as key-value pairs, and we get four parts of the key-value pairs for each event case. Then GPT-4o is used to generate the text prompt for each section of the key-value pairs individually. For each part, we apply a straightforward prompt to GPT-4o, such as “Please translate a python dictionary to paragraph, act as a crash data interpreter.” The text content is extracted from GPT-4o’s response for each part, consisting of approximately 100 words. By linking four parts of the text, we obtain a comprehensive textual description for each crash event case. The detailed process is shown in Fig. 7b. For the Maine, Ohio, and North Carolina datasets, the prompts were constructed by filling values into the Illinois template if the features are matched; otherwise, the features are set to None in the Illinois template.

We select three variables as the prediction targets: Injury, Severity, and crash Type. The three targets are defined as:

The Injury${n}_{i}^{{{{\mathcal{D}}}}}\in \{ \, f(l)| l=0,1, 2, \cdots \,\}$, where i denotes the ith data in the dataset, ${{{\mathcal{D}}}}\in \{{{{\mathcal{W}}}},{{{\mathcal{I}}}}\}$ denotes the Washington dataset ${{{\mathcal{W}}}}$ or the Illinois dataset ${{{\mathcal{I}}}}$, l represents the number of people injured, and f(l) denotes the label when the injured people is l.
The Severity${s}_{i}^{{{{\mathcal{D}}}}}\in \{{S}_{k}| k =1,2, \cdots \, \}$, where S_k is the kth level of crash severity.
The Type${t}_{i}^{{{{\mathcal{D}}}}}\in \{{T}_{k}^{{{{\mathcal{D}}}}}| k =1,2,,\cdots \, \}$, where ${T}_{k}^{{{{\mathcal{D}}}}}$ is the kth label of crash type in dataset ${{{\mathcal{D}}}}$.

We utilize these three variables to describe the crash result ${{{{\rm{CR}}}}}_{i}^{{{{\mathcal{D}}}}}$. The crash outcome can be presented in the following format: ${{{{\rm{CR}}}}}_{i}^{{{{\mathcal{D}}}}}=({n}_{i}^{{{{\mathcal{D}}}}},{s}_{i}^{{{{\mathcal{D}}}}},{t}_{i}^{{{{\mathcal{D}}}}})$. For numerical variables, the function f(l) describes the number of people injured in a crash as follows: “zero” if l = 0, “one” if l = 1, “two” if l = 2, and “three and more than three” if l ≥ 3, the values for S_k and ${T}_{k}^{{{{\mathcal{D}}}}}$ are provided in the Supplementary Table 4 and Supplementary Table 5.

SafeTraffic LLM

We fine-tune SafeTraffic LLM by adapting LLaMa 3.1¹³ to crash prediction tasks to enhance the LLMs’ capabilities in interpreting crash data, identifying critical factors, and conducting feature-attribution analysis to offer insights for crash prevention. In this section, we will introduce detailed information on the fine-tuning process.

During the fine-tuning of LLMs, a single input consists of three components: the system prompt, the user prompt, and the target prompt. The system prompt introduces the task, for example: “You are a helpful assistant designed to predict the severity of a traffic crash…”. The user prompt comprises the four content parts detailed in “SafeTraffic Event dataset construction” section for each case. The target prompt represents the expected output. Examples of these prompts are shown in Fig. 3 and Supplementary Section 2.3. We tokenize the text inputs using LLaMA 3.1’s tokenizer.

To adapt the LLM as a crash classifier, additional tokens have been incorporated into the tokenizer’s vocabulary, and the detailed crash attribute categories are listed in Supplementary Table 4 and Supplementary Table 5. Specifically, for predicting the number of people Injuries of Washington dataset and Illinois dataset, we have introduced four special tokens: , , , and . Similarly, for predicting the Crash Severity of the Washington dataset and the Illinois dataset, we use five additional tokens: S_k, where 1 ≤ k ≤ 5, corresponding to different levels of severity. The Type task differs slightly between the Washington and Illinois datasets. For Washington datasets, we utilize 14 special tokens: ${T}_{k}^{{{{\mathcal{W}}}}}$, where 1 ≤ k ≤ 14, each representing a specific crash type. For Illinois datasets, we utilize 16 special tokens: ${T}_{k}^{{{{\mathcal{I}}}}}$, where 1 ≤ k ≤ 16. The parameters of the input and output embedding layers are set as trainable, enabling the model to align the representations of these special tokens with the existing embedding space.

During the fine-tuning phase, the traffic forecasting task is framed as a next-token generation task. Given an input prompt x_i and its prediction target y_i, we construct the full prompt as T_i = concat(x_i, y_i), where concat( ⋅ ) denotes the concatenation operation that appends the target label y_i to the input x_i as a special token. The next-token generation process can be described as:

$${p}_{\theta }({T}_{i})= {\prod}_{j =1}^{| {T}_{i}| }{p}_{\theta }({t}_{j}^{(i)}| {t}_{1}^{(i)},\cdots \,,{t}_{j-1}^{(i)}) ,$$

(1)

where T_i is the ith item in the training data, p_θ is the LLM, ${t}_{j}^{(i)}$ denotes the jth token in T_i. By maximizing the likelihood ${p}_{\theta }(T)={\prod }_{i=1}^{N}{p}_{\theta }({T}_{i})$, the LLM’s parameters are learned. Both the system prompt and the user prompt are masked for loss computation during training. We also used a uniform data sampling strategy during the training process to facilitate the convergence of SafeTraffic LLM ¹⁶. Through this process, the model learns to make predictions for a traffic crash.

Expected crash prediction confidence score calculation

The confidence score is a critical component that links model predictions to interpretability within the SafeTraffic Copilot. The confidence score quantifies the model’s certainty in its prediction for a given input. Since we incorporate target labels as special tokens in the LLM’s vocabulary and fine-tune the model to generate only these tokens as outputs, we define the confidence score based on the predicted token’s probability. Specifically, given a textual input x_i and its corresponding label y_i, the confidence score C(x_i) is defined as:

$$C({x}_{i})={\max }_{{y}_{i}\in {{{\mathcal{Y}}}}}{p}_{\theta }( \, {y}_{i}| {x}_{i})$$

(2)

where ${{{\mathcal{Y}}}}$ denotes the set of all possible labels (e.g., fatal, serious injury, etc., for crash severity prediction). p_θ(y_i∣x_i) is the softmax probability assigned by the model to class y_i, which can be computed by applying the softmax function over the logits corresponding to the special tokens representing each label.

For a given threshold t, let N_t denote the number of samples with confidence scores greater than t. Among these, R_t samples are correctly classified. The accuracy at threshold t is then given by

$${{\mbox{Acc}}}_{t}=\frac{{R}_{t}}{{N}_{t}}.$$

(3)

By computing the accuracy at different thresholds t, we can plot the relationship between accuracy Acc_t and the threshold t, as shown in Fig. 4e, f.

Hyperparameter settings

In our experiments, we follow LoRA³⁷ to fine-tune LLaMA 3.1 models. Specifically, we update only the input and output layers directly, while all remaining layers are frozen and trained through LoRA. We use the AdamW optimizer³⁸ with a learning rate of 3e-4 and a batch size of 32 (with gradient accumulation over 8 steps). The models are trained on 8 NVIDIA A100 GPUs (80GB memory each) using DeepSpeed³⁹ for efficient distributed training.

Data split

We split the Washington and Illinois dataset into training, validation, and test sets in a 7:1.5:1.5 ratio. Since the Washington dataset contains relatively few crash events per year, we utilized as many reports as possible to ensure sufficient training data. However, the data distribution across different classes is highly imbalanced. For example, in the crash severity prediction task in the Washington dataset, the ratio of #S₁/#S₅ is nearly 100:1, where #S_k is the number of data with label S_k. The imbalanced data distribution presents a great challenge for the model’s training and evaluation. During the fine-tuning, we used a uniform sampling strategy to train the model on this unbalanced data. Similarly, to facilitate the model’s evaluation, for the validation set and test set, we removed most of the data with a crash severity category of S₁. Specifically, after processing, the dataset consisted of 16,188 records, with 11,332 used for training, 2428 for validation, and 2428 for testing. To balance the validation and test set for better evaluation, we removed 1428 S₁ data and used 1000 remaining data for the validation set and test set separately. Compared with Washington state, more crash records can be used in Illinois state to generate a dataset. As a result, we were able to balance all subsets, including the training, validation, and test sets. Ultimately, the Illinois dataset comprised 42,715 records, with 29,307 used for training, 6704 for validation, and 6704 for testing. See Supplementary Section 1.3 for the detailed distribution for each dataset.

Evaluation metrics

In evaluating the model performance as a classification task, we employ weighted accuracy, precision, and F1-score as metrics. In the context of a classification task, we have four notations: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Using these notations, we can represent the metrics as follows:

Accuracy is one of the most commonly used measures for the classification performance, and it is defined as a ratio between the correctly classified samples to the total number of samples as follows:

$${{{\rm{Accuracy}}}}=\frac{TP+TN}{TP+TN+FP+FN}$$

(4)
Precision represents the proportion of positive samples that were correctly classified to the total number of positive predicted samples, which reflects the performance of the prediction:

$${{{\rm{Precision}}}}=\frac{TP}{TP+FP}$$

(5)
F1-score combines results on precision and recall. It is the harmonic mean of precision and recall, which can be calculated using the formula:

$${{{\rm{F}}}}1-{{{\rm{score}}}}=\frac{2}{{{{{\rm{Precision}}}}}^{-1}+{{{{\rm{Recall}}}}}^{-1}}=2\cdot \left(\frac{{{{\rm{Precision}}}}\cdot {{{\rm{Recall}}}}}{{{{\rm{Precision}}}}+{{{\rm{Recall}}}}}\right)$$

(6)

where Recall = TP/(TP + FN).

Adopted baselines

We follow recent studies⁴⁰ and adopt machine learning models, including XGBoost⁴¹, Random Forest (RF)⁴², Decision Trees (DT)⁴³, Adaptive Boosting (AdaBoost)⁴⁴, Logistic Regression (LR)⁴⁵, and Categorical Boosting (CatBoost)⁴⁶. We also include deep learning models such as BERT⁴⁷ and TabNet⁴⁸. In addition, we consider the National Average⁴⁹, which predicts crash severity distributions using calibrated Severity Distribution Functions. For these models, the Bayesian optimization method (BayesSearchCV) is used to facilitate the identification of optimal hyperparameters, such as max_depth and learning_rate. To ensure a fair comparison across baseline models, we retained the original architecture and design of each model, modifying only the input data format when necessary. Detailed information on hyperparameter settings and input data preprocessing for all baseline models is provided in Supplementary Section 1.2.

The experiments on the North Carolina, Maine, and Ohio datasets are conducted under a zero-shot setting, where the model is fine-tuned only on the Illinois dataset and has never seen data from North Carolina, Maine, and Ohio during training. Traditional machine learning models perform poorly in this context due to their limited ability to adapt. Therefore, to ensure a fair comparison under the same conditions, we introduce two baseline methods:

BERT⁴⁷. Leveraging its pre-training on large corpora, BERT possesses a certain degree of generalization capability. In our experiments, we fine-tune BERT using prompts from the Illinois dataset and evaluate its zero-shot performance on the North Carolina and Maine datasets.
CoT⁵⁰. Chain-of-thought (CoT) reasoning enables language models to perform multi-step inference by generating intermediate reasoning steps before arriving at a final answer. Zhen et al.²³ explored the use of CoT for zero-shot crash severity prediction and reported improved performance over standard LLM prompting. Following their approach, we apply CoT prompting to evaluate zero-shot performance on the North Carolina and Maine datasets.

SafeTraffic Attribution

To identify the feature contribution of each factor to the prediction results, this paper introduces and adapts the concept of Shapley values³¹. Shapley value is a concept from cooperative game theory that has been widely adopted in machine learning to interpret model predictions⁵¹. It provides a way to fairly allocate the contribution of each feature to the outcome of a predictive model. In essence, the Shapley value quantifies how much each feature contributes to a prediction by considering all possible combinations of features. Formally, the Shapley value φ of a feature (or player) i in a cooperative game is defined as:

$${\varphi }_{i}=\mathop{\sum}_{S\subseteq N\setminus \{i\}}\frac{| S| !(n-| S| -1)!}{n!}\left[v(S\cup \{i\})-v(S)\right],$$

(7)

where N = {1, 2, …, n} is the index set of n features, S is a subset of N, and v(S) is the utility of the subset S, which represents a measurable value, such as accuracy or prediction score, achieved by the model using only the subset S of features.

The Shapley value is utilized in both the training and inference stages in SafeTraffic Copilot. During the training stage, it quantifies the contributions of four primary categories of information: general information, infrastructure information, event information, and unit information. During the inference stage, the Shapley value is applied to assess the contributions of individual sentences to the prediction outcomes.

Feature contributions at the training stage

The Shapley value is utilized to assess the influence of different components in the training set on the model during training. As outlined in “Developing SafeTraffic LLM for predicting crashes” in the “Results” section, the jth prompt T_j in the dataset P is divided into five parts: c₀: system prompt (i.e., “You are a helpful assistant designed to predict the severity of a traffic crash…”), c₁: general information, c₂: infrastructure information, c₃: event information, and c₄: unit information. We denote p_j(k) as the c_k portion of p_j. Given an index set S, we can construct a variant T_j(S) by concatenating the parts in S. For example, if S = {0, 1, 2}, then T_j(S) contains c₀, c₁, and c₂. Formally,

$${T}_{j}(S)={{\mbox{concat}}}_{k\in S}\,{T}_{j}(k),$$

(8)

where concat denotes concatenation. The resulting dataset based on S is P(S) = {T_j(S)∣ j = 0, 1, …, L}, where L is the dataset size.

Referring to Equation (7), the contribution of part c_i at training, ${\varphi }_{i}^{\,{\mbox{train}}\,}$, is

$${\varphi }_{i}^{\,{\mbox{train}}\,}={\sum}_{S\subseteq N\setminus \{i\}}\frac{| S| !(n-| S| -1)!}{n!}\cdot \left[v(P(S\cup \{0,i\}))-v(P(S\cup \{0\}))\right],$$

(9)

where N = {1, 2, 3, 4} indexes the four content parts, and v(P(S)) is a performance metric (e.g., accuracy) obtained after retraining the model only on prompts in P(S).

Sentence-level feature contributions at the inference stage

Unlike traditional machine learning models that primarily handle fixed-length feature vectors, LLMs process variable-length text sequences as input⁵². This characteristic makes commonly used Shapley value approximation methods, such as KernelSHAP⁵³ and DeepSHAP, less applicable to LLMs. Recent approaches like TokenSHAP⁵⁴ and TransSHAP⁵⁵ have been proposed to address this by decomposing input text into tokens and computing Shapley values at the token level. However, applying token-level Shapley value computation to SafeTraffic LLM introduces two primary challenges: (1) Computational limitations. The computational complexity of Shapley values is exponential in the number of players. In our SafeTraffic LLM, with an input size of approximately 500 tokens, the large-scale computation of token-level Shapley values for crash data becomes impractical. (2) Limited interpretability. Decomposing the prompt at the token level disregards inter-token dependencies, and the arbitrary masking or replacement of tokens can lead to semantic ambiguity and contextual shifts. These issues hinder a precise understanding of how individual features contribute to predictions. Moreover, paragraph-level analysis is too coarse for detailed attribution, since it can merge distinct features into a single category (e.g., driver and vehicle details under “unit information”).

To overcome these limitations, we propose a sentence-level feature contributions calculation method for inputs of LLMs, which proceeds as follows:

Sentence segmentation. The prompts are segmented using delimiters (e.g., commas “,” or periods “.”) to produce sentence-level units.
Feature groups annotation. GPT-4o is used to group and label these sentences (see Fig. 5 for the groups’ content). Each group is represented as c_k, where $k\in {N}^{{\prime} }=\{1,2,3,\ldots \,n\}$. For the Washington dataset, n = 14, while for the Illinois dataset n = 12. Given index set ${S}^{{\prime} }\subseteq {N}^{{\prime} }\setminus \{\,i\}$, we can construct the prompt ${T}_{j}({S}^{{\prime} })$ similar to the process Equation (8).
Feature contributions calculation based on the feature groups. Based on the constructed dataset, the feature contribution for the ith sentence-group ${\varphi }_{i,j}^{inf}$ for the jth item in the dataset can be calculated as:

$${\varphi }_{i,j}^{\,{\mbox{inf}}\,}= {\sum}_{{S}^{{\prime} }\subseteq {N}^{{\prime} }\setminus \{i\}}\frac{| {S}^{{\prime} }| !\,(n-| {S}^{{\prime} }| -1)!}{n!}\cdot \left[{p}_{\theta }\left({y}_{j}| {T}_{j}({S}^{{\prime} }\cup \{0,i\})\right)\right. \\ \left. -{p}_{\theta }\left({y}_{j}| {T}_{j}({S}^{{\prime} }\cup \{0\})\right)\right]$$

(10)

where p_θ represents the LLM, which returns the predicted probability of the targets y_j given the inputs. A higher ${\varphi }_{i,j}^{\,{\mbox{inf}}\,}$ indicates a greater contribution of the ith sentence group to the model’s confidence for predicting y_j. To reduce computational overhead, we adopt a stratified sampling–based Shapley estimation method using complementary contributions³⁶.

Source link

Raw data

SafeTraffic Event dataset construction

SafeTraffic LLM

Expected crash prediction confidence score calculation

Hyperparameter settings

Data split

Evaluation metrics

Adopted baselines

SafeTraffic Attribution

Feature contributions at the training stage

Sentence-level feature contributions at the inference stage

You Might Also Like

New Physics Model Challenges Big Bang Story

What are the pathways to greater steel reuse in bridges?

Zerodha’s Nikhil Kamath shares only jobs that will exist in 10 years, tells why college education is doomed

Enjoying the contents?

Subscribe to our weekly newsletter