Raw data
The raw crash data used in this study were obtained from the HSIS29 and Google Maps30. Data from the HSIS, sourced from multiple systems, encompasses a variety of formats, including categorical, numerical, and textual. In total, four main datasets were used:
-
Crash data. This dataset captures the essential spatio-temporal and contextual attributes of each crash. It includes crash date, time, day of the week, and month, along with location details such as route number, milepost, and the surrounding area’s classification (e.g., rural or urban). Higher-level planning attributes (e.g., roadway and functional classifications, intersection-related indicators) are also recorded. In addition, it documents the dynamic circumstances leading up to the event, including the number of vehicles and pedestrians involved, vehicle travel directions (increasing or decreasing milepost), and any maneuvers performed (e.g., lane changes, straight-line movement).
-
Infrastructure data. This dataset details the physical and infrastructural features of the crash site. Key elements include the type of road surface (e.g., asphalt or concrete), average annual daily traffic (AADT), posted speed limits, and access control mechanisms. It also encompasses dimensions such as total road width, right and left shoulder widths, and median width (including median barriers if present), as well as road surface conditions (e.g., dry or wet) and ambient lighting at the time of the crash (e.g., daylight or dusk).
-
Vehicle data. This dataset consolidates information on the vehicles involved in each crash, including vehicle type (e.g., passenger car or truck), intended use (e.g., commercial or private), mechanical condition (e.g., defects), and relevant driver actions (e.g., lane changes or stopping). Additional information on airbag deployment and occupant ejection status provides further granularity.
-
Person data. This dataset compiles information about individuals involved in the crash, detailing demographic characteristics such as age, gender, and seating position. It also includes the use of safety equipment (e.g., seat belts or helmets) and any contributing factors, such as driver distraction or impairment.
The satellite images obtained from Google Maps serve as a supplementary data source to complement the HSIS dataset. Overall, we collected 16,188 crash event data from Washington State and 42,715 events from Illinois State for further analysis. We also collect and process 2250 events from Maine, 2250 from Ohio, and 2802 from North Carolina to evaluate the model’s cross-region training-free generalization.
SafeTraffic Event dataset construction
To adapt the raw data for LLMs’ fine-tuning process, we employ the feature engineering and textualization process to generate textual inputs (see Fig. 7). We followed the following process to generate a textual prompt from raw data entry:
-
Data mapping and organization. For each crash, we associated the crash report with the involved vehicles and individuals using the crash ID, thus obtaining descriptions of the crash and the persons involved. The route ID and milepost were used to identify the specific road segment where the crash occurred, allowing us to gather related road and environment information from infrastructure data. The integrated data was then systematically organized into four categories: general information, infrastructure information, event information, and unit information, aligning with the components outlined above.
Fig. 7: The construction procedure of SafeTraffic Event dataset. a Data processing. Four raw datasets from HSIS (crash, infrastructure, vehicle, and person data) are used to construct a prompt through four steps. (1) Data mapping and organization: Link the datasets and organize them into four parts: general, infrastructure, event, and unit. (2) Satellite image textualization: Retrieve satellite images via GPS coordinates using the Google Maps API, then employ GPT-4o to extract text-based information. (3) Dimensionality reduction: Combine targets with similar values using GPT-4o. (4) Prompt generation: Use the processed data from the previous steps to generate a prompt for each part. b AI-expert textualization. An example of the infrastructure information part of an event case in the Washington dataset is shown.
-
Satellite images textualization. The HSIS datasets provide GPS coordinates for crash locations in Washington and Illinois. To address missing information, such as the number of road lanes, high-resolution satellite images (512 × 512 pixels at a zoom level of 19) were retrieved using these GPS coordinates via the Google Maps API. These images supplement the crash dataset with crucial infrastructure and environmental context. Descriptive textual annotations were generated from the satellite images using GPT-4, filling key gaps in the original dataset. These annotations include information such as the number of lanes at the crash site, whether the crash occurred at an intersection, and whether the surrounding area is residential. Image-related information enhances the model’s performance; see Fig. 8 for details.
Fig. 8: Analysis of the impacts of visual-textual information integration in SafeTraffic LLM. a Examples of prompt modifications with image-derived information removed. b Performance comparison for expected crash prediction on Num. of Injury, Severity, and Crash Type prediction tasks on the Illinois and Washington datasets, using T + I (Text + Image) and T (Text-only) input modalities. c Average contribution of image-derived information at the inference stage. d Contribution of the image-derived paragraph at the training stage. The central line represents the median; the box spans from the 25th to 75th percentiles; whiskers extend to 1.5 × IQR. Source data are provided as a Source data file.
-
Dimensionality reduction. Raw data include abundant attributes with rich and varied descriptions. However, some features suffer from insufficient distinction between attribute values due to the original classification’s complexity. To address this, we performed dimensionality reduction on these attributes by combining domain experts’ insights with GPT-4o clustering results. For example, similar classifications like “pedalcyclist struck by vehicle” and “pedalcyclist strikes vehicle” were clustered under a broader category such as “pedalcyclist collisions.” This process generalized the data and reduced redundancy. See Supplementary Table 6 for detailed information.
-
Prompt generation using an AI-expert textualization method. To generate logically coherent and continuous textual data suitable for LLM training, we transformed each category of data into text format using GPT-4o12. All data are organized as key-value pairs, and we get four parts of the key-value pairs for each event case. Then GPT-4o is used to generate the text prompt for each section of the key-value pairs individually. For each part, we apply a straightforward prompt to GPT-4o, such as “Please translate a python dictionary to paragraph, act as a crash data interpreter.” The text content is extracted from GPT-4o’s response for each part, consisting of approximately 100 words. By linking four parts of the text, we obtain a comprehensive textual description for each crash event case. The detailed process is shown in Fig. 7b. For the Maine, Ohio, and North Carolina datasets, the prompts were constructed by filling values into the Illinois template if the features are matched; otherwise, the features are set to None in the Illinois template.
We select three variables as the prediction targets: Injury, Severity, and crash Type. The three targets are defined as:
-
The Injury\({n}_{i}^{{{{\mathcal{D}}}}}\in \{ \, f(l)| l=0,1, 2, \cdots \,\}\), where i denotes the ith data in the dataset, \({{{\mathcal{D}}}}\in \{{{{\mathcal{W}}}},{{{\mathcal{I}}}}\}\) denotes the Washington dataset \({{{\mathcal{W}}}}\) or the Illinois dataset \({{{\mathcal{I}}}}\), l represents the number of people injured, and f(l) denotes the label when the injured people is l.
-
The Severity\({s}_{i}^{{{{\mathcal{D}}}}}\in \{{S}_{k}| k =1,2, \cdots \, \}\), where Sk is the kth level of crash severity.
-
The Type\({t}_{i}^{{{{\mathcal{D}}}}}\in \{{T}_{k}^{{{{\mathcal{D}}}}}| k =1,2,,\cdots \, \}\), where \({T}_{k}^{{{{\mathcal{D}}}}}\) is the kth label of crash type in dataset \({{{\mathcal{D}}}}\).
We utilize these three variables to describe the crash result \({{{{\rm{CR}}}}}_{i}^{{{{\mathcal{D}}}}}\). The crash outcome can be presented in the following format: \({{{{\rm{CR}}}}}_{i}^{{{{\mathcal{D}}}}}=({n}_{i}^{{{{\mathcal{D}}}}},{s}_{i}^{{{{\mathcal{D}}}}},{t}_{i}^{{{{\mathcal{D}}}}})\). For numerical variables, the function f(l) describes the number of people injured in a crash as follows: “zero” if l = 0, “one” if l = 1, “two” if l = 2, and “three and more than three” if l ≥ 3, the values for Sk and \({T}_{k}^{{{{\mathcal{D}}}}}\) are provided in the Supplementary Table 4 and Supplementary Table 5.
SafeTraffic LLM
We fine-tune SafeTraffic LLM by adapting LLaMa 3.113 to crash prediction tasks to enhance the LLMs’ capabilities in interpreting crash data, identifying critical factors, and conducting feature-attribution analysis to offer insights for crash prevention. In this section, we will introduce detailed information on the fine-tuning process.
During the fine-tuning of LLMs, a single input consists of three components: the system prompt, the user prompt, and the target prompt. The system prompt introduces the task, for example: “You are a helpful assistant designed to predict the severity of a traffic crash…”. The user prompt comprises the four content parts detailed in “SafeTraffic Event dataset construction” section for each case. The target prompt represents the expected output. Examples of these prompts are shown in Fig. 3 and Supplementary Section 2.3. We tokenize the text inputs using LLaMA 3.1’s tokenizer.
To adapt the LLM as a crash classifier, additional tokens have been incorporated into the tokenizer’s vocabulary, and the detailed crash attribute categories are listed in Supplementary Table 4 and Supplementary Table 5. Specifically, for predicting the number of people Injuries of Washington dataset and Illinois dataset, we have introduced four special tokens:
During the fine-tuning phase, the traffic forecasting task is framed as a next-token generation task. Given an input prompt xi and its prediction target yi, we construct the full prompt as Ti = concat(xi, yi), where concat( ⋅ ) denotes the concatenation operation that appends the target label yi to the input xi as a special token. The next-token generation process can be described as:
$${p}_{\theta }({T}_{i})= {\prod}_{j =1}^{| {T}_{i}| }{p}_{\theta }({t}_{j}^{(i)}| {t}_{1}^{(i)},\cdots \,,{t}_{j-1}^{(i)}) ,$$
(1)
where Ti is the ith item in the training data, pθ is the LLM, \({t}_{j}^{(i)}\) denotes the jth token in Ti. By maximizing the likelihood \({p}_{\theta }(T)={\prod }_{i=1}^{N}{p}_{\theta }({T}_{i})\), the LLM’s parameters are learned. Both the system prompt and the user prompt are masked for loss computation during training. We also used a uniform data sampling strategy during the training process to facilitate the convergence of SafeTraffic LLM 16. Through this process, the model learns to make predictions for a traffic crash.
Expected crash prediction confidence score calculation
The confidence score is a critical component that links model predictions to interpretability within the SafeTraffic Copilot. The confidence score quantifies the model’s certainty in its prediction for a given input. Since we incorporate target labels as special tokens in the LLM’s vocabulary and fine-tune the model to generate only these tokens as outputs, we define the confidence score based on the predicted token’s probability. Specifically, given a textual input xi and its corresponding label yi, the confidence score C(xi) is defined as:
$$C({x}_{i})={\max }_{{y}_{i}\in {{{\mathcal{Y}}}}}{p}_{\theta }( \, {y}_{i}| {x}_{i})$$
(2)
where \({{{\mathcal{Y}}}}\) denotes the set of all possible labels (e.g., fatal, serious injury, etc., for crash severity prediction). pθ(yi∣xi) is the softmax probability assigned by the model to class yi, which can be computed by applying the softmax function over the logits corresponding to the special tokens representing each label.
For a given threshold t, let Nt denote the number of samples with confidence scores greater than t. Among these, Rt samples are correctly classified. The accuracy at threshold t is then given by
$${{\mbox{Acc}}}_{t}=\frac{{R}_{t}}{{N}_{t}}.$$
(3)
By computing the accuracy at different thresholds t, we can plot the relationship between accuracy Acct and the threshold t, as shown in Fig. 4e, f.
Hyperparameter settings
In our experiments, we follow LoRA37 to fine-tune LLaMA 3.1 models. Specifically, we update only the input and output layers directly, while all remaining layers are frozen and trained through LoRA. We use the AdamW optimizer38 with a learning rate of 3e-4 and a batch size of 32 (with gradient accumulation over 8 steps). The models are trained on 8 NVIDIA A100 GPUs (80GB memory each) using DeepSpeed39 for efficient distributed training.
Data split
We split the Washington and Illinois dataset into training, validation, and test sets in a 7:1.5:1.5 ratio. Since the Washington dataset contains relatively few crash events per year, we utilized as many reports as possible to ensure sufficient training data. However, the data distribution across different classes is highly imbalanced. For example, in the crash severity prediction task in the Washington dataset, the ratio of #S1/#S5 is nearly 100:1, where #Sk is the number of data with label Sk. The imbalanced data distribution presents a great challenge for the model’s training and evaluation. During the fine-tuning, we used a uniform sampling strategy to train the model on this unbalanced data. Similarly, to facilitate the model’s evaluation, for the validation set and test set, we removed most of the data with a crash severity category of S1. Specifically, after processing, the dataset consisted of 16,188 records, with 11,332 used for training, 2428 for validation, and 2428 for testing. To balance the validation and test set for better evaluation, we removed 1428 S1 data and used 1000 remaining data for the validation set and test set separately. Compared with Washington state, more crash records can be used in Illinois state to generate a dataset. As a result, we were able to balance all subsets, including the training, validation, and test sets. Ultimately, the Illinois dataset comprised 42,715 records, with 29,307 used for training, 6704 for validation, and 6704 for testing. See Supplementary Section 1.3 for the detailed distribution for each dataset.
Evaluation metrics
In evaluating the model performance as a classification task, we employ weighted accuracy, precision, and F1-score as metrics. In the context of a classification task, we have four notations: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Using these notations, we can represent the metrics as follows:
-
Accuracy is one of the most commonly used measures for the classification performance, and it is defined as a ratio between the correctly classified samples to the total number of samples as follows:
$${{{\rm{Accuracy}}}}=\frac{TP+TN}{TP+TN+FP+FN}$$
(4)
-
Precision represents the proportion of positive samples that were correctly classified to the total number of positive predicted samples, which reflects the performance of the prediction:
$${{{\rm{Precision}}}}=\frac{TP}{TP+FP}$$
(5)
-
F1-score combines results on precision and recall. It is the harmonic mean of precision and recall, which can be calculated using the formula:
$${{{\rm{F}}}}1-{{{\rm{score}}}}=\frac{2}{{{{{\rm{Precision}}}}}^{-1}+{{{{\rm{Recall}}}}}^{-1}}=2\cdot \left(\frac{{{{\rm{Precision}}}}\cdot {{{\rm{Recall}}}}}{{{{\rm{Precision}}}}+{{{\rm{Recall}}}}}\right)$$
(6)
where Recall = TP/(TP + FN).
Adopted baselines
We follow recent studies40 and adopt machine learning models, including XGBoost41, Random Forest (RF)42, Decision Trees (DT)43, Adaptive Boosting (AdaBoost)44, Logistic Regression (LR)45, and Categorical Boosting (CatBoost)46. We also include deep learning models such as BERT47 and TabNet48. In addition, we consider the National Average49, which predicts crash severity distributions using calibrated Severity Distribution Functions. For these models, the Bayesian optimization method (BayesSearchCV) is used to facilitate the identification of optimal hyperparameters, such as max_depth and learning_rate. To ensure a fair comparison across baseline models, we retained the original architecture and design of each model, modifying only the input data format when necessary. Detailed information on hyperparameter settings and input data preprocessing for all baseline models is provided in Supplementary Section 1.2.
The experiments on the North Carolina, Maine, and Ohio datasets are conducted under a zero-shot setting, where the model is fine-tuned only on the Illinois dataset and has never seen data from North Carolina, Maine, and Ohio during training. Traditional machine learning models perform poorly in this context due to their limited ability to adapt. Therefore, to ensure a fair comparison under the same conditions, we introduce two baseline methods:
-
BERT47. Leveraging its pre-training on large corpora, BERT possesses a certain degree of generalization capability. In our experiments, we fine-tune BERT using prompts from the Illinois dataset and evaluate its zero-shot performance on the North Carolina and Maine datasets.
-
CoT50. Chain-of-thought (CoT) reasoning enables language models to perform multi-step inference by generating intermediate reasoning steps before arriving at a final answer. Zhen et al.23 explored the use of CoT for zero-shot crash severity prediction and reported improved performance over standard LLM prompting. Following their approach, we apply CoT prompting to evaluate zero-shot performance on the North Carolina and Maine datasets.
SafeTraffic Attribution
To identify the feature contribution of each factor to the prediction results, this paper introduces and adapts the concept of Shapley values31. Shapley value is a concept from cooperative game theory that has been widely adopted in machine learning to interpret model predictions51. It provides a way to fairly allocate the contribution of each feature to the outcome of a predictive model. In essence, the Shapley value quantifies how much each feature contributes to a prediction by considering all possible combinations of features. Formally, the Shapley value φ of a feature (or player) i in a cooperative game is defined as:
$${\varphi }_{i}=\mathop{\sum}_{S\subseteq N\setminus \{i\}}\frac{| S| !(n-| S| -1)!}{n!}\left[v(S\cup \{i\})-v(S)\right],$$
(7)
where N = {1, 2, …, n} is the index set of n features, S is a subset of N, and v(S) is the utility of the subset S, which represents a measurable value, such as accuracy or prediction score, achieved by the model using only the subset S of features.
The Shapley value is utilized in both the training and inference stages in SafeTraffic Copilot. During the training stage, it quantifies the contributions of four primary categories of information: general information, infrastructure information, event information, and unit information. During the inference stage, the Shapley value is applied to assess the contributions of individual sentences to the prediction outcomes.
Feature contributions at the training stage
The Shapley value is utilized to assess the influence of different components in the training set on the model during training. As outlined in “Developing SafeTraffic LLM for predicting crashes” in the “Results” section, the jth prompt Tj in the dataset P is divided into five parts: c0: system prompt (i.e., “You are a helpful assistant designed to predict the severity of a traffic crash…”), c1: general information, c2: infrastructure information, c3: event information, and c4: unit information. We denote pj(k) as the ck portion of pj. Given an index set S, we can construct a variant Tj(S) by concatenating the parts in S. For example, if S = {0, 1, 2}, then Tj(S) contains c0, c1, and c2. Formally,
$${T}_{j}(S)={{\mbox{concat}}}_{k\in S}\,{T}_{j}(k),$$
(8)
where concat denotes concatenation. The resulting dataset based on S is P(S) = {Tj(S)∣ j = 0, 1, …, L}, where L is the dataset size.
Referring to Equation (7), the contribution of part ci at training, \({\varphi }_{i}^{\,{\mbox{train}}\,}\), is
$${\varphi }_{i}^{\,{\mbox{train}}\,}={\sum}_{S\subseteq N\setminus \{i\}}\frac{| S| !(n-| S| -1)!}{n!}\cdot \left[v(P(S\cup \{0,i\}))-v(P(S\cup \{0\}))\right],$$
(9)
where N = {1, 2, 3, 4} indexes the four content parts, and v(P(S)) is a performance metric (e.g., accuracy) obtained after retraining the model only on prompts in P(S).
Sentence-level feature contributions at the inference stage
Unlike traditional machine learning models that primarily handle fixed-length feature vectors, LLMs process variable-length text sequences as input52. This characteristic makes commonly used Shapley value approximation methods, such as KernelSHAP53 and DeepSHAP, less applicable to LLMs. Recent approaches like TokenSHAP54 and TransSHAP55 have been proposed to address this by decomposing input text into tokens and computing Shapley values at the token level. However, applying token-level Shapley value computation to SafeTraffic LLM introduces two primary challenges: (1) Computational limitations. The computational complexity of Shapley values is exponential in the number of players. In our SafeTraffic LLM, with an input size of approximately 500 tokens, the large-scale computation of token-level Shapley values for crash data becomes impractical. (2) Limited interpretability. Decomposing the prompt at the token level disregards inter-token dependencies, and the arbitrary masking or replacement of tokens can lead to semantic ambiguity and contextual shifts. These issues hinder a precise understanding of how individual features contribute to predictions. Moreover, paragraph-level analysis is too coarse for detailed attribution, since it can merge distinct features into a single category (e.g., driver and vehicle details under “unit information”).
To overcome these limitations, we propose a sentence-level feature contributions calculation method for inputs of LLMs, which proceeds as follows:
-
Sentence segmentation. The prompts are segmented using delimiters (e.g., commas “,” or periods “.”) to produce sentence-level units.
-
Feature groups annotation. GPT-4o is used to group and label these sentences (see Fig. 5 for the groups’ content). Each group is represented as ck, where \(k\in {N}^{{\prime} }=\{1,2,3,\ldots \,n\}\). For the Washington dataset, n = 14, while for the Illinois dataset n = 12. Given index set \({S}^{{\prime} }\subseteq {N}^{{\prime} }\setminus \{\,i\}\), we can construct the prompt \({T}_{j}({S}^{{\prime} })\) similar to the process Equation (8).
-
Feature contributions calculation based on the feature groups. Based on the constructed dataset, the feature contribution for the ith sentence-group \({\varphi }_{i,j}^{inf}\) for the jth item in the dataset can be calculated as:
$${\varphi }_{i,j}^{\,{\mbox{inf}}\,}= {\sum}_{{S}^{{\prime} }\subseteq {N}^{{\prime} }\setminus \{i\}}\frac{| {S}^{{\prime} }| !\,(n-| {S}^{{\prime} }| -1)!}{n!}\cdot \left[{p}_{\theta }\left({y}_{j}| {T}_{j}({S}^{{\prime} }\cup \{0,i\})\right)\right. \\ \left. -{p}_{\theta }\left({y}_{j}| {T}_{j}({S}^{{\prime} }\cup \{0\})\right)\right]$$
(10)
where pθ represents the LLM, which returns the predicted probability of the targets yj given the inputs. A higher \({\varphi }_{i,j}^{\,{\mbox{inf}}\,}\) indicates a greater contribution of the ith sentence group to the model’s confidence for predicting yj. To reduce computational overhead, we adopt a stratified sampling–based Shapley estimation method using complementary contributions36.