KDE+

What is the KDE+ method?
The KDE+ method performs cluster analysis of traffic crashes (or other point events) within a network (road, railway...). It extends the kernel density estimation (KDE) by statistical significance testing and allows for the ranking of the resulting significant clusters.

What kind of inputs are needed?
All the data which are needed for the analysis performed by the KDE+ method are:

XY position or stationing of traffic crashes (or other point events) on the sections. The crashes (point events) which occur in intersections (junctions) should be excluded.
The network consisting of the sections (it is assumed that traffic volume is more or less constant in space within a single network section)

Is road segmentation necessary?
Road segmentation is not needed to apply the KDE+ method. In fact, we discourage from segmenting a road prior to the application of the KDE+ method, because it can distort results. For example, it can divide a hotspot. The sections of road network has to be divided between intersections, where traffic volume is changing.

Mitigation is only cost-effective at places where clustering of traffic crashes occurs due to a local factor. Other traffic crashes on a road segment represent noise with respect to the spatial pattern of traffic crashes, i.e. originating as a result of global causes (Fig. a). The second example (Fig. b) indicates road section segmentation which caused that all three segments have the same number (4) of traffic crashes. The segmentation system itself can therefore influence inputs to a model and thus the final results.

Input parameters of the KDE+ analysis in KDE+ toolbox for ArcGIS

Tolerance for snapping points to line features
Only points in the vicinity/neighbourhood of lines are considered in the calculations. The “tolerance for snapping” is the maximum distance from a point to a line defining this vicinity.

Bandwidth for KDE+ analysis
A smoothing parameter of the kernel density estimation. In general, values from 50 m to 500 m are recommended as far as traffic crashes are concerned. While lower values should be used for urban network, higher values are recommended for highways and motorway due to higher speed limit. Default value is 100 m, for line (road) segments in cities 50 Meters is recommended, for highways or motorways 150 m is recommended.

Number of Monte Carlo simulations
Higher number of Monte Carlo simulations leads to more precise estimates of critical values when testing for hotspots. As a trade-off, computation lasts longer when increasing the number of simulations. The recommended value is 800.

Step of discretization
Accuracy of transferring continuous (theoretical) concept into its discrete counterpart which can be implemented in a computer. By default, the step of discretization is 1 m. Hence, the error of discretization is negligible. Increasing the step of discretization speeds up the computations but it may increase the error of hotspots’ precise identification.

Type of origin of input data
Two options: XY (GPS) coordinates – points have precise XY coordinates (neglecting the error of a GPS device), Linear stationing data – points do not have XY coordinates (e. g. linear referencing with rounding error) or the coordinates are not precise and the error is not negligible with respect to the desired precision of the results (i. e. it is necessary to consider it in the calculations).

Minimum interval of linear stationing
Relevant only in the case of “Linear stationing data”. It defines the uncertainty in the data points (e. g. a rounding error of linear referencing or the error in the GPS coordinates which is not negligible with respect to the desired precision of the results).

Input parameters of the KDE+ analysis in KDE+ JAVA software

Bandwidth for KDE+ analysis
A smoothing parameter of the kernel density estimation. In general, values from 50 m to 500 m are recommended as far as traffic crashes are concerned. While lower values should be used for urban network, higher values are recommended for highways and motorway due to higher speed limit. Default value is 100 m, for highways or motorways 150 m is recommended.

Data accuracy
Two options: Accurate (GPS) - XY coordinates – points have precise XY coordinates (neglecting the error of a GPS device), Stationing (100 units) – Linear stationing data – points do not have XY coordinates (e. g. linear referencing with rounding error 100 units) or the coordinates are not precise and the error is not negligible with respect to the desired precision of the results (i. e. it is necessary to consider it in the calculations).

Other Input parameters can be set in the KDE+ JAVA application configuration file:

Step of discretization
Accuracy of transferring continuous (theoretical) concept into its discrete counterpart which can be implemented in a computer. Distance between two values in which the discretized probability density is calculated, whole number (e.g. 1, 10). Hence, the error of discretization is negligible. Increasing the step of discretization speeds up the computations but it may increase the error of hotspots’ precise identification. By default, the step of discretization is 10 m.

Snapping distance (Tolerance for snapping points to line features)
Only points in the vicinity/neighbourhood of lines are considered in the calculations. The “tolerance for snapping” is the maximum distance from a point to a line defining this vicinity. By default, the snapping distance is 20 m.

Which attributes of resulting clusters are important for me?
ID_clus     - ID of the cluster
ID_line     - ID of the line section on which the cluster is located
NPts_clus     - number of points within the cluster
NPts_line     - number of points on the line section on which the cluster is located
Strength     - a relative number which measures the degree of violation of the null hypothesis (uniform distribution of traffic crashes along the road section); cluster strength is important for individual drivers, it represents the individual risk
Clus_from     - relative position of the cluster start point on the section
Clus_to     - relative position of the cluster end point on the section
Len_clus     - length of the cluster
Len_line     - length of the line section
Dens_Point     - density of points within the cluster per 100 m
Str_Dens2 = Strength*Dens_point^2 (collective risk) - a measure of collective importance of a cluster
SStr - segment strength (renamed from GStr - global strength from previous versions to SStr - segment strength)
- suitable for reducing the false alarm rate, possible miss of less important clusters (set SStr = 1)

Should I use strength or collective risk to order resulting clusters?
It is important to consider both individual risk (represented by the cluster strength) and collective risk. Kernel density estimation (the blue curve) highlights places where a traffic crash is the most likely to occur within a road. On the other hand, number of traffic crashes within a road reflects the dangerousness of the road as a whole (how frequently traffic crashes occur) and it is related to exposure (in the form of number of possibilities for traffic crash occurrence). Thus, collective risk of a cluster depends on the cluster strength and number of traffic crashes per 100 m. (Favilli et all., 2018)

Example

Low cluster strength (relatively low number of traffic crashes within the cluster compared to the number of traffic crashes within the whole road segment) + low number of traffic crashes within the road segment.
High cluster strength (relatively high number of traffic crashes within the cluster) + low number of traffic crashes within the road segment.
Low cluster strength (relatively low number of traffic crashes within the cluster) + high number of traffic crashes within the road segment.
High cluster strength (relatively high number of traffic crashes within the cluster) + high number of traffic crashes within the road segment.

Ordering according to the cluster strength: 4, 2, 3, 1 (2 has greater individual risk than 3, because 3 has greater exposure).
Ordering according to the collective risk of a cluster: 4, 3, 2, 1 (3 has greater collective risk than 2, because there are fewer traffic crashes on 2 than on 3).

Recommendations and restrictions
It is recommended to exclude the point events located at intersections when analyzing traffic crashes in general. The reason lies in the fact that intersections are typically dangerous places by definition. If they are not (for example in the case of animal-vehicle collisions), there is no need to exclude events located at intersections.

The original restriction of network sections shorter than 200 meters (mentioned in Bíl et al., 2013) doesn’t apply from version 2.0. Therefore, also short sections can be analyzed using KDE+.

Segment test (Segment threshold)
-        in previous versions (3.1 and lower) defined as "Global test"
-        applied to a road segment as a whole, not at every location on the road
-        cannot precisely identify the extend of a hotspot
-        suitable for reducing the false alarm rate
-        possible miss of less important clusters
If we want to focus only on identification of several most dangerous hotspots, we can proceed as follows. First, we identify and localize significant clusters according to the local threshold. Afterwards, we check the results of the segment test. For filtering out false alarms from resulting clusters set the “SStr” = 1.

Border bias

arises when KDE is applied in a bounded domain (e. g. a road section)
its extend is dependent on the selected bandwidth
a boundary correction can be applied (red curves)

A detailed spatiotemporal analysis

Since hotspots (statistically significant traffic crash clusters) evolve over time, it is meaningful to perform a spatiotemporal analysis.

An approach, based on the KDE+ method, which is capable of evaluating spatiotemporal behavior of traffic crash hotspots in a high detail was introduced in Bíl, M., Andrášik, R., Sedoník, J., 2019. A detailed spatiotemporal analysis of traffic crash hotspots. Applied Geography 107, 82-90. This approach can be utilized in research focusing on the spatiotemporal evolution of crash patterns within a road network. Practitioners can use it as a tool allowing for a retrospective analysis of the efficiency of safety measures.

A sketch of the three elementary forms of hotspots in relation to their temporal behavior is shown
(hotspot 1: disappearance, hotspot 2: stability, hotspot 3: emergence).

For instance, a dangerous curve presents a stable hotspot on the road segment below.

FAQ