Spatial Autocorrelation and Moran's I in SQL and BigQuery

Mathias Schläffer
4 min readMar 24, 2022
Photo by Tim Foster on Unsplash

Spatial lag and autocorrelation is discussed in many spatial or geographic data science resources. One common metric calculated to describe global spatial autocorrelation is Moran's I. There are good Python packages out there to calculate this metric, one of them is PySal which also comes with a great e-book I mentioned before named Geographic Data Science with Python (GDS). The book explains all concepts from spatial lag, rook- and queen adjacency to spatial weights and autocorrelation better than I ever could. I therefore want to skip the theoretical part of Moran's I and instead demonstrate quickly how it can be calculated in SQL.

Since I skip the theoretical part, I will use the same data as the authors of GDS and will use the same variable names whenever possible so that it is easier to follow along and fill in the theory if needed.
After loading the brexit vote data and the local authority districts, the two tables are joined into one, containing both the variable for what percentage of the population voted to leave, and the polygons of the districts.

Red : votes for leave > 50%, White: votes for leave = 50%, Blue: votes for leave < 50%

First we extract the variables needed, then we already assign the neighbours to each area. Since the authors of GDS use an assignment of a fixed number of 8 neighbours based on distance rather than adjacency, I will also show this method, but will show a variation at the end of this article.
To get 8 neighbours closest to each area (based on the distance of the centroids of the districts) we use a cross-join of each area on each other area before ordering pairs on the distance of their centroids and limiting to 8 neighbours.

WITH
lad_votes AS (
SELECT
objectid,
lad16nm,
Pct_Leave,
geometry,
ST_CENTROID(geometry) as centroid
FROM
`project.dataset.local_authority_districts_votes`
),
k_nearest AS (
SELECT
a.objectid,
a.lad16nm,
a.Pct_Leave,
ARRAY_AGG(
STRUCT(
b.objectid as objectid_neighbour,
b.lad16nm as lad16nm_neighbour,
b.Pct_Leave AS Pct_Leave_neighbour)
ORDER BY ST_DISTANCE(a.centroid, b.centroid) ASC
LIMIT 8
) as neighbours
FROM
lad_votes a, lad_votes b
WHERE a.objectid <> b.objectid
GROUP BY
a.objectid,
a.lad16nm,
a.Pct_Leave
),

In the next step, the row weights are calculated (in this case a constant 1/8 as the number of distinct neighbours is fixed by design, but in the later example this weight will vary) and multiplied with the Pct_leave_neighbour. For each area, all row standardized values are summed up into one metric called Pct_leave_lag — the weighted sum of the spatial lag for the Pct_leave metric.

row_weights AS (
SELECT
objectid,
lad16nm,
Pct_Leave,
SUM(Pct_Leave_neighbour)/COUNT(DISTINCT(objectid_neighbour)) AS Pct_leave_lag
FROM
k_nearest, UNNEST(neighbours)
GROUP BY
objectid,
lad16nm,
Pct_Leave
),

The output of this step looks something like this:

We then normalize both the Pct_leave and Pct_leave_lag by subtracting their respective means:

normalize AS (
SELECT
objectid,
lad16nm,
(Pct_Leave - AVG(Pct_Leave) OVER ()) AS Pct_Leave_std,
(Pct_leave_lag - AVG(Pct_leave_lag) OVER ()) AS Pct_leave_lag_std
FROM
row_weights
)

Finally, Moran's I is the slope of the linefit of Moran's plot which is 'just' a scatter plot with Pct_leave_std on the x-axis and Pct_leave_lag_std on the y-axis.

The slope of the linefit can be calculated simply by taking the covariance of Pct_leave_lag_std and Pct_Leave_std divided by the variance of Pct_leave_std

SELECT
COVAR_POP(Pct_leave_lag_std, Pct_Leave_std)/VAR_POP(Pct_Leave_std) AS Morans_i
FROM
normalize

Finally I want to offer an alternative approach to the distance based assignment of a fixed number of neighbours. Instead, it is simple to switch to an adjacency based assignment of neighbouring areas. To do this, the only step that needs to be changed is the spatial join in the subquery k_nearest. Instead of a cross-join and ordering based on distance one can join areas based on the intersection of their borders (In theory ST_TOUCHES should work, but that requires almost perfect polygons with absolutely no overlaps of adjacent polygons, ST_INTERSECTS is somewhat more forgiving). In this case, the number of neighbours is not fixed, but will be determined by the actual number of adjacent areas. Therefore row weights in the following step will vary with the number of neighbours.

k_nearest AS (
SELECT
a.objectid,
a.lad16nm,
a.Pct_Leave,
ARRAY_AGG(
STRUCT(
b.objectid as objectid_neighbour,
b.lad16nm as lad16nm_neighbour,
b.Pct_Leave AS Pct_Leave_neighbour)
) as neighbours
FROM
lad_votes a
JOIN lad_votes b
ON ST_INTERSECTS(a.geometry, b.geometry)
WHERE a.objectid <> b.objectid
GROUP BY
a.objectid,
a.lad16nm,
a.Pct_Leave
),

When I was thinking of writing this article, my intention was also to show how a hexagon grid takes care of the spatial lag so elegantly as all adjacent hexagon neighbours are equidistant from each other (neither rook nor queen adjacency in a rectangular grid have that) and the kring functions available are so easy to use that there is no need for explicit spatial joins. But when preparing some material I found that Carto already provides an out-of-the-box Moran's I function for H3 which lets you set the number of krings to determine the number of neighbours and the distance decay function of the spatial lag for the spatial weights matrix . Saved me a whole lot of writing :)

Sources:

Geographic Data Science with Python: https://geographicdata.science/book/intro.html

local authority districts: https://data.gov.uk/dataset/1261032e-c683-4408-a82d-37281d0e250d/local-authority-districts-december-2016-full-clipped-boundaries-in-great-britain

Carto: https://docs.carto.com/analytics-toolbox-bigquery/overview/getting-started/

--

--

Mathias Schläffer

Economist turned Data Scientist. Creating human mobility insights at Unacast