ESDA in PySal (5): Exploratory analysis of spatial data: spatial autocorrelation

ESDA in PySal (5): Exploratory analysis of spatial data: spatial autocorrelation

In this notebook, we introduce the method of Exploratory Spatial Data Analysis
Aims to use formal univariate sum
Multivariate statistical tests of spatial clustering.

1. Imports

import esda
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame
import libpysal as lps
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Point
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

Our dataset comes from Berlin Airbnb crawls taken in April 2018. This data frame was created by Levi Wolf and Serge Rey. As part of the workshop, we built a geopandas data frame with one column reporting the median unit listing price for each neighborhood in Berlin:

gdf = gpd.read_file('data/berlin-neighbourhoods.geojson')

bl_df = pd.read_csv('data/berlin-listings.csv')
geometry = [Point(xy) for xy in zip(bl_df.longitude, bl_df.latitude)]
crs = {<!-- -->'init': 'epsg:4326'}
bl_gdf = GeoDataFrame(bl_df, crs=crs, geometry=geometry)

bl_gdf['price'] = bl_gdf['price'].astype('float32')
sj_gdf = gpd.sjoin(gdf, bl_gdf, how='inner', op='intersects', lsuffix='left', rsuffix='right')
median_price_gb = sj_gdf['price'].groupby([sj_gdf['neighborhood_group']]).mean()
median_price_gb
neighborhood_group
Charlottenburg-Wilm. 58.556408
Friedrichshain-Kreuzberg 55.492809
Lichtenberg 44.584270
Marzahn - Hellersdorf 54.246754
Mitte 60.387890
Neuk?lln 45.135948
Pankow 60.282516
Reinickendorf 43.682465
Spandau 48.236561
Steglitz - Zehlendorf 54.445683
Tempelhof - Sch?neberg 53.704407
Treptow-K?penick 51.222004
Name: price, dtype: float32
gdf = gdf.join(median_price_gb, on='neighborhood_group')
gdf.rename(columns={<!-- -->'price': 'median_pri'}, inplace=True)
gdf.head(15)
neighborhood neighbourhood_group geometry median_pri
0 Blankenfelde/Niedersch?nhausen Pankow MULTIPOLYGON (((13.41191 52.61487, 13.41183 52… 60.282516
1 Helmholtzplatz Pankow MULTIPOLYGON (((13.41405 52.54929, 13.41422 52… 60.282516
2 Wiesbadener Stra?e Charlottenburg-Wilm. MULTIPOLYGON (((13.30748 52.46788, 13.30743 52… 58.556408
3 Schm?ckwitz/Karolinenhof/Rauchfangswerder Treptow – K?penick MULTIPOLYGON (((13.70973 52.39630, 13.70926 52… 51.222004
4 Müggelheim Treptow – K?penick MULTIPOLYGON (((13.73762 52.40850, 13.73773 52… 51.222004
5 Biesdorf Marzahn – Hellersdorf MULTIPOLYGON (((13.56643 52.53510, 13.56697 52… 54.246754
6 Nord 1 Reinickendorf MULTIPOLYGON (((13.33669 52.62265, 13.33663 52… 43.682465
7 West 5 Reinickendorf MULTIPOLYGON (((13.28138 52.59958, 13.28158 52… 43.682465
8 Frankfurter Allee Nord Friedrichshain-Kreuzberg MULTIPOLYGON (((13.45320 52.51682, 13.45321 52… 55.492809
9 Buch Pankow MULTIPOLYGON (((13.46449 52.65055, 13.46457 52… 60.282516
10 Kaulsdorf Marzahn – Hellersdorf MULTIPOLYGON (((13.62135 52.52704, 13.62196 52… 54.246754
11

th>

NaN NaN MULTIPOLYGON (((13.61659 52.58154, 13.61458 52… NaN
12 NaN NaN MULTIPOLYGON (((13.61668 52.57868, 13.60703 52… NaN
13 n?rdliche Luisenstadt Friedrichshain-Kreuzberg MULTIPOLYGON (((13.44430 52.50066, 13.44266 52… 55.492809
14 Nord 2 Reinickendorf MULTIPOLYGON (((13.30680 52.58606, 13.30667 52… 43.682465

We first have to deal with a “nan”:

pd.isnull(gdf['median_pri']).sum()
2
gdf['median_pri'].fillna((gdf['median_pri'].mean()), inplace=True)
gdf.plot(column='median_pri')
<Axes: >

fig, ax = plt.subplots(figsize=(12,10), subplot_kw={<!-- -->'aspect':'equal'})
gdf.plot(column='median_pri', scheme='Quantiles', k=5, cmap='GnBu', legend=True, ax=ax)
#ax.set_xlim(150000, 160000)
#ax.set_ylim(208000, 215000)
<Axes: >

2. Spatial autocorrelation

Visual inspection of map patterns of prices allows us to search for spatial structure. If the spatial distribution of prices is random, then we should not see any clustering of similar values on the map. However, our visual system is attracted to the darker star clusters in the southwest and center, with lighter tones (lower prices) concentrated in the north-central and southeast.

Our brains are very powerful pattern recognition machines. However, sometimes they can be too powerful, causing us to detect false positives or patterns where there is no statistical pattern. This is a problem of particular concern dealing with the visualization of irregular polygons of different sizes and shapes.

The concept of space
Autocorrelation
involves a combination of two types of similarity: spatial similarity similarity and attribute similarity. While there are many different measures of spatial autocorrelation, they all combine these two types of similarity into one summary measure.

Let’s use PySAL to generate these two types of similarity measures.

2.1 Spatial similarity###

We have already encountered spatial weights in previous notebooks. In spatial autocorrelation analysis, spatial weights are used to formalize the concept of spatial similarity. As we saw there there are many ways to define spatial weights, here we will use queen continuity:

df = gdf
wq = lps.weights.Queen.from_dataframe(df)
wq.transform = 'r'

2.2 Attribute similarity

Therefore, the neighborhood

i

i

i and

j

j

The spatial weight between j indicates whether the two neighborhoods
are neighbors (i.e. geographically similar). We also need to measure
Attribute similarity matches the concept of spatial similarity. this
Spatial Lag is a derived variable that does this for us. for the neighborhood

i

i

i spatial lag is defined as:

y

l

a

g

i

=

j

w

i

,

j

y

j

ylag_i = \sum_j w_{i,j} y_j

ylagi?=j∑?wi,j?yj?

y = df['median_pri']
ylag = lps.weights.lag_spatial(wq, y)
ylag
array([56.9625061, 60.28251648, 56.37749926, 51.22200394, 51.22200394,
       50.52180099, 43.6824646, 45.63422012, 52.65491422, 60.28251648,
       53.64180374, 52.73586273, 52.73586273, 56.47182541, 47.83247757,
       58.58870177, 60.33520317, 59.60296903, 60.38788986, 60.02159348,
       51.80624199, 57.94034958, 52.84482813, 53.40314266, 57.90522512,
       60.28251648, 60.28251648, 55.79730334, 56.79401737, 50.81182589,
       59.01427841, 60.29756982, 60.28251648, 50.86356888, 60.3220315,
       60.28251648, 55.48057556, 54.42881557, 60.32466583, 59.50179418,
       54.42846909, 58.55640793, 58.55640793, 57.73426285, 57.47818544,
       57.74774106, 56.13040733, 48.23656082, 48.23656082, 53.74621709,
       55.11957245, 45.95951271, 51.67650986, 54.1985906, 51.45368042,
       52.36880302, 54.44568253, 54.44568253, 50.84825389, 56.50104523,
       53.92108345, 55.9956289, 50.49590378, 49.14499828, 48.61369433,
       49.70049, 49.32550866, 51.22200394, 51.22200394, 47.80509822,
       49.70049, 51.22200394, 45.13594818, 47.57037048, 51.22200394,
       51.22200394, 51.22200394, 51.22200394, 49.60257785, 51.57007762,
       51.42743301, 51.22200394, 51.22200394, 52.43339348, 49.41551208,
       51.58891296, 44.58427048, 51.58891296, 51.42743301, 49.82624902,
       48.947686, 48.40726217, 45.95951271, 47.57037048, 43.6824646,
       47.02354965, 45.95951271, 58.55640793, 56.30865606, 58.09966066,
       47.34497997, 46.40236028, 58.05298805, 59.24321365, 58.55640793,
       47.83247757, 49.49497332, 50.74955784, 48.6149381, 55.97644615,
       57.95624052, 57.87081385, 58.75619634, 60.37283652, 48.23656082,
       49.389711, 54.00091705, 54.26036358, 57.54238828, 55.61980756,
       51.97116137, 48.92101212, 50.97179985, 54.07504463, 47.45824547,
       49.42017746, 45.13594818, 45.13594818, 48.61369433, 49.41551208,
       51.22200394, 50.20766131, 48.72533471, 54.24675369, 54.24675369,
       54.24675369, 53.23850377, 56.1851902, 49.23337746, 43.6824646 ])
import mapclassify as mc
ylagq5 = mc.Quantiles(ylag, k=5)
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=ylagq5.yb).plot(column='cl', categorical=True, \
        k=5, cmap='GnBu', linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.title("Spatial Lag Median Price (Quintiles)")

plt.show()

Spatially lagged quintile maps tend to enhance value impressions
spatial similarity. In fact, it is a local smoother.

df['lag_median_pri'] = ylag
f,ax = plt.subplots(1,2,figsize=(2.16*4,4))
df.plot(column='median_pri', ax=ax[0], edgecolor='k',
        scheme="quantiles", k=5, cmap='GnBu')
ax[0].axis(df.total_bounds[np.asarray([0,2,1,3])])
ax[0].set_title("Price")
df.plot(column='lag_median_pri', ax=ax[1], edgecolor='k',
        scheme='quantiles', cmap='GnBu', k=5)
ax[1].axis(df.total_bounds[np.asarray([0,2,1,3])])
ax[1].set_title("Spatial Lag Price")
ax[0].axis('off')
ax[1].axis('off')
plt.show()

However, we still have
The challenge of visually correlating the price value of neighboring areas
The spatial lag value from the focus unit value. The latter is a
Weighted average of list prices near key counties.

To complement the geographical visualization of these associations, we can turn to the formal
Statistical measures of spatial autocorrelation.

3. Global spatial autocorrelation

We start with a simple case where the variables considered are binary.
This is useful for unraveling the logic of spatial autocorrelation tests. So despite
Our property is a continuous valued property and we convert it to the binary case
Explain key concepts:

3.1 Binary case

y.median()

53.704407
yb = y > y.median()
sum(yb)
68

We have 68 neighborhoods with list prices above the median and 70 below the median
Median (recall the tie problem).

yb = y > y.median()
labels = ["0 Low", "1 High"]
yb = [labels[i] for i in 1*yb]
df['yb'] = yb

The spatial distribution of binary variables immediately raises questions
About the juxtaposition of “black” and “white” areas.

fig, ax = plt.subplots(figsize=(12,10), subplot_kw={<!-- -->'aspect':'equal'})
df.plot(column='yb', cmap='binary', edgecolor='grey', legend=True, ax=ax)
<Axes: >

3.2 Add to count

One way to formalize the test for spatial autocorrelation in binary properties is
Consider so-called _joins_. There is a connection for each neighbor pair
Observe, and the connections are reflected in our binary spatial weight object
wq.

Each unit can take on one of the two values “black” or “white”, so for a given
There are three different types of connections between a pair of adjacent positions, which can
Appear:

  • Heihei (BB)
  • White White (WW)
  • Black and white (or white-on-black) (BW)

Suppose we have 68 black polygons on our map, what is the number of black polygons?
If the process is like this, then we can expect black (BB) to join
Are polygons randomly distributed on the map? This is the logic of join count statistics.

We can use the esda package in PySAL for connection count analysis:

import esda
yb = 1 * (y > y.median()) # convert back to binary
wq = lps.weights.Queen.from_dataframe(df)
wq.transform = 'b'
np.random.seed(12345)
jc = esda.join_counts.Join_Counts(yb, wq)

The result object stores observation counts for different types of connections:

jc.bb
164.0
jc.ww
149.0
jc.bw
73.0

Note that these three scenarios exhaust all possibilities:

jc.bb + jc.ww + jc.bw
386.0

and

wq.s0/2
386.0

This is the only number of connections in the spatial weight object.

Our object tells us that 164 BB connections have been observed:

jc.bb
164.0

The key question for us is whether this is a departure from our original intention
Expect the process to produce a spatial distribution of black polygons
Is it completely random? To answer this question, PySAL uses a random space
Arrangements of observed property values to generate the next realization
Complete Spatial Randomness (CSR) is empty. This is a lot of repetition
Number of times to build the reference distribution to evaluate (default 999)
Statistical significance of our observed counts.

The average number of BB connections in the synthetic implementation is:

jc.mean_bb
90.70170170170171

This is less than what we observed. The question is whether we observe
Are values so far removed from expectations that we would reject CSR as ineffective?

import seaborn as sbn
sbn.kdeplot(jc.sim_bb, shade=True)
plt.vlines(jc.bb, 0, 0.075, color='r')
plt.vlines(jc.mean_bb, 0,0.075)
plt.xlabel('BB Counts')

Text(0.5, 0, 'BB Counts')

Density depicts the distribution of BB counts, black vertical
Line represents average BB count in synthetic implementation and red
Draw the observed BB count as our price. Obviously the value we observe is
Extremely high. The pseudo p-value summarizes this:

jc.p_sim_bb

0.001

Since this is below traditional significance levels, we will reject the zero value
Complete spatial randomness is conducive to the spatial autocorrelation of market prices.

3.3 Continuous Cases

Connection count analysis is based on binary attributes and can cover many
Interesting empirical applications for people interested in being and existence
Absence phenomenon. In our case we artificially created binary variables,
In the process we lost a lot of original information
Continuous attributes. Returning to the original variables, we can explore
Additional tests of spatial autocorrelation for continuous cases.

First, we convert the weights from their current binary state to row normalization:

wq.transform = 'r'
y = df['median_pri']

Moran’s I is a test of global autocorrelation for continuous properties:

np.random.seed(12345)
mi = esda.moran.Moran(y, wq)
mi.I
0.6563069331329718

Again, our statistical values need to be interpreted against the reference
Distribution when CSR is invalid. PySAL uses a similar approach to what we saw in
Connection count analysis: randomized spatial permutations.

import seaborn as sbn
sbn.kdeplot(mi.sim, shade=True)
plt.vlines(mi.I, 0, 1, color='r')
plt.vlines(mi.EI, 0,1)
plt.xlabel("Moran's I")

Text(0.5, 0, "Moran's I")

Here again the values we observe are in the upper tail, although visually this is
Doesn’t look extreme relative to the binary case. However, it is still statistically significant:

mi.p_sim
0.001

4. Local autocorrelation: hot spots, cold spots and spatial outliers

In addition to global autocorrelation statistics, PySAL also has many local autocorrelation statistics
Autocorrelation statistics. Let us compute the same local Moran statistic
d

np.random.seed(12345)
import esda
wq.transform = 'r'
lag_price = lps.weights.lag_spatial(wq, df['median_pri'])

price = df['median_pri']
b, a = np.polyfit(price, lag_price, 1)
f, ax = plt.subplots(1, figsize=(9, 9))

plt.plot(price, lag_price, '.', color='firebrick')

 # dashed vert at mean of the price
plt.vlines(price.mean(), lag_price.min(), lag_price.max(), linestyle='--')
 # dashed horizontal at mean of lagged price
plt.hlines(lag_price.mean(), price.min(), price.max(), linestyle='--')

# red line of best fit using global I as slope
plt.plot(price, a + b*price, 'r')
plt.title('Moran Scatterplot')
plt.ylabel('Spatial Lag of Price')
plt.xlabel('Price')
plt.show()

Now, we have a local

I

i

I_i

An array of Ii? instead of a single

I

I

I Statistics
The statistics are stored in the “.Is” attribute and the simulated p-value is
in “p_sim”.

li = esda.moran.Moran_Local(y, wq)
li.q
array([1, 1, 1, 3, 3, 4, 3, 3, 4, 1, 1, 3, 3, 1, 3, 1, 1, 1, 1, 1, 4, 1,
       1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 3, 3, 1, 1, 3, 3, 1, 3, 3, 1, 1, 4, 1, 1, 1, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3, 1, 1, 1, 3, 3, 4, 3, 1,
       1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 4, 3, 4, 1, 3, 3, 3, 3, 3, 4, 3, 3,
       4, 1, 1, 1, 1, 2, 3, 3])

We can again use permutations to test for local clustering, but here we use
Conditional random arrangement (different distribution of each focus position)

(li.p_sim < 0.05).sum()
68

We can distinguish specific types of associations reflected in local spatial
The four quadrants of the Moran scatter plot above:

sig = li.p_sim < 0.05
hotspot = sig * li.q==1
coldspot = sig * li.q==3
doughnut = sig * li.q==2
diamond = sig * li.q==4
spots = ['n.sig.', 'hot spot']
labels = [spots[i] for i in hotspot*1]
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['red', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

spots = ['n.sig.', 'cold spot']
labels = [spots[i] for i in coldspot*1]
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['blue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

spots = ['n.sig.', 'doughnut']
labels = [spots[i] for i in doughnut*1]
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['lightblue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

spots = ['n.sig.', 'diamond']
labels = [spots[i] for i in diamond*1]
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['pink', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()

sig = 1 * (li.p_sim < 0.05)
hotspot = 1 * (sig * li.q==1)
coldspot = 3 * (sig * li.q==3)
doughnut = 2 * (sig * li.q==2)
diamond = 4 * (sig * li.q==4)
spots = hotspot + coldspot + doughnut + diamond
spots
array([1, 1, 0, 0, 0, 0, 3, 3, 0, 1, 0, 3, 3, 0, 3, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 3, 0,
       3, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 3, 0,
       0, 3, 0, 3, 3, 3, 3, 3, 3, 1, 0, 1, 3, 3, 1, 1, 1, 3, 0, 0, 3, 0,
       1, 1, 1, 1, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 3, 3, 3, 0, 0, 0,
       4, 0, 0, 0, 0, 0, 3, 3])
spot_labels = [ '0 ns', '1 hot spot', '2 doughnut', '3 cold spot', '4 diamond']
labels = [spot_labels[i] for i in spots]
from matplotlib import colors
hmap = colors.ListedColormap([ 'lightgrey', 'red', 'lightblue', 'blue', 'pink'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, \
        k=2, cmap=hmap, linewidth=0.1, ax=ax, \
        edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()