ESDA in PySal (5): Exploratory analysis of spatial data: spatial autocorrelation
In this notebook, we introduce the method of Exploratory Spatial Data Analysis
Aims to use formal univariate sum
Multivariate statistical tests of spatial clustering.
1. Imports
import esda import pandas as pd import geopandas as gpd from geopandas import GeoDataFrame import libpysal as lps import numpy as np import matplotlib.pyplot as plt from shapely.geometry import Point %matplotlib inline import warnings warnings.filterwarnings("ignore")
Our dataset comes from Berlin Airbnb crawls taken in April 2018. This data frame was created by Levi Wolf and Serge Rey. As part of the workshop, we built a geopandas data frame with one column reporting the median unit listing price for each neighborhood in Berlin:
gdf = gpd.read_file('data/berlin-neighbourhoods.geojson')
bl_df = pd.read_csv('data/berlin-listings.csv') geometry = [Point(xy) for xy in zip(bl_df.longitude, bl_df.latitude)] crs = {<!-- -->'init': 'epsg:4326'} bl_gdf = GeoDataFrame(bl_df, crs=crs, geometry=geometry)
bl_gdf['price'] = bl_gdf['price'].astype('float32') sj_gdf = gpd.sjoin(gdf, bl_gdf, how='inner', op='intersects', lsuffix='left', rsuffix='right') median_price_gb = sj_gdf['price'].groupby([sj_gdf['neighborhood_group']]).mean() median_price_gb
neighborhood_group Charlottenburg-Wilm. 58.556408 Friedrichshain-Kreuzberg 55.492809 Lichtenberg 44.584270 Marzahn - Hellersdorf 54.246754 Mitte 60.387890 Neuk?lln 45.135948 Pankow 60.282516 Reinickendorf 43.682465 Spandau 48.236561 Steglitz - Zehlendorf 54.445683 Tempelhof - Sch?neberg 53.704407 Treptow-K?penick 51.222004 Name: price, dtype: float32
gdf = gdf.join(median_price_gb, on='neighborhood_group') gdf.rename(columns={<!-- -->'price': 'median_pri'}, inplace=True) gdf.head(15)
neighborhood | neighbourhood_group | geometry | median_pri | |
---|---|---|---|---|
0 | Blankenfelde/Niedersch?nhausen | Pankow | MULTIPOLYGON (((13.41191 52.61487, 13.41183 52… | 60.282516 |
1 | Helmholtzplatz | Pankow | MULTIPOLYGON (((13.41405 52.54929, 13.41422 52… | 60.282516 |
2 | Wiesbadener Stra?e | Charlottenburg-Wilm. | MULTIPOLYGON (((13.30748 52.46788, 13.30743 52… | 58.556408 |
3 | Schm?ckwitz/Karolinenhof/Rauchfangswerder | Treptow – K?penick | MULTIPOLYGON (((13.70973 52.39630, 13.70926 52… | 51.222004 |
4 | Müggelheim | Treptow – K?penick | MULTIPOLYGON (((13.73762 52.40850, 13.73773 52… | 51.222004 |
5 | Biesdorf | Marzahn – Hellersdorf | MULTIPOLYGON (((13.56643 52.53510, 13.56697 52… | 54.246754 |
6 | Nord 1 | Reinickendorf | MULTIPOLYGON (((13.33669 52.62265, 13.33663 52… | 43.682465 |
7 | West 5 | Reinickendorf | MULTIPOLYGON (((13.28138 52.59958, 13.28158 52… | 43.682465 |
8 | Frankfurter Allee Nord | Friedrichshain-Kreuzberg | MULTIPOLYGON (((13.45320 52.51682, 13.45321 52… | 55.492809 |
9 | Buch | Pankow | MULTIPOLYGON (((13.46449 52.65055, 13.46457 52… | 60.282516 |
10 | Kaulsdorf | Marzahn – Hellersdorf | MULTIPOLYGON (((13.62135 52.52704, 13.62196 52… | 54.246754 |
11
th> | NaN | NaN | MULTIPOLYGON (((13.61659 52.58154, 13.61458 52… | NaN |
12 | NaN | NaN | MULTIPOLYGON (((13.61668 52.57868, 13.60703 52… | NaN |
13 | n?rdliche Luisenstadt | Friedrichshain-Kreuzberg | MULTIPOLYGON (((13.44430 52.50066, 13.44266 52… | 55.492809 |
14 | Nord 2 | Reinickendorf | MULTIPOLYGON (((13.30680 52.58606, 13.30667 52… | 43.682465 |
We first have to deal with a “nan”:
pd.isnull(gdf['median_pri']).sum()
2
gdf['median_pri'].fillna((gdf['median_pri'].mean()), inplace=True)
gdf.plot(column='median_pri')
<Axes: >
fig, ax = plt.subplots(figsize=(12,10), subplot_kw={<!-- -->'aspect':'equal'}) gdf.plot(column='median_pri', scheme='Quantiles', k=5, cmap='GnBu', legend=True, ax=ax) #ax.set_xlim(150000, 160000) #ax.set_ylim(208000, 215000)
<Axes: >
2. Spatial autocorrelation
Visual inspection of map patterns of prices allows us to search for spatial structure. If the spatial distribution of prices is random, then we should not see any clustering of similar values on the map. However, our visual system is attracted to the darker star clusters in the southwest and center, with lighter tones (lower prices) concentrated in the north-central and southeast.
Our brains are very powerful pattern recognition machines. However, sometimes they can be too powerful, causing us to detect false positives or patterns where there is no statistical pattern. This is a problem of particular concern dealing with the visualization of irregular polygons of different sizes and shapes.
The concept of space
Autocorrelationinvolves a combination of two types of similarity: spatial similarity similarity and attribute similarity. While there are many different measures of spatial autocorrelation, they all combine these two types of similarity into one summary measure.
Let’s use PySAL to generate these two types of similarity measures.
2.1 Spatial similarity###
We have already encountered spatial weights in previous notebooks. In spatial autocorrelation analysis, spatial weights are used to formalize the concept of spatial similarity. As we saw there there are many ways to define spatial weights, here we will use queen continuity:
df = gdf wq = lps.weights.Queen.from_dataframe(df) wq.transform = 'r'
2.2 Attribute similarity
Therefore, the neighborhood
i
i
i and
j
j
The spatial weight between j indicates whether the two neighborhoods
are neighbors (i.e. geographically similar). We also need to measure
Attribute similarity matches the concept of spatial similarity. this
Spatial Lag is a derived variable that does this for us. for the neighborhood
i
i
i spatial lag is defined as:
y
l
a
g
i
=
∑
j
w
i
,
j
y
j
ylag_i = \sum_j w_{i,j} y_j
ylagi?=j∑?wi,j?yj?
y = df['median_pri'] ylag = lps.weights.lag_spatial(wq, y)
ylag
array([56.9625061, 60.28251648, 56.37749926, 51.22200394, 51.22200394, 50.52180099, 43.6824646, 45.63422012, 52.65491422, 60.28251648, 53.64180374, 52.73586273, 52.73586273, 56.47182541, 47.83247757, 58.58870177, 60.33520317, 59.60296903, 60.38788986, 60.02159348, 51.80624199, 57.94034958, 52.84482813, 53.40314266, 57.90522512, 60.28251648, 60.28251648, 55.79730334, 56.79401737, 50.81182589, 59.01427841, 60.29756982, 60.28251648, 50.86356888, 60.3220315, 60.28251648, 55.48057556, 54.42881557, 60.32466583, 59.50179418, 54.42846909, 58.55640793, 58.55640793, 57.73426285, 57.47818544, 57.74774106, 56.13040733, 48.23656082, 48.23656082, 53.74621709, 55.11957245, 45.95951271, 51.67650986, 54.1985906, 51.45368042, 52.36880302, 54.44568253, 54.44568253, 50.84825389, 56.50104523, 53.92108345, 55.9956289, 50.49590378, 49.14499828, 48.61369433, 49.70049, 49.32550866, 51.22200394, 51.22200394, 47.80509822, 49.70049, 51.22200394, 45.13594818, 47.57037048, 51.22200394, 51.22200394, 51.22200394, 51.22200394, 49.60257785, 51.57007762, 51.42743301, 51.22200394, 51.22200394, 52.43339348, 49.41551208, 51.58891296, 44.58427048, 51.58891296, 51.42743301, 49.82624902, 48.947686, 48.40726217, 45.95951271, 47.57037048, 43.6824646, 47.02354965, 45.95951271, 58.55640793, 56.30865606, 58.09966066, 47.34497997, 46.40236028, 58.05298805, 59.24321365, 58.55640793, 47.83247757, 49.49497332, 50.74955784, 48.6149381, 55.97644615, 57.95624052, 57.87081385, 58.75619634, 60.37283652, 48.23656082, 49.389711, 54.00091705, 54.26036358, 57.54238828, 55.61980756, 51.97116137, 48.92101212, 50.97179985, 54.07504463, 47.45824547, 49.42017746, 45.13594818, 45.13594818, 48.61369433, 49.41551208, 51.22200394, 50.20766131, 48.72533471, 54.24675369, 54.24675369, 54.24675369, 53.23850377, 56.1851902, 49.23337746, 43.6824646 ])
import mapclassify as mc ylagq5 = mc.Quantiles(ylag, k=5)
f, ax = plt.subplots(1, figsize=(9, 9)) df.assign(cl=ylagq5.yb).plot(column='cl', categorical=True, \ k=5, cmap='GnBu', linewidth=0.1, ax=ax, \ edgecolor='white', legend=True) ax.set_axis_off() plt.title("Spatial Lag Median Price (Quintiles)") plt.show()
Spatially lagged quintile maps tend to enhance value impressions
spatial similarity. In fact, it is a local smoother.
df['lag_median_pri'] = ylag f,ax = plt.subplots(1,2,figsize=(2.16*4,4)) df.plot(column='median_pri', ax=ax[0], edgecolor='k', scheme="quantiles", k=5, cmap='GnBu') ax[0].axis(df.total_bounds[np.asarray([0,2,1,3])]) ax[0].set_title("Price") df.plot(column='lag_median_pri', ax=ax[1], edgecolor='k', scheme='quantiles', cmap='GnBu', k=5) ax[1].axis(df.total_bounds[np.asarray([0,2,1,3])]) ax[1].set_title("Spatial Lag Price") ax[0].axis('off') ax[1].axis('off') plt.show()
However, we still have
The challenge of visually correlating the price value of neighboring areas
The spatial lag value from the focus unit value. The latter is a
Weighted average of list prices near key counties.
To complement the geographical visualization of these associations, we can turn to the formal
Statistical measures of spatial autocorrelation.
3. Global spatial autocorrelation
We start with a simple case where the variables considered are binary.
This is useful for unraveling the logic of spatial autocorrelation tests. So despite
Our property is a continuous valued property and we convert it to the binary case
Explain key concepts:
3.1 Binary case
y.median()
53.704407
yb = y > y.median() sum(yb)
68
We have 68 neighborhoods with list prices above the median and 70 below the median
Median (recall the tie problem).
yb = y > y.median() labels = ["0 Low", "1 High"] yb = [labels[i] for i in 1*yb] df['yb'] = yb
The spatial distribution of binary variables immediately raises questions
About the juxtaposition of “black” and “white” areas.
fig, ax = plt.subplots(figsize=(12,10), subplot_kw={<!-- -->'aspect':'equal'}) df.plot(column='yb', cmap='binary', edgecolor='grey', legend=True, ax=ax)
<Axes: >
3.2 Add to count
One way to formalize the test for spatial autocorrelation in binary properties is
Consider so-called _joins_. There is a connection for each neighbor pair
Observe, and the connections are reflected in our binary spatial weight object
wq
.
Each unit can take on one of the two values “black” or “white”, so for a given
There are three different types of connections between a pair of adjacent positions, which can
Appear:
- Heihei (BB)
- White White (WW)
- Black and white (or white-on-black) (BW)
Suppose we have 68 black polygons on our map, what is the number of black polygons?
If the process is like this, then we can expect black (BB) to join
Are polygons randomly distributed on the map? This is the logic of join count statistics.
We can use the esda package in PySAL for connection count analysis:
import esda yb = 1 * (y > y.median()) # convert back to binary wq = lps.weights.Queen.from_dataframe(df) wq.transform = 'b' np.random.seed(12345) jc = esda.join_counts.Join_Counts(yb, wq)
The result object stores observation counts for different types of connections:
jc.bb
164.0
jc.ww
149.0
jc.bw
73.0
Note that these three scenarios exhaust all possibilities:
jc.bb + jc.ww + jc.bw
386.0
and
wq.s0/2
386.0
This is the only number of connections in the spatial weight object.
Our object tells us that 164 BB connections have been observed:
jc.bb
164.0
The key question for us is whether this is a departure from our original intention
Expect the process to produce a spatial distribution of black polygons
Is it completely random? To answer this question, PySAL uses a random space
Arrangements of observed property values to generate the next realization
Complete Spatial Randomness (CSR) is empty. This is a lot of repetition
Number of times to build the reference distribution to evaluate (default 999)
Statistical significance of our observed counts.
The average number of BB connections in the synthetic implementation is:
jc.mean_bb
90.70170170170171
This is less than what we observed. The question is whether we observe
Are values so far removed from expectations that we would reject CSR as ineffective?
import seaborn as sbn sbn.kdeplot(jc.sim_bb, shade=True) plt.vlines(jc.bb, 0, 0.075, color='r') plt.vlines(jc.mean_bb, 0,0.075) plt.xlabel('BB Counts')
Text(0.5, 0, 'BB Counts')
Density depicts the distribution of BB counts, black vertical
Line represents average BB count in synthetic implementation and red
Draw the observed BB count as our price. Obviously the value we observe is
Extremely high. The pseudo p-value summarizes this:
jc.p_sim_bb
0.001
Since this is below traditional significance levels, we will reject the zero value
Complete spatial randomness is conducive to the spatial autocorrelation of market prices.
3.3 Continuous Cases
Connection count analysis is based on binary attributes and can cover many
Interesting empirical applications for people interested in being and existence
Absence phenomenon. In our case we artificially created binary variables,
In the process we lost a lot of original information
Continuous attributes. Returning to the original variables, we can explore
Additional tests of spatial autocorrelation for continuous cases.
First, we convert the weights from their current binary state to row normalization:
wq.transform = 'r'
y = df['median_pri']
Moran’s I is a test of global autocorrelation for continuous properties:
np.random.seed(12345) mi = esda.moran.Moran(y, wq) mi.I
0.6563069331329718
Again, our statistical values need to be interpreted against the reference
Distribution when CSR is invalid. PySAL uses a similar approach to what we saw in
Connection count analysis: randomized spatial permutations.
import seaborn as sbn sbn.kdeplot(mi.sim, shade=True) plt.vlines(mi.I, 0, 1, color='r') plt.vlines(mi.EI, 0,1) plt.xlabel("Moran's I")
Text(0.5, 0, "Moran's I")
Here again the values we observe are in the upper tail, although visually this is
Doesn’t look extreme relative to the binary case. However, it is still statistically significant:
mi.p_sim
0.001
4. Local autocorrelation: hot spots, cold spots and spatial outliers
In addition to global autocorrelation statistics, PySAL also has many local autocorrelation statistics
Autocorrelation statistics. Let us compute the same local Moran statistic
d
np.random.seed(12345) import esda
wq.transform = 'r' lag_price = lps.weights.lag_spatial(wq, df['median_pri'])
price = df['median_pri'] b, a = np.polyfit(price, lag_price, 1) f, ax = plt.subplots(1, figsize=(9, 9)) plt.plot(price, lag_price, '.', color='firebrick') # dashed vert at mean of the price plt.vlines(price.mean(), lag_price.min(), lag_price.max(), linestyle='--') # dashed horizontal at mean of lagged price plt.hlines(lag_price.mean(), price.min(), price.max(), linestyle='--') # red line of best fit using global I as slope plt.plot(price, a + b*price, 'r') plt.title('Moran Scatterplot') plt.ylabel('Spatial Lag of Price') plt.xlabel('Price') plt.show()
Now, we have a local
I
i
I_i
An array of Ii? instead of a single
I
I
I Statistics
The statistics are stored in the “.Is” attribute and the simulated p-value is
in “p_sim”.
li = esda.moran.Moran_Local(y, wq)
li.q
array([1, 1, 1, 3, 3, 4, 3, 3, 4, 1, 1, 3, 3, 1, 3, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 3, 3, 1, 3, 3, 1, 1, 4, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3, 1, 1, 1, 3, 3, 4, 3, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 4, 3, 4, 1, 3, 3, 3, 3, 3, 4, 3, 3, 4, 1, 1, 1, 1, 2, 3, 3])
We can again use permutations to test for local clustering, but here we use
Conditional random arrangement (different distribution of each focus position)
(li.p_sim < 0.05).sum()
68
We can distinguish specific types of associations reflected in local spatial
The four quadrants of the Moran scatter plot above:
sig = li.p_sim < 0.05 hotspot = sig * li.q==1 coldspot = sig * li.q==3 doughnut = sig * li.q==2 diamond = sig * li.q==4
spots = ['n.sig.', 'hot spot'] labels = [spots[i] for i in hotspot*1]
df = df from matplotlib import colors hmap = colors.ListedColormap(['red', 'lightgrey']) f, ax = plt.subplots(1, figsize=(9, 9)) df.assign(cl=labels).plot(column='cl', categorical=True, \ k=2, cmap=hmap, linewidth=0.1, ax=ax, \ edgecolor='white', legend=True) ax.set_axis_off() plt.show()
spots = ['n.sig.', 'cold spot'] labels = [spots[i] for i in coldspot*1]
df = df from matplotlib import colors hmap = colors.ListedColormap(['blue', 'lightgrey']) f, ax = plt.subplots(1, figsize=(9, 9)) df.assign(cl=labels).plot(column='cl', categorical=True, \ k=2, cmap=hmap, linewidth=0.1, ax=ax, \ edgecolor='white', legend=True) ax.set_axis_off() plt.show()
spots = ['n.sig.', 'doughnut'] labels = [spots[i] for i in doughnut*1]
df = df from matplotlib import colors hmap = colors.ListedColormap(['lightblue', 'lightgrey']) f, ax = plt.subplots(1, figsize=(9, 9)) df.assign(cl=labels).plot(column='cl', categorical=True, \ k=2, cmap=hmap, linewidth=0.1, ax=ax, \ edgecolor='white', legend=True) ax.set_axis_off() plt.show()
spots = ['n.sig.', 'diamond'] labels = [spots[i] for i in diamond*1]
df = df from matplotlib import colors hmap = colors.ListedColormap(['pink', 'lightgrey']) f, ax = plt.subplots(1, figsize=(9, 9)) df.assign(cl=labels).plot(column='cl', categorical=True, \ k=2, cmap=hmap, linewidth=0.1, ax=ax, \ edgecolor='white', legend=True) ax.set_axis_off() plt.show()
sig = 1 * (li.p_sim < 0.05) hotspot = 1 * (sig * li.q==1) coldspot = 3 * (sig * li.q==3) doughnut = 2 * (sig * li.q==2) diamond = 4 * (sig * li.q==4) spots = hotspot + coldspot + doughnut + diamond spots
array([1, 1, 0, 0, 0, 0, 3, 3, 0, 1, 0, 3, 3, 0, 3, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 3, 0, 3, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 3, 0, 0, 3, 0, 3, 3, 3, 3, 3, 3, 1, 0, 1, 3, 3, 1, 1, 1, 3, 0, 0, 3, 0, 1, 1, 1, 1, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 3, 3, 3, 0, 0, 0, 4, 0, 0, 0, 0, 0, 3, 3])
spot_labels = [ '0 ns', '1 hot spot', '2 doughnut', '3 cold spot', '4 diamond'] labels = [spot_labels[i] for i in spots]
from matplotlib import colors hmap = colors.ListedColormap([ 'lightgrey', 'red', 'lightblue', 'blue', 'pink']) f, ax = plt.subplots(1, figsize=(9, 9)) df.assign(cl=labels).plot(column='cl', categorical=True, \ k=2, cmap=hmap, linewidth=0.1, ax=ax, \ edgecolor='white', legend=True) ax.set_axis_off() plt.show()