twoeleven | walking around in cities, part 1: some statistics

Rebeccmeister@lj mentioned this study of cities and relationship between their economic output and their walkability. I've got some interest in urban planning and the study seemed have plausible and well-presented statistics, so I thought I'd dig into it. Which I did, and found that – as expected – the answer isn't as clear-cut as the study's authors would have you believe. So, I was gonna write that up... and it occurred to me that only a handful of my readers could make any sense of what I was writing about, so I wrote a simple introduction to the statistical method the study uses.

The result is kinda long. So, I've broken it up into sections:

The study wants to show that making cities easy to walk around in makes them wealthier. They start by reciting what's already known about cities:

Many studies have shown the causal link between increased education of an individual or metropolitan area and increased economic performance on a per capita gross domestic product (GDP) basis, metropolitan GDP, and U.S. GDP.

The Milken Institute, for example, released a study on the GDP performance of 261 U.S. metros in January 2013 that concludes, “the overall explanatory power of the relationship [between higher education and GDP per capita] is strong and robust.” It finds “over 70 percent of the variation in real GDP per capita across the 261 metros from 1990 to 2010 is explained [by higher education attainment].”¹⁰ This causal connection underpins the same conclusions in Enrico Moretti’s book, The New Geography of Jobs.

This study also shows significant correlation between higher education, measured by the percentage of population aged 25 years or older with a college degree, and metropolitan GDP per capita.¹¹

[...]

Given the relationship between educational attainment and walkable urbanism, and the relationship between educational attainment and per capita GDP, it is not surprising that walkable
urbanism and per capita GDP are also positively correlated.¹³ [...]

This research does not indicate whether walkable urbanism causes highly educated persons to move or stay in metro areas, or whether metro areas become more walkable urban because of higher-educated persons. Previous research suggests, though, that educated people prefer walkable urban places but does not indicate the causal connection. [...]

Up to this point, I really like this study: not only did they say what they found, they also provided all the figures of merit: not just R² -- how much of the variability the predictor explains -- but also the equation for the predictor. However, the study starts to go downhill here:

Using both educational attainment and walkable urbanism together in a multiple regression analysis explains 66 percent of the variation in per capita GDP among the 30 largest metros. This correlation is only slightly stronger than the correlation between educational attainment and per capita GDP. This finding suggests that walkable urbanism’s positive correlation with per capita GDP may be due to its association with educated people. At the very least, though, these relationships establish that metro areas with wealthy, educated residents tend to be walkable.

Well, that's honest, but could we see the regression equation for both factors together? Saying it explains just a little more of the variation is all well and good, but how much to the two factors contribute to each city's GDP per capita? Do they contribute equally, or is one much more important than the other?

And then the authors start to cherry-pick their data:

Metro Dallas and Houston seem to be outliers in this analysis, with moderate and tentative walkable urbanism but high GDP per capita. As expected, excluding Dallas and Houston from the analysis results in an even stronger correlation between walkable urbanism, higher education, and GDP per capita.¹⁵ If they are excluded, the R² value for the correlation between walkable urbanism and educational attainment increases from 0.62 to 0.66. For the correlation between walkable urbanism and per capita GDP, it increases to 0.69.

and footnote 15 says

15. These two metros differ from their counterparts in several ways: 1) they have high GDP per capita income ($58,900, compared to the high walkable urban metros of $60,668); 2) their economies, especially Houston, uniquely rely on oil and gas industries; and 3) they rank moderate to tentative on walkable urbanism.

...and they stop talking about the combined effect entirely.

Cherry-picking is an admission of defeat: they're coming out and saying that their model of what makes a productive city is incomplete. In fact, the way they did it is just sad; 1) is essentially indicating that their prediction fails for those cities, so they're just gonna ignore them. It also begs the question of what other factors might be at work in determining urban productivity, which, if included, might better explain the effect seen. For example: location (New York, Chicago, and St. Louis sit astride major trade routes), urban geography (Manhattan is an island), climate, historical effects (Ellis Island was once a major immigrant gateway), etc. And of course, DC is the capital, and the US has only one of those.

The last one is what jumped out at me when I looked at the plots: DC is the most walkable and best-educated city. It's also relatively far from the rest of the points, which means it could be having an unusually strong effect on the regression model.

Looking at the data itself (immediately below), DC really sticks out: where the walkability of the other cities listed ranges from 5%-38%, it's at 43%. The range for college degrees is 22%-43%... except DC, which is at 48%. Something is strange about it. It might be because as the nation's capital, it's a showpiece city, and was designed to look good, or maybe it's because those who want to influence the nation's policies tend to hire the best and brightest to do the intellectual heavy lifting they can't.

(and my copy of the data for those who want to play with it themselves.)

Since DC might be having undue influence on the regression models of data, so I should say a bit about what regression is and why DC might be biasing the results. Regression attempts to explain the variability seen in one bunch of data in terms of a linear relationship with that data and one or more other bunches of data. In this case, explaining an urban area's GDP per capita in terms of walkability and/or education. The relationship between those bunches of data is only a representation of what's happening, so it's called a "model" of the data.

So, how do we make a regression model? We try to fit a line through the two or more sets of data such that it minimizes the distance between the line and the data points according to a some rules. The rules themselves mostly deal with residuals, which are the y-axis (vertical) distances between data points and the line:

The purple lines are the residuals for the points and model line.

The rules:

1: The line is drawn (calculated) such that it minimizes the sum of the squared residuals.
2: The residuals are assumed to be normally distributed around the line. That is, the set of residuals has a "bell-shaped" distribution.
3: No point has any unusual leverage on the set.

The last one is more easily drawn than defined, as it concerns points far on the x-axis (horizontal distance) from the rest of them:

(Both this figure and the last one are worked up from Denise Ferrari's Introduction to Regression in R Part I : Simple Linear Regression (pdf). It's a good introduction to regression models, but her discussion is mixed with examples and instructions for using the R statistics language for regression, and general readers may find it confusing and/or uninteresting as a result.)

Ferrari shows leverage points as "good" and "bad", but rarely are the cases as clear-cut as her (rigged) examples show. I tend to go all empirical on leverage points: if the regression software complains about one, drop it from the set and see what happens.

The same goes for sets of residuals that are far from normally-distributed, but usually, I find it easier to just to look at the distribution and see if it's more-or-less peaking in the middle and symmetric, and call that close enough to normal. Finding data sets that have strictly normally-distributed residuals, especially with small sets like the one we're looking at, is hard, and stat geeks usually allow some wiggle room. (There are formal tests for normality for when it matters. Since I'm not publishing this, I'm not gonna worry too much about it.)

So, let's start looking at the nitty-gritty of the urban walkability data and what linear regression tells us about it. Since we know that productivity (GDP per capita) already correlates with education (percentage of people with a college degree) we'll start there.

I dropped the data into Minitab, a stats program, simply 'cause I knew it could crank out suitable information for this post. R, or any other statistics tool, will work just as well.

First, the text output of the model, then some pretty pictures of the residuals.

Regression Analysis: gdpc versus degree The regression equation is gdpc = 7604 + 1282 degree

This section confirms what regression model we're running, and what the model is. In this case, urban GDP per capita has a base of $7,604 and rises $1,282 with every percentage of the population that has a college degree. (I used short names for all the calculations, so "gdpc" is GDP per capita, "degree" is percentage of people in the area with a degree, and "walk" is percentage of an urban area's offices and stores that are considered walkable. [Which is why I used short names...])

Predictor Coef SE Coef T P Constant 7604 6388 1.19 0.244 degree 1281.8 191.9 6.68 0.000 S = 5924.95 R-Sq = 61.4% R-Sq(adj) = 60.1%

This part tells us how good the model is. It's a bit obscure, but the key parts are the P value – which kinda sorta tells the probability the result is just due to chance – and the fraction of the variability in the data explained by the regression model R² (R-squared, and an adjusted value for how many variables we're using). That the constant (base line) term has such a high P value is pretty odd, and I'd have to think about what that means. (I'm not sure I've seen that happen before, and a try to check the data without it got strange results.)

Analysis of Variance Source DF SS MS F P Regression 1 1565412299 1565412299 44.59 0.000 Residual Error 28 982942368 35105085 Total 29 2548354667

This shows another view of how good the regression is, totaling up where the sum of the squared residuals (SS column) are explained. (I'm not gonna talk about this section or most of the previous section for any of the other models. Statistically inclined readers can find the whole Minitab session, with all the goodies here.)

Unusual Observations Obs degree gdpc Fit SE Fit Residual St Resid 1 48.0 66400 69130 3112 -2730 -0.54 X 15 29.0 62400 44776 1305 17624 3.05R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.

And finally, a list of weird stuff that might be breaking the assumptions regression relies on. Point ("observation", or “obs”) #1 is Washington, DC, and #15 is Houston (see the data chart above).

As I mentioned in my brief introduction, residuals are what's left after the regression line is subtracted. According to the assumptions of regression, they should be more or less normally distributed, with the familiar “bell-curve” distribution. There also shouldn't be any patterns in the residuals; obvious patterns means something's wrong with the model. So, let's see what we see here...

The two panels on the left compare the residuals with a normal distribution: in the upper left, normally distributed residuals would lie on the blue line shown (they're not even close) and in the lower left, we'd like to see something that looks like a bell-curve distribution. The two panels on the right show the residuals in two different orders: vs the fitted values (that is, compared to the line the regression model creates) and simply in order of how they appear in the list. Note that one point way up by itself: That's Houston. That's also the point that appears all the way to the right in the left-side plots.

So not only is Houston tugging on the model, but overall, the residuals are way outta whack: we have a small number of cities way above what the model predicts balancing out a bunch just below it. I don't want to say the model is junk, but if I were publishing this data, I'd load it down with caveats.

In any case, this isn't the interesting case: we want to see the effect of walkability when its added to the known effect of education.

Next model:

Regression Analysis: gdpc versus degree, walk The regression equation is gdpc = 16487 + 830 degree + 329 walk S = 5697.58 R-Sq = 65.6% R-Sq(adj) = 63.1% Unusual Observations Obs degree gdpc Fit SE Fit Residual St Resid 15 29.0 62400 46162 1470 16238 2.95R R denotes an observation with a large standardized residual.

There's Houston again, and the residuals are almost as messed up:

I'd still be cautious about publishing this thing, so let's take the authors' suggestion and drop Houston:

Results for: walking+money+edu - houston.dat Regression Analysis: gdpc versus degree The regression equation is gdpc = 4564 + 1356 degree S = 4931.02 R-Sq = 72.4% R-Sq(adj) = 71.4% Obs degree gdpc Fit SE Fit Residual St Resid 1 48.0 66400 69635 2593 -3235 -0.77 X 7 34.0 62000 50656 932 11344 2.34R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.

Yup, DC might be causing problems again, and now we have a new problem child: Portland. It's surprisingly productive for its population's education. And the pretty pictures:

Pretty much the same residual problems, though they're somewhat improved.

And adding walkability:

Regression Analysis: gdpc versus degree, walk The regression equation is gdpc = 11619 + 1002 degree + 255 walk S = 4779.85 R-Sq = 75.0% R-Sq(adj) = 73.1% Unusual Observations Obs degree gdpc Fit SE Fit Residual St Resid 7 34.0 62000 51287 981 10713 2.29R 24 31.0 55600 44972 1354 10628 2.32R R denotes an observation with a large standardized residual.

Portland's back, and now Dallas is causing problems. *sigh* My feeling is that there's much more at work than just these two variables, and trying to chase around cities to drop is failing to address the real problem: we're not finding the actual causes (or at least correlates) of urban productivity.

But since DC might be a leverage point, just a quick run-though of what we get by dropping that, only showing the combined models of education and walkability together on productivity:

Results for: walking+money+edu - DC.dat Regression Analysis: gdpc versus degree, walk The regression equation is gdpc = 14308 + 889 degree + 354 walk S = 5726.97 R-Sq = 62.2% R-Sq(adj) = 59.3% Unusual Observations Obs degree gdpc Fit SE Fit Residual St Resid 14 29.0 62400 46107 1479 16293 2.94R R denotes an observation with a large standardized residual.

Houston's still a problem, and the residuals:

...are hardly better.

Out of curiosity, I tried dropping both houston and DC:

Regression Analysis: gdpc versus degree, walk The regression equation is gdpc = 9335 + 1064 degree + 280 walk S = 4767.94 R-Sq = 72.6% R-Sq(adj) = 70.4% Unusual Observations Obs degree gdpc Fit SE Fit Residual St Resid 6 34.0 62000 51661 1039 10339 2.22R 23 31.0 55600 44833 1357 10767 2.36R R denotes an observation with a large standardized residual.

The residuals here are better and worse:

The distribution is better, but there's still a few outliers messing with the model.

My conclusion is that fiddling with this data any longer isn't gonna help. There are just too many cities that the model fits poorly. We've already dropped two of the thirty cities, and I think we'd have to drop at least two more, maybe three, after looking at the residuals. At that point, we'd have thrown away a sixth of the data, and I think that's an indication we're doing something wrong.

However, to the extent that this model is telling us anything, I don't think the walkability folks are gonna like it. Let's review the four equations we've gotten for what has effects on productivity:

All the cities: gdpc = 16487 + 830 degree + 329 walk (R-sq = 63.1%)

Without Houston: gdpc = 11619 + 1002 degree + 255 walk (R-sq = 71.4%)

Without DC: gdpc = 14308 + 889 degree + 354 walk (R-sq = 59.3%)

Without both: gdpc = 9335 + 1064 degree + 280 walk (R-sq = 70.4%)

Both people with a college degree and fraction of a city that's walkable are percentages, so the two ranges are on the same scale, meaning the factors in front of them are directly proportional to their contribution. That is, increasing the fraction of people in a city with a degree by one (percent) raises the productivity by between $830 and $1064, and increasing the walkability increases it by between $255 to $354. At best, then, walkability matters something less than half as much as people having a college education does, and maybe as little as a quarter as much.

Which, oddly, occurs if we leave out Houston, which the authors seem to be implying would go in the other direction: removing the effect of oil income (if it's there) decreases the effect of walkability... or in other words, if oil matters, it makes walkable cities earn more.¹

But there's a bigger if: we've been predicting productivity on the basis of education and walkability. What if we change the implied causality, and try to figure out what makes cities walkable? I ran only one case – I admit, I was getting tired of playing with this thing – but it looks interesting anyway:

Results for: walking+money+edu - houston+DC.dat Regression Analysis: walk versus gdpc, degree The regression equation is walk = - 25.9 + 0.000410 gdpc + 0.714 degree S = 5.77545 R-Sq = 60.8% R-Sq(adj) = 57.7% Unusual Observations Obs gdpc walk Fit SE Fit Residual St Resid 1 59400 38.00 24.87 1.77 13.13 2.39R

Now, the problem child is New York... and we're explaining only a bit more than half the variability, but there's a curious thing in the residuals:

They're dead on... except now we're obviously missing something, because the residuals vary directly with walkability. What we're seeing, in part, is that walkability can't go below 0%, so as walkability increases, so does the amount of variability in walkability. Fixing that up requires a fancier statistical technique.

So, from a statistical standpoint, trying to explain an urban area's productivity from the education of its citizens and how walkable it is using regression modeling isn't an unreasonable thing to do. Walkability seems to be a relatively small factor compared to education.

However, we've got to be careful about those conclusions, because we're playing fast and loose with the assumptions regression relies on. In a similar way, we have to be careful about implying causal relations of the factors, since it appears we can get OK models by reversing it (which makes the models better and worse, depending on what we're looking at.) I suspect the real answer to both of these points is to add more predictive variables, and switch to more powerful statistical methods.

1: A side note: by putting the four models next to each other, we can see that DC might be a lever point, but that's not having much effect on the models. However, Houston's presence or absence makes a significant difference in what models we get.

/dev/mull

walking around in cities, part 1: some statistics

walking around in cities, part 1: some statistics

no subject

no subject

Profile