The advent of the big data era has created both opportunities and challenges for the study of urban phenomena. In this brief essay, I formulate some thoughts on its role in stimulating an emerging methodological “interdisciplinarity,” that I like to refer to as spatial data science, and on the potential for this spatial data science to enhance our understanding of urban dynamics.
I see spatial data science as a subset of the broader data science (e.g., Schutt and O’Neil 2014), differentiated by dealing explicitly with the role of space (location, spatial arrangement, spatial interaction). In analogy to generic data science, it consists of a combination of the strengths of exploratory spatial data analysis, spatial statistics, and spatial econometrics from a statistical disciplinary perspective, with spatial data mining, spatial database manipulation, and machine learning from a computer science disciplinary perspective. In addition to addressing the need for integration of the purely methodological aspects derived from these disciplines, a major component of data science consists of the dirty work of turning the raw source of data into a form suitable for sophisticated analyses, referred to as data wrangling or data munging. This requires the consideration of appropriate data structures, efficient workflows, the application of clever algorithms, and (typically) high-performance computing. In a spatial context, this also inevitably necessitates leveraging geospatial technologies, such as GIS, GPS, and remote sensing.
Rather than using the term “big data” (which lacks a rigorous definition, see, e.g., Mayer-Schönberger and Cukier 2013, p. 6), I would like to stress the opportunities provided by “new” urban data, some of which technically may not be “big” (enough). I see three major forces that generate important new information for urban analysis:
- The “smart cities” movement, and in particular the ubiquitous presence of sensors that provide location-specific, near-continuous measurement of a range of phenomena (weather conditions, environmental indicators, flow of traffic, etc.)
- The “open data” movement, which provides unprecedented access to a treasure trove of urban administrative data (e.g., crime reports, 311 requests, building permits, energy use, etc.), much of which is geo-referenced and available for analysis through efficient application programming interfaces (API)
- The “volunteered geographic information” (VGI) movement, which collects unstructured geographic information by means of crowd-sourcing
I consider VGI to be more than the crowd-sourced digitizing efforts through which it is best known (such as OpenStreetMap), and also include under this rubric the indirect geographic information provided by social media messaging services (such as Twitter, Facebook, or LinkedIn), location sharing services (such as Foursquare), and photo sharing sites (such as Flickr), as well as the locational tags in 311 messages, cell phone records, and similar data. Stefanidis et al (2013, p. 320) refer to this as “ambient geospatial information.”
Besides their sheer size, what makes these new sources of urban information so different from the traditional census and social survey data is the very fine-grained geographical and temporal detail. The data by and large pertain to individuals (or individual actions), are typically in point form (even moving points) rather than aggregated areal units, and are available in near-continuous time.
However, not all this big data is necessarily useful, and there have by now been several well-argued critiques of its use (and abuse) in social science research (e.g., Ruths and Pfeffer 2014). Social media data in particular have been deemed to fall short in terms of how they represent the underlying population (e.g., the self-selection of demographic groups that participate in social messaging). Furthermore, the content of messages can be manipulated for commercial or political purposes by particular features of the service (e.g., Twitter messages). Also, the inclusion (or lack thereof) of administrative information in a city’s open data portal may be subject to local political interference. Similarly, the placement of sensors and decisions on what aspects of the urban environment are sensed are not always without controversy. At a more technical level, the unevenness of precision and uncertainty associated with the locational (and temporal) information derived from GPS units in the various smart devices used in social media and other indirect VGI sources needs to be properly accounted for.
In spite of these deficiencies, I would submit that the new data sources have the potential to help address questions pertaining to the dynamics of urban structure (neighborhood dynamics, movement patterns, etc.) that currently can only be partially tackled by means of the traditional cross-sectional urban census and survey data (e.g., Golder and Macy 2014). In particular, with proper accounting for sample representation, uncertainty, and other potential data deficiencies, they provide new ways to measure, visualize, and analyze the specific spatial and space-time aspects of phenomena ranging from crime, neighborhood decay, shopping patterns, and commuting to the spread through social networks of political sentiment and even “happiness.” Taken together, they begin to provide a way to “endogenize” spatial structure (e.g., neighborhoods) and to allow us to move away from the “container” view provided by administrative spatial units. They also contain the raw materials to start making explicit the complex connections between social and spatial interaction. An example of a creative use of social media data is the livelihoods project at Carnegie-Mellon (Crenshaw et al 2012), which yields neighborhood delineations derived from check-in data for the Foursquare location sharing service. Characteristic of this application is the combination of several techniques, such as geographic information, network analysis, and machine learning techniques, rather than the application of a single methodological paradigm. However, so far this work is still mostly descriptive and not connected directly to social science theory.
The emerging spatial data science will provide the overarching methodological framework to allow a closer integration of the data aspect with substantive conceptualization. While “standard” data science does deal with geographic information, it tends to ignore the distinguishing “spatial” characteristics of such data, including spatial dependence and spatial heterogeneity, much like many “mainstream” social science econometric and statistical analyses did in the past (Anselin 1988). While much progress has been made in this respect (Anselin 2012), in order for the new spatial data science to be effective, it will need to be more than a straightforward combination of spatially explicit techniques from statistics, econometrics, and computer science. Three challenges in particular come to mind:
- The issue of scale, in particular the scale through which space-time dynamics are conceptualized and measured (e.g., what is a “distance metric” in space-time?)
- The issue of endogeneity, and specifically how it is reflected in and affects our ability to tease out spatial-social interaction (e.g., resulting in neighborhood dynamics)
- On a more technical level, the need for computational efficiency to deal with large amounts of very fine-grained geographical data in near real-time, which will require a rethinking of many traditional algorithms to scale up to the big data context.
Anselin, Luc (1988). Spatial Econometrics, Methods and Models. Dordrecht: Kluwer Academic.
Anselin, Luc. (2010). Thirty Years of Spatial Econometrics. Papers in Regional Science 89, 3–25.
Crenshaw, Justin, Raz Schwartz, Jason Hong and Norman Sadeh (2012). The Livehoods project: Utilizing social media to understand the dynamics of a city. Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM) Trinity College, Dublin, Ireland, June 4-8, 2012. Menlo Park, CA, AAAI Press.
Golder, Scott A. and Michael W. Macy (2014). Digital footprints: Opportunities and challenges for online social research. Annual Review of Sociology 40:129–152.
Mayer-Schonberger, Viktor and Cukier, Kenneth (2013). Big data: A revolution that will change how we live, work and think. Boston, Eamon Dolan.
Ruths, Derek and Jurgen Pfeffer (2014). Social media for large studies of behavior. Science 346 (6213), 1063-1064.
Schutt, Rachel and Cathy O’Neil (2014). Doing Data Science. Sebastopol, O’Reilly.
Stefanidis, Anthony, Crooks, Andrew and Radzikowski, Jacek (2013). Harvesting ambient geospatial information from social media feeds. GeoJournal 78, 319-338.