Methodologies for matching datasets on the basis of individual records are well developed.
Analysis of social data often requires combining information from various data observed at different spatial levels (sometimes called the change of support problem). This can be due to confidentiality reasons, boundaries changing through time, or different jurisdictions. Going forward, such matching will be completed on spatial metrics, which introduces new methodological questions on spatial misalignment management and the scale of support issues. Statistical tools, sometimes described as “data analytics”, are fundamental for dealing with a range of issues including data quality checking, linking, statistical modelling of relationships, inference, visualisation and quantification of uncertainty.
Spatial statistics for linked urban data
The statistical analysis of spatial data has developed very strongly over the past twenty years. We explored whether techniques and methodologies being developed in related fields (such as downscaling, upscaling, spatio-temporal modelling, uncertainty quantification as used in climate science, environmental epidemiology etc.) can be applied to the specific problems that arise out of the linkage of administrative, business and other records in social science research in the big data era. Specifically, we started by considering how count data such as e.g. hospital admissions or number of crimes at one spatial level such as intermediate geographies, or districts can be upscaled/downscaled to data zones or communities. This included the transfer and adaptation of existing statistical concepts, but also included the development of new concepts and tools which address the issues raised by the complex data structures created by data linkage. This research component linked to, and supported, many of the specific social science research projects being undertaken by the UBDC.
Uncertainty modelling and visualisation
A further key issue is the modelling and representation of uncertainty. This is key to weighing the evidence for effects of interest. An uncertainty accounting using statistical modelling adds essential value to any scientific results. Innovative methods of display also need to be developed, in order to communicate the results of complex models to the wider community of social scientists and the general public, where detailed understanding of quantitative methods and models cannot be assumed. The development of statistical linkage models for spatial and other attribute data that can draw meaningful inferences from large, heterogeneous data sets is essential, but accounting for uncertainty in the linked data sets, in model parameters, and the choice of statistical model itself is challenging. As the complexity and diversity of linked data sets and sources develop, new techniques are needed to represent and communicate these complex uncertainties. We have developed interactive online applications which allow the user to visualise spatial and spatio-temporal areal data, model relationships between variables, upload their own datasets and take advantage of the app’s interactive data-visualisation techniques, to explore model choices.