Data sources: registered prevalence and primary care supply
Counts of general practice-registered, i.e. observed prevalence of CHD, stroke and hypertension in April 2007 were obtained from English general practice disease registers, produced for the purposes of incentivizing practices for achievement of QOF treatment targets. These patients are clinically confirmed cases of CVD who are receiving regular follow-up for their disease. QOF prevalence rates are based on total populations registered with practices, but to compare observed prevalence with expected prevalence estimated from resident population-based contextual data from the Census and other sources e.g. deprivation and proportion ethnic minority population, we derived residence-based QOF prevalence estimates for LAs using a lookup table - a pooled extract of England practice registers - from the National Strategic Tracing Service (now Personal Demographics Service) which apportioned practice populations to LA areas as at January 2006, by providing the exact number of practice population resident in each LA [14].
We apportioned counts of CVD patients registered by practices to LAs, in accordance with the proportion of each practice population resident in that area, which assumed that CVD prevalence was geographically uniform across a practice population (the average practice population is only about 6,400). We divided the aggregated count of CVD patients in each LA by total mid-2006 LA population estimates to give estimated crude prevalence, as used for QOF prevalence. Where less than 50 patients fell into an LA, the numbers were excluded from the look-up process. Three LAs could not be mapped due to discrepancies between QOF and NSTS datasets. In order to investigate the effect of healthcare supply upon diagnosis levels we included a measure of general practitioner availability in the form of the number of general practitioners (GPs) per thousand LA population, calculated in the same way [15].
Data sources: expected/estimated prevalence
Expected prevalence for each LA was obtained from the APHO epidemiologic models, which are based on the socio-demographic and behavioural characteristics of respondents with the respective conditions in the Health Survey for England (HSfE). To produce the models, HSfE data for the years 2003-4 was pooled (sample size 21,233) to increase the cases of diseases. Surveys over this period included boosts for ethnic minority and older people, and focused on CVD and its risk factors. Of the respondents included, 14,574 (68.6%) were of White ethnicity, 308 were Mixed (1.5%), 1,991 (9.4%) were Black or Black British, and 3,725 (17.5%) Asian or Asian British. The outcome variables were patient-reported doctor-diagnosed CHD and stroke, and for hypertension, normotensive-treated, hypertensive-treated but uncontrolled, and hypertensive-untreated groups; i.e. a combination of patient-reported and objectively-measured variables. Patient reports of doctor-diagnosed CHD and stroke have been extensively validated elsewhere [16–20]. For example, in the British Regional Heart Study, 80 per cent of men with a GP record of angina reported their diagnosis, and 70 per cent of men who reported an angina diagnosis had confirmation of this from the record review. The prevalence of diagnosed angina in these older men was 10.1 per cent according to self-reported history and 8.9 per cent according to GP record review [16]. At that time (1999) some cases, such as newly-registered patients, may not have had their diagnosis clearly recorded in GP records.
Ordinary least-squares (OLS) logistic regression models were fitted and explanatory variables for each disease outcome identified by reverse stepwise selection.
The baseline odds of each disease were obtained directly from the HSfE dataset. The strength of association between each explanatory variable and disease caseness was then used to calculate the relative odds, which were applied to the baseline odds to derive the prevalence estimates for each sub-group of risk factors. The variables which can be included in each local model are limited by the availability of local data for them from Census and other national sources. The core model variables are ten-year age band, gender (male and female), ethnicity (Asian/Asian British, Black/Black British, White, Mixed and Other including Chinese) and deprivation (based on Index of Multiple Deprivation 2004 scores) [21]. In the case of CHD and stroke, smoking prevalence is also included, and the stroke model does not include ethnicity. The models use 2006 mid-year quinary age-band population estimates by ethnic group from the UK Office for National Statistics (ONS), which were summed to 10-year age bands to match the model [22]. LAs are stratified into deprivation score bands based on cut-offs of Lower Super Output Area quintiles - the ONS categories used in the HSfE. Internal validation included using the models to predict the response for each subject in the source data, and area under the receiver operating characteristics (AUROC) curve. AUROC curve values were 0.834, 0.844, and 0.807 for the stroke, CHD and hypertension models respectively. External validation showed that prevalence gradients derived from the models - for example with age and smoking status - agree well with published results from population-based studies.
In the case of the CHD and stroke models, smoking status is also included. Local smoking prevalence estimates are not available from the HSfE because of small sample sizes, so the CHD model uses synthetic estimates from the Neighbourhood Statistics website, which are for the period 2003-2005 [23]. Model assumptions include that the proportion of smokers, ex-smokers and never-smokers is uniform across ethnic categories and that the proportion of ex-smokers in each age-sex group is constant across areas. Sensitivity analysis has shown that varying the smoking prevalence has a very small effect on prevalence. Further technical details of the models are available in additional files 1 (CHD), 2 (hypertension) and 3 (stroke) in the web appendix, and also on the APHO website [13].
Spatial analyses
Observed: expected prevalence ratios for LAs were calculated in Excel 2007 and mapped using the geographic information systems package ArcGIS 9. Two exploratory spatial data analysis methods commonly used in geographical studies were used to investigate patterns in O:E relationships, Local Moran's I (LMI) analysis and geographically weighted regression (GWR). The LMI technique is used to identify geographic clusters and outliers in data by testing for randomness in spatial distribution across a dataset, localities with significance scores (Z scores) greater than two standard deviations being considered to be either clusters or outliers [24]. Strongly positive Z scores indicate statistically significant similar values in close geographic proximity hence the presence of a cluster; a strongly negative Z score demonstrates a locality with a significantly dissimilar value in relation to its neighbouring localities thus indicating an outlier.
GWR is a form of spatial statistics which disaggregates geographic data into spatial blocks using a probability distribution kernel, which moves from location to location across the dataset to test for geographic variation in regression relationships. In situations where there is geographic variation in the strength of a regression relationship, a phenomenon referred to as spatial non-stationarity, the use of GWR will improve model goodness-of-fit to data, expressed as the trade-off between statistical predictor bias (linked to R-squared values) and variance (linked to degrees of freedom). In comparison to a classical model GWR will produce higher correlation coefficients, lower residuals and higher degrees of freedom than traditional ordinary least squares (OLS) regression [25, 26].
In this paper, we used GWR to assess whether a linear regression relationship between observed and expected prevalence existed and if so whether it varied in strength over space, the purpose of which should be viewed as distinct from that of mapping observed to expected ratios, as the latter aims to measure equality between two variables rather than to assess predictability of an association. Both OLS and GWR models were run in the software package GWR 3 to test for spatial non-stationarity. The optimal bandwidth for the kernel was estimated using the Akaike Information Criterion [27]. Two rounds of regression were performed, the first a univariate regression involving expected prevalence as the independent variable and observed prevalence as the dependent variable; the second bivariate including whole-time equivalent GP supply as an additional variable.
Our research conformed to the Helsinki Declaration http://www.wma.net/en/30publications/10policies/b3/, and to local legislation. It did not require ethical approval or patient consent as it is a secondary analysis of publicly-available data.