7Model preparation: Maize aggregation and data combination
For the final modelling we are only using predicted maize Se concentration to calculate the different level of aggregation. For more information on the reasons, see the section Model testing & calibration.
The file predmaize.df contains the coordinates of Malawi (lat/lon), the predicted maize Se concentration, both log-transformed (Zhat) and back transformed (Zhat_exp), and their covariance (uncertainty associated to each prediction) for each of the locations in Malawi.
The dataset was derived from the maize.df dataset using ordinary kriging and the scripts can be found in 01_maize-model.R and are based on the study of Gashu and colleagues (2022).
Then, in order to calculate the mean/median value for each level of aggregation we will use the boundaries that were generated in the script (00_cleaning-location.R) and described in Section 3.1.
Code
# Loading libraries and functions -----#library(plyr) # weighted data analysislibrary(dplyr) # data wrangling library(stringr) # string data manipulation library(ggplot2) # visualisationlibrary(sf) # spatial data manipulationlibrary(tmap) #spatial data manipulation and visualisation# Data: Shapefiles ----## Admin Boundaries for Malawi ----# Reading the EA shapefile w/ updated districts (See 00_cleaning-boundaries.R)ea_admin <-st_read(here::here( "data", "inter-output", "boundaries", "mwi_admbnda_adm4_nso.shp"))
Reading layer `mwi_admbnda_adm4_nso' from data source
`C:\Users\LuciaSegoviaDeLaRevi\OneDrive - London School of Hygiene and Tropical Medicine\PhD\PhD_r-project\PhD_geospatial-modelling\data\inter-output\boundaries\mwi_admbnda_adm4_nso.shp'
using driver `ESRI Shapefile'
Simple feature collection with 10502 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 32.67162 ymin: -17.12628 xmax: 35.91842 ymax: -9.363662
Geodetic CRS: WGS 84
Code
# Changing variable region class to factorea_admin$region <-as.factor(ea_admin$region)# Districtsdist_bnd <-st_read(here::here( "data","mwi-boundaries","mwi_adm_nso_hotosm_20230329_shp", "mwi_admbnda_adm2_nso_hotosm_20230329.shp"))
Reading layer `mwi_admbnda_adm2_nso_hotosm_20230329' from data source
`C:\Users\LuciaSegoviaDeLaRevi\OneDrive - London School of Hygiene and Tropical Medicine\PhD\PhD_r-project\PhD_geospatial-modelling\data\mwi-boundaries\mwi_adm_nso_hotosm_20230329_shp\mwi_admbnda_adm2_nso_hotosm_20230329.shp'
using driver `ESRI Shapefile'
Simple feature collection with 32 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 32.67164 ymin: -17.12975 xmax: 35.91848 ymax: -9.367346
Geodetic CRS: WGS 84
Code
dist_bnd <-st_make_valid(dist_bnd) # Check this# Cluster's EA infocluster.df <-readRDS(here::here("data", "inter-output", "dhs_se_gps_admin.RDS")) %>%distinct(survey_cluster1, EACODE, ADM2_PCODE, ADM2_EN, urbanity)# Unique EAs where WRA potentially resideEAselected <-unique(cluster.df$EACODE)# Loading predicted maize Se conc.predmaize.df <-read.csv(here::here("data", "OK","2024-05-03Se_raw_OK_expmaize.csv")) %>%rename(predSe ="Zhat_exp")# Buffer distancesbuff.dist <-na.omit(unique(stringr::str_extract(list.files(here::here("data", "inter-output", "boundaries", "buffer")),"[:digit:]{2}")))
7.1 Data aggregation
Here we are proceeding as described in Section 6.2. In summary, to generate the final aggregations we are:
Loading the boundaries for the administrative units (cluster-level and district-level) and the buffers which can be adapted using the function buffer_generator(), and are 10 km, 15 km, 20 km, 25 km, 30 km, 40 km, 50 km, 60 km. More information on that function can be found in Section 6.2.1.
After selecting the aggregations, we need to get the information on the maize Se concentration for each aggregation area. To do so, we need to convert the predicted maize Se dataset (Long/Lat) into a spatial object (point locations).
Merging the data by location. This spatial joint allowed us to obtain the maize Se concentration point data in each area. See ?fig-XX.
Note
Because the dataset of the predicted maize is very large, and we only need the maize aggregation for the areas (e.g., EA group) were data on plasma Se were collected, we are using a shapefile with the cluster-level spatial data only for those EAs/clusters with plasma Se concentration collected.
Code
### Cluster ----## Checking the EAs of the HHs with predicted maizebnd_reduced <- ea_admin %>%filter(EACODE %in% EAselected)geopred_ea <-st_join(geopredmaize.df, bnd_reduced)
Once we know for each maize Se concentration to which EA below, we can aggregate at cluster level (and district) (predmaize).
For the buffers, it is performed equally but using a loop, that ingest the buffer generated and output the dataset with the maize Se mean for each buffer.
Code
# Getting the distances of the buffers(buff.dist <-na.omit(unique(stringr::str_extract(list.files(here::here("data", "inter-output", "boundaries", "buffer")),"[:digit:]{2}"))))#Saving unti a new objectgeodata.df <- geopredmaize.df# Getting the variable to perform the mean (calculations over)Se <-grep("Se", names(geodata.df), value =TRUE)# Looping over the buffersfor(i in1:length(buff.dist)){ buffer <-st_read(here::here("data", "inter-output","boundaries", "buffer",paste0("mwi_buffer", buff.dist[i], ".shp"))) %>%rename(survey_cluster1 ="srvy_c1") maize_buff <-st_join(geodata.df, buffer) if(sum(is.na(maize_buff[,Se]))>0){print("Error in the data merging")}}
Here’s an example of the output from the aggregated dataset.
Code
i =1buffer <-st_read(here::here("data", "inter-output","boundaries", "buffer",paste0("mwi_buffer", buff.dist[i], ".shp"))) %>%rename(survey_cluster1 ="srvy_c1")
Reading layer `mwi_buffer10' from data source
`C:\Users\LuciaSegoviaDeLaRevi\OneDrive - London School of Hygiene and Tropical Medicine\PhD\PhD_r-project\PhD_geospatial-modelling\data\inter-output\boundaries\buffer\mwi_buffer10.shp'
using driver `ESRI Shapefile'
Simple feature collection with 102 features and 1 field
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 32.90714 ymin: -17.17541 xmax: 35.74069 ymax: -9.424538
Geodetic CRS: WGS 84
Here, we are assigning the maize Se concentration, at the different level, to each WRA in each cluster (or district).
For all the datasets (n= 11), except for the district level, the linkage between the women and its corresponding maize Se concentration (i.e., the hypothetical concentration of Se that is likely to be supplied by maize) are based on the cluster id. This is because the buffer were generated based on the cluster centroids.
For ease of use, the datasets linkages are done using a loop (as outlined below), after which the datasets are ready to be used in the R-INLA model.
The final dataset contains the plasma Se concentration for women and all the covariates for the model, except for the distance to main lakes, which is merged in the next step.
Code
# Loop to generate a file with plasma data and maize with each aggregation unit# Run one for every file/aggegationfor(i in1:length(file)){# Load the maize Se conc. aggregated datasetmaize.df <-readRDS(here::here("data", "inter-output", "aggregation", file[i]))# Join (left) plasma and maize datasets#based on common variable (eg cluster id)data.df <-left_join(plasma.df, maize.df) # Check if there are missing values for plasma Se conc.if(sum(is.na(data.df$selenium))>0){stop(paste0("Missing values in plamsa Se in ", file[i]))}# Check if there are missing values for maize Se conc.if(sum(is.na(data.df$Se_mean))>0){stop(paste0("Missing values in maize Se in ", file[i]))}# Save the output into the model foldersaveRDS(data.df, here::here("data", "inter-output", "model", paste0("plasma-", file[i])))}