1 Methods: Datasets & Variables
In this section we describe the data sets and variables that were evaluated for use in the study: “Plasma selenium concentrations are associated with spatial variation in maize selenium concentrations in Malawi; implications for assessing dietary selenium adequacy”. We provide information about the data provenance and the reasons for inclusion/exclusion of data sets in the analysis.
1.1 Plasma Se concentration and other biological data
Geo-referenced biomarker and other information for women (15-45 yo) was obtained from the Malawi DHS - MNS 2015-2016 ([Malawi] and ICF (2017)). Access need to be requested to DHS data here.
1.1.0.1 Micronutrient survey data
This data set (MW_WRA) provided information on biomarkers and other biological data for women of reproductive age in Malawi. The recode and all information about the variable description can be found here. Here’s the selection of variables that were considered for cleaning and exploration.
- mregion (region) [1-3 = N->S]
- mtype (urban/rural) [U-R = 1-2]
- m04 (gender)
- mcluster (cluster number - PSU)
- mnumber (household number - SSU)
- mweight (Micronutrient survey household weight (6 decimals))
- m01 (line number - individual ID within HH)
- m07 (age) (age in years)
- m435 (weight, in kg)
- m436 (height, in m)
- sel (Plasma Se (ng/ml))
- m415 (Took any kind of vitamin or mineral tablet/syrup/powder)
- m431 (Malaria test result)[^1]
- agp_crp_c1 (ANY_INFLAMMATION, Any inflammation (AGP >1 g/L or CRP >5 mg/L)
- agp (AGP ’Alpha-1-Acid Glycoprotein (g/L), (inflammation biomarker))
- crp (CRP C-Reactive Protein (mg/L), (inflammation biomarker))
[^1] Malaria test recode: Final result of malaria from blood smear test
- 2 = Negative (in recode manual = 0)
- 1 = Positive (1)
- 6 = Test undetermined (4)
- 7 = Sample not found in lab database (6)
- 9 = Missing (m)
- (na) = Not applicable
1.1.0.2 Women questionnaire data
This data set (MWIR7) provided individual-level information for women (15-49yo) on socioeconomic factors for the main DHS survey. We downloaded both the .dat and .dta file because the file with .dat extension was not opening in R. It was not corrupted as we could open the .MAP with Notepad+, and we could see all recode documentation.
These variables provide information at individual level and, below are the variables of interest:
- v101/V024 (De facto region of residence)
- sdist (district)
- v022 (survey strata)[^2]
- v005 (Woman’s sample weights)[^3]
- v106 (Highest educational level)
- v155 (Literacy)
- v190 (Wealth index) [1-5 (lowest to highest)]
- v190a (Wealth index for urban/rural)
- v002 (Household number)
- v001 (Cluster number)
- v463a (is_smoker, Only cover cigarettes (other smoking variables))
- m420 (had_malaria)
- v150 (Relationship to household head)
[^2]Note that v022 & v023 are strata Urban/rural district as the information collected here was for the whole DHS survey. This should not be used for MNS because it is a subset of the DHS survey with the representatives (and strata) in at region+rural/urban. We could create our own strata based on the survey design (1/6 (3regions*2urban/rural)).
[^3]Similarly to the above this variable is not to be used for MNS, as MNS is a subset and has its own survey weights. For the DHS, it needs to be divided by 1000000 before applying the weighting factor.
1.1.0.3 Location data
The data file (MWGE7AFL) provides information on GPS coordinates (displaced) for each cluster (PSU = EA) in the main DHS (2015-2016) survey. Hence, all the households in the same EA/cluster have the same (GPS) location.
The variable used for merging is DHSCLUST (to be merged with V001).
Note that GPS displacement is true within district. Also, according to the documentation the cluster “centre” recorded is the “center of the populated place in the cluster” (MWGE7AFL/GPS_Displacement_README.txt, DHS, 2014).
1.2 Geo-referenced crop (maize) composition (Se concentration) data in Malawi
1.2.0.1 Main dataset: GeoNutrition
This data set provides geo-referenced data on the concentration of Se (and other minerals) in maize (and other crops) in Malawi as well as soil pH (measured in water) for each sampling site. The data was collected from April to June 2018 in Malawi and it aimed to be representative of the agricultural land in the country. The data set can be obtained from this open repository, and it is further described by Kumssa et al. (2022) & Gashu et al, (2021). We decided to use this data set instead of the one reported in the repository accompanying Gashu and colleagues study (2021) (see excluded data sources below) because it provided the raw data (i.e., including below detection limit values), as well as more metadata. For instance, it provided information on the limit of detection for each round of the Inductively Coupled Plasma (ICP). And, in particularly for Se analysis which were performed with the ICP - Mass Spectrometry (MS) separately due to the small amounts found in Malawi.
The data repository contains a number of data files for Malawi:
Raw crop mineral dataset (MWI_CropSoilData_Raw), which provides the raw values, as extracted from the analytical device (i.e, data from the ICP without any further processing).
Processed crop mineral dataset (MWI_CropSoilData_NA), which has transformed values for not analysed (NM), below the limit of detection (BLOD), or negative values into
NA
.Limit of detection dataset (MWI_Crop_LOD_ByICPRun), provides information about the limit of detection per each round of ICP.
The main variable extracted from the main dataset (“MWI_CropSoilData_Raw.csv”) was: Se_grain
which contain the information on Selenium (mg kg-1 DM) for all grain types.
File name: data/maize/MWI_CropSoilChemData_CSV/MWI_CropSoilData_Raw.csv
File name: data/maize/MWI_CropSoilChemData_CSV/MWI_Crop_LOD_ByICPRun.csv
1.2.0.2 Complementary dataset: Chilimba
Malawi maize grain and soil Se concentration data collected between 2009-2010 in the study of Chilimba et al. (2011). The data used was obtained from the co-authors of the manuscript (E.L.A.) which included the values of pH (in water) of the soil where samples were collected. However, the analyses outlined here can be performed using the publicly available version in here.
This data set was mainly included to supplement the data on maize grain Se concentration in Southern Malawi. Note that no information regarding limit of detection were provided, and that data collection and data analysis were done at different time and by a different team.
File name: data/maize/AllanChilimbaFieldData_forLucia_20230615.xlsx
1.2.0.3 List of the excluded sources of data
Here we present a list of data sets containing geo-referenced maize grain Se concentration information in Malawi that were evaluated and excluded from the final analyses. The cleaning script can be found in the exlcuded
folder as 00_cleaning_excluded.R
. More details on the reasons for exclusion are found in Section 4.3.
- Malawi maize grain and soil Se concentration dataset (2018). It is publicly available here.
File name: GeoNutrition/Soil_Crop_comparisons/Malawi/Malawi_grain_soil.xlsx
- Malawi crops, including maize, mineral concentration data set (2012-2013) and soil pH. This data was collected by Joy and colleagues (2014), and it is publicly available here in the Supplemantary table. Tabs:1-10: Explored in this analysis:
- STable6: crop sample info (maize Se) and coordinates
- STable5: Soil samples (soil pH)
- STable10: Information on each crop and its corresponding soil sample (Sample_number, Soil_sample).
File re-name: data/maize/2012_Joy-mwi-samples.xlsx
1.3 Boundaries of Malawi
Malawi administrative boundaries are, from the largest to smallest area: Region> District> Traditional Authorities (TA)> Enumeration Areas (EA). We are mainly interested in Districts and EAs for our study. Other boundaries in Malawi that were explored included: agro-ecological zones, agricultural land, water bodies, national parks, etc.
We excluded Agro-Ecological Zones (AEZs) boundaries from our study because there are only 4 AEZs in Malawi, hence areas are too large to get any meaningful spatial pattern, and/or level of aggregation.
Note: The AEZs shape files can be found in the Scoping review (R project, data folder), and further information can be found in here.
Information of the boundaries were found from difference sources and data owners. Here’s a list of the boundaries we explored and the reasons for inclusion/ exclusion.
1.3.0.1 Main administrative boudaries file
- Malawi NSO from HDX (Humanitarian datasets) (mwi_adm_nso_hotosm_20230329). This source only have up to Admin2 (District) level.
File: mwi-boundaries/mwi_adm_nso_hotosm_20230329_shp/mwi_admbnda_adm3_nso_hotosm_20230329.shp
- Population over Enumeration Areas NSO, (eas_bnd): this layer shows the distribution of the population over the country of Malawi on the basis of enumeration area boundary. Data producer is the National Statistical Office.
File: mwi-boundaries/EN_NSO/eas_bnd.shp
1.3.0.2 Excluded boudaries files
GADM maps and data (gadm40_MWI_shp): The file for the EAs has some issues that led to some data points falling into the lake. Hence, we found a different file with the Malawi boundaries. This is because some of the EAs near the lake has some “ownership” over certain areas of the lake, which were included in the boundaries of those EAs in the shapefile (i.e., as a continuum from land to lake).
Echo2 prioritization, these files covered up to EAs. (depricated - Updated version: Population over Enumeration Areas NSO).
GeoBoundaries: It also has 3 level of boudaries: 1-region, 2-district, 3-TA. These files has no EA boundaries (deprecated).
Other excluded boundaries: Agro-Ecological Zones, Traditional Authorities (more information on data exploration: Chapter 3).
1.4 Covariates
1.4.0.1 Fish intake proxy
Fish intake has been identified as one of the main sources of Se in Malawian diets (REF, REF, REF), and it may be a key determinant for plasma Se status in Malawi. Hence, as fish intake proxy, we decided to use some measurement of the distance to inland water bodies, as suggested by other studies. We tested three approaches:
Distance to inland water bodies in Malawi: Following the study from O’Meara et al, (2021), and using the raster layer developed by WorldPop and that can be accessed here.
Distance to Lake Malawi: Following the study fro Phiri et al, (2019), and using distance to Lake Malawi. We used the shape files of Malawi (
ea_bnd
) described in Section 1.3, and calculated the Euclidean distance from the DHS cluster centroids to the Lake.Distance to main Lakes: We decided to use a mixed approach between the other two, where we used the shortest distance to one of the three main lakes in Malawi: Lake Chilwa, Lake Malombe and Lake Malawi. As they have been shown to be fishing ground in Malawi. For instance, while lake Malawi provides the largest volume of fish, it is followed by lake Chilwa which provides around 20% of the total fish landings in Malawi (Chiwaula et al., 2012, Simmance et al, 2021). We calculate the closest distance to the closest lake using the same boundary file as described above.
1.4.0.2 Excluded
There were some covariate data sets that we explored but that were excluded from the study.
Mean Annual Temperature (MAT) was obtained as a raster data set (mwi_CHELSA_bio10_1.tif) from the CHELSA data set (CHELSA_bio10_1.tif), which is a worldwide from 1979-2013 high resolution MAT raster layer, the original data was extracted from the here as GEOtiff format. Then, the data set was cropped (clipped) from the global to Malawi country area. The clipping was done on QGIS version (3.28.3) Firenze using the enumeration area shapefile (ea_bnd
layer) as reference extend for clipping. The geographic coordinate system are referenced to the WGS 84 horizontal datum, with the horizontal coordinates expressed in decimal degrees.
Replicating Sequence: Raster>Extraction>Clip Raster by Extend…>Clipping extent>Calculate from a layer> ea_bnd.
This variable was planned to be used as environmental covariate for the geospatial Se maize model (BIO1
) however following the method published by Gashu et al (2020), no covariates were used for the maize grain Se predictions.
Distance to water bodies in Malawi
GLOBAL LAKES AND WETLANDS DATABASE (GLWD): Following the study from O’Meara et al., (2021), we obtained the Level 2 (GLWD-2) which comprises the shoreline polygons of permanent open water bodies with a surface area ≥ 0.1 km2 excluding the water bodies contained in GLWD-1. The approximate 250,000 polygons of GLWD-2 are attributed as lakes, reservoirs and rivers (World Wildlife Fund, 2019). We excluded this data set because a) it was a bit old (2004), and seemed that some of the files were corrupted.
ESACCIL (Lamarche et al. (2017)): This data set contains the boundaries of each water body in Malawi, and was superseded by the raster layer on distance to inland water bodies in Malawi generated by WorldPop. We explored the data and calculated the distance to the DHS centroid, however the harmonised Worldpop dataset method to calculate distance for each pixel was more robust. The shapefiles with the world inland water bodies was accessed from here, and then cropped to Malawi.
1.4.1 Geo-referenced apparent food consumption in Malawi (Se intake)
This dataset the Fourth Integrated Household Survey 2016-2017, was only used as support information. It can be accessed through the World Bank microdata website. Information about each module and the information contained can be found here.
Cleaning script, summary statistics and plots can be found in the repository ihs4, under the MAPS project.
File of interest & variables:
Food consumption
Geographic data (HouseholdGeovariablesIHS4.csv). This dataset has information on the displaced coordinates of each household, distance to some features (road, population centre, agricultural market, etc), and other enviromental variables (elevation, rainfall, etc).
Variables of interest:
case_id: Unique Household Identifier
HHID: household id (Survey Solutions Unique HH Identifier)
lat_modified: GPS Latitude Modified
lon_modified: GPS Longitude Modified
dist_road: HH Distance in (KMs) to Nearest Road
dist_agmrkt: HH Distance in (KMs) to Nearest Agricultural Market
ssa_aez09: Agro-ecological Zones
srtm_1k: Elevation (m)
- Sales vs stored crops (rainy season) (AG_MOD_I)
- Sales vs stored crops (dry (dimba) season) (AG_MOD_O)
- Sales vs stored trees (12months) (AG_MOD_Q)