5  Methods: Generating predicted maize grain Se concentration for Malawi

Here, we describe all the steps performed to obtain the prediction of maize grain Se concentration in unsampled location of Malawi. The R script (01_maize-model.R) used to run the analysis can be found in the accompanying repository and are an adaptation of the script developed by Prof Murray Lark for the GeoNutrion project which are published here.

Code
# Libraries & functions
library(geosphere)
source(here::here("functions", "kriging-functions.R")) # Matern functions & others
source(here::here("functions", "CEPHaStat_3.R")) #stat functions
CEPHaStat version  3

CEPHaStat is a collection of R functions produced by the UKRI GCRF
CEPHaS project.  The functions relate primarily to statistical analysis of
soil and related data from agricultural and environmental surveys and 
experiments.  The CEPHaStat file can be shared as it stands, and is made 
available to all for use, without warranty or liability.
Code
#  Note that the script contains an adaptation of the scripts develop for 
# the analysis of the Malawi  and the Ethiopia data sets (GeoNutrition)

# Loading the data
data.df  <- readRDS(here::here("data", "inter-output", "mwi-grain-se_raw.RDS")) # cleaned geo-loc maize Se data (2 datasets)

5.1 Exploratory analysis

The data set that we are using is the output generated in the script (00_cleaning-maize.R) which contains the information on crop Se concentration in Malawi. The variables are the GPS sample location (lat/lon), maize grain Se concentration, soil pH and Mean Annual Temperature (MAT) of the two combined datasets (Kumssa et al. (2022), Chilimba et al. (2011)) See Chapter 4 for more information on the data set.

Then, information for maize only is selected, and some housekeeping actions are performed, such as, removing missing values for the variable of interest (Se_raw) which contains the Se concentration including below detection limits, and transforming zeros into the minimum value. This last operation is performed as log-transformation would be required in subsequent steps.

Code
## Choosing only entries with maize Se values
data.df <-  subset(data.df, Crop == "Maize")
# Housekeeping: Check NA and zeros
sum(is.na(data.df$Se_raw))
[1] 7
Code
sum(data.df$Se_raw == 0 & !is.na(data.df$Se_raw))
[1] 23
Code
data.df <- subset(data.df, !is.na(Se_raw) & Se_raw>0) #excluding NA & zeros
Note

Zeros were excluded due to:

  1. Assumed that maize sample did not contain Se
  2. Log-transformation of zero is equal to -Inf.

Now, we are performing some summary statistics and checking the data skewness to decide on the log-tranformation.

Code
# Now create plot name with Se (element of interest)

element<-"Se_raw"
Crop.target <- "maize"
plotname<-expression("Se (raw) /"~mg~kg^{-1})
plotname2<-expression("Se (raw) /"~log~mg~kg^{-1})


# Compute summary statistics of the selected variable

data.df  <- as.data.frame(data.df)

summary<-(summa(data.df[,element]))
summary<-rbind(summary,(summa(log(data.df[,element]))))
row.names(summary)<-c("mg/kg,","log mg/kg")
summaplot((data.df[,element]),plotname)

Due to the markedly right-side skewness, we decided to log-transform the variable (Se_raw). We can see in the figure below, that the log-transformed concentration of Se in maize is very close to normality, hence we will proceed with the log-trasformed variable for the variogram estimation. Variograms are particularly sensitive to right-side outliers (e.g., exceptionally large values) (Webster and Oliver (2007)).

Code
# Checking the data distribution after log-transformation
summaplot(log(data.df[,element]),plotname2)

Code
# Log-transforming the maize Se conc.
data.df[,element]<-log(data.df[, element]) 
plotname<-plotname2

Then, we can see in Figure 5.1 the log-transformed maize Se concentration coloured by their quartile from the smallest (in blue) to the highest (in red) to see if there are some spatial trend or pattern. We also checked for North to South and West to East trends (see code in the R scripts).

Code
# Map of maize Se concentration

quantiles<-as.character(cut(data.df[,element],quantile(data.df[,element],
c(0,0.25,0.5,0.75,1)),include.lowest=T,
label=c("blue","green","yellow","red")))    

# Classified post-plot
options(scipen=5)
plot(data.df[,"Longitude"],data.df[,"Latitude"], # Lon & Lat variables
pch=16,col=quantiles,asp=1,
xlab="Longitude",ylab="Latitude",cex=0.5)
Figure 5.1: Map of Malawi: each point represents the maize Se concentration (log-transformed) at a sampled location, the colours show the quantiles: 0-24.9% (blue), 25-49.9% (green), 50-79.9% (yellow), 75-100% (red)

5.2 Variogram estimation and validation

After the data exploration, the next step is selecting the best estimators to fit the model (ordinary kriging (OK)) for getting the most accurate results.

The method for evaluating and selecting the model was based on Gashu et al. (2021) and best practices suggested by and Webster and Oliver (2007).

5.2.1 Obtaining the variogram estimates

In this section, we are studying the spatial structure of the data by generating an empirical variogram.

To do so, first we need to calculate some information from the data: the distance between all pairs of observations (lag) and the difference in maize Se concentration between two observations (varogram cloud).

Code
#  Exploratory study of empirical variogram

#  Make a variogram cloud

N<-nrow(data.df) # No. of observations

NP<-0.5*N*(N-1)  # No. of pairs

# Extracting variables from the data set
Long<-data.df$Longitude # Lon
Lat<-data.df$Latitude # Lat
z<-data.df[,element]  # maize Se conc.

# Generating vectors to store the data to be calculated in the loop below
lag<-vector("numeric",NP)   # Distance between 2pts (lag)
vclo<-vector("numeric",NP)  # Variogram cloud 
bear<-vector("numeric",NP)  # Shortest distance between 2pts (on an ellipsoid or sphere)

ico=0

for (i in 1:(N-1)){
for (j in (i+1):N){
ico=ico+1
# print(ico/NP)
lag[ico]<-distVincentySphere(c(Long[i],Lat[i]),c(Long[j],Lat[j]))/1000 #
bear[ico]<-finalBearing(c(Long[i],Lat[i]),c(Long[j],Lat[j]), a=6378137, f=1/298.257223563, sphere=TRUE)
zi<-(z[i])
zj<-(z[j])
vclo[ico]<-(zi-zj)
}
}

With that information, we are going to decide on the maximum lag distance that will be included in the variogram, and the lag bins. For more information about the use of bin distance (d) and the importance of its selection, see “Irregular sampling in one dimension” (Webster and Oliver (2007), page 68).

Code
# Setting for the variogram:
# max. distance (e.g., end of spatial correlation) around 100km 
# We have tested 100km, 150km and 200km 
maxdist <- 100

lagbins<-cut(lag,seq(0,maxdist,10),labels=seq(5,maxdist,10))   # 10-km bins for Malawi

lag2<-lag[!is.na(lagbins)]
vclo2<-vclo[!is.na(lagbins)]
lagbins<-factor(lagbins[!is.na(lagbins)])
nlags<-nlevels(lagbins)

Then, calculating the semi-variance we are using three different estimators: isotropic Matheron (Ma), Cressie-Hawkins (CH) and Dowd (Do).

In order to store the data we created three matrices, one semivar which contains the variogram estimates and was structured as follow:

  • column 1 are estimates with Matheron (iso-Ma),
  • columns 2 are for the robust estimator Cressie-Hawkins (iso-CH),
  • columns 3 are for the robust Dowd estimator (iso-Do).

Then, the lag pairs per bin, and the mean lag distance were stored in the objects npair and varlag (isotropic).

Code
# Matrices for storing the results
semiv<-matrix(nrow=nlags,ncol=3)
npair<-matrix(nrow=nlags,ncol=1)
varlags<-matrix(nrow=nlags,ncol=1)

colnames(semiv)<-c("iso-Ma","iso-CH","iso-Do")
colnames(npair)<-c("iso")
colnames(varlags)<-c("iso")

Now the mean distance per lag bin and the number of pairs in each lag (total/mean = n) and the estimators are calculated.

Code
# Adding the mean lag distance per bin 
varlags[,1]<-as.numeric(by(lag2, lagbins, mean))
# Adding the number of pairs in each lag bin 
npair[,1]<-as.numeric(by(lag2, lagbins, sum))/as.numeric(by(lag2, lagbins, mean))

# Calculating the estimator

#  Matheron
semiv[,1]<-0.5*(as.numeric(by((vclo2)^2, lagbins, mean)))

# CH
semiv[,2]<-(as.numeric(by((sqrt(abs(vclo2))), lagbins, mean)))^4
semiv[,2]<-0.5*semiv[,2]/(0.457+0.459/npair[,1]+0.045/npair[,1]^2)

# Do
vclo2d<-c(vclo2,-vclo2)
lagbinsd<-c(lagbins,lagbins)

semiv[,3]<-0.5*(as.numeric(by(vclo2d, lagbinsd, mad)))^2

And then, results are plotted…

Code
#  Now plot the variograms
maxv<-max(semiv[,1:3])
minv<-min(semiv[,1:3])

plot(varlags,semiv[,1],ylim=c(0,maxv),xlab="Distance /km",ylab="Variance",pch=16)
points(varlags,semiv[,2],pch=1)
points(varlags,semiv[,3],pch=17)

locsv<-minv*c(0.6,0.4,0.2)
points(60,locsv[1],pch=16)
points(60,locsv[2],pch=1)
points(60,locsv[3],pch=17)

text(65,locsv[1],"Matheron",pos=4)
text(65,locsv[2],"Cressie-Hawkins",pos=4)
text(65,locsv[3],"Dowd",pos=4)

5.2.2 Fit exponential variogram functions to each set of estimates

First, we need to choose initial estimates for the parameters. These will impact the estimates, so also should they be chosen with care! (See Chapter 6 and page 101 of Webster and Oliver (2007)).

Code
# Reasonable initial estimates of the parameters must be entered below

co<-20000
c1<-30000
a<-50

maxl<-max(varlags)*1.1
par(mfrow=c(2,2))

Then, we are fitting the model using the different estimators using weighted least squares (See the wls() function in the kringing-functions.R script under the functions folder).

Code
# Matheron's estimator (Ma)

h<-varlags[,1] # mean lag distance per bin 
sv<-semiv[,1]  # Ma semivariance
npa<-npair[,1] # No of pairs per bin
kap<-0.5

# Fitting the model 
oo<-optim(c(co,c1,a),wls
,method="L-BFGS-B",
lower=c(0.0,0.0,0),
upper=c(1000000,1000000,5000))

Ahat<-length(h)*log(oo$value)+6  

ooMa<-oo # Storing values 

Visually inspecting the results of the fitted variogram for the Matheron’s estimator.

Code
plot(varlags[,1],semiv[,1],ylim=c(0,maxv),xlim=c(0,maxl),xlab="Distance /km",ylab="Variance",pch=16)
lines(seq(0,maxl,0.1),oo$par[1]+oo$par[2]*(1-exp(-seq(0,maxl,0.1)/oo$par[3])))

mtext("a). Matheron",3,adj=0,line=0.5)

Then, we fitted the varigram using the Cressie’s and Hawkins’s estimators.

Code
# Cressie's and Hawkins's estimator  (CH)

h<-varlags[,1] # mean lag distance per bin 
sv<-semiv[,2]  # CH semivariance
npa<-npair[,1] # No of pairs per bin
kap<-0.5

# Fitting the model 
oo<-optim(c(co,c1,a),wls
,method="L-BFGS-B",
lower=c(0.0,0.0,0),
upper=c(1000000,1000000,5000))

Ahat<-length(h)*log(oo$value)+6  

ooCH<-oo # Storing values 

Again, we proceed to visually inspect the fitted variogram.

Code
# Plotting the data
plot(varlags[,1],semiv[,2],ylim=c(0,maxv),xlim=c(0,maxl),xlab="Distance /km",ylab="Variance",pch=16)
lines(seq(0,maxl,0.1),oo$par[1]+oo$par[2]*(1-exp(-seq(0,maxl,0.1)/oo$par[3])))
mtext("b). Cressie-Hawkins",3,adj=0,line=0.5)

And, finally, we fitted the variogram using the Dowd’s estimator.

Code
# Dowd's estimator Do

h<-varlags[,1] # mean lag distance per bin 
sv<-semiv[,3]  # Do semivariance
npa<-npair[,1] # No of pairs per bin
kap<-0.5

# Fitting the model 
oo<-optim(c(co,c1,a),wls
,method="L-BFGS-B",
lower=c(0.0,0.0,0),
upper=c(1000000,1000000,5000))

Ahat<-length(h)*log(oo$value)+6  

ooDo<-oo # Storing values 

Visualising the results….

Code
plot(varlags[,1],semiv[,3],ylim=c(0,maxv),xlim=c(0,maxl),xlab="Distance /km",ylab="Variance",pch=16)
lines(seq(0,maxl,0.1),oo$par[1]+oo$par[2]*(1-exp(-seq(0,maxl,0.1)/oo$par[3])))

mtext("c). Dowd",3,adj=0,line=0.5)

5.2.3 Cross-validation

Cross-validation consists in estimating the sample points at a time by leaving one observation out. It was performed using a subset of the data, for our case, we used 1,000 observations.

Code
N<-nrow(data.df) # No. of observations

Nv<-500 # vs 1000 check perc.

kvop<-matrix(nrow=Nv,ncol=5)

Nr<-Nv-1

# Make random sample without replacement
sam<-sample(1:N,Nv,replace=F)

# Sub sample data
Long_s<-Long[sam]
Lat_s<-Lat[sam]
z_s<-(z[sam])

After, getting the subset of the sample, we repeat the step 1 just for the sub-sample, i.e, calculating the distance matrix (e.g., pair distances).

Code
# Compute distance matrix for subsample

DM<-matrix(0,nrow=Nv,ncol=Nv)

for (i in 1:(Nv-1)){
for (j in (i+1):Nv){
DM[i,j]<-distVincentySphere(c(Long_s[i],Lat_s[i]),c(Long_s[j],Lat_s[j]))/1000 #WGS84 ellipsoid is used by default
DM[j,i]<-DM[i,j]
}
}

Now, we are performing the cross-validation for the three estimators.

First, we chose the parameters of the model fitted using the Matheron estimator.

Code
#Select parameter set to validate

ooX<-ooMa  # Matheron

# variogram matrix

Gam<-ooX$par[1]+ooX$par[2]*(1-matern(DM,ooX$par[3],0.5))
diag(Gam)<-0

for (i in 1:Nv){

# print(i)

A<-Gam[-i,-i]
A<-cbind(A,rep(1,Nr))
A<-rbind(A,c(rep(1,Nr),0))

z_0<-z_s[i]
z_n<-z_s[-i]

b<-c(Gam[i,-i],1)

lam<-solve(A)%*%b
Zhat<-sum(z_n*lam[1:Nr])
err<-z_0-Zhat
kv<-t(b)%*%lam
theta<-(err^2)/kv

kvop[i,]<-c(z_0,Zhat,err,kv,theta)
}

Then, an exploratory plot of prediction errors is produced for the variogram, and store the validation outputs for later use.

Code
# Visualising the error 
dev.new()
summaplot(kvop[,3],"Kriging error")

# Storing the values
kvopMa<-kvop
colnames(kvopMa)<-c("z_0","Zhat","err","kv","theta")

Then, we repeated the same operation for the model fitted with Cressie & Hawkins estimator.

Code
#Select parameter set to validate

ooX<-ooCH

# variogram matrix

Gam<-ooX$par[1]+ooX$par[2]*(1-matern(DM,ooX$par[3],0.5))
diag(Gam)<-0

for (i in 1:Nv){

# print(i)

A<-Gam[-i,-i]
A<-cbind(A,rep(1,Nr))
A<-rbind(A,c(rep(1,Nr),0))

z_0<-z_s[i]
z_n<-z_s[-i]

b<-c(Gam[i,-i],1)

lam<-solve(A)%*%b
Zhat<-sum(z_n*lam[1:Nr])
err<-z_0-Zhat
kv<-t(b)%*%lam
theta<-(err^2)/kv

kvop[i,]<-c(z_0,Zhat,err,kv,theta)
}

And, visusalise and store the values.

Code
# Visualising the error
dev.new()
summaplot(kvop[,3],"Kriging error")

# Storing the values
kvopCH<-kvop
colnames(kvopCH)<-c("z_0","Zhat","err","kv","theta")

Finally, we repeated the operation for Dowd.

Code
#Select parameter set to validate

ooX<-ooDo

# variogram matrix

Gam<-ooX$par[1]+ooX$par[2]*(1-matern(DM,ooX$par[3],0.5))
diag(Gam)<-0

for (i in 1:Nv){

#print(i)

A<-Gam[-i,-i]
A<-cbind(A,rep(1,Nr))
A<-rbind(A,c(rep(1,Nr),0))

z_0<-z_s[i]
z_n<-z_s[-i]

b<-c(Gam[i,-i],1)

lam<-solve(A)%*%b
Zhat<-sum(z_n*lam[1:Nr])
err<-z_0-Zhat
kv<-t(b)%*%lam
theta<-(err^2)/kv

kvop[i,]<-c(z_0,Zhat,err,kv,theta)
}

And, visualising and storing.

Code
# Visualising the error
dev.new()
summaplot(kvop[,3],"Kriging error")
#summa(kvop[,5])

# Storing the values
kvopDo<-kvop
colnames(kvopDo)<-c("z_0","Zhat","err","kv","theta")

We then checked the mean and median standardised squared prediction error for each variogram.

The median for Matheron was just outside the median standardised squared prediction error interval (thetaCLL, thetaCLU) calculated below. Then, when we compared the other robust estimators, both medians were in range, with Dowd closest to 0.455.

Code
# checking the theta (MSER)
summa(kvopMa[,5])[,c(1,2)]
summa(kvopCH[,5])[,c(1,2)]
summa(kvopDo[,5])[,c(1,2)]

thetaCLU<-0.455+2*sqrt(1/(8*(Nv/2)*(dchisq(0.455,1))^2))
thetaCLL<-0.455-2*sqrt(1/(8*(Nv/2)*(dchisq(0.455,1))^2))

print(c(thetaCLL,thetaCLU))

Hence that was the model parameters chosen for the kriging predictions at unsampled locations.

Code
# For our maize Se data (log-transformed) Do was
# the better performing
ooX<-ooDo 

Validation outputs were saved for the estimators of Matheron (kvopMA), Cressie & Hawkins (kvopCH) and Dowd (kvopDo) respectively (see R script:01_maize-model.R).

5.3 Predicting maize grain Se concentrations in unsampled locations

After choosing the optimal estimator, we used it for predicting the maize Se concentration in unsampled locations in Malawi using ordinary kriging.

First, we need to load the target locations, which are all the unsampled locations in Malawi (n=178,040). It was obtained by dividing the country in even squares at resolution.

Then, we need to calculate the matrix distance (i.e., the distance between each point location in our target sample) and the variogram matrix for the sampled data as performed previously.

Code
# Loading the data with the unsampled locations
targets.df<-read.csv(here::here("data", "maize", "Malawi_targets.csv"),header=T) 

# Sampled location data
urdata.df<-data.frame(cbind(Lat,Long,z))

# Make a distance matrix among all data points 

Long<-urdata.df$Long
Lat<-urdata.df$Lat

DM<-matrix(0,nrow=N,ncol=N)

for (i in 1:(N-1)){
for (j in (i+1):N){

DM[i,j]<-distVincentySphere(c(Long[i],Lat[i]),c(Long[j],Lat[j]))/1000 #WGS84 ellipsoid is used by default
DM[j,i]<-DM[i,j]
}
}

# variogram matrix

Gam<-ooX$par[1]+ooX$par[2]*(1-matern(DM,ooX$par[3],0.5))

A<-Gam
A<-cbind(A,rep(1,N))
A<-rbind(A,c(rep(1,N),0))
Ainv<-solve(A)

We prepared the unsampled data set for making the prediction.

Code
# No of unsampled obs.
Nt<-nrow(targets.df)

# Matrix to store the data from the OK
krop<-matrix(nrow=Nt,ncol=5)

# Binding all the sampled locations
Allpoints<-cbind(Lat,Long)

Finally, we performed the prediction. Note that the outputs are the locations (Long/Lat), the log-tranformed predicted value for each location, the log-transformed kriging variance and the Lagrange multiplier. The later one is important because as reported previously (Webster and Oliver (2007)), the back transformation of the lognormal kriging for ordinary kriging need to be done using the following Equation 5.1:

\[ Z_{OK}(x_0) = exp(Y_{OK}(x_0) + Var_{OK}(x_0)/2 - \lambda(x_0)) \tag{5.1}\]

Where \(Z_{OK}(x_0)\) is the back transformed variable at a given point location \((x_0)\), \(Y_{OK}(x_0)\) and \(Var_{OK}(x_0)\) are the kriged estimated log-transformed of the maize Se concentration and its variance using ordinary kriging at the same location, respectively, and \(\lambda(x_0)\) is the Lagrange multiplier.

Note that the prediction will take some time to run, hence, after finishing, we will save the file, before proceeding to back-transforming the data.

Code
# OK prediction

for (it in 1:Nt){
print(c(it,Nt))

# Extract Lat and Long of target point 

Lat_t<-targets.df$Lat[it]
Long_t<-targets.df$Long[it]

Bd<-matrix(nrow=(N+1),ncol=1)

for (j in 1:N){
Bd[j]<-distVincentySphere(c(Long[j],Lat[j]),c(Long_t,Lat_t))/1000
}

#maxdis<-max(Bd[1:Ne])
#print(c(it,it/Nt,maxdis))

Bd[1:N]<-ooX$par[1]+ooX$par[2]*(1-matern(Bd[1:N],ooX$par[3],0.5))
Bd[N+1]<-1
lam<-Ainv%*%Bd
Zhat<-sum(z*lam[1:N])
kv<-t(Bd)%*%lam
lagr<-lam[N+1]
krop[it,]<-c(Long_t,Lat_t,Zhat,kv,lagr)
}

# Back-transforming lognormal OK
krop$Zhat_exp <- exp(krop$Zhat + krop$kv/2 - krop$lagr)

# Saving exponential OK
fname<-paste("data/OK/",paste(Sys.Date(), element,"_OK_exp",Crop.target,".csv",sep=""))

write.csv(krop,fname,row.names=F)  

Finally, we had to back-transform the data and saved the output in the data folder.