# Benchmarks: definition of the metrics

Metrics are defined to *objectively* compare the different stochastic assimilation methods tested on each benchmark. Here, the mathematical measure (or score) **and** the evaluated objects form what we call `metric'. Basically the scores are the same for each experiment, but the hierarchy of SANGOMA benchmarks sets a hierarchy of designed specifications to correspond to the purpose of each one.

## Scores

The quality of the probability distributions produced by the stochastic assimilation systems are evaluated and compared through 2 probabilistic attributes: the statistical consistency, or reliability, and the statistical variability, or resolution/information/entropy (e.g. Toth et al 2003).

The reliability measures the agreement between an estimated and a verified distribution. Many scores can be used to assess this attribute:

- Rank histogram and its multivariate extension (Gneiting et al. 2008). A perfectly reliable system shows a flat rank histrogram. The non-reliability can be compared to the flatness by a χ² test. For the multi-variate verification, a rank histogram can also be built from a minimum spanning tree process (Gombos et al. 2007).
- Random centered reduced variable and its matricial extension (Candille et al. 2007). This diagnosis enables a partition of the reliability into bias and dispersion (at least for univariate case): for a perfectly reliable system, one has to get a null bias and a unit dispersion.
- Reliability component of the Brier score and of the continuous ranked probability score (CRPS, Candille and Talagrand 2005). Scores negatively oriented and equal to zero for a perfect reliable system.

The resolution measures the system ability in separating relevant situations. For instance a system always producing the climatological distribution is perfectly reliable but provides no information (except the climatology of course). Many scores can be used to assess the resolution attribute:

- Resolution component of the Brier score (Murphy 1973). Score negatively oriented and equal to zero for a perfect deterministic system (best value). The resolution is usually compared to the
*uncertainty*(part of the score which only depends on the verification data set). If the resolution is greater than the*uncertainty*, the system is considered useless. - The resolution component of the CRPS and its mutivariate extension, the energy score (Gneiting et al. 2008). Score negatively oriented and equal to zero for a perfect deterministic system. Also, if the resolution is greater than the uncertainty, the system is considered useless.
- Entropy (Gneiting et al. 2008). Score negatively oriented and equal to zero for a perfect deterministic system. Also, if the resolution is greater than the
*uncertainty*, the system is considered useless. This measure is equivalent to the resolution part of the CRPS, but does not provide any diagnosis on the reliability of the system.

Since the ensemble verification is statistically performed, we can add confidence interval (for instance by resampling method, bootstrap) on the scores in order to get *objective* comparisons between the assimilation systems.

## Small case benchmark: Lorenz-96 model

The first benchmark involves a very small size dynamical system and a very idealized assimilation problem (twin experiment, all variables could be observed, no model error). The questions can thus be stated with full mathematical generality, and the metrics can be defined without any kind of approximation. Conversely, no answers on the numerical cost or on the robustness to uncontrolled approximations can be expected at this stage. Nevertheless, even with this small state vector size of 40, it would be ambitious (but still possible) to evaluate the multivariate probability distribution as a whole. It would take very large ensemble size (abput 100-1000) in order to well define the whole probability distribution. Even at this stage, the verification could be limited to the marginal and the N-variate distributions (N < 40).

*Questions*

- What is the consistency between the exact prior probability distribution and the one that is simulated or assumed by the assimilation method ?
- Is the posterior probability distribution statistically consistent with the real error ?
- To what extent is the uncertainty about the system reduced by the assimilation method ?

*Metrics*

- Compute every score evaluating both reliability and resolution on the prior distribution. Since all the errors are under control, the emphasis should be put on the reliability attribute in order to estimate the impact of the ensembles size on the assimilation stochastic methods.
- Compute the scores evaluating both reliability and resolution on the posterior distribution.
- Compute the scores evaluating the resolution/entropy. Compare the gain between the prior and posterior distributions. A better resolution/entropy for the posterior distributions is expected.

## Medium case benchmark: double-gyre NEMO configuration

The intermediate benchmark is meant to be a direct transposition of the first benchmark metrics to a mesoscale ocean flow. For this reason, the ocean system is kept as simple as possible (square ocean, simple physics) and the assimilation problem is still an idealized problem (twin experiments, no model error). The additional difficulties come from the much larger size of the system, and from the fact that not all state variables are observed. With this benchmark, the question of the numerical efficiency of the assimilation method starts to become an issue.

*Questions*

- To what extent can the prior probability distribution be described by a moderate size ensemble ? What is the best way to combine the ensemble description with additional assumptions about the prior distribution (like adaptive procedures) ?
- Are the marginal posterior probability distributions consistent with the real error ? Is there a difference between observed and non-observed variables or as a function of depth ?
- What is the posterior uncertainty for every single model variable ? How does it change in space and time ?

*Metrics*

- Produce an estimate of exact marginal probability distributions using a very large ensemble, and explore the variations of the scores with respect to this exact distribution as a function of the ensemble size and/or additional assumptions (which are related to the numerical cost). This can be done for univariate marginal distributions and for several bi- or tri-variate marginal distributions to see if the dependence between variables is correctly reproduced as a function of the distance and/or time. If the distributions are close to Gaussian, this can be reduced to exploring the modifications in the ensemble variance and in the linear correlation structure.
- Compute the scores of every model variable given the corresponding posterior marginal distribution.
- Compute and compare the scores resolution/entropy corresponding to the marginal prior and posterior probability distributions.

## Large case benchmark: North-Atlantic 1/4° NEMO/LOBSTER configuration

The purpose of the last benchmark is to provide an intercomparison of the assimilation methods using a real-world assimilation problem, which is close to the current MyOcean systems. With respect to the intermediate benchmark, the additional difficulties are: (i) the much larger complexity of the system (larger variety of dynamical processes, about n=6 x 10^7 state variables), (ii) the use of real-world observations (so that the true state of the system is no longer known), and (iii) the presence of various sources of model errors. For these reasons, the questions must be reformulated and the metrics adapted to provide a similar kind of intercomparison in this more complex situation.

*Questions*

- To what extent is it possible to provide a consistent description of the prior probability distribution ? Is it consistent with the available observations ? What is the best compromise between exploring the probability distribution using a large size ensemble (e.g. to identify non-Gaussian behaviours) and making prior assumptions about the shape of the distribution ?
- Are the marginal posterior probability distributions consistent with the
*independent*observations ? - What is the posterior uncertainty ? Is the estimation compatible with the available prior knowledge of the dynamics ?

*Metrics*

- Compute and compare the scores for the marginal (and maybe bivariate) prior distributions, with emphasis on the reliability scores.
- Compute the scores evaluating both reliability and resolution for the marginal (and bivariate if possible) posterior distributions.
- Define the list of key diagnostics (variables function) to be evaluated. Estimate and compare the scores resolution/entropy corresponding to the marginal prior and posterior probability distributions for each diagnosis.