Integración de Metropolis-Hastings: ¿por qué no funciona mi estrategia?

Suponga que tengo una función $g(x)$ que deseo integrar

\int_{- \infty}^{\infty} g (x) d x .

$\int_{-\infty}^\infty g(x) dx.$ Por supuesto, suponiendo que

g (x)

$g(x)$ va a cero en los puntos finales, sin ampliaciones, buena función. Una forma con la que he estado jugando es usar el algoritmo Metropolis-Hastings para generar una lista de muestras

x_{1}, x_{2}, \dots, x_{n}

$x_1, x_2, \dots, x_n$ partir de la distribución proporcional a

g (x)

$g(x)$ , al que le falta la constante de normalización

N = \int_{- \infty}^{\infty} g (x) d x

$N = \int_{-\infty}^{\infty} g(x)dx$ que llamaré

p (x)

$p(x)$ , y luego calcularé alguna estadística

f (x)

$f(x)$ en estas

x

$x$ 's:

\frac{1}{n} \sum_{i = 0}^{n} f (x_{i}) \approx \int_{- \infty}^{\infty} f (x) p (x) d x .

$\frac{1}{n} \sum_{i=0}^n f(x_i) \approx \int_{-\infty}^\infty f(x)p(x)dx.$

Como $p(x) = g(x)/N$ , puedo sustituir en $f(x) = U(x)/g(x)$ para cancelar $g$ de la integral, lo que resulta en una expresión de la forma Entonces, siempre queintegre alargo de esa región, debería obtener el resultado, que podría tomar el recíproco para obtener la respuesta que quiero. Por lo tanto, podría tomar el rango de mi muestra (para usar los puntos de manera más efectiva)y dejarpara cada muestra que he dibujado. De esa manera

\frac{1}{N} \int_{- \infty}^{\infty} \frac{U (x)}{g (x)} g (x) d x = \frac{1}{N} \int_{- \infty}^{\infty} U (x) d x .

$\frac{1}{N}\int_{-\infty}^{\infty}\frac{U(x)}{g(x)} g(x) dx = \frac{1}{N}\int_{-\infty}^\infty U(x) dx.$

U (x)

$U(x)$

1

$1$

1 / N

$1/N$

r = x_{max} - x_{min}

$r = x_\max - x_\min$

U (x) = 1 / r

$U(x) = 1/r$

U (x)

$U(x)$ evaluates to zero outside of the region where my samples aren't, but integrates to

1

$1$ in that region. So if I now take the expected value, I should get:

E [\frac{U (x)}{g (x)}] = \frac{1}{N} \approx \frac{1}{n} \sum_{i = 0}^{n} \frac{U (x)}{g (x)} .

$E\left [\frac{U(x)}{g(x)}\right ] = \frac{1}{N} \approx \frac{1}{n} \sum_{i=0}^n \frac{U(x)}{g(x)}.$

I tried testing this in R for the sample function $g(x) = e^{-x^2}$ . In this case I do not use Metropolis-Hastings to generate the samples but use the actual probabilities with rnorm to generate samples (just to test). I do not quite get the results I am looking for. Basically the full expression of what I'd be calculating is:

\frac{1}{n (x_{max} - x_{min})} \sum_{i = 0}^{n} \frac{1}{e^{- x_{i}^{2}}} .

$\frac{1}{n(x_{\max} - x_\min)} \sum_{i=0}^n \frac{1}{ e^{-x_i^2}}.$ This should in my theory evaluate to

1 / \sqrt{π}

$1/\sqrt{\pi}$ . It gets close but it certainly does not converge in the expected way, am I doing something wrong?

ys = rnorm(1000000, 0, 1/sqrt(2))
r = max(ys) - min(ys)
sum(sapply(ys, function(x) 1/( r * exp(-x^2))))/length(ys)
## evaluates to 0.6019741. 1/sqrt(pi) = 0.5641896

Edit for CliffAB

The reason I use the range is just to easily define a function that is non-zero over the region where my points are, but that integrates to $1$ on the range $[-\infty, \infty]$ . The full specification of the function is:

U (x) = {\begin{cases} \frac{1}{x_{max} - x_{min}} & x_{max} > x > x_{min} \\ 0 & otherwise. \end{cases}

$U(x) = \begin{cases} \frac{1}{x_\max - x_\min} & x_\max > x > x_\min \\ 0 & \text{otherwise.} \end{cases}$ I did not have to use

U (x)

$U(x)$ as this uniform density. I could have used some other density that integrated to

1

$1$ , for example the probability density

P (x) = \frac{1}{\sqrt{π}} e^{- x^{2}} .

$P(x) = \frac{1}{\sqrt{\pi}} e^{-x^2}.$ However this would have made summing the individual samples trivial i.e.

\frac{1}{n} \sum_{i = 0}^{n} \frac{P (x)}{g (x)} = \frac{1}{n} \sum_{i = 0}^{n} \frac{e^{- x_{i}^{2}} / \sqrt{π}}{e^{- x_{i}^{2}}} = \frac{1}{n} \sum_{i = 0}^{n} \frac{1}{\sqrt{π}} = \frac{1}{\sqrt{π}} .

$\frac{1}{n} \sum_{i=0}^n \frac{P(x)}{g(x)} = \frac{1}{n} \sum_{i=0}^n \frac{e^{-x_i^2}/\sqrt{\pi}}{e^{-x_i^2} } = \frac{1}{n} \sum_{i=0}^n \frac{1}{\sqrt{\pi}} = \frac{1}{\sqrt{\pi}}.$

I could try this technique for other distributions that integrate to $1$ . However, I would still like to know why it doesn't work for a uniform distribution.

— Mike Flynn
fuente

Only quickly looking over this, so I'm not sure exactly why you decided to use range(x). Conditionally on it being valid, it's extremely inefficient! The range of a sample of that size is just about the most unstable statistic you could take.

— Cliff AB

@CliffAB There's nothing particularly special about me using the range, aside from defining a uniform distribution on the interval where my points lie. See edits.

— Mike Flynn

I'll look at this later on in more detail. But something to consider is that as if x is a set of uniform RV's, then as

n \to \infty

$n \rightarrow \infty$ , range

(x) \to 1

$(x) \rightarrow 1$ . But if x is a set of non-degenarate normal RV's, then as

n \to \infty

$n \rightarrow \infty$ ,

range (x) \to \infty

$\text{range}(x) \rightarrow \infty$ .

— Cliff AB

@CliffAB you might have been right, I think the reason was that the bounds of the integral were not fixed, and so the variance of the estimator will never converge...

— Mike Flynn

This is a most interesting question, which relates to the issue of approximating a normalising constant of a density $g$ based on an MCMC output from the same density $g$ . (A side remark is that the correct assumption to make is that $g$ is integrable, going to zero at infinity is not sufficient.)

In my opinion, the most relevant entry on this topic in regard to your suggestion is a paper by Gelfand and Dey (1994, JRSS B), where the authors develop a very similar approach to find

\int_{X} g (x) d x

$\int_\mathcal{X} g(x) \,\text{d}x$ when generating from

p (x) \propto g (x)

$p(x)\propto g(x)$ . One result in this paper is that, for any probability density

α (x)

$\alpha(x)$ [this is equivalent to your

U (x)

$U(x)$ ] such that

{x; α (x) > 0} \subset {x; g (x) > 0}

$\{x;\alpha(x)>0\}\subset\{x;g(x)>0\}$ the following identity

\int_{X} \frac{α (x)}{g (x)} p (x) d x = \int_{X} \frac{α (x)}{N} d x = \frac{1}{N}

$\int_\mathcal{X} \dfrac{\alpha(x)}{g(x)}p(x) \,\text{d}x=\int_\mathcal{X} \dfrac{\alpha(x)}{N} \,\text{d}x=\dfrac{1}{N}$ shows that a sample from

p

$p$ can produce an unbiased evaluation of

1 / N

$1/N$ by the importance sampling estimator

\hat{η} = \frac{1}{n} \sum_{i = 1}^{n} \frac{α (x_{i})}{g (x_{i})} x_{i} \overset{iid}{\sim} p (x)

$\hat\eta=\frac{1}{n}\sum_{i=1}^n \dfrac{\alpha(x_i)}{g(x_i)}\qquad x_i\stackrel{\text{iid}}{\sim}p(x)$ Obviously, the performances (convergence speed, existence of a variance, &tc.) of the estimator

\hat{η}

$\hat\eta$ do depend on the choice of

α

$\alpha$ [even though its expectation does not]. In a Bayesian framework, a choice advocated by Gelfand and Dey is to take

α = π

$\alpha=\pi$ , the prior density. This leads to

\frac{α (x)}{g (x)} = \frac{1}{ℓ (x)}

$\dfrac{\alpha(x)}{g(x)} = \dfrac{1}{\ell(x)}$ where

ℓ (x)

$\ell(x)$ is the likelihood function, since

g (x) = π (x) ℓ (x)

$g(x)=\pi(x)\ell(x)$ . Unfortunately, the resulting estimator

\hat{N} = \frac{n}{\sum_{i = 1}^{n} 1 / ℓ (x_{i})}

$\hat{N}=\dfrac{n}{\sum_{i=1}^n1\big/\ell(x_i)}$ is the harmonic mean estimator, also called the worst Monte Carlo estimator ever by Radford Neal, from the University of Toronto. So it does not always work out nicely. Or even hardly ever.

Your idea of using the range of your sample $(\min(x_i),\max(x_i))$ and the uniform over that range is connected with the harmonic mean issue: this estimator does not have a variance if only because because of the $\exp\{x^2\}$ appearing in the numerator (I suspect it could always be the case for an unbounded support!) and it thus converges very slowly to the normalising constant. For instance, if you rerun your code several times, you get very different numerical values after 10⁶ iterations. This means you cannot even trust the magnitude of the answer.

A generic fix to this infinite variance issue is to use for $\alpha$ a more concentrated density, using for instance the quartiles of your sample $(q_{.25}(x_i),q_{.75}(x_i))$ , because $g$ then remains lower-bounded over this interval.

When adapting your code to this new density, the approximation is much closer to $1/\sqrt{\pi}$ :

ys = rnorm(1e6, 0, 1/sqrt(2))
r = quantile(ys,.75) - quantile(ys,.25)
yc=ys[(ys>quantile(ys,.25))&(ys<quantile(ys,.75))]
sum(sapply(yc, function(x) 1/( r * exp(-x^2))))/length(ys)
## evaluates to 0.5649015. 1/sqrt(pi) = 0.5641896

We discuss this method in details in two papers with Darren Wraith and with Jean-Michel Marin.

— Xi'an
fuente