The Many Ways of Sampling

Micro Blog: Importance Sampling

Published

May 11, 2026

You will see this dotted around the blog: TL;ND: (Taking) too long; not doing (it)..

Feel free to read [1, Secs. 1–2.1] to quickly get up to speed on what importance sampling is and why we do it (i.e., variance reduction). Here’s a quick teaser:

[1]

A. Owen and Y. Zhou, “Safe and effective importance sampling,” Journal of the American Statistical Association, vol. 95, no. 449, pp. 135–143, 2000, Accessed: Apr. 13, 2026. [Online]. Available: http://www.jstor.org/stable/2669533

We have a standard diffuse Lambertian model.

\[ L_o=\int_{H^+}L_i\frac{\rho}{\pi}\cos\theta d\omega \]

Most of the time, these models are part of a greater complex model that is intractable. So, we use Monte Carlo sampling ¹ with \(N\) samples.

¹ Note: \(\frac{1}{N}\) is not averaging over the integration. Think of it like Riemann sum (i.e., standard rectangular quadrature rule) but extended probabilistically, which is better known as stochastic quadrature rule. In other words, in terms of 1D interval, think of it like each samples represents random points of evaluation with all having the same average width depending on sample count. The more sample there is, the smaller each samples are weighted, but they’re randomly placed (unlike the aligned rectangular sample grids of Riemann sum).

\[ \langle L_o \rangle = \frac{1}{N}\sum_{k=1}^N\bigg(\frac{L_i\frac{\rho}{\pi}\mathbf{n}\cdot\mathbf{\omega_k}}{\text{pdf}_{H^+}}\bigg) \text{ s.t. } \mathbf{\omega_k}\sim H^+ \]

The notation \(\langle \cdot \rangle\) is confusing and sometimes seems to be inconsistent, but to the best of my knowledge, the notation seems to be borrowed from physics to denote expectation. In computer graphics literature, it is more specifically denoted as the expectation over the partitions of the stochastic quadrature with the sampling technique used by the Monte-Carlo estimation (which is just the Monte-Carlo estimation) with samples limited to \(N\). As you increase the \(N\), the M.C. estimation should converge to (not be) the correct value as if it were analytically integrable. We are still basically summing \(N\) random variables which means the M.C. estimate itself is still a random variable. If we were given \(100\) samples split into \(10\) groups with each having \(10\) samples for a total of \(10\) M.C. evaluations of the same integral, we still have \(10\) R.V. from each of the M.C. evaluations but each of the estimations is a singular value of a valid estimation of the integral that is usable in its own way. So why not just have all \(100\) samples be evaluated into a single M.C. estimation? Well, this now depends on the preference of the application. In graphics, we want each M.C. evaluations have a limited samples so that it can run with high FPS. But since the M.C. evaluates only an estimate (and it’s a R.V.), we want to ensure it is unbiased (accurate) and low variance (precise) by running additional statistical analysis over it separately. And this is where it gets confusing again with finding biasness in the M.C. evaluations \(\mathbb{E}[\langle \cdot \rangle]\) which is mostly hand-waved in [2, Sec. 2.4].

[2]

E. Veach, “Robust monte carlo methods for light transport simulation,” PhD thesis, Stanford University, Stanford, CA, USA, 1998.

Tangent: “Better” proof on \(\mathbb{E}[\langle Y \rangle] = Y\)

Disclaimer: I over-complicated and Veach’s proof remains sensical as is.

Suppose \(Y=\int f(x)dx\) is M.C. estimated with:

\[ \langle Y \rangle = \frac{1}{N}\sum_{i=1}^N \frac{f(X_i)}{p_X(X_i)} \]

where \(X_i \sim p_X\)² (\(X_i\) is a R.V.).

Suppose \(\mathbb{E}[g(z)] = \int g(z)p_X(z)dz\). We get the following:

\[ \mathbb{E}[\langle Y \rangle] = \mathbb{E}\bigg[ \frac{1}{N}\sum_{i=1}^N \frac{f(X_i)}{p_X(X_i)} \bigg] = \frac{1}{N}\sum_{i=1}^N \mathbb{E}\bigg[ \frac{f(X_i)}{p_X(X_i)} \bigg] = \frac{1}{N}\sum_{i=1}^N \int \frac{f(X_i)}{p_X(X_i)} p_X(z)dz \]

If we simplify the expression, we get \(\mathbb{E}_{z\sim p_X}\bigg[ \frac{1}{N}\sum_{i=1}^N \frac{f(X_i)}{p_X(X_i)} \bigg]\). Do you see the problem? I over-complicated the interpretation of Veach’s proof. Of course an expectation over a function (parameterized by \(X_i\)) that does not even take the parameter \(z\) that’s being expected over is just the function itself again. It’s a totally non-sensical statement that’s only made sense if the expectation is modified into \(\mathbb{E}_{X_i\sim p_X}\bigg[ \frac{1}{N}\sum_{i=1}^N \frac{f(X_i)}{p_X(X_i)} \bigg]\). Rest of the proof should follow trivially.

The key idea was Monte-Carlo estimator was sampled “externally” and the expectation brought the external sampling into itself, turning the sampling process into an integration over all possible samples.

² There does seem to be a general ambiguity on how to denote the process of sampling variates or generating random variables from distributions, densities, probability measures, etc. Per §1.2 in [3, p. 10], we can say in our case the random variable \(X_i\) is generated from another random variable \(X\) that induces a push-forward measure \(\mu\) with a density \(p_X(x)\) for every realization/variates of \(x\) from \(X\).

[3]

R. Durrett, Probability: Theory and examples, 5th ed. USA: Cambridge University Press, 2019.

In our case, we are interested in reducing the variance, one of the main workable properties of this enigmatic, dynamic/changing evaluations of \(\langle L_o \rangle\). And a common, simple way to reduce variance is by applying importance sampling. In this case, if we swap out our sampling from a uniform hemispherical sampling \(\mathbf{\omega_k}\sim H^+\) with a PDF of \(\frac{1}{2\pi}\) to a cosine-weighted hemispherical sampling \(\mathbf{\omega_k}\sim \cos_\mathbf{n}\) with a PDF of \(\frac{\mathbf{n}\cdot\omega_k}{\pi}\), we can actually get the following.

\[ \langle L_o \rangle = \frac{1}{N}\sum_k^N\bigg(\frac{L_i(\mathbf{\omega_k})\frac{\rho}{\pi}\mathbf{n}\cdot\mathbf{\omega_k}}{\frac{\mathbf{n}\cdot\omega_k}{\pi}}\bigg) = \frac{1}{N}\sum_k^N\bigg(L_i(\mathbf{\omega_k})\rho\bigg) \text{ s.t. } \mathbf{\omega_k}\sim \cos_\mathbf{n} \]

Isn’t that just satisfying? Just simply replace how you sample and you get a much more neater expression (that also lowers variance). Now, the interesting part is not the simplification algebra, but rather how are we even able to craft a seemingly arbitrary sampling distribution such as cosine-weighted hemispherical distribution over here? It’s also the idea of the existence of the ability to transport uniform samples to a desired probability “density”³ itself and how it is done that seems so captivating. This blog will be a focus on how one can craft these seemingly arbitrary samplers from just uniform distributions (a common, sensible, intuitive utility widely available in libraries from Python standard library, PyTorch, cuRAND, etc.), and show where it is commonly applied in (if it has any.)

³ I’m telling you, probability is just… enigmatic. You either can explain it in words and not visually, or you can show it visually and impossible to explain it in words like here.

Here’s a teaser, from barebones uniform sampler \(U[0,1)\) (shortened to \(U\)) to hemisphere sampler \(p_\text{hemisphere}\) to cosine-weighted hemisphere sampling \(p_\text{cos}\). I won’t show the the derivation for this one because if you are just getting started in graphics you probably have incidentally just memorize the formulas from the constant mentioning of cosine-weighted sampling during implementation (i.e., beating a dead horse.)

The purpose of this visualization shows the deterministic transformation of the samples. This can be thought of as assuming each samples are deterministic to clearly visualize the transformation process of the density and treats the input randomness as something independent. We are given \(U_1, U_2, U_3\).

Introduction

These are some distributions I have come across a lot and seems to have derivation to transform uniform samples into it. I hope this also lets you start seeing the world of transforming uniform samples.

3D Gaussian

We will specifically use 3D version of the standard Gaussian. Suppose a Monte Carlo has Normal distribution factor (parameterized by \(\mathbf{x}=[x,y,z]\)) that we want to remove via \(\mathbf{pdf}_\text{gaussian}=\frac{\exp{(-\frac{1}{2}\mathbf{x}^\top\mathbf{x})}}{\sqrt{(2\pi)^3}}\). How would we design the sampler such that the sampled points \(\mathbf{x}\) naturally follows the Gaussian distribution so that its manufactured density cancels out in the Monte-Carlo expression? Suppose \(U_1, U_2, U_3\).

1st attempt: Inverse CDF

Recall that a CDF \(F\) is an integral of PDF \(f\) that describes the percentile for a given random variate \(x\) with density \(f\). A nice property of CDF is that \(0\leq F(x) \leq 1\) since \(f\) is normalized, allowing us to generate uniform samples that will be transformed by \(F^{-1}\) to get the original density without the cost of evaluating additional terms. Per Theorem 3.1 of [4, Secs. 11–3.2], we can let each of our target random variables \(Y_1, Y_2, Y_3\) be assigned as \(Y_i = F^{-1}_i(U_i)\), and it will still maintain the original density of the 3D gaussian. Additionally, Gaussian distribution is independent with respect to each of its dimension, so we can split them into a product of three marginal Gaussian distribution \(\mathbf{pdf}_\text{gaussian}=\mathcal{N}_1(0,1)\mathcal{N}_2(0,1)\mathcal{N}_3(0,1)\) ⁴ which can be applied to the theorem. However, to obtain \(F\) in the first place, we have to find a closed form solution of \(F(x)=\int_{-\infty}^x \mathcal{N} dx=\int_{-\infty}^x \frac{e^{-\frac{x^2}{2}}}{\sqrt{2\pi}} dx\). This is a well-known problem and has no closed-form solution (strictly speaking). Hence, we go with the Box-Muller technique.

⁴ Where we do denote \(\mathcal{N}\) as the standard marginal Gaussian distribution.

2nd attempt: (Extended) Box-Muller ⁵

Since we want to avoid rejection sampling out of elegancy, we are limited to Box-Muller technique that works in 2 dimensions. In other words, we will have to use \(U_1,U_2,U_3,U_4\) (2x Box-Muller) to generate a 4D Gaussian which we will just extract the first three components (since each axes are independent of each other). You might argue this is even less elegant, but I’m to deep into Box-Muller such that we’ll just go with it.

We will focus on \(d=2\) case for now. In [4, Secs. 5–4.4], we can generate \(d=2\) i.i.d. normal random variates by \(RX\), where \(R\) is the scalar random variable⁶ of radius and \(X\) is the 2-dimensional random variable on the surface of the unit 2D hyper-sphere (\(C_2\)). Per Theorem 4.4 of [4, Secs. 5–4.3], statement 5 states \(X\sim(\cos(2\pi U_1), \sin(2\pi U_1))\) is uniformly distributed on \(C_2\) ⁷, which summarizes the Box-Muller method. Then, \(r\sim R\) is realized on a density of \(d\cdot V_d\cdot r^{d-1}\cdot e^{-\frac{r^2}{2}}\) where \(d\) is the dimension and \(V_d\) is the volume of \(d\)-dimension hypersphere⁸. However, this is different from Devroye’s originally intended density which is \(d \cdot V_d \cdot r^{d-1} \cdot g(r) = d \cdot V_d \cdot r^{d-1} \cdot \frac{e^{-\frac{r^2}{2}}}{(2\pi)^{\frac{d}{2}}}\) from Theorem 4.3 (1st statement) and Theorem 4.2. This is the normalized (hence, correct) version, but Devroye did subtly mention \(R \sim \sqrt{2G}\)⁹ where \(G \sim \text{Gamma}(\frac{d}{2},1)\). In fact, the following note (collapsed by default) will show that \(\sqrt{2G}\) is the correct, normalized version ¹⁰.

⁶ Random variate \(x\) is a particular realized value/outcome from a random variable \(X\). Random variate \(x\) can be exactly treated like traditional variable while random variable \(X\) is intrinsically described by a distribution (e.g., CDF, PDF) that could behave differently than a traditional variable within expressions.

⁷ \(\tan{\pi U}=\frac{\sin X_1}{\cos X_2}\) where \(X_1,X_2\) are on \(C_2\). This is also the main reason why Box-Muller works in 2 dimensions.

⁸ Check [4, Secs. 5–4.1] for more info.

[4]

L. Devroye, Non-uniform random variate generation. New York: Springer-Verlag, 1986. Available: http://cg.scs.carleton.ca/~luc/rnbookindex.html

⁹ If \(R=\sqrt{2G}\), that means every realized values of \(\sqrt{2G}\) is \(R\) at a concrete, realized level. \(R\sim \sqrt{2G}\) just means the underlying realized, concrete values might differ but the distribution (i.e., the density) of these realized values will follow from \(R\) to \(\sqrt{2G}\) (and vice versa).

¹⁰ Conceptually, density loosely does not have to be normalized. It makes sense as long the order is kept (higher density will be scaled to be denser in the same amount as the space with lesser density.) However, by definition, we strictly want it to be normalized out of consistency, and I was confused for a long long time on this specific description.

Ensuring normalization is present in \(R\)

Given the following:

\(G \sim \text{Gamma}(\frac{d}{2},1) = \frac{1}{\Gamma(\frac{d}{2})}x^{(\frac{d}{2}-1)}e^{-x}\) ¹¹
\(V_d=\frac{\pi^\frac{d}{2}}{\Gamma(\frac{d}{2}+1)}\) (the volume of a \(d\)-dimensional hypersphere)
\(f_G(g) = \frac{1}{\Gamma(\frac{d}{2})}g^{(\frac{d}{2}-1)}e^{-g}\) (the density of the source distribution)
\(\Gamma(z+1)=z\Gamma(z)\) where \(z\in \mathbb{C}, \mathfrak{R}(z)>0\)

We want to find the density of \(R\):

\(y_R(r) ={}?\)

We can start with the transformation of random variable. Per Theorem 4.2 in [4, Secs. 1–4.1], suppose we have a transformation \(R=\sqrt{2G}\). Since \(r>0\), then \(G>0\). Hence, an inverse transformation must also exist where \(G=\frac{R^2}{2}\). Finding the Jacobian in 1D is just the absolute diffentiation of \(R\) w.r.t. \(G\).

\[ y_R(r) = f_G\bigg(\frac{r^2}{2}\bigg)\bigg|J\bigg| = f_G\bigg(\frac{r^2}{2}\bigg)\bigg|\frac{d\frac{r^2}{2}}{dr}\bigg| = f_G\bigg(\frac{r^2}{2}\bigg)\bigg|r\bigg| \]

Since \(r\ge 0\), we can simplify further.

\[ \begin{align*} y_R(r) &= f_G\bigg(\frac{r^2}{2}\bigg)r \\ &= \frac{1}{\Gamma(\frac{d}{2})}\bigg(\frac{r^2}{2}\bigg)^{(\frac{d}{2}-1)}e^{-\frac{r^2}{2}}r \\ &= \frac{1}{\frac{2}{d}\Gamma(\frac{d}{2}+1)}\frac{\pi^\frac{d}{2}}{\pi^\frac{d}{2}}\bigg(\frac{r^{d-2}}{2^{(\frac{d}{2}-1)}}\bigg)e^{-\frac{r^2}{2}}r \\ &= d \cdot V_d \cdot r^{d-1} \cdot \frac{e^{-\frac{r^2}{2}}}{(2\pi)^\frac{d}{2}} \end{align*} \]

Despite the first equation shown as un-normalized, the implied sampling distribution \(\sqrt{2G}\) from Devroye is actually normalized and we can safely assume it is correct with the full context.

¹¹ per wikipedia

Now with everything satisfied, we can comfortable know that with \(X\sim(\cos(2\pi U_1), \sin(2\pi U_1))\) and \(R\sim \sqrt{2G}\) we can produce \(RX\) that gives us 2 i.i.d. normal samples. Recall the independence assumption, we can get a 3D gaussian by letting \((Y_1,Y_2) = RX\) and \(Y_3 = RX_1\). In fact, we can let each \(Y_i\) sample be the \(\sin\) component of each of the 3 Box-Muller \(X\) pairs and theoretically it will still work, but that’s a total waste of computation.

We’re not done yet, while we can clearly see how to sample \(X\), what about sampling \(\sqrt{2G}\)?¹² This should be easier since we’re now working 1D. Recall \(G\sim \frac{1}{\Gamma(\frac{d}{2})}r^{(\frac{d}{2}-1)}e^{-r}\). The gamma function \(\Gamma\) (not to be confused with the Gamma distribution) does get wonky with the possiblity of no closed-form solution (i.e., impossible to find a sampling method) on non-integer input, so we’ll assume \(d=2\) from now on (another reason we are restricted in terms of 2, dimensionally speaking). We get a convenient distribution \(G\sim e^{-r}\). Easy win: \(G \sim -\ln{(1-U)}\). (Check the notes below for derivation.)

¹² When we want to sample, we want to work the other way. Previously, we tried to find the density of that. Now, we want to find the sampling method that takes a uniform and transform it into the density.

Yet another derivation on how to sample exponential distribution

Via inverse CDF.

We have the density (PDF) \(f(r)=e^{-r}\) for \(r\geq 0\). We can get its CDF as follows:

\[ \begin{align*} F(r) &= \int_{0}^{r} e^{-t}dt \\ &= -e^{-t} \big|_{0}^{r} \\ &= -e^{-r}+e^{0} = 1-e^{-r} \\ \end{align*} \]

The inverse CDF is \(-\ln{(1-F(r))}=r=F^{-1}(u)\). Since exponential distribution has infinite support (in one direction), the CDF will never reach to 1. This means the \(\ln\) is always defined since \(F(r)<1 \implies 1-F(r)>0\). Because we defined our uniform as \(U[0,1)\), \(1-U\) yields an interval of \((0,1]\), ensuring validity.

Hence, we can sample the distribution as \(-\ln{(1-U)}\).

Now we have \(X\sim(\cos(2\pi U_1), \sin(2\pi U_1))\) and \(R \sim \sqrt{-2\ln{(1-U_2)}}\) all in terms of \(U\), we can finally put it altogether.

The steps required to sample a 3D gaussian is more involved with more steps. Feel free to change the code (i.e., permute the transitions) to your liking or visualize the individual parts like sampling the exponential distribution. In later sections, we will show a better animation of the change in density of 3D gaussian by ensuring point-to-point correspondence. Currently, there does seem to be a tradeoff between closed-form sampling methods versus point-corresponding density transformation, which the later sections will relax upon.

Remark: Re-defining what “elementary” functions are

Closed-form solutions and methods can be thought of as them strictly solved/derived within a self-contained set of functions. Traditionally, perhaps from a pedagogical view, this has been the arbitrarily-defined “elementary” functions. Since we are more focused on the computational, programmatic aspect of probabilities, and math in general (for my blogs), let us redefine the elementary functions that are elementary in the sense that many computational libraries for CPU & GPU provides the function atomically. Usually for good reasons that includes minimal constant-bounded clock cycles if the entire math is viewed from a restricted Turing machine instead of elementary functions or even ZFC. Because at the end of the day, \(\sin\) and \(\cos\) are numerically approximated and \(U\) is provided practically everywhere via computational libraries as aforementioned.

Tangent: Proving single-variable inverse CDF technique from transformation of R.V.

Devorye’s [4] definition of R.V. is quite different from the standard defintion of R.V. Suppose a random variable \(X\), Devroye assumes random variable are random variates generated by an algorithm that always halt and produces a real number \(X\), while Kolmogorov’s definition of R.V. per Durrett [3] is a function where \(X:\Omega \to \mathbb{R}\). Random variable in this sense can be thought of as a variable, but technically we are passing the function object itself (i.e., lambdas in Python) and the inequalities that commonly operate on the R.V. are like the domain restriction on that R.V. Similiarly, \(\sim\) describes how the R.V., as a function, is distributed, in relation to other distribution or R.V. Regardless, we’ll continue Devorye’s assumption of R.V. ¹³ ¹⁴

Suppose we have a distribution function (CDF) \(F(x)\), which always has an inverse \(F^{-1}(y)=\inf\{x:F(x)=y, 0<y<1\}\)¹⁵ by Devorye’s infinum definition¹⁶ of inverse CDF, from the target distribution \(X\). Given a known source distribution \(Y\sim U[0,1)\), how can we obtain the target via transformation of R.V.?

We will assume the CDF is 1D¹⁷.

Since \(Y\) has a density \(f_Y(y)=1\) that is continuous a.e. over its support and our transformation, the quantile function \(x=F^{-1}(y)\), is one-to-one and onto (again, because of its quantile definition). Following Theorem 4.2 [4, p. 14], the inverse (of the inverse CDF) must exist as the CDF itself \(y=F(x)\). Since \(\frac{dy}{dx}\) exists due to CDF’s continuity and the codomain of \(F^{-1}(y)\) is also continuous, by the theorem, \(X\) has a density of \(f_Y(F(x))|\frac{dy}{dx}|\). Since \(f_Y(y)=1\) and \(|\frac{d}{dx}F(x)|=|f_X(x)|\) with the fact that all density must be positively defined, \(X\) has the density \(f_X(x)\) as expected.

\(\Box\)

¹⁷ It was at this point one realizes that the simple inverse CDF only works 1D. Multivariable inverse CDF requires special handling.

¹⁶ a.k.a. quantile function

¹⁵ The reason it is not \(0\le y\le 1\) is we got to find a good definition of inverse CDF that supports various distributions (including those that have infinite support), but also \(y=0,1\) are 0-measure that contributes nothing when removed. Hence, it is safe to exclude them. Devorye does this implicitly, but this is more commonly known as almost everywhere (a.e.) and almost surely (a.s.). A fancy way to say we ignore degenerate cases precisely and hand-wavily (usually to keep the definition consistent).

¹⁴ Bayesian and frequentist are different views/interpretations of Kolmogorov probability and statistics, which is orthogonal to different definitions of probability that was discussed before.

¹³ Devorye’s definition of probability is a subsidiary of Kolmogorov’s / Measure-theoretic definition of probability just like Peano’s axiom to ZFC. Additionally, Kolomogorov probability itself can somehow be further verified by ZFC. Regardless, we’ll only focus on Kolmogorov’s since that’s the most widely used. However, there does exist some fringe definitions of probability such as quantum probability that could be useful later on (though I highly doubt quantum computers is even theoretically useful for anything other than cryptography).

Tangent: Proving variance reduction of importance sampling (single R.V.; scalar)

Suppose we have the following base Monte-Carlo estimation (\(X_i \sim p_X\)) of the integral \(A = \int_\mathcal{D} f(x)g(x)dx\). Assume \(g(x)\) is non-negative for almost all \(x\)¹⁸.

\[ A = \frac{|\mathcal{D}|}{N}\sum_{i=1}^{N}\frac{f(X_i)g(X_i)}{p_X(X_i)} \]

Let us suppose the domain of integration is simple (\(|\mathcal{D}|=1\)) and we naively sample the estimation uniformly \(p_\text{naive}(Y_i)=1\).

\[ \langle A \rangle_\text{naive} = \frac{1}{N}\sum_{i=1}^{N}f(Y_i)g(Y_i) \]

Let us suppose we employ importance sampling (“i.s.”) technique where \(p_\text{i.s.}(Z_i)\propto g(Z_i)\), we get the following. \(c=1/\int_{\mathcal{D}}g(x)dx=1/\mathbb{E}_{X_1\sim p_{\text{naive}}}[g(X_1)]\) ¹⁹ is the normalization factor for \(g\) to ensure its conversion into a sampling process is normalized to \(1\).

\[ \langle A \rangle_\text{i.s.} = \frac{1}{N}\sum_{i=1}^{N}\frac{f(Z_i)g(Z_i)}{cg(Z_i)} = \frac{1}{cN}\sum_{i=1}^{N}f(Z_i) \]

Now that we have both \(\langle A \rangle_\text{naive}\) and \(\langle A \rangle_\text{i.s.}\), we can show if importance sampling truely reduces the variance \(\mathbb{V}\). Remember, the output estimate of the Monte-Carlo is a random variable itself.

\[ \begin{align*} \mathbb{V}[\langle A \rangle_\text{naive}] &= \mathbb{V}_{Y_1,\cdots,Y_N\sim p_\text{naive}}\bigg[ \frac{1}{N}\sum_{i=1}^N f(Y_i)g(Y_i) \bigg] \\ &= \frac{1}{N^2}\sum_{i=1}^N \mathbb{V}_{Y_i\sim p_\text{naive}}[ f(Y_i)g(Y_i)] \\ &= \frac{1}{N} \mathbb{V}_{Y_1\sim p_\text{naive}}[ f(Y_1)g(Y_1)] \end{align*} \]

Since \(Y_1,\cdots,Y_N\) are usually i.i.d. in the context of M.C. estimation, we can use Theorem 2.2.1 from [3, p. 56] to get sum of variance from a variance of sum, since the covariance would theoretically be \(0\) because of the independence (stronger assumption than uncorrelated). And since the interior term are identical in distribution, their variance, a property of theirs, would match regardless how many times it is sampled. Hence, we can reduce the sum of variance into a single variance times \(N\).

We can do the same with \(\langle A \rangle_\text{i.s.}\) and get the following, although with a caveat of different sampling method. \[ \frac{1}{c^2N} \mathbb{V}_{Z_1\sim p_\text{i.s.}}[ f(Z_1)] \]

Now, since we want variance reduction, we can do the following. We hope it is a tautology (i.e., verifies it always reduces variance) but it could just be reduced to a conditional also, which would help with the analysis regardless.

\[ \begin{align*} \frac{1}{c^2N} \mathbb{V}_{Z_1\sim p_\text{i.s.}}[ f(Z_1)] < \frac{1}{N} \mathbb{V}_{Y_1\sim p_\text{naive}}[ f(Y_1)g(Y_1)] \\ \implies \frac{1}{c^2} \mathbb{E}_{Z_1\sim p_\text{i.s.}}[ f(Z_1)^2]- \frac{1}{c^2}\mathbb{E}_{Z_1\sim p_\text{i.s.}}[ f(Z_1)]^2 < \mathbb{E}_{Y_1\sim p_\text{naive}}[ (f(Y_1)g(Y_1))^2] - \mathbb{E}_{Y_1\sim p_\text{naive}}[ f(Y_1)g(Y_1)]^2 \\ \implies \frac{1}{c} \mathbb{E}_{Z_1\sim p_\text{naive}}[ f(Z_1)^2 g(Z_1)]-A^2 < \mathbb{E}_{Y_1\sim p_\text{naive}}[ (f(Y_1)g(Y_1))^2] - A^2 \\ \implies 0 < \mathbb{E}_{Y_1\sim p_\text{naive}}[ (f(Y_1)g(Y_1))^2] - \mathbb{E}_{X_1\sim p_{\text{naive}}a}[g(X_1)] \mathbb{E}_{Z_1\sim p_\text{naive}}[ f(Z_1)^2 g(Z_1)] \\ \end{align*} \]

Although this looks like covariance, there are still some technical differences especially on finite samples. The first term in the final reduced expression needs to have a sampling from a joint distribution between \(f^2g\) vs. \(g\) while having the same individual marginal samples from the joint for the second term after the subtraction. TL;ND, I will leave the justification here without an alternative proof.

\(\Box\)

(The previous iteration of the proof is lowkey bad in hindsight.)

Similiarly, §9.1 of [5] talks about the effectiviness of importance sampling in variance reduction and proves the optimal density to sample from is \(p\propto |fg|\) while we only talk about importance sampling over some of the term (grouped as \(g\)) and finding how much variance is reduced given a combination of \(f\) and \(g\).

[5]

A. B. Owen, Monte carlo theory, methods and examples. https://artowen.su.domains/mc/, 2013.

¹⁹ If \(|\mathcal{D}|\neq1\), this expectation will take care of it.

¹⁸ Negative density (i.e., negative probabilities) creates a lot of problems.

Tangent: Unbiased estimation after differentiation

We are given the following expression that is deceptively simple.

\[ f(\langle I(\theta) \rangle) \]

We have a deterministic quadratic function \(f\) (e.g., objective function) that takes an M.C. estimation of an integral \(I(\theta)=\int h_\theta(x)dx\) (e.g., the rendering equation, expectation, etc.) parameterized by some variable \(\theta\). Remember, M.C. estimation is still an R.V. itself. This is recurring theme from a lot of places from differentiable rendering and reinforcement learning to finance—especially when we want to optimize the parameter \(\theta\).

Currently, optimization is done through differentiation. So, let’s try to differentiate it. Despite the MC estimate is considered a RV, it is linearly composed of \(N\) i.i.d. RV’s evaluated on the integrand \(N\) times. Since differentiation is linear, it goes right pass through (assuming the domain of the integral does not depend on \(\theta\)).

\[ \begin{align*} \frac{d}{d\theta}f(\langle I(\theta) \rangle) &= f'(\langle I(\theta) \rangle) \frac{d}{d\theta}\langle I(\theta) \rangle \\ &= f'(\langle I(\theta) \rangle) \langle I'(\theta) \rangle \\ \end{align*} \]

The chain rule of differentiation created two MC estimate (symbolically) which are basically two RVs joined by arithmetic. In some domain (cough differentiable rendering), it is deceptively intuitive to compute the derivative \(I'\) alongside the original \(I\) (i.e., reusing path samples or just samples). But, is reusing samples allowed?

Key insight: MC estimate is still a RV; hence, the entire expression is a big RV. Let’s call that RV \(L\).

\[ L = f'(\langle I(\theta) \rangle) \langle I'(\theta) \rangle \]

If you think about, many RV has expectation and variance including this. Why not compare the following two equations to check if they’re unbiased? Let us assume \(f'\) is linear (\(f\) is quadratic).

Uncorrelated sampling with \(Y,Z \sim X\) where \(Y\) and \(Z\) are i.i.d.: \[ f'(\langle I(\theta) \rangle_Y) \langle I'(\theta) \rangle_Z \]

Correlated sampling with \(Y\sim X\): \[ f'(\langle I(\theta) \rangle_Y) \langle I'(\theta) \rangle_Y \]

Let us start with uncorrelated sampling:

\[ \begin{align*} \mathbb{E}_{Y,Z\sim X}[f'(\langle I(\theta) \rangle_Y) \langle I'(\theta) \rangle_Z] &= \mathbb{E}_{Y_i,Z_i\sim X}\bigg[f'\bigg(\frac{1}{N}\sum_{i=1}^{N} \frac{h_\theta(Y_i)}{p(Y_i)}\bigg) \frac{1}{M}\sum_{i=1}^{M} \frac{h'_\theta(Z_i)}{p(Z_i)} \bigg] \\ &= f'\bigg( \mathbb{E}_{Y_i\sim X}\bigg[\frac{1}{N}\sum_{i=1}^{N} \frac{h_\theta(Y_i)}{p(Y_i)} \bigg] \bigg) \mathbb{E}_{Z_i\sim X}\bigg[ \frac{1}{M}\sum_{i=1}^{M} \frac{h'_\theta(Z_i)}{p(Z_i)} \bigg] \\ &= f'( I(\theta)) I'(\theta) \\ \end{align*} \]

Uncorrelated sampling is an unbiased estimator of the gradient.

What about correlated sampling? (Assuming \(N=M\).)

\[ \begin{align*} \mathbb{E}_{Y\sim X}[f'(\langle I(\theta) \rangle_Y) \langle I'(\theta) \rangle_Y] &= \mathbb{E}_{Y_i \sim X}\bigg[f'\bigg(\frac{1}{N}\sum_{i=1}^{N} \frac{h_\theta(Y_i)}{p(Y_i)}\bigg) \frac{1}{M}\sum_{i=1}^{M} \frac{h'_\theta(Y_i)}{p(Y_i)} \bigg] \\ &= f'\bigg( \mathbb{E}_{Y_i\sim X}\bigg[\frac{1}{N}\sum_{i=1}^{N} \frac{h_\theta(Y_i)}{p(Y_i)} \bigg] \bigg) \mathbb{E}_{Y_i\sim X}\bigg[ \frac{1}{M}\sum_{i=1}^{M} \frac{h'_\theta(Y_i)}{p(Y_i)} \bigg] + \text{Cov}_{(Y_i, Y_i)\sim X}\bigg(\frac{1}{N}\sum_{i=1}^{N} \frac{h_\theta(Y_i)}{p(Y_i)}, \frac{1}{M}\sum_{i=1}^{M} \frac{h'_\theta(Y_i)}{p(Y_i)}\bigg) \\ &= f'( I(\theta)) I'(\theta) + \text{Cov}(\langle I(\theta) \rangle_Y, \langle I'(\theta) \rangle_Y ) \\ \end{align*} \]

As you can clearly see, correlated sampling has an additional covariance term \(\text{Cov}(\langle I(\theta) \rangle_Y, \langle I'(\theta) \rangle_Y )\) which can potentially make correlated sampling biased. But I said can potentially, not must. Making covariance \(0\) is a rather lax requirment (compared to independence). We just need to find \(I(\theta)\) such that \(I(\theta)\) and \(I'(\theta)\) are non-linear. While this seems achievable (?), in differentiable rendering these expressions can be complicated really quickly. While it is reasonable to think a more complicated an expression is the more likely it’s derivative will be non-linear w.r.t. to itself, it is still biased nonetheless in the strictest sense. Disclaimer: no proof has been done nor will be done on how much bias exists. Will leave that as an excercise for the reader /j.

⁵ Forgot Box-Muller dimensionally works only in 2. The exercises in [4, Secs. 5–4] are interesting and adjacently concerns the issue we’re having with 3D sampling of Gaussian. There is likely an elegant method out there, but due to limited time, we’ll just implement 4D Box-Muller and extract the first 3 dimensions.

Triangle

Suppose we are given \(\mathbf{p},\mathbf{q},\mathbf{r}\in \mathbb{R}^3\) to describe triangle \(\triangle \mathbf{p}\mathbf{q}\mathbf{r}\). Since triangle is 2D, we will assume we are also given \(U_1,U_2\). How do we sample uniformly on \(\triangle\)?

Remember the triangle area formula \(\frac{1}{2}bh\)? It is based off of calculating half of the area of a parallelogram (which can always be made of two congruent triangles). Can we do something with it? (Yes, it is a bit like rejection sampling.)

Suppose \(\overline{\mathbf{p}\mathbf{q}}\) and \(\overline{\mathbf{p}\mathbf{r}}\) constitutes the two edge vectors of the parallelogram. We can let \(\mathbf{X}=U_1\overline{\mathbf{p}\mathbf{q}} + U_2\overline{\mathbf{p}\mathbf{r}}\) to uniformly sample over the parallelogram. We also have a third edge \(\overline{\mathbf{q}\mathbf{r}}\). If both \(\overline{\mathbf{q}\mathbf{r}} \times \overline{\mathbf{q}\mathbf{p}}\) and \(\overline{\mathbf{q}\mathbf{r}} \times \overline{\mathbf{q}\mathbf{X}}\) have the same sign, it is in the correct triangle. If their sign does not match, that means their on the wrong side/triangle. We can flip the random point \(\mathbf{X}\) over via \(\mathbf{X'}=(1-U_1)\overline{\mathbf{p}\mathbf{q}} + (1-U_2)\overline{\mathbf{p}\mathbf{r}}\). Think of the transformation is flipping over \(y=-x\) line and shifting up and right by \(1\) unit putting the points back to the same triangle (w.r.t. to the parallelogram frame). The pdf will theoretically be the area of the \(\triangle\). Notably, the \(h\) can be calculated by projection one side to the base and finding the distance from one end of the projected vector to the other point.

Hold up!

Triangles has this nifty feature called barycentric coordinates. Can we use it to make better animation to preserve point locality to emphasize the concept of transforming density? We start off knowing it can be natively parameterized/described by two parameters: \(\mathbf{X}=U_1\mathbf{p}+U_2(1-U_1)\mathbf{q}+(1-U_1-U_2(1-U_1))\mathbf{r}\)²⁰. The additional arithmetic within the weights ensures they all sum to \(1\). Yet, is \(X\) uniform on the triangle? To check for that we have to find its local distortion via Jacobian determinant²¹.

²⁰ A specific case of convex combinations of points forming the convex hull where we have three points instead of arbitrary amount of points.

²¹ Jacobian for \(\mathbb{R}^2 \to \mathbb{R}^3\) can be found by \(||\partial_{u}\mathbf{X}(u,v) \times \partial_{v}\mathbf{X}(u,v)||\), which is a particular case of Gram determinant that allows determinants to remain meaningful in an assigned lower dimension \(\mathbb{R}^k\) where \(k<n\) embedded within \(\mathbb{R}^n\).

Computing the Jacobian

Suppose \(u:=U_1,v:=U_2\) and \(A\) is the area of the triangle.

\[ \begin{align*} J &= ||\partial_{u}\mathbf{X}(u,v) \times \partial_{v}\mathbf{X}(u,v)|| \\ &= ||(\mathbf{p}-v\mathbf{q}-(1-v)\mathbf{r}) \times ((1-u)\mathbf{q}-(1-u)\mathbf{r})|| \\ &= (1-u)||[(\mathbf{p}-\mathbf{r}+v(\mathbf{r}-\mathbf{q}))\times (\mathbf{q}-\mathbf{r})]|| \\ &= (1-u)||[(\mathbf{p}-\mathbf{r})\times(\mathbf{q}-\mathbf{r})]|| \\ &= (1-u)2A \\ \end{align*} \]

Jacobian is \((1-U_1)2A\), which means not only is it not normalized to the area of the sampling (\(A\)), but it also varies over \(U_1\), which can be seen in the following visualizations. Obtaining \(A\) for Jacobian, we suppose our old \(U_1\) is redefined as \(f(\tilde{U}_1)\) for some arbitrary transformation \(f\). If we re-differentiate the Jacobian by \(\tilde{U}_1\), we get \((1-f(\tilde{U}_1))2Af'(\tilde{U}_1)\), which we want it to be simplified as \(A\)²².

²² This is still different from transformation of RV, which is given a function \(g\), how to properly transform the RV while maintaing conservation of probability mass (maintaining normalization).

Finding \(f\)

More explicitly, \(u:=f(w)\big|_{w=\tilde{U}_1}\) and \(u':=f'(w)\big|_{w=\tilde{U}_1}\).

\[ \begin{align*} (1-f(\tilde{U}_1))2Af'(\tilde{U}_1)=A &\implies (1-u)2u'=1 \\ &\implies (1-u)2du=dw \\ &\implies \int(1-u)2du=\int dw \\ &\implies -u^2+2u-w+C=0 \\ &\implies u = 1\pm\sqrt{1+C-w} \\ \end{align*} \]

Since \(u\in [0,1]\) and \(w\in [0,1)\), then it must be \(u=1-\sqrt{1-w}\) or \(U_1:=1-\sqrt{1-\tilde{U}_1}\).

This can also be directly related to inverse CDF but TL;ND.

Once we find the right transformation \(f\) to get the proper Jacobian, we get the following corrected triangle sampling: \[ \mathbf{X}=(1-\sqrt{1-\tilde{U}_1})\mathbf{p}+U_2(\sqrt{1-\tilde{U}_1})\mathbf{q}+(\sqrt{1-\tilde{U}_1})(1-U_2)\mathbf{r} \]

Notice anything between the transformation of p4 and p2? Despite both being proper uniform sampling of the triangle and identical in distribution, their realized samples are still different. Animation does seem to reveal a lot of the inner mechanism of various methods of sampling concisely.

Trowbridge-Reitz (GGX)

This is a distribution that is commonly found in microfacet model for BRDF where many of these mathematical models are empirically²³ formulated from real lighting data. Also known as normal distribution functions (NDFs)²⁴, these has been integral to modelling PBR in graphics since the beginning—practically. Specifically, they model the distribution of the direction each microfacets are facing; hence, they indirectly model the specularity of a surface (i.e.., how shiny the surface is). This is why NDFs (and microfacets in general) are widely used since the shineness (specularity) is non-negotiable for PBR (they’re like everywhere in real life). Which only emphasizes the importance in importance sampling (haha get it?). While other parts in the microfacet are equally important for efficient sampling overall, for the sake of the blog, we will only focus on sampling NDF.

²³ Since GGX has been a very good approximation, most work in rendering has been orthogonal to finding a better BSDF model and is mostly assumed. Most of the time it is finding better sampling methods or as a basis for inverse rendering.

²⁴ Normal as in perpendicularity in geometry such as the surface of a triangle, not the normal distribution.

[6]

M. Pharr, W. Jakob, and G. Humphreys, Physically based rendering, fourth edition: From theory to implementation. MIT Press, 2023. Available: https://books.google.ad/books?id=kUtwEAAAQBAJ

²⁵ NDF does not tell us where the most of the specular reflection happens absolutely, especially in the context of the half-way/middle direction \(\mathbf{m}\) and microfacet models for a given incident and outgoing direction. NDF just tells us how fast the reflection should change from perfect-mirror to diffuse as the half-way direction moves away from the surface normal.

Throughout the years, there has been many NDF (and proto-NDF) proposals from (chronologically) flat shading, Gouraud shading (think Roblox), Phong shading, Blinn-Phong shading to Beckmann and GGX shading. We will focus on anisotropic GGX paramaterized in spherical coordinates from [6], Equation 9.16, which has the distribution of the following²⁵:

\[ D(\mathbf{m}) = D(\theta_m,\phi_m) = \frac{1}{\pi \alpha_x \alpha_y \cos^4 \theta_m \bigg(1 + \tan^2 \theta_m \bigg(\frac{\cos^2 \phi_m}{\alpha_x^2}+\frac{\sin^2 \phi_m}{\alpha^2_y}\bigg)\bigg)^2} \]

Remember, \(\mathbf{m}\) is the halfway vector between the incident and outgoing direction of radiance. But for our purpose, we can treat sampling \(\mathbf{m}\) as over all of the hemisphere. So, how can we move the NDF in the integrand into the sampling (inducing a PDF weight of the same NDF itself)? So far, inverse CDF has been treating us well, and so we can try it here.

There’s a lot of brute-force trig involved which I defer the derivation to another blog written by “A Graphics Guy’s Note”: blog link. (I tried (╥‸╥) TL;ND)

The final sampling method:

\[ \begin{align*} \phi_m &= \arctan(\frac{\alpha_y}{\alpha_x}\tan(2\pi U_1)) + \begin{cases} 0 & U_1 \in [0,\frac{1}{4}] \\ \pi & U_1 \in (\frac{1}{4},\frac{3}{4}) \\ 2\pi & U_1 \in [\frac{3}{4},1] \\ \end{cases} \\ \theta_m &= \arctan{\sqrt{\frac{U_2}{(1-U_2)(\frac{\cos^2(\phi_m)}{\alpha^2_x}+\frac{\sin^2(\phi_m)}{\alpha^2_y})}}} \\ \end{align*} \]

Ensure that \(\frac{1}{\alpha_x},\frac{1}{\alpha_y} > 0\) where each of them represents the roughness per-axis. Higher roughness means smoother NDFs (i.e., more diffuse scattering). Play around the roughness values!

Future Work

TL;ND

While it started with ambition to find many interesting ways to sample and understand what density in probability means, the initial naivety on the basics of probability bogged down majority of the time. There’s still much to explore from Henyey-Greenstein and Spherical Gaussians to uniform sampling on weird geometric shapes (if it’s even possible), including sampling 3D gaussians from inverse CDF for locality-preserving transformation. There’s also probability integral transform, Rosenblatt transform, and measure transport with connections to generative modelling. I think it’s mentally quite dangerous to continue being gun-ho on something that has lost the initial novelty (especially when it gets repetitive), so I will leave it here for now.

I hope those programmatic animations were insightful and fun!

For future blog:

There’s practically infinite amount of “basics” to write about. Future blog will likely go over necessary basics that are still interesting (for myself) with an anchor topic on something recent and conceptually interesting
If a function can be numerically evaluated in a reasonable constant-bounded time (usually directly exposed as a library function), it is closed to me! (e.g., \(\text{erf}\) is a closed-form solution because it can be approximated by Taylor series in finite amount of time because of limited precision in practically every computers)
Very wordy especially on the side notes out of fear of lack of completeness. Future blog will have significantly reduced words and more emphasis on visuals while keeping the side-notes reasonably small (as intended)