# Representing points in a set in positional-notation fashion (a trick by Bourgain): part II

This is the second and final part of an entry dedicated to a very interesting and inventive trick due to Bourgain. In part I we saw a lemma on maximal Fourier projections due to Bourgain, together with the context it arises from (the study of pointwise ergodic theorems for polynomial sequences); we also saw a baby version of the idea to come, that we used to prove the Rademacher-Menshov theorem (recall that the idea was to represent the indices in the supremum in their binary positional notation form and to rearrange the supremum accordingly). Today we finally get to see Bourgain’s trick.

Before we start, recall the statement of Bourgain’s lemma:

Lemma 1 [Bourgain]: Let $K$ be an integer and let $\Lambda = \{\lambda_1, \ldots, \lambda_K \}$ a set of ${K}$ distinct frequencies. Define the maximal frequency projections

$\displaystyle \mathcal{B}_\Lambda f(x) := \sup_{j} \Big|\sum_{k=1}^{K} (\mathbf{1}_{[\lambda_k - 2^{-j}, \lambda_k + 2^{-j}]} \widehat{f})^{\vee}\Big|,$

where the supremum is restricted to those ${j \geq j_0}$ with $j_0 = j_0(\Lambda)$ being the smallest integer such that $2^{-j_0} \leq \frac{1}{2}\min \{ |\lambda_k - \lambda_{k'}| : 1\leq k\neq k'\leq K \}$.
Then

$\displaystyle \|\mathcal{B}_\Lambda f\|_{L^2} \lesssim (\log \#\Lambda)^2 \|f\|_{L^2}.$

Here we are using the notation $(\mathbf{1}_{[\lambda_k - 2^{-j}, \lambda_k + 2^{-j}]} \widehat{f})^{\vee}$ in the statement in place of the expanded formula $\int_{|\xi - \lambda_k| < 2^{-j}} \widehat{f}(\xi) e^{2\pi i \xi x} d\xi$. Observe that by the definition of $j_0$ we have that the intervals $[\lambda_k - 2^{-j_0}, \lambda_k + 2^{-j_0}]$ are disjoint (and $j_0$ is precisely maximal with respect to this condition).
We will need to do some reductions before we can get to the point where the trick makes its appearance. These reductions are the subject of the next section.

3. Initial reductions

A first important reduction is that we can safely replace the characteristic functions $\mathbf{1}_{[\lambda_k - 2^{-j}, \lambda_k + 2^{-j}]}$ by smooth bump functions with comparable support. Indeed, this is the result of a very standard square-function argument which was already essentially presented in Exercise 22 of the 3rd post on basic Littlewood-Paley theory. Briefly then, let $\varphi$ be a Schwartz function such that $\widehat{\varphi}$ is a smooth bump function compactly supported in the interval $[-1,1]$ and such that $\widehat{\varphi} \equiv 1$ on the interval $[-1/2, 1/2]$. Let $\varphi_j (x) := \frac{1}{2^j} \varphi \Big(\frac{x}{2^j}\Big)$ (so that $\widehat{\varphi_j}(\xi) = \widehat{\varphi}(2^j \xi)$) and let for convenience $\theta_j$ denote the difference $\theta_j := \mathbf{1}_{[-2^{-j}, 2^{-j}]} - \widehat{\varphi_j}$. We have that the difference

$\displaystyle \sup_{j\geq j_0(\Lambda)} \Big|\sum_{k=1}^{K} ((\mathbf{1}_{[\lambda_k - 2^{-j}, \lambda_k + 2^{-j}]} - \widehat{\varphi_j}(\cdot - \lambda_k)) \widehat{f})^{\vee}\Big|$

is an $L^2$ bounded operator with norm $O(1)$ (that is, independent of $K$). Indeed, observe that $\mathbf{1}_{[\lambda_k - 2^{-j}, \lambda_k + 2^{-j}]}(\xi) - \widehat{\varphi_j}(\xi - \lambda_k) = \theta_j (\xi - \lambda_k)$, and bounding the supremum by the $\ell^2$ sum we have that the $L^2$ norm (squared) of the operator above is bounded by

$\displaystyle \sum_{j \geq j_0(\Lambda)} \Big\|\sum_{k=1}^{K} (\theta_j(\cdot - \lambda_k)\widehat{f})^{\vee}\Big\|_{L^2}^2,$

where the summation in ${j}$ is restricted in the same way as the supremum is in the lemma (that is, the intervals $[\lambda_k - 2^{-j}, \lambda_k + 2^{-j}]$ must be pairwise disjoint). By an application of Plancherel we see that the above is equal to

$\displaystyle \sum_{k=1}^{K} \Big\| \widehat{f}(\xi) \Big[\sum_{j \geq j_0} \theta_j(\xi - \lambda_k) \Big]\Big\|_{L^2}^2;$

but notice that the functions $\theta_j$ have supports disjoint in ${j}$, and therefore the multiplier satisfies $\sum_{j\geq j_0} \theta_j(\xi - \lambda_k) \lesssim 1$ in a neighbourhood of $\lambda_k$, and vanishes outside such neighbourhood. A final application of Plancherel allows us to conclude that the above is bounded by $\lesssim \|f\|_{L^2}^2$ by orthogonality (these neighbourhoods being all disjoint as well).
By triangle inequality, we see therefore that in order to prove Lemma 1 it suffices to prove that the operator

$\displaystyle \sup_{j} \Big|\sum_{k=1}^{K} (\widehat{\varphi_j}(\cdot - \lambda_k) \widehat{f})^{\vee}\Big|$

is $L^2$ bounded with norm at most $O((\log \#\Lambda)^2)$.

The next reduction consists of the observation that, since the frequency intervals appearing in the above operator (as supports of the $\widehat{\varphi_j}(\cdot - \lambda_k)$ functions) are all disjoint, it will suffice to prove a vector-valued inequality for the vector-valued functions $(f_k)_{k=1,\ldots, K}$ (taking values in the finite-dimensional vector space $\mathbb{C}^K$). Indeed, observe that by the standard properties of the Fourier transform we have

$\displaystyle (\widehat{\varphi_j}(\cdot - \lambda_k) \widehat{f})^{\vee}(x) = e^{2\pi i \lambda_k x} (\varphi_j \ast (e^{-2\pi i \lambda_k \cdot}f))(x),$

and by the disjointness of the intervals $[\lambda_k - 2^{-j_0}, \lambda_k + 2^{-j_0}]$ we can replace ${f}$ in the above expression by its frequency projection $\tilde{f}_k$ onto the interval $[\lambda_k - 2^{-j_0}, \lambda_k + 2^{-j_0}]$. Since by orthogonality and Plancherel $\Big(\sum_{k=1}^{K} \|\tilde{f}_k\|_{L^2}^2 \Big)^{1/2} = \| \sum_{k=1}^{K} \tilde{f}_k \|_{L^2} \leq \|f\|_{L^2}$, it will suffice to prove1

$\displaystyle \Big\| \sup_{j \geq j_0} \Big|\sum_{k=1}^{K} e^{2\pi i \lambda_k x} (\varphi_j \ast f_k)(x)\Big| \Big\|_{L^2} \lesssim (\log\#\Lambda)^2 \Big(\sum_{k=1}^{K} \|f_k\|_{L^2}^2 \Big)^{1/2}$

for generic $\mathbb{C}^K$-valued functions $(f_k)_{k=1,\ldots, K}$.

At this point we worsen the operator a bit by replacing the discrete supremum with a continuous one; this looks bad in general, but in this case it is harmless and will actually be convenient in the following. So, we let $\tau_0 = \tau_0(\Lambda) = 2^{-j_0(\Lambda)}$, so that $|\lambda_k - \lambda_{k'}| \geq 2\tau_0$, and let $\varphi_t(x) := \frac{1}{t} \varphi\Big(\frac{x}{t}\Big)$ in analogy with the previous notation. If we replace $2^{j}$ by the new continuous parameter ${t}$ we see that it suffices to prove

$\displaystyle \Big\| \sup_{t > \tau_0^{-1}} \Big|\sum_{k=1}^{K} e^{2\pi i \lambda_k x} (\varphi_t \ast f_k)(x)\Big| \Big\|_{L^2} \lesssim (\log\#\Lambda)^2 \Big(\sum_{k=1}^{K} \|f_k\|_{L^2}^2 \Big)^{1/2}. \ \ \ \ \ (1)$

We are now almost ready, we just need one final reduction. This is achieved through another trick that is ubiquitous in Bourgain’s work and which consists of introducing some parameter in the particular inequality we are trying to prove which does not alter the nature of the inequality itself, and then averaging in this parameter, thus “massaging” the expression into a more manageable one – very roughly speaking. This is another one of Bourgain’s crazy tricks from his magical top hat: while it is nothing too fancy, it is always somewhat unexpected (to me). The averaging trick would deserve another post or series of posts by itself, but I don’t think I would be up to the task. We will see the trick again in the future, when I will talk about Bourgain’s work in the optimal asymptotic behaviour of the constants in the Littlewood-Paley inequalities.
Let’s see this final reduction. The starting point is the observation that since we have the restriction $t > \tau_0^{-1}$ we have for any such ${t}$ that $\mathrm{Supp}(\widehat{\varphi_t}) \subset [-\tau_0, \tau_0]$, and for this reason we might as well assume that all the functions $f_k$ are frequency supported in the same $[-\tau_0, \tau_0]$ interval, without loss of generality. This seemingly innocuous assumption has an interesting consequence though: if we let $\sigma_u$ denote the translation operator $\sigma_u f(x) := f(x+u)$, then we can translate any $f_k$ by some amount $\ll \tau_0^{-1}$ while not altering the distribution of the $L^2$ mass too much. More precisely, if $|u| \leq \frac{\tau_0^{-1}}{20}$ we have that

$\displaystyle \|f_k - \sigma_u f_k \|_{L^2} < \frac{1}{2} \|f_k\|_{L^2}; \ \ \ \ \ (2)$

indeed, by Plancherel the LHS is $\|\widehat{f_k}(\xi)(1 - e^{2\pi i u \xi}) \|_{L^2}$ and by the assumption that $\mathrm{Supp}(\widehat{f_k}) \subset [-\tau_0, \tau_0]$ we have that $|1 - e^{2\pi i u \xi}| \leq 2\pi|u||\xi|\leq 2\pi |u||\tau_0| < 1/2$ if $|u| \leq \tau_0^{-1}/20$.
Now we take this observation and combine it with the bootstrapping-inequality trick that we already saw in the previous post, in Bourgain’s proof of the Rademacher-Menshov theorem, to argue that in the inequality that we have to prove we can safely replace $f_k$ with its translate $\sigma_u f_k$: let $B$ denote the best constant for which inequality (1) holds (this constant is certainly finite, since for example $B=O(\#\Lambda)$); by triangle inequality

\displaystyle \begin{aligned} \Big\| \sup_{t > \tau_0^{-1}} \Big|\sum_{k=1}^{K} e^{2\pi i \lambda_k x} (\varphi_t \ast f_k)(x)\Big| \Big\|_{L^2} \leq & \Big\| \sup_{t > \tau_0^{-1}} \Big|\sum_{k=1}^{K} e^{2\pi i \lambda_k x} (\varphi_t \ast \sigma_u f_k)(x)\Big| \Big\|_{L^2} \\ & + \Big\| \sup_{t > \tau_0^{-1}} \Big|\sum_{k=1}^{K} e^{2\pi i \lambda_k x} (\varphi_t \ast (f_k - \sigma_u f_k))(x)\Big| \Big\|_{L^2}, \end{aligned}

and applying (1) with the best constant $B$ to the second term in the RHS we can bound it by $\frac{B}{2} \Big(\sum_{k=1}^{K} \|f_k\|_{L^2}^2 \Big)^{1/2}$, by (2). With this bound, observe that now there is a single remaining parameter ${u}$ in the RHS above, and we can therefore average in it! Since translations and convolutions commute with each other, we have $\varphi_t \ast \sigma_u f_k = \sigma_u(\varphi_t \ast f_k)$, and with a change of variables ($x \to x-u$) we have for the $L^2$ norm

$\displaystyle \Big\| \sup_{t > \tau_0^{-1}} \Big|\sum_{k=1}^{K} e^{2\pi i \lambda_k x} \sigma_u(\varphi_t \ast f_k)(x)\Big| \Big\|_{L^2}^2 = \int \sup_{t > \tau_0^{-1}} \Big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} (\varphi_t \ast f_k)(x)\Big|^2 dx.$

Now we average this expression in ${u}$, over the interval $[-\tau_0^{-1}/20, \tau_0^{-1}/20]$, obtaining by Fubini

$\displaystyle \int_{\mathbb{R}} 20\tau_0 \int_{0}^{\tau_0^{-1}/20}\sup_{t > \tau_0^{-1}} \Big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u}\big[ e^{2\pi i \lambda_k x} (\varphi_t \ast f_k)(x) \big]\Big|^2 \,du \,dx.$

Remark: Notice that the inner expression, when ${x}$ is fixed, looks almost like a Fourier series in ${u}$ with variable coefficients, and this will be important later. In Lacey’s case, which we mentioned in the previous post, when $\Lambda \subset \delta \mathbb{Z}$ for some $\delta >0$, the expression above is an actual Fourier series, and this helps simplify the treatment of the operator a lot.

By the bootstrapping trick and the above discussion, it will therefore suffice to prove that

\displaystyle \begin{aligned} \Big\| \tau_0^{1/2} \Big\|\sup_{t > 0} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} & \big[ e^{2\pi i \lambda_k x} (\varphi_t \ast f_k)(x) \big] \big| \Big\|_{L^2([0,\tau_0^{-1}/20],du)} \Big\|_{L^2(\mathbb{R},dx)} \\ & \lesssim (\log\#\Lambda)^2 \Big(\sum_{k=1}^{K} \|f_k\|_{L^2}^2 \Big)^{1/2} \ \ \ \ \ (3) \end{aligned}

to conclude that $B \lesssim (\log\#\Lambda)^2$, as desired. Notice that we have even taken the liberty of extending the supremum to the full $t>0$ range at this point – the restriction is not meaningful anymore now, as the parameter ${\tau_0}$ has been incorporated in the inner $L^2([0,\tau_0^{-1}/20],du)$ norm. We have thus completed the series of preliminary reductions.

4. Bourgain’s trick

That was quite the tour, but we are finally ready to discuss the trick that gives title to these posts. So, we are trying to prove inequality (3) above, whose LHS involves two distinct $L^2$ norms. As remarked above, the object that we are evaluating looks a bit like a Fourier series in the variable ${u}$ with variable coefficients, parametrised in ${x}$ (and ${t}$). In particular, the set of coefficients form a vector in the ${K}$-dimensional space $\mathbb{C}^K$, more precisely (ignoring the modulation) the vector $(\varphi_t \ast f_k(x))_{k=1,\ldots, K}$. As ${t}$ ranges over ${t>0}$ this vector will describe some sort of trajectory in $\mathbb{C}^K$ space; we denote the resulting set of points by $A_x$, namely

$\displaystyle A_x := \{ (\varphi_t \ast f_k(x))_{k=1,\ldots,K} \,:\, t >0 \}.$

Now, when we take the supremum in ${t}$ (or we linearise the supremum, equivalently), roughly speaking, there will be for any ${x}$ a point in $A_x$ that realises the supremum2. We would like to proceed in a manner similar to the one we used in the proof(s) of the Rademacher-Menshov theorem, that is, we would like to represent the indices in which we are taking the supremum in a positional-notation fashion, so that we can partition them and obtain an at most logarithmic loss (in some parameter) in the constant. But here we run into a problem: in the Rademacher-Menshov theorem the indices were integers and we could naturally express them in binary positional-notation, while now the index is a point in the set $A_x$, a set which not only is not discrete but that might also have an arbitrarily irregular structure. It would seem like the positional-notation trick cannot be carried over to this situation – that is, unless you are Bourgain.
Bourgain argues that there is indeed a way to represent points of $A_x$ in that fashion, through the use of particular coverings by balls. More precisely, for any ${j \in \mathbb{Z}}$, there is a covering of $A_x$ by balls of radius $2^j$ and in particular there is a covering with a minimal number of such balls (that a finite number of balls suffices to cover the set is clear: indeed, $\varphi_t \ast f_k (x) \lesssim Mf_k (x)$ and $Mf_k \in L^2$, and therefore it is finite almost everywhere). This minimal number is an important quantity, and is known as the covering number or entropy number of the set $A_x$ for the given radius. More precisely, if ${r}$ is a positive real and ${E}$ a set, we denote by $M_r(E)$ the minimal number of balls of radius ${r}$ that are needed to cover the set ${E}$, and we call $M_r(E)$ the ${r}$-entropy number of ${E}$.

Actually, for later convenience, we make a small modification to the definition of $M_r(E)$: if ${r \geq \mathrm{diam}(E)}$, we stipulate that $M_r(E) = 0$. Indeed, when considering coverings of sets by dyadic balls, we can always stop once we reach a radius so large that one ball is sufficient to cover the whole set.

If we consider all minimal coverings of $A_x$ for radii $2^j, \; j \in \mathbb{Z}$ (minimal in the sense that each consists of $M_{2^j}(A_x)$ balls), then we see that to specify a point in $A_x$ we can equivalently specify a sequence of balls of decreasing (dyadic) radius in which the point is contained! This is one of those cases where a picture really is worth a thousand words, so here it is:

The squiggly line represents the set $A_x$, as indicated. The blue balls have a certain radius ${r}$, the red balls have radius ${r/2}$ and the green ones have radius ${r/4}$. I have stopped there but of course one would continue by covering $A_x$ with balls of radius $r/8, r/16, \ldots$ and so on. You can see that for each colour, the balls of that colour cover the set $A_x$, and that the covering is “optimal” (in the sense that with any smaller number of balls you could not cover the set). On the set $A_x$ you can see that I have highlighted the point ${a}$ – to identify that point, I can specify one blue ball, one red ball,one green ball, and so on. Also, notice that ${a}$ is contained in two green balls: indeed, the sequence of balls that identify any given point is not unique! However, ambiguities of this sort resolve once the decreasing sequence of balls gets sufficiently long, so to speak; and the non-uniqueness is not really an issue: we only care about the fact that we have at least one way to represent the point as the limit of a sequence of balls. You can start to see that, if we have a way to control the entropy numbers, then we can think about pulling the same trick we used to prove the Rademacher-Menshov theorem. Indeed, this is what the phenomenal idea above will allow us to do – we will reduce to study inequalities for the entropy numbers (see next section).

However, we need to be a little more rigorous in order to proceed. The exact claim is the following:

Positional notation for points in $A_x$: There exists a sequence $(B_j)_{j \in \mathbb{Z}}$ of (finite) subsets of the Minkowski difference $A - A := \{a - a' \; : \; a,a' \in A_x \}$ such that

1. for each vector $b \in B_j$ we have for its length $\|b\| \lesssim 2^j$;
2. for each $j \in \mathbb{Z}$, we have $\# B_j \leq M_{2^j}(A_x)$;
3. for every $a \in A_x$, there exists a sequence of vectors $(b^{(j)})_{j \in \mathbb{Z}}$ such that $b^{(j)} \in B_j$ and $a = a_0 + \sum_{j \in \mathbb{Z}} b^{(j)}$, where $a_0$ is a fixed element of $A_x$.

This is not hard to prove. Say that for each $j \in \mathbb{Z}$ we fix an optimal covering $\mathcal{C}_j$ of $A_x$, that is $\# \mathcal{C}_j = M_{2^j}(A_x)$. Now, impose on the collections $\mathcal{C}_j$ a tree structure: that is, each ball $B \in \mathcal{C}_j$ will have a unique parent $\widehat{B} \in \mathcal{C}_{j+1}$, which is a ball with the property that $B \cap \widehat{B} \neq \emptyset$. There are clearly many different tree structures that we can give to this collection, but any will do.
For each ball $B \in \mathcal{C}_j$ there is a point $a_B \in B \cap A_x$ (else the ball wouldn’t be in the collection); this gives rise to a collection of elements of $A_x$ that we might denote by $\mathcal{A}_j := \{a_B \; : \; B \in \mathcal{C}_j\}$. Notice at some point that for any sufficiently large ${j}$ we will have $M_{2^j}(A_x) = 0$ (by our modified definition), and therefore we can actually truncate our sums to the smallest such ${j = j_\ast}$. The union of these collections $\mathcal{A}_j$ clearly inherits the tree structure that we imposed on $\bigcup_j \mathcal{C}_j$. Define the the collections $B_j$ above as

$\displaystyle B_j := \{a_B - a_{\widehat{B}} \; : \; B \in \mathcal{C}_j \},$

with the modification that $B_{j_\ast}$ is given by the elements $a_B - a_0$ for $B \in B_{j_\ast}$ and $a_0$ is an element of $A_x$ that we fix arbitrarily. Then we claim that these collections satisfy the above properties. Indeed, we have that $\# B_j \leq \# \mathcal{C}_j = M_{2^j}(A_x)$; moreover, $\|a_B - a_{\widehat{B}}\| \leq 2(2^j + 2^{j+1}) \lesssim 2^j$. Finally, given a point $a \in A_x$, we can find a sequence of balls $(B^{(j)})_j$ such that $\widehat{B^{(j)}} = B^{(j+1)}$ (that is, the sequence is ordered with respect to the tree structure and maximal) and such that $a \in 10 B^{(j)}$ for any ${j}$ (this is the trickiest point in this argument, but such a collection can be built starting from a sequence of balls that contain ${a}$ without too much effort); it is then clear that $a_{B^{(j)}} \to a$ as $j \to -\infty$, and we can write in telescopic series $a_0 + \sum_{j\geq k}(a_{B^{(j)}} - a_{B^{(j+1)}}) = a_{B^{(k)}}$, which converges to $a$ by construction, and each term in the series belongs to $B_j$ as claimed.

So, now that we have sorted out the claim and we have this beautiful idea of representing points on the sets $A_x$ in a positional-notation fashion using the optimal coverings of the set by dyadic balls, we proceed in a way that is analogous to the one we used in proving the Rademacher-Menshov theorem. Indeed, by the fact that any $a \in A_x$ can be represented as $a = \sum_{j \in \mathbb{Z}} b_j$ with $b_j \in B_j$ (we ignore $a_0$ because it is harmless), we have

$\displaystyle \sup_{t > 0} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} \big[ e^{2\pi i \lambda_k x} (\varphi_t \ast f_k)(x) \big] \big| \leq \sum_{j \in \mathbb{Z}} \max_{b \in B_j} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big|,$

where $b_k$ denotes the ${k}$-th component of the vector ${b}$. Then remember that in (3) we have to evaluate the $L^2([0,\tau_0^{-1}/20],du)$ norm of the above expression, and by triangle inequality we reduce therefore to estimate for any ${j}$ the quantity

$\displaystyle \tau_0^{1/2} \Big\| \max_{b \in B_j} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big| \Big\|_{L^2([0,\tau_0^{-1}/20],du)}.$

Observe that for the inner expression we can give two simple bounds: the first bound is simply the observation that $e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x}$ has absolute value equal to 1, and therefore we can bound the expression simply by the length of ${b}$, that is (by Cauchy-Schwarz)

$\displaystyle \max_{b \in B_j} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big| \lesssim 2^j K^{1/2};$

the second bound is obtained instead by replacing the maximum by the $\ell^2$ sum over $B_j$, that is

$\displaystyle \max_{b \in B_j} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big| \leq \Big(\sum_{b \in B_j }\big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big|^2 \Big)^{1/2}.$

The first bound gives us simply

$\displaystyle \tau_0^{1/2} \Big\| \max_{b \in B_j} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big| \Big\|_{L^2([0,\tau_0^{-1}/20],du)} \lesssim 2^j K^{1/2},$

because the expression is an $L^2$ average. The second bound gives instead

\displaystyle \begin{aligned} \tau_0^{1/2} \Big\| \max_{b \in B_j} \big|\sum_{k=1}^{K} & e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big| \Big\|_{L^2([0,\tau_0^{-1}/20],du)} \\ & \lesssim \Big( \sum_{b \in B_j} \tau_0 \Big\| \sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \Big\|_{L^2([0,\tau_0^{-1}/20],du)}^2 \Big)^{1/2}; \ \ \ \ \ (4) \end{aligned}

notice that the summand on the RHS looks a lot like the $L^2$ norm of a Fourier series, and therefore we are tempted to use Plancherel to estimate it in terms of $\sum_{k} |b_k|^2$. As observed before, the expression is unfortunately not exactly a Fourier series in general, because the frequencies $\lambda_k$ don’t necessarily belong to a common grid; however, we have assumed some separation between the frequencies, and an inequality of Plancherel-type still holds:

Approximate Plancherel: we have for any coefficients $(a_k)_{k = 1, \ldots, K} \in \ell^2$ that

$\displaystyle \Big\| \sum_{k=1}^{K} e^{-2\pi i \lambda_k u} a_k \Big\|_{L^2([0,\tau_0^{-1}/20],du)} \lesssim \tau_0^{-1/2} \Big(\sum_{k=1}^{K} |a_k|^2\Big)^{1/2}. \ \ \ \ \ (5)$

Indeed, if $\phi$ denotes a smooth bump function supported on a slight enlargement of $[0, \tau_0^{-1}/20]$ and identically 1 on it, we can bound the LHS by

$\displaystyle \Big\| \sum_{k=1}^{K} e^{-2\pi i \lambda_k u} a_k \phi(u) \Big\|_{L^2(\mathbb{R})};$

applying Plancherel and expanding the square, we see that it suffices to bound

$\displaystyle \sum_{k \neq k'} a_k \overline{a_{k'}} \int \widehat{\phi}(x - (\lambda_k - \lambda_{k'})) \overline{\widehat{\phi}(x)} dx$

by (the square of) the RHS of (5). However, the smoothness and the support of $\phi$ imply that $|\widehat{\phi}(x)| \lesssim \tau_0^{-1}(1 + (|x|/\tau_0))^{-100}$, and therefore the integral is bounded by, say, $\tau_0^{-1} (|\lambda_k - \lambda_{k'}| / \tau_0)^{-10}$. Since $|\lambda_k - \lambda_{k'}| > 2\tau_0$, we see that $\sum_{k : k\neq k'} (|\lambda_k - \lambda_{k'}| / \tau_0)^{-10} = O(1)$, and therefore we can conclude by Cauchy-Schwarz (in the difference $k - k'$) that we have

$\displaystyle \sum_{k \neq k'} a_k \overline{a_{k'}} \int \widehat{\phi}(x - (\lambda_k - \lambda_{k'})) \overline{\widehat{\phi}(x)} dx \lesssim \tau_0^{-1} \sum_{k=1}^{K} |a_k|^2,$

as claimed.
Now, using (5) on the RHS of (4) and recalling that $\|b\| \lesssim 2^j$ for $b \in B_j$, we see that the second bound above gives the estimate

$\displaystyle \tau_0^{1/2} \Big\| \max_{b \in B_j} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} e^{2\pi i \lambda_k x} b_k \big| \Big\|_{L^2([0,\tau_0^{-1}/20],du)} \lesssim 2^j (\# B_j)^{1/2} \leq 2^j M_{2^j}(A_x)^{1/2}.$

Summarising the above discussion, we have shown that the inner expression in (3) is bounded as follows:

$\displaystyle \tau_0^{1/2} \Big\|\sup_{t > 0} \big|\sum_{k=1}^{K} e^{-2\pi i \lambda_k u} \big[ e^{2\pi i \lambda_k x} (\varphi_t \ast f_k)(x) \big] \big| \Big\|_{L^2([0,\tau_0^{-1}/20],du)} \lesssim \sum_{j \in \mathbb{Z}} 2^j (\min\{K, M_{2^j}(A_x)\})^{1/2}.$

(observe that the sum is finite, since for $j$ large enough we will have $M_{2^j}(A_x) =0$, according to our modification of the definition of entropy numbers).
Notice that, by the monotonicity of the entropy numbers, the RHS in the above formula is bounded by

$\displaystyle \int_{0}^{\infty} (\min\{K, M_{s}(A_x)\})^{1/2} ds,$

and therefore we have reduced the whole ordeal to showing that

$\displaystyle \Big\| \int_{0}^{\infty} (\min\{K, M_{s}(A_x)\})^{1/2} ds \Big\|_{L^2(\mathbb{R},dx)} \lesssim (\log K)^2 \Big(\sum_{k=1}^{K} \|f_k\|_{L^2(\mathbb{R})}^2 \Big)^{1/2}. \ \ \ \ \ (6)$

If we prove this inequality, we will have proven (3), which means (by the reductions) that we will also have proven (1), which in turn means that we will have proven Bourgain’s lemma. We have come very far, carried by Bourgain’s tricks and beautiful ideas!

In the next and final section, I will sketch for completeness how this last inequality for entropy numbers is proven; however, the presentation of Bourgain’s trick is to be considered concluded, since it will not appear again. I invite you to appreciate how similar the idea here was to the idea that was used to prove the Rademacher-Menshov theorem (particularly the second more general version of the theorem), and to contemplate how amazing it is that we have been able to replicate it in such a different setting.

5. Sketch of the proof of the entropy numbers inequality

So, we have seen that through a series of standard arguments and some inventive tricks Bourgain has managed to reduce the proof of Lemma 1 to a certain inequality for entropy numbers (inequality (6)). How is this inequality proved in turn? Bourgain deduced it as a consequence of the Lépingle inequality (an inequality for the ${q}$-variation of martingale differences when ${q}$> 2), in an almost straightforward way. Here I will outline the argument but refrain from entering into too much detail since this post is already long enough (and the proof is quite readable in Section 3 of the original paper).

From now on we will write $M_{s}(x)$ in place of $M_{s}(A_x)$ for convenience. In what follows, we forget about the dimensionality of the space where the function $(f_k(x))_k$ takes values: in other words, there is no longer the restriction that the summation in ${k}$ be limited to $k \in \{1, \ldots, K\}$; in general we might even replace the sequence $(f_k)_k$ by a vector-valued function taking values in a (separable) Hilbert space and the proof would be the same. The only dependence of the estimates on ${K}$ will come from the one in $\min\{K, M_s(x)\}$.
In estimating the integral in (6), we need to massage the expression $(\min\{K, M_{s}(x)\})^{1/2}$ into something more manageable. A first thing to notice is that when ${s}$ is sufficiently large then the expression will be identically zero; can we give a quantitative bound for how large ${s}$ needs to be for this to happen? Well, yes: the diameter of $A_x$ is clearly controlled by $\Big( \sum_{k} \sup_{t > 0}|\varphi_t \ast f_k(x)|^2 \Big)^{1/2} =: F(x)$, and therefore we can restrict the integration to the interval $[0, F(x)]$. Observe that $F(x)$ is a nice function, because each summand is bounded by the Hardy-Littlewood maximal function of the corresponding $f_k$, and therefore

$\displaystyle \|F\|_{L^2} \lesssim \Big(\sum_{k} \|f_k\|_{L^2}^2\Big)^{1/2},$

a bound that does not depend on ${K}$ at all (the implicit constant depends on $\varphi$ though). This is good when $s \leq K^{-1/2} F(x)$: indeed, in this case we can bound $\min\{K, M_s(x)\} \leq K$ and therefore

$\displaystyle \int_{0}^{K^{-1/2}F(x)} (\min\{K, M_s(x)\})^{1/2} dx \leq F(x).$

For the remaining part of the integral, we can replace the minimum by a geometric average of the two terms, which is usually a good idea: we have for any ${q > 2}$

\displaystyle \begin{aligned} \int_{K^{-1/2}F(x)}^{F(x)} (\min\{K, M_s(x)\})^{1/2} ds \leq & \int_{K^{-1/2}F(x)}^{F(x)} K^{1/2 - 1/q} M_s(x)^{1/q} ds \\ = & K^{1/2 - 1/q} \int_{K^{-1/2}F(x)}^{F(x)} s M_s(x)^{1/q} \frac{ds}{s} \\ \leq & K^{1/2 - 1/q} \Big[\sup_{s>0} s M_s(x)^{1/q} \Big]\int_{K^{-1/2}F(x)}^{F(x)} \frac{ds}{s} \\ = & \frac{1}{2} K^{1/2 - 1/q} \log K \Big[\sup_{s>0} s M_s(x)^{1/q} \Big]. \end{aligned}

There is an unpleasant dependence on ${K}$ in this expression that we can remove if we choose ${q}$ sufficiently close to 2 that $1/2 - 1/q = 1/\log K$: indeed, this forces $K^{1/2 - 1/q} = 2$. However, this choice shifts the dependence on ${k}$ on the exponent ${q}$ and thus we will need to be careful with the dependence of any estimate in ${q}$. From the above bound for $\|F\|_{L^2}$, we see that to conclude (6) it suffices to show the following bound for the $L^2$ norm of $\sup_{s>0} s M_s(x)^{1/q}$:

Proposition 3: Let $q>2$. Then

$\displaystyle \Big\| \sup_{s>0} s M_s(x)^{1/q} \Big\|_{L^2(\mathbb{R},dx)} \lesssim (q-2)^{-1} \Big(\sum_{k} \|f_k\|_{L^2}^2\Big)^{1/2}.$

Indeed, observe that if $1/2 - 1/q = 1/\log K$ then $(q-2)^{-1} = (\log K - 2)/4 \lesssim \log K$; it is this factor that gives us the second logarithmic factor in (6).

The point is now to connect the inequality to be proven to variation quantities. This is actually not hard, in that the expression $s M_s(x)^{1/q}$ is already very close to a variation: indeed, observe that there must be at least $M_s(x)$ points on $A_x$ that are at a distance of at least ${s}$ from each other (for otherwise we could cover $A_x$ with less than $M_s(x)$ balls of radius ${s}$). Each such point, being an element of $A_x$, is labeled by a value $t_j$ (the parameter in which we take the supremum) and we have by construction that

$\displaystyle \Big(\sum_{k} |\varphi_{t_j} \ast f_k(x) - \varphi_{t_{j-1}} \ast f_k(x)|^2 \Big)^{1/2} \geq s;$

therefore we can write

$\displaystyle s M_s(x)^{1/q} \leq \Big( \sum_{j} \big(\sum_{k} |\varphi_{t_j} \ast f_k(x) - \varphi_{t_{j-1}} \ast f_k(x)|^2 \big)^{q/2} \Big)^{1/q}.$

This last expression at the RHS is precisely the ${q}$-variation of the sequence $(\varphi_t \ast \mathbf{f}(x))_{t>0}$ along the sequence $(t_j)_{j}$. We introduce suitable notation for these quantities:

Definition: Let $r \geq 1$ and let $(x_t)_{t>0}$ be a continuously indexed sequence in a Hilbert space $\mathcal{H}$ with norm $\|\cdot\|_{\mathcal{H}}$. The ${r}$-variation of the sequence $(x_t)_{t>0}$ is

$\displaystyle \|(x_t)_{t>0}\|_{V_r(\mathcal{H})}:= \sup_{N} \sup_{0< t_1 < \ldots < t_N} \Big(\sum_{j=2}^{N} \|x_{t_j} - x_{t_{j-1}}\|_{\mathcal{H}}^r \Big)^{1/r}.$

With this notation, the observation above is that

$\displaystyle s M_s(x)^{1/q} \leq \|(\varphi_t \ast \mathbf{f}(x))_{t>0}\|_{V_q(\ell^2)},$

where $\ell^2$ denotes the space where the vector-valued function $\mathbf{f}(x) = (f_k (x))_{k}$ takes values. So it will suffice to control the ${q}$-variation instead in order to prove Proposition 3. In particular, since ${q}$ > 2, the vector-valued inequality follows from the single-valued one that follows:

Proposition 3′: Let $q>2$ and let $\varphi$ be a smooth function on $\mathbb{R}$ such that $\int |\varphi'(x)||x|dx < \infty$. Then for any $f \in L^2(\mathbb{R})$

$\displaystyle \big\|\|(\varphi_t \ast f(x))_{t>0}\|_{V_q}\big\|_{L^2(\mathbb{R},dx)} \lesssim_\varphi (q - 2)^{-1} \|f\|_{L^2(\mathbb{R})}.$

[The constant implicit in $\lesssim_\varphi$ is $C \int |\varphi'(x)||x|dx$.]

How does one go about proving such an inequality? Well, it is good to start from a simple case, and since $\varphi_t \ast f$ is an average of ${f}$ we might as well replace it with the simplest average we can think of, that is $\frac{1}{t} \mathbf{1}_{[0,t]} \ast f = \frac{1}{t} \int_{0}^{t} f(x - y) dy$. We let $\mathbf{1}_{[0,1]} =: \chi$ for convenience, so that the average is $\chi_t \ast f$. We claim that if we can prove Proposition 3′ with $\varphi$ replaced by $\chi$, then we can prove it for any $\varphi$. Indeed, the reason is very simple: since $\varphi$ certainly vanishes at $\infty$, we can write it as averages of $\chi$, and more precisely

\displaystyle \begin{aligned} \varphi(x) = & -\int_{x}^{\infty} \varphi'(t) dt \\ = & -\int_{0}^{\infty} \mathbf{1}_{[0,t]}(x)\varphi'(t) dt \\ = & -\int_{0}^{\infty} \chi_t(x) \varphi'(t)t \,dt \end{aligned}

and again we can conclude by convexity, that is by the fact that $q$ > 2.

At this point, anyone with a decent knowledge of martingale theory would recognise that the inequality that we are trying to prove is very similar to a well known inequality for the ${q}$-variation of martingale differences, the aforementioned Lépingle inequality:

Lépingle inequality: Let $q$ > 2 and let $(\mathcal{F}_{n})_{n \in \mathbb{N}}$ be an increasing sequence of sigma-algebras on a probability space $(\Omega, \mathbb{P})$ and let $\mathbb{E}_n f := \mathbb{E}[f | \mathcal{F}_n]$. Then we have

$\displaystyle \big\| \| (\mathbb{E}_n f)_{n \in \mathbb{N}}\|_{V_q} \big\|_{L^2(d\mathbb{P})} \lesssim (q-2)^{-1} \|f\|_{L^2(d\mathbb{P})}.$

Indeed, you can see how this inequality is exactly the discrete version of the one we want to prove, and it is therefore reasonable to assume that the inequality we want will follow from the one of Lépingle; this is indeed the case.
We do not prove the Lépingle inequality here (though it is an easy consequence of the Doob oscillation inequality, which itself is an easy consequence of the Burkholder-Davis-Gundy inequality for martingale square functions); maybe in a future post. Now, how does one deduce Proposition 3′ from Lépingle? this is the most technical part of the argument actually, and I will only briefly comment on the proof. First of all, one can argue that if $P_t$ is the Poisson kernel for the upper half-plane ($P_t(x) = t / (t^2 + x^2)$) then the Lépingle inequality implies

$\displaystyle \big\| \| (P_t \ast f)_{t>0}\|_{V_q} \big\|_{L^2} \lesssim (q-2)^{-1} \|f\|_{L^2};$

this is because of the connection between the Poisson semi-group and the Brownian martingale. So we are a bit closer to the inequality for $(\chi_t \ast f)_{t>0}$, but not there yet. The final step is to consider instead the difference, that is $((\chi-P_1)_t \ast f)_{t>0}$, and prove another $L^2$ inequality for the ${q}$-variation with respect to this kernel $K := \chi - P_1$. This kernel satisfies favourable Fourier estimates, in particular one can verify easily that

\displaystyle \begin{aligned} |\widehat{K}(\xi)| \lesssim & \min\{|\xi|, |\xi|^{-1}\}, \\ \big|\frac{d}{d\xi}\widehat{K}(\xi)\big| \lesssim & |\xi|^{-1}. \end{aligned}

The strategy is then to estimate the 2-variation3 of $(K_t \ast f)_{t>0}$ by dividing it in a lacunary part where we have long variation jumps (from $2^k$ to $2^{k+1}$) and a local part where we have short variation jumps (that is, restricted to the dyadic interval $[2^k, 2^{k+1}]$). This is the same strategy that was used in Kovač’s argument for maximal Fourier restriction and is a very typical strategy in dealing with variation estimates (it was introduced and perfected by Oberlin, Seeger, Tao, Wright,…). As is to be expected, the lacunary part of the 2-variation is easily estimated by Plancherel and the enemy is the local part. After a dyadic frequency localisation of $K$, matters are reduced to a trick that is essentially the same we used in proving the boundedness of the spherical maximal function! So that one ends up estimating $L^2$ norms of $\frac{d}{dt} (K_t \ast f)$, and here is where those good decay estimates for $\widehat{K}$ really come in handy. Using essentially the same as the Sobolev-type inequality that we used there, one can optimally balance two estimates and obtain the desired control of the 2-variation (obviously, there is no dependence on ${q}$ in this case). It is an interesting proof for us, because it uses two ideas that we have already seen in this blog; but this post is already long enough, and therefore I refrain from writing out the details.

This concludes our first detour in the brilliant mind of the late Jean Bourgain. In trying to explain the idea behind one of his beautiful tricks, we have come across many others that are equally interesting. Bourgain’s papers are very hard to read in general, but each of them is a formative experience in its own as we have tried to convey; one truly feels enriched after having read one.

Footnotes:
1: the modulation $e^{-2\pi i \lambda_k \cdot}$ has been absorbed into the $f_k$‘s. Notice that we are not making any assumptions on the frequency support of the functions $f_k$.
2: Technically, since the supremum is not necessarily a maximum, it would be best to look for a point that realises at least $1/2$ of the supremum, or equivalently, to linearise the supremum; but we will gloss over such technicalities.
3: Observe that since $\|(a_k)_k \|_{\ell^q} \leq \|(a_k)_k \|_{\ell^p}$ for $q \geq p$, we always have $\|(x_t)_{t>0} \|_{V_q(\mathcal{H})} \leq \|(x_t)_{t>0} \|_{V_2(\mathcal{H})}$ when ${q}$ > 2.