# Representing points in a set in positional-notation fashion (a trick by Bourgain): part I

If you are reading this blog, you have probably heard that Jean Bourgain – one of the greatest analysts of the last century – has unfortunately passed away last December. It is fair to say that the progress of analysis will slow down significantly without him. I am not in any position to give a eulogy to this giant, but I thought it would be nice to commemorate him by talking occasionally on this blog about some of his many profound papers and his crazily inventive tricks. That’s something everybody agrees on: Bourgain was able to come up with a variety of insane tricks in a way that no one else is. The man was a problem solver and an overall magician: the first time you see one of his tricks, you don’t believe what’s happening in front of you. And that’s just the tricks part!

In this two-parts post I am going to talk about a certain trick that loosely speaking, involves representing points on an arbitrary set in a fashion similar to how integers are represented, say, in binary basis. I don’t know if this trick came straight out of Bourgain’s magical top hat or if he learned it from somewhere else; I haven’t seen it used elsewhere except for papers that cite Bourgain himself, so I’m inclined to attribute it to him – but please, correct me if I’m wrong.
Today we introduce the context for the trick (a famous lemma by Bourgain for maximal frequency projections on the real line) and present a toy version of the idea in a proof of the Rademacher-Menshov theorem. In the second part we will finally see the trick.

1. Ergodic averages along arithmetic sequences
First, some context. The trick I am going to talk about can be found in one of Bourgain’s major papers, that were among the ones cited in the motivation for his Fields medal prize. I am talking about the paper on a.e. convergence of ergodic averages along arithmetic sequences. The main result of that paper is stated as follows: let $(X,T,\mu)$ be an ergodic system, that is

1. $\mu$ is a probability on $X$;
2. $T: X \to X$ satisfies $\mu(T^{-1} A) = \mu(A)$ for all $\mu$-measurable sets $A$ (this is the invariance condition);
3. $T^{-1} A = A$ implies $\mu(A) = 0 \text{ or } 1$ (this is the ergodicity condition).

Then the result is

Theorem: [Bourgain, ’89] Let $(X,T,\mu)$ be an ergodic system and let $p(n)$ be a polynomial with integer coefficients. If $f \in L^q(d\mu)$ with ${q}$ > 1, then the averages $A_N f(x) := \frac{1}{N}\sum_{n=1}^{N}f(T^{p(n)} x)$ converge $\mu$-a.e. as $N \to \infty$; moreover, if ${T}$ is weakly mixing1, we have more precisely

$\displaystyle \lim_{N \to \infty} A_N f(x) = \int_X f d\mu$

for $\mu$-a.e. ${x}$.

For comparison, the more classical pointwise ergodic theorem of Birkhoff states the same for the case $p(n) = n$ and $f \in L^1(d\mu)$ (notice this is the largest of the $L^p(X,d\mu)$ spaces because $\mu$ is finite), in which case the theorem is deduced as a consequence of the $L^1 \to L^{1,\infty}$ boundedness of the Hardy-Littlewood maximal function. The dense class to appeal to is roughly speaking $L^2(X,d\mu)$, thanks to the ergodic theorem of Von Neumann, which states $A_N f$ converges in $L^2$ norm for $f \in L^2(X,d\mu)$. However, the details are non-trivial. Heuristically, these ergodic theorems incarnate a quantitative version of the idea that the orbits $\{T^n x\}_{n\in\mathbb{N}}$ fill up the entire space ${X}$ uniformly. I don’t want to enter into details because here I am just providing some context for those interested; there are plenty of introductions to ergodic theory where these results are covered in depth.

Now, Bourgain’s methods show the boundedness of the corresponding maximal average, that is the operator $A_\ast f = \sup_N |A_N f|$. However, I should point out that, unlike Birkhoff’s case, this does not imply the pointwise convergence almost everywhere in the standard way: in general there is no dense subclass of functions for which the result is easy to show when $p(n) \neq n$, so density arguments are ruled out. The real argument to show the a.e. convergence goes through what is called an oscillation inequality, which is quite akin to a bound for the 2-variation norm of the sequence of averages: roughly speaking, something of the form

$\displaystyle \int_{X} \sum_{j} \sup_{N_j \leq N \leq N_{j+1}} \big| A_N f - A_{N_{j+1}}f \big|^2 d\mu \leq C \int_{X} |f|^2 d\mu$

for any choice of an increasing sequence $(N_j)_j$, with constant independent of the specific sequence. Indeed, Bourgain was a pioneer in this approach to convergence. The inequality is in turn deduced from the proof of the boundedness of $A_\ast f$, that is some partial results obtained there are repurposed for the oscillation inequality. There are a number of technicalities that make the overall argument for the a.e. convergence look somewhat convoluted, at least to my lesser brain, but for the moment I will gloss over all this and revisit the point in an appendix to this post (in a previous paper by Bourgain, where he worked out the $L^2$ case of the above, it is explained in a bit more detail). Another more important reduction is the so-called Calderón’s transference principle, which is a little argument that allows us to deduce $L^p(X,d\mu) \to L^p(X,d\mu)$ inequalities for the maximal operator on the dynamical system $(X,T, \mu)$ from the $\ell^p(\mathbb{Z}) \to \ell^p(\mathbb{Z})$ inequalities for the model system called the shift model, that is $(\mathbb{Z}, S)$ where ${S}$ denotes the shift operator $S(n) = n+1$. In this model the maximal operator becomes simply $\mathcal{A}_\ast f = \sup_{N} |\mathcal{A}_N f|$, where

$\displaystyle \mathcal{A}_N f(x) := \frac{1}{N} \sum_{n=1}^{N} f(x + p(n));$

observe however that the shift model is not an ergodic system! There is no probability measure on $\mathbb{Z}$ invariant with respect to ${S}$. Nevertheless, it suffices to bound $A_\ast$. The reasoning is as follows: for a fixed $x \in X$ you define the function $g_x: \mathbb{N} \to \mathbb{C}$ to be $g_x(j):= f(T^{j}x)$. It is easy to see that $A_N f(T^j x) = \mathcal{A}_N g_x (j)$, and therefore $A_\ast f(T^j x) = \mathcal{A}_\ast g_x (j)$ as well. $\mathcal{A}_\ast$ being $\ell^q(\mathbb{Z}) \to \ell^q(\mathbb{Z})$ bounded means

$\displaystyle \sum_{j} |\mathcal{A}_\ast g_x(j)|^q \lesssim \sum_{j} |g_x(j)|^q,$

which upon integration in $d\mu(x)$ becomes therefore

$\displaystyle \sum_{j} \int_X |A_\ast f(T^j x)|^q d\mu(x) \lesssim \sum_{j} \int_X |f(T^j x)|^q d\mu(x).$

Now we recall the nice thing about ${d\mu}$ which is that it is invariant with respect to ${T}$, and therefore actually $\int_X |f(T^j x)|^q d\mu(x) = \int_X |f|^q d\mu$ for every ${j}$ (and similarly for the term on the left-hand side). This is a problem however: we are summing infinitely many copies of the same quantity, so the two sides are both infinite! Of course we have just been silly: instead of defining $g_x(j)$ as above, we truncate it and let it be 0 outside a sufficiently large set, and we similarly truncate the supremum… I leave the boring details to you.

Now that we have the basics down, what is Bourgain’s strategy in proving the above theorem, or more specifically the boundedness of $A_\ast$? What follows is a very rough caricature of the $L^2$ case (you can find a fine exposition in the appendix B to this survey paper by Quas and Wierdl, in which they illustrate in detail the case $p(n)=n^2$; there is also a brief sketch in section 4 of this paper by Cuny and Weber). So, we want to prove that

$\displaystyle \Big\|\sup_N \big|\frac{1}{N} \sum_{n=1}^{N} f(x + p(n))\big| \Big\|_{\ell^2(\mathbb{Z})} \lesssim \|f\|_{\ell^2(\mathbb{Z})};$

we notice that $\frac{1}{N} \sum_{n=1}^{N} f(x + p(n)) = f \ast \Big(\frac{1}{N} \sum_{n=1}^{N} \delta_{p(n)}\Big)(x)$, with $\delta_k$ the Dirac delta supported in ${k}$, and therefore this kernel has multiplier $\frac{1}{N} \sum_{n=1}^{N} e^{2\pi i p(n) \xi}$. This is a Weyl sum, a very well studied object for which we have plenty of results. The typical way to attack these sums is to use the Hardy-Littlewood circle method, which in this case allows us to replace the Weyl sum above by a more explicit one: in the case $p(n)=n^2$ (for simplicity), we have that if ${\xi}$ is not “close” to a rational then the multiplier is extremely small, and viceversa if it is “close” to the rational $a/q$ then we can approximate the multiplier well by

$\displaystyle \Big(\frac{1}{q} \sum_{n=1}^{q} e^{2\pi i a n^2 / q}\Big) \cdot \Big(\frac{1}{N} \int_{0}^{N} e^{2\pi i (\xi - a/q)t^2}dt\Big).$

The value of the latter expression is that the first factor does not depend on ${N}$ any longer, so that the dependence is isolated to the second factor; but this latter factor in turn has a nice expression as an oscillatory integral, which is much easier to treat than a Weyl sum. Indeed, it is so well behaved that we can essentially replace it by the characteristic function of the interval $|\xi - a/q| < 1/N^{\deg(p)}$, for our purposes. Once we do this we can partition the rationals by dyadic size of their denominator, defining

$\displaystyle \mathcal{Q}_s := \Big\{ \frac{a}{q} \in \mathbb{Q}_{+} \text{ such that } a

and reduce to bound the maximal operators

$\displaystyle f \mapsto \sup_{N \gg 2^{2ds}} \Big| \sum_{\theta \in \mathcal{Q}_s} \int_{|\xi - \theta| < N^{-1/d}} \widehat{f}(\xi) e^{2\pi i \xi x} d\xi \Big|.$

(the factors $q^{-1} \sum_{n=1}^{q} \exp(2\pi i a n^2/q)$ can be absorbed into the function ${f}$ and efficiently estimated). We will have to sum in ${s}$, but thanks to some decaying factors that I have conveniently left out it will suffice to prove that these operators are $L^2 \to L^2$ bounded with a norm that is at most $O((\log \#\mathcal{Q}_s)^{O(1)})$ (and notice $\#\mathcal{Q}_s \lesssim 2^{2s}$, so that quantity is $O(s^{O(1)})$). This is finally achieved through a very remarkable lemma, which is as follows:

Lemma 1 [Bourgain]: Let $K$ be an integer and let $\Lambda = \{\lambda_1, \ldots, \lambda_K \}$ a set of ${K}$ frequencies. Define the maximal frequency projections

$\displaystyle \mathcal{B}_\Lambda f(x) := \sup_{j} \Big|\sum_{k=1}^{K} \int_{|\xi - \lambda_k| < 2^{-j}} \widehat{f}(\xi) e^{2\pi i \xi x} d\xi\Big|,$

where the supremum is restricted to those ${j}$ such that $2^{-j} \leq \min \{ |\lambda_k - \lambda_{k'}| : 1\leq k\neq k'\leq K \}$.
Then

$\displaystyle \|\mathcal{B}_\Lambda f\|_{L^2} \lesssim (\log \#\Lambda)^2 \|f\|_{L^2}.$

That the lemma allows one to conclude is evident. There are a few versions of this statement around, and I would like to clarify one thing: we can indeed remove the restriction on the supremum, if we reformulate the operator as follows. Given ${\Lambda}$ a finite set of frequencies, we let $\Lambda^{j}$ denote the $2^{-j}$-neighbourhood of the set $\Lambda$, that is

$\displaystyle \Lambda^j = \mathcal{N}_{2^{-j}}(\Lambda) := \{\xi \in \mathbb{R} : \min_{\lambda \in \Lambda} |\xi - \lambda| \leq 2^{-j}\}.$

Then if we redefine $\mathcal{B}$ as

$\displaystyle \tilde{\mathcal{B}}_\Lambda f(x) := \sup_{j \in \mathbb{Z}} \Big|\int \mathbf{1}_{\Lambda^j}(\xi) \widehat{f}(\xi) e^{2\pi i \xi x} d\xi\Big|,$

the conclusion of the lemma still holds, with the same constant.

I should remark here that actually this lemma is much more powerful than what one needs to run the above argument! Indeed, the rationals in $\mathcal{Q}_s$ lie on a regular lattice $m_s^{-1} \mathbb{Z}$, where ${m_s}$ denotes the m.c.m. of all the integers between $2^s$ and $2^{s+1}$; moreover, replacing the oscillatory integral by a characteristic function was a bit rough, and one can more conveniently replace it by a smooth bump function instead. These two additional assumptions are enough to significantly simplify the proof of the lemma for the corresponding operator, as was noticed by Lacey some time later2. This is another amazing thing that Bourgain typically did: he often proved stronger results than what he needed just because it didn’t occur to him that he could get away with less.

A final important remark about Bourgain’s lemma is that the logarithmic term in the constant is unavoidable. In particular, Bourgain, Kostyukovsky and Olevskiǐ proved that the constant must be at least $\gtrsim (\log \#\Lambda)^{1/4}$ using a beautiful construction based on the Kolmogorov rearrangement theorem. More on this in the future (but don’t hold your breath).

The trick this two-parts post is devoted to makes its appearance in the proof of the lemma above. In the next section we discuss a simpler but related result as a warmup.

2. Rademacher-Menshov theorem for frequency projections

A standard result that might always come in handy when dealing with maximal frequency projections is a modern reformulation of a result by Rademacher and Menshov for orthonormal systems that goes by their names. The formulation we are interested in is the following:

Theorem 2 [Rademacher-Menshov à la Bourgain]: Let $E_1 \subset \cdots \subset E_N$ be a finite sequence of increasing measurable subsets of $\mathbb{R}$. The estimate

$\displaystyle \Big\|\sup_{1\leq n \leq N} \Big| \int \mathbf{1}_{E_n}(\xi) \widehat{f}(\xi) e^{2\pi i \xi x} d\xi \Big|\Big\|_{L^2} \lesssim (\log N) \|f\|_{L^2} \ \ \ \ \ (1)$

holds for any $f \in L^2$.

So, the maximal operator in the statement is a maximal frequency projection as well (notice the sets $\Lambda^j$ above are nested as well), but the constant this time depends on the range over which the supremum is taken and as such it cannot be applied to prove Bourgain’s lemma directly. However, the two results are quite related: indeed, let me remark that the Rademacher-Menshov theorem is actually used in the proof of the lemma for the operator $\tilde{\mathcal{B}}_\Lambda$, that is when the supremum is unrestricted; moreover, the proof of the $L^2$ boundedness of the simplified maximal operator of Lacey2 rests heavily on the Rademacher-Menshov theorem as well.

There is also a slightly more general statement which is proven in a similar way and is useful in a variety of contexts:

Theorem 2′ [Rademacher-Menshov]: Let $f_1, \ldots, f_N$ be a finite sequence of functions that satisfy

$\displaystyle \Big\|\sum_{k=1}^{N} \epsilon_k f_k \Big\|_{L^2} \leq C$

for any arbitrary choice of signs $\epsilon_k \in \{+1,-1\}$. Then for the maximal truncations of $\sum_k f_k$ the estimate

$\displaystyle \Big\|\sup_{1\leq n \leq N} \Big| \sum_{k=1}^{n} f_k \Big|\Big\|_{L^2} \lesssim (\log N) C$

holds.

It is almost immediate to verify that Theorem 2′ implies Theorem 2. We will illustrate both proofs, though they are really similar.

Let’s see how Bourgain proves Theorem 2. The overall strategy is to prove a bootstrap inequality, or self-improving inequality, for the optimal constant for which the inequality holds. This is overly fancy jargon to say a very simple (yet somewhat magical) thing: if you want to prove that $A \lesssim B$, it is enough to prove for example that $A \leq B + \frac{1}{2} A$, or that $A^2 \lesssim AB$, or anything of the form – this latter is perhaps most useful in conjunction with $L^2$ estimates and the use of Cauchy-Schwarz (it is for example how Bessel inequalities in time-frequency analysis are usually proved). Bear in mind that for these inequalities to make sense, you need to know somehow that the quantity $A$ is finite to begin with.

Proof of Theorem 2: We write $\int \mathbf{1}_{E_n}(\xi) \widehat{f}(\xi) e^{2\pi i \xi x} d\xi$ as $(\mathbf{1}_{E_n} \widehat{f})^{\vee}$ for shortness.
We can obviously assume that $N = 2^m$ for convenience. We will prove the inequality dual to inequality (1), namely we will show that for all choices of $(g_k)_{k} \in L^2(\ell^1)$

$\displaystyle \Big\| \sum_{k =1}^{2^m} (\mathbf{1}_{E_k} \widehat{g_k})^{\vee} \Big\|_{L^2} \leq B \Big\| \sum_{k = 1}^{2^m} |g_k| \Big\|_{L^2} \ \ \ \ \ (2)$

with $B \lesssim m$; this then implies (1). Indeed, observe that for any function ${f}$ there is a measurable function $k(x)$ such that $\sup_{1\leq k \leq 2^m} |(\mathbf{1}_{E_k} \widehat{f})^{\vee}(x)| = \big|\sum_{k=1}^{2^m} \mathbf{1}(k(x) = k)(\mathbf{1}_{E_{k}} \widehat{f})^{\vee}(x)\big|$; this is a trick known as the linearisation of the supremum (which I have already talked about in the past in the context of the Carleson operator). We can estimate the $L^2$ norm of this quantity by duality as

$\displaystyle \sup_{\|g\|_{L^2} \leq 1}\int g(x) \sum_{k=1}^{2^m} \mathbf{1}(k(x) = k)(\mathbf{1}_{E_{k}} \widehat{f})^{\vee}(x) dx ;$

if we let $g_k(x) := g(x) \mathbf{1}(k(x) = k)$ then by Plancherel

\displaystyle \begin{aligned} \int g \sum_{k} \mathbf{1}(k(x) = k)(\mathbf{1}_{E_{k}} \widehat{f})^{\vee} dx = & \int \sum_{k} g_k (\mathbf{1}_{E_{k}} \widehat{f})^{\vee} dx \\ = & \int \sum_{k} \widehat{g_k} \mathbf{1}_{E_{k}} \widehat{f} d\xi \\ = & \int f \sum_{k} (\mathbf{1}_{E_{k}} \widehat{g_k})^{\vee} dx \\ \leq & \|f\|_{L^2} \Big\| \sum_{k} (\mathbf{1}_{E_{k}} \widehat{g_k})^{\vee} \Big\|_{L^2}. \end{aligned}

The latter is bounded by (2) by $B \Big\| \sum_{k} |g_k |\Big\|_{L^2} = B \|g\|_{L^2}$, thus proving (1) conditional to (2).
Now we prove the dual inequality. Observe that a trivial use of the triangle inequality shows the constant ${B}$ is at most $2^m$, so it is certainly finite and we want to obtain a good bound for it. For shortness, define $G_k := (\mathbf{1}_{E_k} \widehat{g_k})^{\vee}$ and observe that if ${j we have by Plancherel

\displaystyle \begin{aligned} \langle G_j, G_k \rangle = & \langle \mathbf{1}_{E_j} \widehat{g_j}, \mathbf{1}_{E_k} \widehat{g_k} \rangle \\ = & \langle \mathbf{1}_{E_j \cap E_k} \widehat{g_j}, \widehat{g_k} \rangle \\ = & \langle G_j, g_k \rangle . \end{aligned}

With this information, we proceed to expand the square of the LHS of (2), obtaining

$\displaystyle \sum_{k=1}^{2^m} \|G_k\|_{L^2}^2 + 2 \sum_{1 \leq j < k \leq 2^m} \langle G_j, g_k \rangle.$

The first sum is bounded trivially by $\sum_{k} \|g_k\|_{L^2}^2 = \Big\| \big(\sum_{k=1}^{2^m} |g_k|^2\big)^{1/2} \Big\|_{L^2}^2 \leq \Big\| \sum_{k=1}^{2^m} |g_k|\Big\|_{L^2}^2$. For the second term, we bring the logarithm into the picture by expressing the indices $j, k$ in their binary form. Indeed, given two numbers written in their (binary) positional notation, how do you establish which one is larger? well, assuming that the strings you are given are of the same length, you start scanning from the leftmost digit to see whether they are the same, and as soon as you meet two distinct digits (a 1 and a 0), the string with digit 1 is the largest:

$\displaystyle \begin{matrix} 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ \checkmark & \checkmark & \checkmark & \checkmark & \text{\sffamily X} & & & & & \end{matrix}$

This suggests a way to reorganise the sum in such a way that the condition ${j < k}$ is automatically satisfied. We shift the indices back by 1 for convenience (that is they range over $\{0,1,\ldots, 2^m - 1\}$ now). Any index ${j}$ corresponds uniquely to a string ${w}$ drawn from the alphabet $\{0,1\}$ through binary positional notation: precisely, ${j}$ corresponds to the string $w = (w_{m-1}, \ldots , w_1, w_0) \in \{0,1\}^{m}$ such that

$\displaystyle j = \sigma(w) := 2^{m-1} w_{m-1} + \ldots + 2 w_1 + w_0.$

We let $\mathcal{S}$ be the set of all (finite) strings made of 0s and 1s and denote by $|w|$ the length of string $w \in \mathcal{S}$. Given strings ${v,w}$ we can define the string $v \wedge w$ to be the string obtained by appending ${w}$ to ${v}$. Then we can define the set of strings of length ${m}$ with a given prefix: for $w \in \mathcal{S}$ such that $|w|\leq m$, let

$\displaystyle \mathcal{P}(w) := \{ v \in \mathcal{S} : |v| = m \text{ and } v = w \wedge s \text{ for some } s \in \mathcal{S} \}.$

With this notation in place, we can rewrite the sum $\sum_{j < k}$ in terms of the length of the prefix ${j}$ and ${k}$ share when written in binary notation:

\displaystyle \begin{aligned} & \sum_{0 \leq \ell < m} \sum_{\substack{w \in \mathcal{S} : \\ |w| = \ell}} \sum_{v \in \mathcal{P}(w \wedge 0)} \sum_{v' \in \mathcal{P}(w \wedge 1)} \langle G_{\sigma(v)}, g_{\sigma(v')} \rangle \\ & = \sum_{0 \leq \ell < m} \sum_{\substack{w \in \mathcal{S} : \\ |w| = \ell}} \Big\langle \sum_{v \in \mathcal{P}(w \wedge 0)} G_{\sigma(v)}, \sum_{v' \in \mathcal{P}(w \wedge 1)} g_{\sigma(v')} \Big\rangle. \end{aligned}

To be clear, ${v}$ and ${v'}$ above have the same prefix ${w}$ but they differ in their $(\ell+1)$-th digit from the left. In particular, ${v}$ has a 0 there whereas ${v'}$ has a 1, and therefore $\sigma(v) < \sigma(v')$ automatically. Now we do Cauchy-Schwarz (twice) and bound the above by

$\displaystyle \sum_{0 \leq \ell < m} \Big\| \sum_{\substack{w \in \mathcal{S} : \\ |w| = \ell}} \sum_{v \in \mathcal{P}(w \wedge 0)} G_{\sigma(v)} \Big\|_{L^2} \Big\| \sum_{\substack{w \in \mathcal{S} : \\ |w| = \ell}} \sum_{v' \in \mathcal{P}(w \wedge 1)} g_{\sigma(v')} \Big\|_{L^2}.$

The second factor is obviously bounded by $\Big\| \sum_{k=1}^{2^m} |g_k|\Big\|_{L^2}$; for the first one, we apply the inequality (2) with the best constant ${B}$ for which it holds3, thus obtaining

\displaystyle \begin{aligned} & \sum_{0 \leq \ell < m} \Big\| \sum_{\substack{w \in \mathcal{S} : \\ |w| = \ell}} \sum_{v \in \mathcal{P}(w \wedge 0)} G_{\sigma(v)} \Big\|_{L^2} \Big\|\sum_{k=1}^{2^m} |g_k| \Big\|_{L^2} \\ & \leq \sum_{0 \leq \ell < m} B \Big\| \sum_{\substack{w \in \mathcal{S} : \\ |w| = \ell}} \sum_{v \in \mathcal{P}(w \wedge 0)} |g_{\sigma(v)}| \Big\|_{L^2} \Big\|\sum_{k=1}^{2^m} |g_k| \Big\|_{L^2} \\ & \leq \sum_{0 \leq \ell < m} B \Big\|\sum_{k=1}^{2^m} |g_k| \Big\|_{L^2}^2 = Bm \Big\|\sum_{k=1}^{2^m} |g_k| \Big\|_{L^2}^2. \\ \end{aligned}

Collecting everything together, if ${B}$ denotes the best constant for which (2) holds, the above has shown that

$\displaystyle B^2 \leq 1 + 2m B,$

which, since ${B}$ is finite, shows that $B \lesssim m = \log N$, as claimed. $\Box$
[Observe that the gain came from the identity $\langle G_j, G_k \rangle = \langle G_j, g_k \rangle$ for ${j, without which we would have had to bound both factors above by (2) thus gaining a factor of ${B^2}$ instead, which would have given us the useless inequality $B^2 \leq 1 + 2m B^2$.]

Now, let’s see the proof of the more general Theorem 2′. At the heart of the proof there is the same idea of representing integers in their binary form; the proof however is now direct, so no dualisation and no bootstrap inequalities for the optimal constant.

Proof of theorem 2′:
Since we have the freedom of choosing the signs $\epsilon_k$ as we want, we have the ability to average over such choices. This is like a baby version of Khintchine’s inequality, except more trivial since we are in the case $p = 2$ and we can just expand squares. We choose to average in dyadic blocks, that is in the following way. Assume of course that $N = 2^m$ and for each $\ell \leq m$ define the partial sums

$\displaystyle F^{\ell}_j = \sum_{k= 2^\ell j }^{2^{\ell}(j + 1) - 1} f_k$

for $j = 0, \ldots, 2^{m -\ell} - 1$. In other words, we are dividing up the range $[0, 2^m - 1]$ into $2^{m-\ell}$ consecutive blocks, each of length $2^{\ell}$:

$\displaystyle [0,2^{m} -1] = [0, 2^{\ell} -1] \cup [2^{\ell}, 2\cdot 2^{\ell} - 1] \cup \cdots \cup [(2^{m-\ell} - 1)\cdot 2^{\ell},2^m - 1].$

For each fixed $\ell$ we take the signs $\epsilon_k$ to be constant on each of these blocks – they can only differ if they belong to distinct blocks of length $2^{\ell}$. Averaging over all these possible choices we get therefore that

$\displaystyle \Big\| \Big(\sum_{j = 0}^{2^{m - \ell} - 1} |F^{\ell}_j|^2 \Big)^{1/2} \Big\|_{L^2} \leq C. \ \ \ \ \ (3)$

Now we claim that we can bound the supremum $\sup_{0\leq n < N} \Big| \sum_{k=0}^{n} f_k \Big|$ pointwise by the sum of the ${m}$ square functions $\Big(\sum_{j = 0}^{2^{m - \ell} - 1} |F^{\ell}_j|^2 \Big)^{1/2}$; this will immediately conclude the proof by triangle inequality. The key to this last claim is again the fact that any integer ${n}$ can be written in binary form, that is as a sum of dyadic blocks. I will explain this formally, but the notation is a bit unwieldy: draw dyadic blocks on a line and the argument will be crystal clear. So, any $n \in \{0, \ldots, 2^m - 1 \}$ can be written uniquely as

$\displaystyle n = 2^{a_1} + 2^{a_2} + \ldots + 2^{a_i}$

for a sequence of integers $a_1 > a_2 > \ldots > a_i \geq 0$. Define the integer quantities $n_j$ as follows (we will need these to write the indices in $F^{\ell}_j$ correctly):

\displaystyle \begin{aligned} n_1 = & 0, \\ n_2 = & 2^{a_1} 2^{-a_2}, \\ n_3 = & (2^{a_1} + 2^{a_2}) 2^{-a_3}, \\ n_4 = & (2^{a_1} + 2^{a_2} + 2^{a_3}) 2^{-a_4}, \\ & \vdots \\ n_i = & (2^{a_1} + 2^{a_2} + \ldots + 2^{a_{i-1}})2^{-a_i}. \end{aligned}

With this notation, you can verify easily that since

$\displaystyle [2^{a_1} + \ldots + 2^{a_{j-1}}, 2^{a_1} + \ldots + 2^{a_{j}}) = 2^{a_{j}}[n_j, n_j + 1)$

we have

\displaystyle \begin{aligned} \sum_{k=0}^{n-1} f_k = & \sum_{k \in [0,2^{a_1})} f_k + \sum_{k\in[2^{a_1}, 2^{a_1} + 2^{a_2})} f_k + \ldots + \sum_{k \in [n - 2^{a_i}, n)} f_k \\ = & F^{a_1}_{n_1} + F^{a_2}_{n_2} + \ldots + F^{a_i}_{n_i}. \end{aligned}

By triangle inequality we have therefore

\displaystyle \begin{aligned} \Big| \sum_{k=0}^{n-1} f_k \Big|\leq & |F^{a_1}_{n_1}| + |F^{a_2}_{n_2}| + \ldots + |F^{a_i}_{n_i}| \\ \leq & \sum_{\ell = 0}^{m-1} \Big(\sum_{j = 0}^{2^{m - \ell} - 1} |F^{\ell}_j|^2 \Big)^{1/2}, \end{aligned}

and since the RHS does not depend on ${n}$ we can take the supremum and conclude the claim. $\Box$

This concludes part I of this two-part post. In the next part we will see how a similar (but crazier) idea to the one above – that of representing indices in positional notation – allows one to prove Lemma 1.

Appendix:
In this appendix we discuss how to obtain pointwise-a.e. convergence results in a dynamical system following Bourgain’s oscillation method. It is all nicely explained in Hands’ master thesis, and I am taking the presentation below from there (all eventual mistakes are mine though).
Suppose that the averages $A_N f$ indeed do not converge everywhere. This means that the set

$\displaystyle \{x \in X \text{ s.t. } \limsup_{N} A_N f(x) - \liminf_N A_N f(x) > \lambda \}$

has positive $\mu$-measure for any sufficiently small $\lambda$. In particular, by definition of $\limsup, \liminf$, for any ${x}$ in this set there are increasing sequences $(N_j)_j, (M_j)_j$ such that for all ${j}$ we have $|A_{N_j} f(x) - A_{M_j}f(x)| > \lambda$ – and in general these sequences depend on ${x}$ itself. We can make this a bit more explicit by saying that there are $\varepsilon, \delta > 0$ such that for the set

$\displaystyle E_\varepsilon := \{ x \in X \text{ s.t. } \forall M \text{ there exist } N, N' > M \text{ s.t. } |A_{N}f(x) - A_{N'}f(x)|>\varepsilon\}$

we have $\mu(E_\varepsilon) > \delta$. It doesn’t matter what these numbers are, as long as they are positive. Notice that for any ${x \in E_\varepsilon}$ we can find an increasing sequence $(M_j)_j$ such that $\sup_{M_j \leq N \leq M_{j+1}} |A_{N}f(x) - A_{M_{j+1}}f(x)|>\varepsilon$ for any ${j}$ (and this is nothing but a reformulation of the condition above). However, this sequence again depends heavily on ${x}$ in general. We claim that, possibly losing a few points, we can choose a fixed sequence for which the above holds uniformly. We construct this sequence element by element as follows. Fix some value for $N_0$ and assume that we have chosen all elements up to $N_{j-1}$ included; we describe how to choose $N_j$. Pick some $x \in E_\varepsilon$ and observe that by definition you can find $P,Q > N_{j-1}$ such that $|A_{P}f(x) - A_{Q}f(x)| > \varepsilon$; we will impose that $N_j > \max\{P,Q\}$. Observe that for any $N_j$ we eventually pick subject to this constraint, there exists an index $R$ in $[N_{j-1}, N_j)$ such that $|A_{R}f(x) - A_{N_{j}}f(x)| > \varepsilon / 2$ (otherwise we would have $|A_{P}f(x) - A_{Q}f(x)| \leq \varepsilon$ by triangle inequality). Now we define the auxiliary sets

$\displaystyle F_{N_{j-1}, M} := \{x \in E_\varepsilon \, : \, \sup_{N \in [N_{j-1}, M)}|A_N f(x) - A_{M}f(x)| > \varepsilon / 2 \},$

which are non-empty if $M > \max\{P,Q\}$ by the previous discussion. We have that $\mu(F_{N_{j-1},M}) \to \mu(E_\varepsilon)$ as $M \to \infty$, and therefore if we choose $N_j$ sufficiently large we will have $\mu(F_{N_{j-1}, N_j}) > \delta$. This is our choice of $N_j$, and we pass to the construction of the next element in the sequence.
At the end of this procedure we have obtained an increasing sequence $(N_j)_{j\in \mathbb{N}}$ with the property that for every $j$

$\displaystyle \mu(\{ x \in X \,:\, \sup_{N_{j} \leq N \leq N_{j+1}} |A_{N} f(x) - A_{N_{j+1}}f(x)| > \varepsilon / 2 \}) > \delta.$

An immediate consequence of this fact is that

$\displaystyle \Big\| \sup_{N_{j} \leq N \leq N_{j+1}} |A_{N} f(x) - A_{N_{j+1}}f(x)| \Big\|_{L^2} \gtrsim \varepsilon \delta^{1/2},$

which in turn shows that for any $J$

$\displaystyle \frac{1}{J} \sum_{j=1}^{J} \Big\| \sup_{N_{j} \leq N \leq N_{j+1}} |A_{N} f(x) - A_{N_{j+1}}f(x)| \Big\|_{L^2} \gtrsim \varepsilon \delta^{1/2} \ \ \ \ \ (4)$

as well. The precise lowerbound is irrelevant: what is important about this fact is that we have shown that there is an increasing sequence for which the latter quantity is bounded away from 0. So if we prove that for any increasing sequence the above must actually tend to 0, we will have reached a contradiction, showing that the sets $E_{\varepsilon}$ have all zero measure and convergence happens a.e.. This is almost what Bourgain proves, but there is a further step (necessary or not, I don’t know, but I suspect it’s important): it suffices to treat the case where the suprema are restricted to a lacunary sequence with small lacunary constant. Precisely, for any $\delta > 0$ (no relation to the previous one!), define the set

$\displaystyle Z_\delta := \{ \lfloor (1 + \delta)^{n} \rfloor : n \in \mathbb{N} \};$

Bourgain shows that

$\displaystyle \sum_{j=1}^{J} \Big\| \sup_{\widetilde{N} \in Z_\delta \cap [N_{j}, N_{j+1})} |A_{\widetilde{N}} f(x) - A_{N_{j+1}}f(x)| \Big\|_{L^2} \leq o(J) \|f\|_{L^2}, \ \ \ \ \ (5)$

and this suffices to disprove (4) above. The reason is that we can always truncate our function and assume that $f\in L^\infty$; and if we normalise $\|f\|_{L^\infty} \leq 1$ we see that if $\widetilde{N} \in Z_\delta$ is such that $\widetilde{N} \leq N \leq (1 + \delta)\widetilde{N}$ we have

\displaystyle \begin{aligned} |A_N f(x) - A_{\tilde{N}} f(x)| \leq & \Big|\Big(\frac{1}{N} - \frac{1}{\widetilde{N}}\Big) \sum_{n=1}^{\widetilde{N}} f(T^{p(n)}x) \Big| + \frac{1}{N} \Big|\sum_{n=\widetilde{N}}^{N} f(T^{p(n)}x)\Big| \\ \leq & \Big|\Big(\frac{1}{N} - \frac{1}{\widetilde{N}}\Big) \Big| N + \frac{N - \widetilde{N}}{N} \lesssim \delta. \end{aligned}

Thus we can replace any ${N}$ by a member of $Z_{\delta}$ for an $O(\delta)$ price. If the constants are chosen properly small, this is not an issue and one can indeed disprove (4) using (5) above.

Footnotes:
1: The map ${T}$ is said to be weakly mixing if $\mu(A \cap T^{-n}B)$ converges to $\mu(A)\mu(B)$ in the Cesàro sense, that is $\lim_{N \to \infty} \frac{1}{N} \sum_{n=1}^{N} |\mu(A \cap T^{-n}B) - \mu(A)\mu(B)| = 0$.
2: Lacey’s maximal operator is as follows: let ${\varphi}$ denote a function such that $\widehat{\varphi}$ is a bump function with a minimum of regularity (more precisely, $|\varphi(x)|\lesssim |x|^{-3}$, $|\widehat{\varphi}(\xi) - 1|\lesssim |\xi|$, $|\widehat{\varphi}(\xi)|\lesssim |\xi|^{-2}$). Let furthermore $\delta > 0$ be fixed and let $\Lambda$ be a finite set of frequencies contained in the lattice $\delta\mathbb{Z}$. Lacey’s maximal operator $B_\Lambda$ is defined as

$\displaystyle B_\Lambda f(x):= \sup_{j} \Big| \sum_{\lambda \in \Lambda} \int \widehat{\varphi}(2^{j}(\xi - \lambda)) \widehat{f}(\xi) e^{2\pi i \xi x} d\xi\Big|,$

where the supremum is restricted to those ${j}$ such that $2^{-j} \leq \min \{ |\lambda - \lambda'| : \lambda,\lambda' \in \Lambda, \lambda \neq \lambda' \}$. This operator actually satisfies an even better estimate, namely $\|B_\Lambda f\|_{L^2} \lesssim \log\log(\delta^{-1} + \#\Lambda) \|f\|_{L^2}$.
3: The first time you see this it looks like we are doing something illegal, using the very inequality we are trying to prove! However, as remarked at the beginning of the proof, the inequality is true with some large constant, and therefore there is an optimal constant for which it holds. We are allowed to use the inequality with the best constant even though we don’t yet know what that constant is!