This is the second and final part of an entry dedicated to a very interesting and inventive trick due to Bourgain. In part I we saw a lemma on maximal Fourier projections due to Bourgain, together with the context it arises from (the study of pointwise ergodic theorems for polynomial sequences); we also saw a baby version of the idea to come, that we used to prove the Rademacher-Menshov theorem (recall that the idea was to represent the indices in the supremum in their binary positional notation form and to rearrange the supremum accordingly). Today we finally get to see Bourgain’s trick.
Before we start, recall the statement of Bourgain’s lemma:
where the supremum is restricted to those with being the smallest integer such that .
Here we are using the notation in the statement in place of the expanded formula . Observe that by the definition of we have that the intervals are disjoint (and is precisely maximal with respect to this condition).
We will need to do some reductions before we can get to the point where the trick makes its appearance. These reductions are the subject of the next section.
3. Initial reductions
A first important reduction is that we can safely replace the characteristic functions by smooth bump functions with comparable support. Indeed, this is the result of a very standard square-function argument which was already essentially presented in Exercise 22 of the 3rd post on basic Littlewood-Paley theory. Briefly then, let be a Schwartz function such that is a smooth bump function compactly supported in the interval and such that on the interval . Let (so that ) and let for convenience denote the difference . We have that the difference
is an bounded operator with norm (that is, independent of ). Indeed, observe that , and bounding the supremum by the sum we have that the norm (squared) of the operator above is bounded by
where the summation in is restricted in the same way as the supremum is in the lemma (that is, the intervals must be pairwise disjoint). By an application of Plancherel we see that the above is equal to
but notice that the functions have supports disjoint in , and therefore the multiplier satisfies in a neighbourhood of , and vanishes outside such neighbourhood. A final application of Plancherel allows us to conclude that the above is bounded by by orthogonality (these neighbourhoods being all disjoint as well).
By triangle inequality, we see therefore that in order to prove Lemma 1 it suffices to prove that the operator
is bounded with norm at most .
The next reduction consists of the observation that, since the frequency intervals appearing in the above operator (as supports of the functions) are all disjoint, it will suffice to prove a vector-valued inequality for the vector-valued functions (taking values in the finite-dimensional vector space ). Indeed, observe that by the standard properties of the Fourier transform we have
and by the disjointness of the intervals we can replace in the above expression by its frequency projection onto the interval . Since by orthogonality and Plancherel , it will suffice to prove1
for generic -valued functions .
At this point we worsen the operator a bit by replacing the discrete supremum with a continuous one; this looks bad in general, but in this case it is harmless and will actually be convenient in the following. So, we let , so that , and let in analogy with the previous notation. If we replace by the new continuous parameter we see that it suffices to prove
We are now almost ready, we just need one final reduction. This is achieved through another trick that is ubiquitous in Bourgain’s work and which consists of introducing some parameter in the particular inequality we are trying to prove which does not alter the nature of the inequality itself, and then averaging in this parameter, thus “massaging” the expression into a more manageable one – very roughly speaking. This is another one of Bourgain’s crazy tricks from his magical top hat: while it is nothing too fancy, it is always somewhat unexpected (to me). The averaging trick would deserve another post or series of posts by itself, but I don’t think I would be up to the task. We will see the trick again in the future, when I will talk about Bourgain’s work in the optimal asymptotic behaviour of the constants in the Littlewood-Paley inequalities.
Let’s see this final reduction. The starting point is the observation that since we have the restriction we have for any such that , and for this reason we might as well assume that all the functions are frequency supported in the same interval, without loss of generality. This seemingly innocuous assumption has an interesting consequence though: if we let denote the translation operator , then we can translate any by some amount while not altering the distribution of the mass too much. More precisely, if we have that
indeed, by Plancherel the LHS is and by the assumption that we have that if .
Now we take this observation and combine it with the bootstrapping-inequality trick that we already saw in the previous post, in Bourgain’s proof of the Rademacher-Menshov theorem, to argue that in the inequality that we have to prove we can safely replace with its translate : let denote the best constant for which inequality (1) holds (this constant is certainly finite, since for example ); by triangle inequality
and applying (1) with the best constant to the second term in the RHS we can bound it by , by (2). With this bound, observe that now there is a single remaining parameter in the RHS above, and we can therefore average in it! Since translations and convolutions commute with each other, we have , and with a change of variables () we have for the norm
Now we average this expression in , over the interval , obtaining by Fubini
Remark: Notice that the inner expression, when is fixed, looks almost like a Fourier series in with variable coefficients, and this will be important later. In Lacey’s case, which we mentioned in the previous post, when for some , the expression above is an actual Fourier series, and this helps simplify the treatment of the operator a lot.
to conclude that , as desired. Notice that we have even taken the liberty of extending the supremum to the full range at this point – the restriction is not meaningful anymore now, as the parameter has been incorporated in the inner norm. We have thus completed the series of preliminary reductions.
4. Bourgain’s trick
That was quite the tour, but we are finally ready to discuss the trick that gives title to these posts. So, we are trying to prove inequality (3) above, whose LHS involves two distinct norms. As remarked above, the object that we are evaluating looks a bit like a Fourier series in the variable with variable coefficients, parametrised in (and ). In particular, the set of coefficients form a vector in the -dimensional space , more precisely (ignoring the modulation) the vector . As ranges over this vector will describe some sort of trajectory in space; we denote the resulting set of points by , namely
Now, when we take the supremum in (or we linearise the supremum, equivalently), roughly speaking, there will be for any a point in that realises the supremum2. We would like to proceed in a manner similar to the one we used in the proof(s) of the Rademacher-Menshov theorem, that is, we would like to represent the indices in which we are taking the supremum in a positional-notation fashion, so that we can partition them and obtain an at most logarithmic loss (in some parameter) in the constant. But here we run into a problem: in the Rademacher-Menshov theorem the indices were integers and we could naturally express them in binary positional-notation, while now the index is a point in the set , a set which not only is not discrete but that might also have an arbitrarily irregular structure. It would seem like the positional-notation trick cannot be carried over to this situation – that is, unless you are Bourgain.
Bourgain argues that there is indeed a way to represent points of in that fashion, through the use of particular coverings by balls. More precisely, for any , there is a covering of by balls of radius and in particular there is a covering with a minimal number of such balls (that a finite number of balls suffices to cover the set is clear: indeed, and , and therefore it is finite almost everywhere). This minimal number is an important quantity, and is known as the covering number or entropy number of the set for the given radius. More precisely, if is a positive real and a set, we denote by the minimal number of balls of radius that are needed to cover the set , and we call the -entropy number of .
Actually, for later convenience, we make a small modification to the definition of : if , we stipulate that . Indeed, when considering coverings of sets by dyadic balls, we can always stop once we reach a radius so large that one ball is sufficient to cover the whole set.
If we consider all minimal coverings of for radii (minimal in the sense that each consists of balls), then we see that to specify a point in we can equivalently specify a sequence of balls of decreasing (dyadic) radius in which the point is contained! This is one of those cases where a picture really is worth a thousand words, so here it is:
The squiggly line represents the set , as indicated. The blue balls have a certain radius , the red balls have radius and the green ones have radius . I have stopped there but of course one would continue by covering with balls of radius and so on. You can see that for each colour, the balls of that colour cover the set , and that the covering is “optimal” (in the sense that with any smaller number of balls you could not cover the set). On the set you can see that I have highlighted the point – to identify that point, I can specify one blue ball, one red ball,one green ball, and so on. Also, notice that is contained in two green balls: indeed, the sequence of balls that identify any given point is not unique! However, ambiguities of this sort resolve once the decreasing sequence of balls gets sufficiently long, so to speak; and the non-uniqueness is not really an issue: we only care about the fact that we have at least one way to represent the point as the limit of a sequence of balls. You can start to see that, if we have a way to control the entropy numbers, then we can think about pulling the same trick we used to prove the Rademacher-Menshov theorem. Indeed, this is what the phenomenal idea above will allow us to do – we will reduce to study inequalities for the entropy numbers (see next section).
However, we need to be a little more rigorous in order to proceed. The exact claim is the following:
Positional notation for points in : There exists a sequence of (finite) subsets of the Minkowski difference such that
- for each vector we have for its length ;
- for each , we have ;
- for every , there exists a sequence of vectors such that and , where is a fixed element of .
This is not hard to prove. Say that for each we fix an optimal covering of , that is . Now, impose on the collections a tree structure: that is, each ball will have a unique parent , which is a ball with the property that . There are clearly many different tree structures that we can give to this collection, but any will do.
For each ball there is a point (else the ball wouldn’t be in the collection); this gives rise to a collection of elements of that we might denote by . Notice at some point that for any sufficiently large we will have (by our modified definition), and therefore we can actually truncate our sums to the smallest such . The union of these collections clearly inherits the tree structure that we imposed on . Define the the collections above as
with the modification that is given by the elements for and is an element of that we fix arbitrarily. Then we claim that these collections satisfy the above properties. Indeed, we have that ; moreover, . Finally, given a point , we can find a sequence of balls such that (that is, the sequence is ordered with respect to the tree structure and maximal) and such that for any (this is the trickiest point in this argument, but such a collection can be built starting from a sequence of balls that contain without too much effort); it is then clear that as , and we can write in telescopic series , which converges to by construction, and each term in the series belongs to as claimed.
So, now that we have sorted out the claim and we have this beautiful idea of representing points on the sets in a positional-notation fashion using the optimal coverings of the set by dyadic balls, we proceed in a way that is analogous to the one we used in proving the Rademacher-Menshov theorem. Indeed, by the fact that any can be represented as with (we ignore because it is harmless), we have
where denotes the -th component of the vector . Then remember that in (3) we have to evaluate the norm of the above expression, and by triangle inequality we reduce therefore to estimate for any the quantity
Observe that for the inner expression we can give two simple bounds: the first bound is simply the observation that has absolute value equal to 1, and therefore we can bound the expression simply by the length of , that is (by Cauchy-Schwarz)
the second bound is obtained instead by replacing the maximum by the sum over , that is
The first bound gives us simply
notice that the summand on the RHS looks a lot like the norm of a Fourier series, and therefore we are tempted to use Plancherel to estimate it in terms of . As observed before, the expression is unfortunately not exactly a Fourier series in general, because the frequencies don’t necessarily belong to a common grid; however, we have assumed some separation between the frequencies, and an inequality of Plancherel-type still holds:
Indeed, if denotes a smooth bump function supported on a slight enlargement of and identically 1 on it, we can bound the LHS by
applying Plancherel and expanding the square, we see that it suffices to bound
by (the square of) the RHS of (5). However, the smoothness and the support of imply that , and therefore the integral is bounded by, say, . Since , we see that , and therefore we can conclude by Cauchy-Schwarz (in the difference ) that we have
Summarising the above discussion, we have shown that the inner expression in (3) is bounded as follows:
(observe that the sum is finite, since for large enough we will have , according to our modification of the definition of entropy numbers).
Notice that, by the monotonicity of the entropy numbers, the RHS in the above formula is bounded by
If we prove this inequality, we will have proven (3), which means (by the reductions) that we will also have proven (1), which in turn means that we will have proven Bourgain’s lemma. We have come very far, carried by Bourgain’s tricks and beautiful ideas!
In the next and final section, I will sketch for completeness how this last inequality for entropy numbers is proven; however, the presentation of Bourgain’s trick is to be considered concluded, since it will not appear again. I invite you to appreciate how similar the idea here was to the idea that was used to prove the Rademacher-Menshov theorem (particularly the second more general version of the theorem), and to contemplate how amazing it is that we have been able to replicate it in such a different setting.
5. Sketch of the proof of the entropy numbers inequality
So, we have seen that through a series of standard arguments and some inventive tricks Bourgain has managed to reduce the proof of Lemma 1 to a certain inequality for entropy numbers (inequality (6)). How is this inequality proved in turn? Bourgain deduced it as a consequence of the Lépingle inequality (an inequality for the -variation of martingale differences when > 2), in an almost straightforward way. Here I will outline the argument but refrain from entering into too much detail since this post is already long enough (and the proof is quite readable in Section 3 of the original paper).
From now on we will write in place of for convenience. In what follows, we forget about the dimensionality of the space where the function takes values: in other words, there is no longer the restriction that the summation in be limited to ; in general we might even replace the sequence by a vector-valued function taking values in a (separable) Hilbert space and the proof would be the same. The only dependence of the estimates on will come from the one in .
In estimating the integral in (6), we need to massage the expression into something more manageable. A first thing to notice is that when is sufficiently large then the expression will be identically zero; can we give a quantitative bound for how large needs to be for this to happen? Well, yes: the diameter of is clearly controlled by , and therefore we can restrict the integration to the interval . Observe that is a nice function, because each summand is bounded by the Hardy-Littlewood maximal function of the corresponding , and therefore
a bound that does not depend on at all (the implicit constant depends on though). This is good when : indeed, in this case we can bound and therefore
For the remaining part of the integral, we can replace the minimum by a geometric average of the two terms, which is usually a good idea: we have for any
There is an unpleasant dependence on in this expression that we can remove if we choose sufficiently close to 2 that : indeed, this forces . However, this choice shifts the dependence on on the exponent and thus we will need to be careful with the dependence of any estimate in . From the above bound for , we see that to conclude (6) it suffices to show the following bound for the norm of :
Indeed, observe that if then ; it is this factor that gives us the second logarithmic factor in (6).
The point is now to connect the inequality to be proven to variation quantities. This is actually not hard, in that the expression is already very close to a variation: indeed, observe that there must be at least points on that are at a distance of at least from each other (for otherwise we could cover with less than balls of radius ). Each such point, being an element of , is labeled by a value (the parameter in which we take the supremum) and we have by construction that
therefore we can write
This last expression at the RHS is precisely the -variation of the sequence along the sequence . We introduce suitable notation for these quantities:
Definition: Let and let be a continuously indexed sequence in a Hilbert space with norm . The -variation of the sequence is
With this notation, the observation above is that
where denotes the space where the vector-valued function takes values. So it will suffice to control the -variation instead in order to prove Proposition 3. In particular, since > 2, the vector-valued inequality follows from the single-valued one that follows:
[The constant implicit in is .]
How does one go about proving such an inequality? Well, it is good to start from a simple case, and since is an average of we might as well replace it with the simplest average we can think of, that is . We let for convenience, so that the average is . We claim that if we can prove Proposition 3′ with replaced by , then we can prove it for any . Indeed, the reason is very simple: since certainly vanishes at , we can write it as averages of , and more precisely
and again we can conclude by convexity, that is by the fact that > 2.
At this point, anyone with a decent knowledge of martingale theory would recognise that the inequality that we are trying to prove is very similar to a well known inequality for the -variation of martingale differences, the aforementioned Lépingle inequality:
Lépingle inequality: Let > 2 and let be an increasing sequence of sigma-algebras on a probability space and let . Then we have
Indeed, you can see how this inequality is exactly the discrete version of the one we want to prove, and it is therefore reasonable to assume that the inequality we want will follow from the one of Lépingle; this is indeed the case.
We do not prove the Lépingle inequality here (though it is an easy consequence of the Doob oscillation inequality, which itself is an easy consequence of the Burkholder-Davis-Gundy inequality for martingale square functions); maybe in a future post. Now, how does one deduce Proposition 3′ from Lépingle? this is the most technical part of the argument actually, and I will only briefly comment on the proof. First of all, one can argue that if is the Poisson kernel for the upper half-plane () then the Lépingle inequality implies
this is because of the connection between the Poisson semi-group and the Brownian martingale. So we are a bit closer to the inequality for , but not there yet. The final step is to consider instead the difference, that is , and prove another inequality for the -variation with respect to this kernel . This kernel satisfies favourable Fourier estimates, in particular one can verify easily that
The strategy is then to estimate the 2-variation3 of by dividing it in a lacunary part where we have long variation jumps (from to ) and a local part where we have short variation jumps (that is, restricted to the dyadic interval ). This is the same strategy that was used in Kovač’s argument for maximal Fourier restriction and is a very typical strategy in dealing with variation estimates (it was introduced and perfected by Oberlin, Seeger, Tao, Wright,…). As is to be expected, the lacunary part of the 2-variation is easily estimated by Plancherel and the enemy is the local part. After a dyadic frequency localisation of , matters are reduced to a trick that is essentially the same we used in proving the boundedness of the spherical maximal function! So that one ends up estimating norms of , and here is where those good decay estimates for really come in handy. Using essentially the same as the Sobolev-type inequality that we used there, one can optimally balance two estimates and obtain the desired control of the 2-variation (obviously, there is no dependence on in this case). It is an interesting proof for us, because it uses two ideas that we have already seen in this blog; but this post is already long enough, and therefore I refrain from writing out the details.
This concludes our first detour in the brilliant mind of the late Jean Bourgain. In trying to explain the idea behind one of his beautiful tricks, we have come across many others that are equally interesting. Bourgain’s papers are very hard to read in general, but each of them is a formative experience in its own as we have tried to convey; one truly feels enriched after having read one.
1: the modulation has been absorbed into the ‘s. Notice that we are not making any assumptions on the frequency support of the functions .
2: Technically, since the supremum is not necessarily a maximum, it would be best to look for a point that realises at least of the supremum, or equivalently, to linearise the supremum; but we will gloss over such technicalities.
3: Observe that since for , we always have when > 2.