The Aquila consortiumThe Aquila consortium aims at understanding the Universe.
https://www.aquila-consortium.org/
Fri, 27 May 2022 17:22:51 +0300Fri, 27 May 2022 17:22:51 +0300Jekyll v3.9.1Field-level inference on galaxy intrinsic alignment<h1 id="overview">Overview</h1>
<p>A common assumption in weak lensing studies has been that galaxy shapes are on
average uncorrelated. However, during the formation and evolution of galaxies,
anisotropic stress exerted by the large-scale structure on galaxies can affect
their shape. This process results in a coherent alignment of galaxy shapes with
the large-scale tidal field, known as intrinsic alignment. Therefore, accurate
inferences of its amplitude are of paramount importance, in order to avoid
biasing cosmological conclusions drawn from weak lensing analyses. Due to the
mechanism by which the effect arises, inferring the intrinsic alignment
amplitude can constrain the response of galaxy shapes to external structures, in
the context of galaxy formation. Further, it serves as a late-time cosmological
probe, since galaxy shapes ultimately correlate with the large-scale dark matter
density field. In particular for elliptical galaxies, the correlation between
galaxy shapes and the large-scale tidal field can be modeled as a linear
function. In our latest publication,<sup id="fnref:paper" role="doc-noteref"><a href="#fn:paper" class="footnote" rel="footnote">1</a></sup> we constrain the linear alignment
model using galaxy shapes from the LOWZ galaxy sample and tidal fields
constrained with the LOWZ and CMASS samples of the SDSS-III BOSS survey.<sup id="fnref:sdss" role="doc-noteref"><a href="#fn:sdss" class="footnote" rel="footnote">2</a></sup></p>
<h1 id="the-linear-alignment-amplitude">The linear alignment amplitude</h1>
<p>As galaxy shapes are affected by collapsing structures, which carry information
on the initial conditions of the Universe, the intrinsic alignment signal is
expected to be scale-independent and persist up to linear scales. For this
reason, we probe the intrinsic alignment amplitude as a function of scale. To
this end, we filter the original tidal fields with a top-hat filter in Fourier
space. As a result, at any given scale, we remove contributions from smaller
scales to avoid contamination from unmodeled processes, due to the resolution
limit of the tidal fields. In the figure below, we present our results on the
linear alignment amplitude, \(A_I\), as a function of scale \(R\). The yellow
window indicates scales smaller than the original resolution of the tidal
fields.</p>
<p class="figure wide whitebg"><img src="/assets/posts/ia/upload_7ec577343152126f16cb54e75717587c.png" alt="Intrinsic alignment measurement" />
<em>Linear alignment amplitude as a function of scale. The blue (yellow respectively) window represents one standard deviation (scales smaller than the inference resolution respectively).</em></p>
<p>Although all scales are consistent with a constant amplitude, the signal clearly
reach a steady state at \(20h^{-1}\,\mathrm{Mpc}<R<50 h^{-1}\,\mathrm{Mpc}\). At
\(R=20h^{-1}\,\mathrm{Mpc}\), we find \(4\sigma\) evidence of \(A_I=3.19\pm0.80\).
The uncertainty is dominated by processes other than intrinsic alignment. Those
processes are modeled as purely independent random sample from a Gaussian distribution.
Since they are not a priori known, we constrain them jointly with the linear
alignment amplitude. In the Figure 2 below, we present our constraints on this
random uncertainty component, \(\sigma\), and at \(R=20h^{-1}\,\mathrm{Mpc}\) we
find \(\sigma=0.24\pm0.01\).</p>
<p class="figure wide whitebg"><img src="/assets/posts/ia/upload_cdaf3e99a9c7e71bd3ad96c6d923cc4d.png" alt="Intrinsic alignment inferred uncertainty" />
<em>Root mean square random galaxy shape noise as a function of scale. The blue and yellow windows represent 1 standard deviation and scales smaller than the inference resolution, respectively.</em></p>
<h1 id="evolution-with-galaxy-properties">Evolution with galaxy properties</h1>
<p>Brighter galaxies have been found to align stronger, whereas an evolution of the
intrinsic alignment amplitude with redshift, may point to different galaxy
formation scenarios. At the same time, potential evolution with galaxy color is
important to analyses considering a wide color range. As a result, we split our
galaxy sample into luminosity, redshift and color bins to study how the linear
alignment amplitude scales with these properties. In the figure below, the three
subplots from top to bottom show the evolution with luminosity, redshift and
color, respectively.</p>
<p class="figure wide whitebg"><img src="/assets/posts/ia/upload_7b904715bba8346acffbdfdd05e0ebaa.png" alt="Intrinsic alignment inferred uncertainty" />
<em>a) The linear alignment amplitude as a function of smoothing scale for the brightest (L1) and faintest (L4) galaxy sub-sample. b) The linear alignment amplitude as a function of smoothing scale for two redshift sub-samples. Z1 covers the range 0.21 < z < 0.29 and Z2 the range 0.29 < z < 0.36. c) The linear alignment amplitude as a function of the smoothing scale for the bluest (C1) and reddest (C5) sub-sample.</em></p>
<p>We observe no significant correlation between luminosity and the linear
alignment amplitude. Given the short redshift and color range of our galaxy
sample, we further observe no correlation with the corresponding properties.</p>
<h1 id="outlook">Outlook</h1>
<p>Field-level approaches like the one we present here will allow the improvement
on modeling both effects, on the basis of the data likelihood and the physics
model. Ultimately, this work is a first step toward joint field-level inferences
of intrinsic alignment and weak lensing,<sup id="fnref:lensing" role="doc-noteref"><a href="#fn:lensing" class="footnote" rel="footnote">3</a></sup> particularly at high redshifts.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:paper" role="doc-endnote">
<p>Eleni Tsaprazi, Nhat-Minh Nguyen, Jens Jasche, Fabian Schmidt and Guilhem Lavaux, 2021, <a href="https://arxiv.org/abs/2112.04484">arXiv:2112.04484</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:paper" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sdss" role="doc-endnote">
<p>G. Lavaux, J. Jasche and F. Leclercq, 2019, <em>Systematic-free inference of the cosmic matter density field from SDSS3-BOSS data</em>, <a href="https://arxiv.org/abs/1909.06396">arXiv:1909.06396</a> <a href="#fnref:sdss" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lensing" role="doc-endnote">
<p>N. Porqueres, A. Heavens, D. Mortlock and G. Lavaux, <a href="https://doi.org/10.1093/mnras/stab204">MNRAS, 502 (2021), 3035–3044</a> <img class="inline-logo svg" src="/assets/images/newspaper-solid.svg" alt="journal" />, <a href="https://arxiv.org/abs/2011.07722">arXiv:2011.07722</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:lensing" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Thu, 09 Dec 2021 00:00:00 +0200
https://www.aquila-consortium.org/method/observations/ia.html
https://www.aquila-consortium.org/method/observations/ia.htmlmethodobservationsIs the speed of light energy dependent?<h1 id="overview">Overview</h1>
<p>A fundamental assumption in our current theories of the Universe is that photons
always travel at the same speed, \(c\), independent of their energy. But this needs
not be true. If the photon had a non-zero rest mass, then lower energy photons would
travel slower (\(v < c\)). Alternatively, it is expected that quantum fluctuations
of spacetime at high energies in so-called quantum gravity (QG) theories would make
spacetime appear “foamy”, and thus empty space would have an energy-dependent
refractive index. Or perhaps photons of different energy couple to gravity with
different strengths (and thus violate the weak equivalence principle), so that photons
of different energy travel differently through a gravitational field. In any one of these
cases, photons of different energies from a distant source would arrive at different times,
even if they were emitted simultaneously. Since the expected time delay increases with
distance travelled, by studying the energy-dependent arrival times (spectral lag) of photons
from sources at high redshift, we can place tight constraints on the quantum gravity length scale,
\(\ell_{\rm QG}\), the photon mass, \(m_\gamma\), or the different couplings of photons
to gravity at different energy, \(\Delta \gamma\). The high redshifts and short durations
of Gamma Ray Bursts (GRBs) are ideal for this, so this is what we consider here. For the
majority of Gamma Ray Burts, high energy photons are detected before lower energy photons, which is
qualitatively the same as for a massive photon and some quantum gravity models,
and could thus provide evidence for such theories.</p>
<h1 id="the-gravitational-time-delay">The gravitational time delay</h1>
<p>To constrain the equivalence principle, we must be able to predict how long it takes a
photon to travel through a gravitational field. The resulting time delay to a distant source
depends on the gravitational potential along the path that it travels and thus depends on the
direction in the sky. If one had knowledge of the true present-day matter field, then one could
create maps of the expected time delay as a function of source position. Indeed, in previous
attempts to constrain equivalence-principle violation, \(\delta \phi\) was modelled as arising
from one or a few isolated sources near the line of sight, however the long range of gravity
casts doubt on the multiple point masses approximation. Instead, we account fully for the
contributions to the time delay from all mass in the non-linear cosmological density field. We
derive the contribution from local structures using constrained density fields generated by the
BORG reconstruction of SDSS-III/BOSS <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, and combine this with an unconstrained contribution from
distant sources to produce a Monte-Carlo based source-by-source forward model for the expected
time delay. The ensemble mean of the resulting time delay fluctuation map is plotted in Figure 1,
and is \(\sim 10^{11} {\rm \, s}\) for a source at \(z=0.1\).</p>
<p class="figure"><img src="/assets/posts/quantum_gravity/fluctuation_plot.png" alt="Shapiro time delay map at redshift 0.1" />
<em>Mollweide projection in equatorial coordinates of the ensemble mean of the time delay fluctuations at \(z=0.1\) from wavelengths resolved by the BORG reconstruction.</em></p>
<h1 id="forward-modelling-the-time-delays">Forward modelling the time delays</h1>
<p>We use a catalogue of 668 Gamma Ray Bursts for the BATSE satellite since these not only have spectral lag
data, but also pseudoredshifts calculated using the spectral peak energy-peak luminosity relation.
Propagating uncertainties on the pseudoredshifts, sky localisation and spectral parameters through
Monte Carlo Sampling, we produce source-by-source forward models for the likelihood of a time delay
from quantum gravity, a photon mass or equivalence principle violation. However, these are not the
only types of physics that can lead to spectral lags: these may also be generated through intrinsic
differences in the emission of photons of different wavelength at the source or their propagation
through the medium surrounding the Gamma Ray Burst, or through instrumental effects at the observer. Without a
robust physical model for the time delays these lead to, we model them using a generic functional form
(a sum of Gaussians) with free parameters that we marginalise over in constraining \(m_\gamma\),
\(\ell_{QG}\) and \(\Delta\gamma\). We vary the number of Gaussians used to describe these observational
and astrophysical processes to find the best-fitting model to the data. Importantly, we find that our
results are insensitive to this choice; a vital check that was often neglected in previous work. We
compare our predicted time delays to the observed ones through a MCMC algorithm and therefore constrain
\(m_\gamma\), \(\ell_{QG}\) and \(\Delta\gamma\).</p>
<h1 id="is-the-speed-of-light-energy-dependent">Is the speed of light energy dependent?</h1>
<p>We find no evidence that the speed of light has an energy dependence. We constrain the photon mass to
be \(m_\gamma < 4.0 \times 10^{-5} \, h \, {\rm eV}/c^2\) and the quantum gravity length scale to be
\(\ell_{\rm QG} < 5.3 \times 10^{-18} \, h \, {\rm \, GeV^{-1}}\) at 95% confidence <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. As shown in Figure 2,
the quantum gravity constraint is the tightest from time delay studies which consider multiple Gamma Ray Bursts, and the constraint
on \(m_\gamma\), although weaker than from using radio data, provides an independent constraint which is less
sensitive to the effects of dispersion by electrons. We also place upper limits on an energy dependence of
\(\gamma\) of \(\Delta \gamma < 2.1 \times 10^{-15}\) at \(1 \sigma\) confidence between photon energies of
\(25 {\rm \, keV}\) and \(325 {\rm \, keV}\) <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>. These constraints are 40 times tighter than literature results,
illustrating the benefits of using complete mass distributions when studying non-local relativistic effects
such as time delays.</p>
<p>So what can we say about quantum gravity, the photon mass and the equivalence principle? Through the use of
simulation based, Bayesian statistical forward-modelling techniques and the BORG algorithm, we have produced
some of the tightest constraints on these theories to date, and have demonstrated that the results are robust
to how one models other astrophysical and observational contributions to the observed signal. It is expected
that \(\ell_{\rm QG}\) should be near the Planck length, which is approximately two orders of magnitude smaller
than we are currently sensitive to, so we are yet to probe this. It is expected that detecting Gamma Ray Bursts at
\(>100 {\rm \, GeV}\) should be routine in the future; with more, higher energy measurements one should begin to
probe this energy scale, so there is the tantalising possibility of making the first detection of quantum gravity
as these limits approach the Planck scale in the near future.</p>
<p class="figure wide"><img src="/assets/posts/quantum_gravity/qg_constraint_comparison.png" alt="Quantum gravity constraints compared to the literature" />
<em>Lower limits on the quantum gravity energy scale (\(1 / \ell_{\rm QG}\)) from time delay studies which use multiple astrophysical sources. Our work provides the tightest constraint to date. The dashed vertical line is the Planck energy, and it is expected that the quantum gravity energy scale has approximately this value.</em></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>G. Lavaux, J. Jasche & F. Leclercq 2019, ``Systematic-free inference of the cosmic matter density field from SDSS3-BOSS data’’, <a href="https://arxiv.org/abs/1909.06396">arxiv 1909.06396</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>D.J. Bartlett, H. Desmond, P.G. Ferreira & J. Jasche 2021, ``Constraints on quantum gravity and the photon mass from gamma ray bursts’’, PRD accepted, <a href="https://arxiv.org/abs/2109.07850">arxiv 2109.07850</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>D.J. Bartlett, D. Bergsdal, H. Desmond, P.G. Ferreira & J. Jasche 2021, ``Constraints on equivalence principle violation from gamma ray bursts’’, <a href="https://doi.org/10.1103/PhysRevD.104.084025">PRD 104, 084025</a> <img class="inline-logo svg" src="/assets/images/newspaper-solid.svg" alt="journal" />, <a href="https://arxiv.org/abs/2106.15290">arxiv 2106.15290</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sat, 06 Nov 2021 00:00:00 +0200
https://www.aquila-consortium.org/method/observations/quantum_gravity.html
https://www.aquila-consortium.org/method/observations/quantum_gravity.htmlmethod/observationsTesting gravity with the positions of supermassive black holes<h1 id="overview">Overview</h1>
<p>Testing General Relativity on large scales is largely tantamount to searching
for new fundamental interactions (‘‘fifth forces’’) between masses, mediated by
dynamical fields beyond the metric tensor. An important competitor to General Relativity is
<em>galileon gravity</em>, which introduces a new light scalar field with a
Lagrangian that is symmetric under Galilean transformations. Historically the
galileon has been a leading contender for explaining dark energy, but now it is
viewed mainly as an archetype of ‘‘Vainshtein-screened’’ theories where the
fifth force from the scalar field vanishes in high-density regions due to second
derivative terms in the equation of motion. This behaviour arises in many
theories beyond the Standard Model.</p>
<p>A key feature of the galileon is that it couples to nonrelativistic matter but
not to gravitational binding energy, violating the strong equivalence principle.
This means that black holes – the only purely gravitational objects – are
entirely unaffected by the galileon, while the stars, gas and dark matter in
galaxies feel the full fifth force. As illustrated in Figure 1, this causes the
supermassive black holes at the centres of galaxies to lag behind the other
galactic components in the direction of an external galileon field. We have used
this effect in a recent article to place stringent constraints on the strength of a galileon
coupling to matter.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
<p class="figure"><img src="/assets/posts/bh_gravity/BH_cartoon.png" alt="Cartoon illustrating the formation of galaxy--black hole offsets under galileon gravity" />
<em>Cartoon illustrating the formation of galaxy–black hole offsets under galileon gravity. The restoring force on the black hole due to its offset from the galaxy centre compensates for the fact that it doesn’t feel the galileon fifth force.</em></p>
<h1 id="csiborg-mapping-the-large-scale-gravitational-field">CSiBORG: Mapping the large-scale gravitational field</h1>
<p>To make predictions for black hole positions in galileon gravity, we need to
know the fifth-force field on a galaxy-by-galaxy basis. To do this we introduced
<em>CSiBORG</em> (Constrained Simulations in BORG), a suite of ~100 RAMSES <em>N</em>-body
simulations using initial conditions sampled from the posterior of the BORG-PM
algorithm. CSiBORG gives an accurate picture of dark matter structures within
\(\sim 250\) Mpc of the Milky Way with a mass resolution of \(2 \times 10^8 \text{M}_\odot\), including full
propagation of the uncertainties in the initial conditions.</p>
<p>We use CSiBORG to map out the local galileon field in the linear, quasistatic
approximation. Combined with a flexible model for halo structure, this allows us
to calculate the expected galaxy–black hole offsets as a function of the
galileon coupling coefficient and the radius within which the fifth force is
suppressed by the Vainshtein mechanism, \(r_V\). We apply this to \(\sim 2000\) galaxies in
which the offset has been measured by comparing optical images of galaxies to
multi-wavelength observations of Active Galactic Nuclei. Marginalising over an
empirical model describing astrophysical noise, we then use a Bayesian
likelihood framework and MCMC algorithm to constrain the galileon parameters.</p>
<h1 id="constraining-cosmological-galileons">Constraining cosmological galileons</h1>
<p>We find no evidence that black holes are offset from the centres of their hosts
in the direction or with the relative magnitude expected from galileons. This
allows us to place strong constraints on the strength of the galileon fifth
force relative to gravity, \(\Delta G/G_N\). In the left panel of Figure 2 we show this
constraint for four observational datasets as a function of \(r_V\): our final bound,
driven by the largest sample, is \(\Delta G/G_N < 0.16\) at \(1\sigma\) confindence for
\(r_V \lesssim \text{Gpc}\). In the right panel we translate this result to a constraint on
the coupling coefficient \(\alpha\) as a function of the lengthscale that appears in
the galileon action, known as the crossover scale \(r_c\). Figure 2 also shows previous
constraints from Lunar Laser Ranging and the black hole in M87 as well as the
expected relation between \(\alpha\) and \(r_c\) in a higher-dimensional modified
gravity model that introduces galileons called DGP.</p>
<p>Enabled by BORG, ours is the first work to model a large-scale galileon field
point-by-point in space. It is therefore the first to probe crossover scales as
large as the observable universe, and the first to achieve statistically
rigorous constraints. By supplementing our model with numerical solutions of the
galileon equation of motion in the nonlinear regime it will be possible to push
our bound to smaller \(r_c\), superseding the Lunar Laser Ranging result and ruling
out the self-accelerating branch of DGP. More generally, a Monte Carlo-based
forward-modelling approach calibrated against simulations and marginalised over
noise holds great promise for precision tests of fundamental physics with galaxy
survey datasets.</p>
<p class="figure wide"><img src="/assets/posts/bh_gravity/constraints.png" alt="Constraints on gallileons" />
<em> <strong>Left:</strong> \(1\sigma\) constraint on \(\Delta G/G_N\) as a function of average Vainshtein radius, \(rV\), from four observational datasets. \(L_{eq}\) is the length scale at which the matter power spectrum turns over. <strong>Right:</strong> Constraint on the coupling of a cubic galileon to matter, \(\alpha\), as a function of the crossover scale, \(r_c\), from lunar laser ranging (LLR), the black hole at the centre of M87, and our work. Our test probes larger-\(r_c\) galileons than others because it models the full galileon field from large-scale structure.</em></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>D. J. Bartlett, H. Desmond & P. G. Ferreira, 2020, ``Constraints on galileons from the positions of supermassive black holes’’, Phys Rev D submitted, <a href="https://arxiv.org/pdf/2010.05811">arxiv 2010.05811</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Mon, 26 Oct 2020 00:00:00 +0200
https://www.aquila-consortium.org/method/observations/bh_gravity.html
https://www.aquila-consortium.org/method/observations/bh_gravity.htmlmethod/observationsSimulating the Universe on a mobile phone<h1 id="overview">Overview</h1>
<p>There are about two trillion galaxies in the observable Universe, and the evolution of each of them is sensitive to the presence of all the others. Can we put this all into a computer, or even a mobile phone, to simulate the evolution of the Universe? In a recent paper, we introduced a perfectly parallel algorithm for cosmological simulations which addresses this question.</p>
<p>Modern cosmology relies on very large data sets to determine the content of our Universe, in particular the amounts of dark matter and dark energy. These large datasets include the positions and electromagnetic spectra of very distant galaxies, up to 20 billion light-years away. In the next decade, the Euclid mission<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> and the Vera Rubin observatory,<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> in particular, will obtain information on several billion galaxies.</p>
<h1 id="physical-challenges">Physical challenges</h1>
<p>Making the link between our knowledge of physics, for example the equations that govern the evolution of dark matter and dark energy, and astronomical observations requires considerable computational resources. Indeed, the most recent observations cover huge volumes: of the order of that of a cube of 12 billion light-years side length. As the typical distance between two galaxies is only a few million light-years, we have to simulate around one trillion galaxies to reproduce the observations.</p>
<p>In addition, in order to follow the physics of the formation of these galaxies, the spatial resolution should be of the order of ten light-years. Ideally, simulations should therefore have a scale ratio (that is, the ratio between the largest and smallest physical lengths of the problem) close to a billion. No computer, existing or even under construction, can achieve such a goal.</p>
<p>In practice, it is therefore necessary to use approximate techniques, consisting in “populating” the large-scale structures of the Universe with fictitious (but realistic) galaxies. This approximation is further justified by the fact that the evolution of galaxies’ components, for example stars and interstellar gas, involves very fast phenomena in comparison to the global evolution of the cosmos. The use of fictitious galaxies still requires simulating the dynamics of the Universe with a scale ratio of around 4,000, which is just possible with today’s supercomputers.</p>
<h1 id="the-problem-of-computational-limits">The problem of computational limits</h1>
<p>Simulating the gravitational dynamics of the Universe is what physicists call a \(N\)-body problem. Although the equations to be solved are analytical, as in most cases in physics, solutions have no simple expressions and require numerical techniques as soon as \(N\) is larger than four. The direct numerical solution consists in explicitly calculating the interactions between all the pairs of bodies, also called “particles”. The computation of forces by direct summation was the favoured technique in cosmology at the beginning of the development of numerical simulations, in the 1970s. At present, it is mainly used for simulations of star clusters and galactic centres. The number of particles used in “direct summation” simulations is represented by green dots in figure 1, where the \(y\)-axis has a logarithmic scale.</p>
<p class="figure wide"><img src="/assets/posts/scola/Moore_law_cosmosims.png" alt="Number of particles in cosmological simulations as a function of time" />
<em>Evolution of the number of particles used in \(N\)-body simulations as a function of year of publication. Different symbols and colours correspond to different methods used to compute gravitational dynamics (direct summation in green, advanced algorithms in orange). For comparison, Moore’s law concerning computer performance is represented by the black dotted line.</em></p>
<p>The direct summation method has a numerical cost which increases like \(N^2\), the number of pairs of particles considered. For this reason, in spite of improvements provided by hardware accelerators such as graphics processing unit (GPUs), the number of particles used with this method cannot grow as quickly as in the famous “Moore’s Law”, which predicts a doubling of computer hardware performance every 18 months. Moore’s law was verified for about four decades (1965-2005), but as traditional hardware architectures are reaching their physical limit, the performance of individual compute cores attained a plateau around 2015 (see figure 2). Therefore, cosmological simulations cannot merely rely on processors becoming faster to reduce the computational time.</p>
<p class="figure wide"><img src="/assets/posts/scola/Moore_law_processors.png" alt="Single-threaded floating point performance as a function of time" />
<em>Single-threaded floating point performance of CPUs as a function of time. Different trademarks and models are represented by different colours and symbols as indicated in the caption. This plot is based on adjusted SPECfp® results.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></em></p>
<p>In order to reduce the cost of simulations, most of the work in numerical cosmology since 1980 has consisted in improving algorithms. The aim was to circumvent the explicit calculation of all gravitational interactions between particles, especially for pairs which are the most distant in the volume to be simulated. These algorithmic developments have enabled a huge increase in the number of particles used in cosmological simulations (see the orange triangles in figure 1). In fact, since 1990, the increase in computational capacity in cosmology has been faster than Moore’s Law, with software improvements adding to the increase in computer performance (more details in <a href="http://florent-leclercq.eu/blog.php?page=2">this blog post</a>).</p>
<p>In 2020, with the architectures of modern supercomputers, calculations are no longer limited by the number of operations that processors can perform in a given time, but by the inherent latencies in communications among the different processors involved in so-called “parallel” calculations. In these computational techniques, a large number of processors work together synchronously to perform calculations far too complex to be carried out on a conventional computer. The stagnation of performances due communication latencies has been theorised in “Amdahl’s law” (see figure 3), named after the computer scientist who formulated it in 1967. It is now the main challenge for cosmological simulations: without improving the “degree of parallelism” of our algorithms, we will soon reach a technological plateau.</p>
<p class="figure wide"><img src="/assets/posts/scola/Amdahl_law.png" alt="Amdahl’s law" />
<em>Amdahl’s law: theoretical speed-up in the execution of a program as a function of the number of processors executing it, for different values of the parallel fraction of the program (different lines). The speed-up is limited by the serial part of the program. For example, if 90% of the program can be parallelised, the theoretical maximum speed-up factor using a large number of processors would be 10.</em></p>
<h1 id="the-scola-approach-divide-and-conquer">The sCOLA approach: divide and conquer</h1>
<p>Let us go back to the physical problem to be solved: it is about simulating the gravitational dynamics of the Universe at different scales. At “small” scales, there are many objects that interact with each other: numerical simulations are required. But at “large” spatial scales, that is to say if we look at figure 4 from very far, not much happens during evolution (except for a linear increase of the amplitude of inhomogeneities). Despite this, with traditional simulation algorithms, the gravitational effect of all the particles on each other must be calculated, even if they are very far apart. It is expensive and almost useless, since most of gravitational evolution is correctly described by simple equations, which can be solved analytically without a computer.</p>
<p class="figure wide"><img src="/assets/posts/scola/scola_comparison.png" alt="Comparison between traditional and sCOLA simulations" />
<em>Comparison between a traditional simulation (left panel) and a simulation using our new algorithm (right panel). In our approach, the volume of the simulation is a mosaic made of “tiles” calculated independently and whose edges are represented by dotted lines.</em></p>
<p>In order to minimise unnecessary numerical calculations, it is possible to use a hybrid simulation algorithm: analytical at large scales and numerical at small scales. The underlying idea, called spatial comoving Lagrangian acceleration (sCOLA<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>), is common in physics: it is a “change of frame of reference”. In this framework, large-scale dynamics is taken into account by the new frame of reference, while small-scale dynamics is solved numerically by the computer, using conventional calculations of the gravity field. Unfortunately, the most naive version of the sCOLA algorithm gives results that are too approximate to be usable. In our last publication,<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> we modified sCOLA in order to improve its accuracy.</p>
<p>Furthermore, we have realised that this concept makes it possible to “divide and conquer”. Indeed, given a large volume to be simulated, sCOLA allows sub-volumes of smaller size to be simulated independently, without communication with neighbouring sub-volumes. Our approach therefore makes it possible to represent the Universe as a large mosaic: each of the “tiles” in figure 4 is a small simulation that a modest computer can solve, and the assembly of all the tiles gives the overall picture. This is what is called in computer science a “perfectly parallel” algorithm, unlike all cosmological simulation algorithms so far. Thanks to it, we have been able to obtain cosmological simulations at a satisfactory resolution, while remaining on a relatively modest computing facility (figure 5).</p>
<p>Our perfectly parallel sCOLA algorithm has been implemented in the publicly available <strong>Simbelmynë</strong> code,<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup> where it is included in version 0.4.0 and later.</p>
<p class="figure"><img src="/assets/posts/scola/horizon_cluster.png" alt="A GPU-based computer" />
<em>A GPU-based computer at the Institut d’Astrophysique de Paris. Its costs represents only a hundredth of that of a supercomputer at national computing facilities.</em></p>
<h1 id="new-hardware-to-simulate-the-universe">New hardware to simulate the Universe</h1>
<p>This new algorithm is not limited to being used in small computing facilities, but allows to envisage new ways of exploiting computing hardware. Ideally, each of the “tiles” could be small enough to fit in the “cache memory” of our computers, that is, the part of the memory that processors can access in the smallest amount of time. The resultant communication speed up would allow us to simulate the entire volume of the Universe extremely quickly, or even at a resolution never achieved so far.</p>
<p>Going further, we can even imagine that each of the simulations corresponding to a “tile” would be small enough that it can be run on a modern mobile phone! This parallelisation technique would be based on a platform such as Cosmology@Home<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup>, which is dedicated to distributed collaborative computing. This platform is derived from the efforts initiated by SETI@Home<sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup> for the search for extraterrestrial intelligence.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p><a href="https://www.euclid-ec.org/">https://www.euclid-ec.org/</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a href="https://www.lsst.org/">https://www.lsst.org/</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p><a href="http://spec.org/">http://spec.org/</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>S. Tassev, D. J. Eisenstein, B. D. Wandelt, M. Zaldarriaga, <em>sCOLA: The N-body COLA Method Extended to the Spatial Domain</em> (2015), <a href="https://arxiv.org/abs/1502.07751">arXiv:1502.07751</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>F. Leclercq, B. Faure, G. Lavaux, B. D. Wandelt, A. H. Jaffe, A. F. Heavens, W. J. Percival, C. Noûs, <em>Perfectly parallel cosmological simulations using spatial comoving Lagrangian acceleration</em>, A&A, in press (2020), <a href="https://arxiv.org/abs/2003.04925">arXiv:2003.04925</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>The Simbelmynë code: <a href="http://simbelmyne.florent-leclercq.eu">homepage</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p><a href="https://www.cosmologyathome.org/">https://www.cosmologyathome.org/</a> <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:8" role="doc-endnote">
<p><a href="https://setiathome.berkeley.edu/">https://setiathome.berkeley.edu/</a> <a href="#fnref:8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Mon, 25 May 2020 00:00:00 +0300
https://www.aquila-consortium.org/method/scola.html
https://www.aquila-consortium.org/method/scola.htmlmethodWhy neural networks don’t work and how to use them<h1 id="neural-networks-as-universal-model-approximators">Neural networks as universal model approximators</h1>
<p>We can think of a neural network, \(\mathbb{NN}(\boldsymbol{w}, \boldsymbol{\alpha}) : {\bf d}\to\boldsymbol{\tau}\), as an approximation of a model, \(\mathcal{M} : {\bf d}\to{\bf t}\), where \({\bf d}\) is some input data to the network and the output of the network is \(\boldsymbol{\tau}\) which is an estimate of some target, \({\bf t}\), associated with the data. The neural network itself is a function of some trainable parameters called weights, \(\boldsymbol{w}\), and some hyperparameters, \(\boldsymbol{\alpha}\), which encompass the architecture of the network, the initial values of the weights, the form of activation functions, the choice of cost function, etc.</p>
<h1 id="likelihood-of-obtaining-targets-given-a-network">Likelihood of obtaining targets given a network</h1>
<p>In a traditional sense, the training of a neural network is equivalent to minimising a <em>cost</em> or <em>loss</em> function, \(\Lambda({\bf t}, \boldsymbol{\tau})\), with respect to the weights of the network, \(\boldsymbol{w}\) (and hyperparameters, \(\boldsymbol{\alpha}\)) given a set of pairs of data and targets for training and validation, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) and \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\). The cost function, \(\Lambda({\bf t}, \boldsymbol{\tau})\), measures how close the outputs of a fixed network, \(\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to\boldsymbol{\tau}\), are to some target, \({\bf t}\), given a data-target pair, \(\{ {\bf d}, {\bf t}\}\), at some fixed network parameters and hyperparameters, \(\boldsymbol{w}=\boldsymbol{w}^*\) and \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\). That is, how likely is it that the output of the network provides the true target for the input data given a chosen set of weights and fixed network hyperparameters, i.e. the cost function is equivalent to the (negative logarithm of the) likelihood function</p>
\[\Lambda({\bf t}, \boldsymbol{t})\simeq-\textrm{ln}\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w}^*,\boldsymbol{\alpha}^*).\]
<p class="figure"><img src="/assets/posts/nn/likelihood.svg" alt="Wibbly likelihood surface" />
<em>The likelihood surface, although regular for a given set of network parameters and hyperparameters, is extremely complex, degenerate, and even discrete and non-convex in the directions of the network parameters and hyperparameters.</em></p>
<p>Although the cost function is normally chosen to be convex, i.e. with a global minimum and defined everywhere, at a given value of \(\boldsymbol{w}=\boldsymbol{w}^*\) and \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\), the shape of the likelihood is extremely complex, degenerate and bumpy when considering all possible \(\boldsymbol{w}\) and will often be discrete and non-convex in the \(\boldsymbol{\alpha}\) direction.</p>
<h2 id="maximum-likelihood-network-parameter-estimates">Maximum likelihood network parameter estimates</h2>
<p>The normal procedure for using neural networks is to <em>train</em> them. This means finding the maximum likelihood estimates of the weights of a network with a given set of training data-target pairs \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) and fixed hyperparameters, \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\), by doing</p>
\[\boldsymbol{w}^\textrm{MLE}=\underset{\boldsymbol{w}}{\textrm{argmax} }\left[\mathcal{L}(\{ {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\vert \{ {\bf d}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}, \boldsymbol{w}, \boldsymbol{\alpha}^*)\right].\]
<p>That is, find the set of \(\boldsymbol{w}\) for which the likelihood function evaluated at every member in the training set is maximum. In the case that each pair of data and targets, \(\{ {\bf d}, {\bf t}\}\) are independent and identically distributed we can write the likelihood as</p>
\[\mathcal{L}(\{ {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\vert \{ {\bf d}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}, \boldsymbol{w}, \boldsymbol{\alpha}^*)=\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w},\boldsymbol{\alpha}).\]
<p class="figure"><img src="/assets/posts/nn/mle.gif" alt="Stochastic gradient descent" />
<em>By finding the set of \(\boldsymbol{\tau}\) which are closest (in the sense of the minimum cost function) to the target \({\bf t}\), given some a neural network and some input data \({\bf d}\), the weights of the network traverse the negative logarithm of the likelhiood surface for the true target, hopefully ending at some minimum (which is a maximum in the likelihood).</em></p>
<p>To find the maximum likelihood of the weights, one would normally consider some sort of stochastic gradient descent. Since most software is more efficient at finding minima rather than maxima, we actually minimise the negative logarithm of the likelihood, i.e. the cost function</p>
\[\begin{align}
\boldsymbol{w}^\textrm{MLE}&=\underset{\boldsymbol{w}}{\textrm{argmin} }\left[\sum_i^{n_\textrm{train} }\Lambda({\bf t}^\textrm{train}_i, \boldsymbol{\tau}^\textrm{train}_i)\right]\\
&=\underset{\boldsymbol{w}}{\textrm{argmin} }\left[-\sum_i^{n_\textrm{train} }\textrm{ln}\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w}, \boldsymbol{\alpha}^*)\right].
\end{align}\]
<p>The weights are updated using \(\boldsymbol{w}\to\boldsymbol{w}-\nabla_\boldsymbol{w} \sum_i^{n_\textrm{train} }\ \textrm{ln}\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w}, \boldsymbol{\alpha}^*)\). In the ideal case there would be one global minimum in the likelihood so that after training the value of the weights of the neural network would be equal to the maximum likelihood estimates, \(\boldsymbol{w}=\boldsymbol{w}^\textrm{MLE}\). However, since the likelihood surface is, in reality, extremely degenerate and flat in the space of weight values, it is most likely that the weights only achieve a local maximum, i.e. \(\boldsymbol{w}=\boldsymbol{w}^\textrm{local MLE}\). In fact, which local maximum is found will normally depend extremely strongly on the initial \(\boldsymbol{w}=\boldsymbol{w}_\textrm{init}\) which is used for the gradient descent.</p>
<p class="figure"><img src="/assets/posts/nn/w_init.gif" alt="Initialisation dependent gradient descent" />
<em>The initialisation of the weights will be very important in determining which local maximum likelihood estimate is found. This is because the surface of the likelihood is very bumpy. It can also be highly degenerate which leads to whole families of pseudo-maximum likelihood estimates.</em></p>
<p>Once the maximum (or at least local maximum) is found, it is normal to evaluate the accuracy (or some other figure of merit) using some validation set, \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\). This validation set is used to modify the hyperparameters, \(\boldsymbol{\alpha}\), of the network to achieve the best fit to both the training and validation sets as possible. These modifications could include changing the initial seeds of the weights, changing the activation functions, or changing the entire architecture, for example. However, networks trained in such a way do not provide a way to obtain scientifically robust estimates of the true targets \({\bf t}\), given observed data \({\bf d}\). To see why, we need to consider the probabilistic interpretation of neural networks.</p>
<h1 id="probabilistic-interpretation-of-neural-networks">Probabilistic interpretation of neural networks</h1>
<p>The posterior predictive density of obtaining a target, \({\bf t}\), given some input data, \({\bf d}\), is</p>
\[\mathcal{P}({\bf t}\vert {\bf d}) = \int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}).\]
<p>The likelihood of obtaining the true value of the target \(\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w},\boldsymbol{\alpha})\), which is the (unnormalised) negative exponential of the <em>cost</em> function, when given some input data \({\bf d}\) and network parameters and hyperparameters \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\). \(\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})\) is the probability of obtaining the weights and hyperparameters of the neural network. Since the likelihood of obtaining any value of the target, \({\bf t}\), given some input data, \({\bf d}\), for any given neural network, i.e. any combination of \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\), is essentially equal then the likelihood, \(\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\), is almost flat. Therefore, the majority of the information about the posterior predictive density, \(\mathcal{P}({\bf t}\vert {\bf d})\), comes from the any <em>a priori</em> or <em>a posteriori</em> knowledge of the weights \(\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha})\), and therefore, it has to be chosen or found very carefully.</p>
<p class="figure"><img src="/assets/posts/nn/pp.gif" alt="Pointiness of posterior predictive density" />
<em>The form of the posterior predictive density of the targets \({\bf t}\) depends mostly on the probability of the weights and hyperparameters of the network. This means that the prior for the weights and hyperparameters must be chosen carefully or the posterior extremely well characterised via training data.</em></p>
<p>A Bayesian neural network is a network which provides the true posterior predictive density of targets \({\bf t}\) given data \({\bf d}\).</p>
<h2 id="failure-of-traditionally-trained-neural-networks">Failure of traditionally trained neural networks</h2>
<p>As described above, given a set of training pairs, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\), and validation pairs, \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\), we can find the (local) maximum likelihood estimates of the weights, \(\boldsymbol{w}=\boldsymbol{w}^\textrm{local MLE}\), and optimise the hyperparameters to \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\) which gives the best fit to both the training and validation data-target pair sets. Since we fix both the parameters and hyperparameters, those values are set in stone and we degenerate the posterior distribution to a Dirac \(\delta\) function, neglecting any information brought by the training data, i.e.</p>
\[\begin{align}
\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}|\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}) &\propto\mathcal{L}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})p(\boldsymbol{w},\boldsymbol{\alpha})\\
&\to\delta(\boldsymbol{w}-\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)
\end{align}\]
<p>where \(p(\boldsymbol{w},\boldsymbol{\alpha})\) is a prior distribution over the weights and hyperparameters.
By making such a choice, we erase the entirety of the information about the distribution of data and work only with the best fit model, which may (or may not) be complete.
As such, the predictive probability density of the targets \({\bf t}\) given data \({\bf d}\) is</p>
\[\mathcal{P}({\bf t}\vert {\bf d}) =\delta({\bf t}-\boldsymbol{\tau}({\bf d})),\]
<p>i.e., the probability of obtaining an estimate from the network is zero everywhere apart from at the value of the output of the network, \(\mathbb{NN}(\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}^*) : {\bf d}\to\boldsymbol{\tau}\) - the function is completely deterministic. Effectively, this means that the probability of obtaining \({\bf t}\) given the fixed network parameters and hyperparameters and some data \({\bf d}\) is impossibly small.</p>
<p>Consider a third <em>test</em> set, \(\{ {\bf d}^\textrm{test}_i, {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\). One normally determines how well a neural network is trained using this unseen (blind) set. To test the network, all of the test data, \(\{ {\bf d}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\), are passed through the network to get estimates \(\{\boldsymbol{\tau}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) which can be plotted against the known targets, \(\{ {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) (see above figure).</p>
<p class="figure"><img src="/assets/posts/nn/nn_w.gif" alt="True vs. Predicted targets" />
<em>For any set of data, a trained neural network with fixed hyperparameters and network parameters at their maximum likelihood values, the probability of obtaining a target is a \(\delta\) function. There is no knowledge of whether the output of the network will be equal to the target, and it is, in fact, improbably unlikely that they will be.</em></p>
<p>A network which produces \(\{\boldsymbol{\tau}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) which correlate very strongly with \(\{ {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) is probably a network that is in a very good local maximum for both the weights and the hyperparameters. However, there is no assurance that the true \({\bf t}\) should be obtained by the network, and due to the complexity of the likelihood \(\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\), there is also no way of ensuring that \(\boldsymbol{\tau}\) should be similar to \({\bf t}\). Simply, for complex models, it is not possible to prove that the neural network is equivalent to the model, \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}\), and so there is no trust that the network will provide \(\boldsymbol{\tau}={\bf t}\). In fact, because \(\mathcal{P}({\bf t}\vert {\bf d})=\delta(\boldsymbol{\tau})\), it is improbably unlikely to ever find \(\boldsymbol{\tau}={\bf t}\). For extremely simple architectures it may be possible to prove that at the global maximum likelihood estimates of the weights that \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}\), but unfortunately, such simple networks are much less likely to contain the exact representation of \(\mathcal{M}\). Therefore, one can only prove \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}\) in the limit of infinite data. This is because, in the limit of infinite training data and infinite validation data then we can assume (but not know) that a network could be found (via optimising the hyperparameters over the space of all possible architectures, activation functions, initial conditions of the weights, etc.) which has the capability to exactly reproduce the model \(\mathcal{M} : {\bf d}\to{\bf t}\) by finding the true global maximum of the weights over the space of all possible weights in all possible architectures.</p>
<p>An interesting point to make, especially for regression to model parameters, is that one attempts to use the neural network to find a mapping from a many-to-one value space since the same \({\bf t}\) could produce a very large number of different \({\bf d}\), i.e. the forward model is stochastic. It is an extremely difficult procedure to undo stochastic processes, which is why the neural network will likely never achieve the target function.</p>
<!--### Using MCDropout
MCDropout is a simple extension to traditionally trained neural networks where a probabilistic binary mask $$\boldsymbol{m}$$ is applied to every weight $$\boldsymbol{w}$$ of the network, $$\boldsymbol{w}\to\boldsymbol{mw}$$. The mask can take a value of 0 or 1 given a binomial distribution where a _keep_ value determines what proportion of weights are set to zero. Training is performed in the traditional way where the weights are _dropped_ randomly, which essentially samples some extremely small subset of the hyperparameter space $$\boldsymbol{\alpha}$$, i.e. the subset whose global maximum .
MCDropout is a technique which is often said to approximate Bayesian neural networks. However, we can see this cannot be true since the weights, $$\boldsymbol{w}$$, are fixed and only a very small prior space of $$\boldsymbol{\alpha}$$ is subsampled. In practice, the weights for each subnetwork in the dropped network will not be in even a local maximum of the likelihood for that subnetwork and, as such, not only are the $$\boldsymbol{\tau}$$ not equal to the true targets $${\bf t}$$ given data $${\bf d}$$ but it is likely that they are very far away. In particular, it is very common to obtain spurious modes of certainty for some set of subnetworks.-->
<h2 id="variational-inference-using-approximate-weight-priors">Variational inference using approximate weight priors</h2>
<p class="figure"><img src="/assets/posts/nn/VB.svg" alt="Variational inference network" />
<em>A neural network can be trained via variational inference where parameters of the network predict the parameters of a variational distribution from which the weights for the forward propagation are drawn.</em></p>
<p>All of the problems with the traditional picture arise due to degenerating the probability of the weights and hyperparameters \(\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})\to\delta(\boldsymbol{w}-\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)\). We can recover variational inference by assuming the posterior distribution of the weights becomes an approximate variational distribution, \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\), which approximates posterior of \(\boldsymbol{w}\) given a secondary set of network parameters which define the shape of the variational distribution, \(\boldsymbol{v}\), a set of hyperparameters, \(\boldsymbol{\alpha}\), and a set of training data and target pairs, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\). The posterior predictive density for the targets \({\bf t}\) is then written</p>
\[\mathcal{P}({\bf t}\vert {\bf d})=\int d\boldsymbol{w}d\boldsymbol{v}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})p(\boldsymbol{v},\boldsymbol{\alpha}).\]
<p>In practice, the parameters controlling the shape of the variational distribution, \(\boldsymbol{v}\) and the hyperparameters, \(\boldsymbol{\alpha}\) are optimised iteratively using a training and validation set as with the traditional training framework and as such the posterior predictive density becomes</p>
\[\begin{align}
\mathcal{P}({\bf t}\vert {\bf d})&=\int d\boldsymbol{w}d\boldsymbol{v}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\\
&\phantom{=hello}\times\delta(\boldsymbol{v}-\boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)\\
&=\int d\boldsymbol{w}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w}, \boldsymbol{\alpha}^*)\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}^*, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}).
\end{align}\]
<p class="figure"><img src="/assets/posts/nn/vi_w.gif" alt="True vs. variational targets" />
<em>When the posterior distribution for the weights and hyperparameters of a neural network are approximated using a variational distribution, the posterior predictive density for the targets given some data has a form dictated mostly by the shape of the variational distribution. This shape is not necessarily correct since only simple distributions are usually used for the variational distribution and the distribution of weights can be extremely complex.</em></p>
<p>In principle, if \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}^*, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\) well represents the true posterior of the weights and hyperparameters, \(\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})\), then this can be a good approximation. However, this is very dependent on the distributions which \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) can represent. \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) is normally chosen to be Gaussian, or perhaps a mixture of Gaussians. As discussed already, the likelihood of obtaining any set of weights, \(\boldsymbol{w}\), is actually extremely bumpy and degenerate and, as such, \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) must be chosen to be able to properly represent this. If \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) is poorly proposed then the posterior predictive density of the targets, \(\mathcal{P}({\bf t}\vert {\bf d})\), will be incorrect.</p>
<p class="figure"><img src="/assets/posts/nn/wrong_variational_w.svg" alt="Poor variational distribution" />
<em>The variational distribution often does not have enough complexity to fully model the intricate nature of the true posterior distribution of weights and hyperparameters. This can lead variational inference te be misleading.</em></p>
<h2 id="bayesian-neural-networks">Bayesian neural networks</h2>
<p class="figure"><img src="/assets/posts/nn/Bayes.svg" alt="Bayesian neural network" />
<em>A Bayesian neural network is similar a traditional one apart from the distribution of the weights (and hyperparameters) of the network are characterised by the posterior for the weights and hyperparameters given a set of training data.</em></p>
<p>An effective Bayesian neural network can be be built if we use the true posterior of \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\) given some training data, rather than degenerating it to a Dirac \(\delta\), and instead keeping</p>
\[\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\propto\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i,\boldsymbol{w},\boldsymbol{\alpha})p(\boldsymbol{w},\boldsymbol{\alpha}).\]
<p>With this, the predictive probability density of \({\bf t}\) given \({\bf d}\) becomes</p>
\[\begin{align}
\mathcal{P}({\bf t}\vert {\bf d}) =&~\int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\\
\propto&~\int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i,\boldsymbol{w},\boldsymbol{\alpha})p(\boldsymbol{w},\boldsymbol{\alpha}).
\end{align}\]
<p>Obviously the Bayesian neural network comes at a much higher computational cost than just finding the maximum likelihood estimate for the weights, but it does provide a more reasoned posterior predictive probability density, \(\mathcal{P}({\bf t}\vert {\bf d})\). Notice that the prior, \(p(\boldsymbol{w},\boldsymbol{\alpha})\), still enters and so we need to make an informed decision on our belief for what the values of \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\) should be. However, for enough training data-target pairs (and enough time to sample through whatever chosen prior, \(p(\boldsymbol{w},\boldsymbol{\alpha})\)) the posterior \(\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\) becomes informative enough to obtain useful posterior predictions for the targets.</p>
<p class="figure"><img src="/assets/posts/nn/dd.gif" alt="Characterising the posterior" />
<em>For small numbers of data points, the likelihood is poorly characterised and so can lead to biasing in the posterior predictive density. It is therefore important to have enough data to properly know the likelihood - it is not easy to determine how much this is.</em></p>
<p>In effect, to make use of Bayesian neural networks, one has to resort to sampling techniques, such as Markov chain Monte Carlo, to describe \(\mathcal{P}({\bf t}\vert {\bf d})\). Because of the (normally extremely large) dimension of the number of weights, techniques such as Metropolis-Hastings cannnot be considered. We proposed using a second-order geometrical adaptation of Hamiltonian Monte Carlo (QN-HMC) in Charnock et al. 2019 (<a href="/method/machine%20learning/npe.html">read more</a>). By using such a sampling technique, one could generate samples for the posterior predictive density, \(\mathcal{P}({\bf t}\vert {\bf d})\), whose distribution describes what was the probability of getting a target \({\bf t}\) from data \({\bf d}\) marginalised over all network parameters \(\boldsymbol{w}\) given a hyperparameter, \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\)<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. It is difficult to sample \(\boldsymbol{\alpha}\) when using the QN-HMC since gradients of the likelihood need to be computed and the likelihood in the \(\boldsymbol{\alpha}\) direction is often discrete. How to properly sample from \(\boldsymbol{\alpha}\) is still up for debate.</p>
<p>So now lets say we have enough computational power to build a true Bayesian neural network. Are we guaranteed to obtain a correct posterior predictive density?</p>
<h1 id="source-of-the-problem">Source of the problem</h1>
<h2 id="training-on-data">Training on data</h2>
<p>Notice how all of the techniques mentioned above are dependent on a set of training data and target pairs, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) (and possibly validation data and targets, \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\)). It is in the posterior (or variational distribution) for the weights that the training data arises</p>
\[\mathcal{P}({\bf t}\vert {\bf d}) = \int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\]
<p>and, as already explained, the last term in the integral contains the informative part about the posterior predictive density. As such, any biasing due to \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) greatly affects \(\mathcal{P}({\bf t}\vert {\bf d})\).</p>
<p>When depending on a training set, \(\mathcal{P}({\bf t}\vert {\bf d})\) is always unknowably biased until the limit of infinite data is reached. So, no method mentioned so far provides us with the correct probability of obtaining the target!</p>
<p>For networks, such as emulators (or generative networks as they are commonly called), where the probability distribution of generating targets, \(\mathcal{\bf P}({\bf t}\vert {\bf z})\), with generated data \({\bf t}\) and a latent distribution \({\bf z}\), should approximate the distribution of true data \(\mathcal{P}({\bf d})\), then the above argument means that we cannot find \(\mathcal{P}({\bf d})\) by training a neural network without infinite training data<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>
<h2 id="incorrect-models">Incorrect models</h2>
<p>One interesting use for neural networks is the predicting of physical model parameters, \(\boldsymbol{\theta}\), for a model \(\mathcal{M} : \boldsymbol{\theta}\to{\bf d}\). In this case, even for infinite data, we cannot obtain true posterior distributions for the parameters. Take a network which maps \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to\hat{\boldsymbol{\theta}}\), where \(\hat{\boldsymbol{\theta}}\) are estimates of the model parameters, \(\boldsymbol{\theta}\), which generate the data. Even if there is infinite training data, \(\{ {\bf d}^\textrm{train}_i, \boldsymbol{\theta}^\textrm{train}_i\vert i\in[1,\infty]\}\), if the original model is incorrect, then the neural network will be conditioned on the wrong map from data, \({\bf d}\), to parameters, \(\boldsymbol{\theta}\), and so any observed data, \({\bf d}^\textrm{obs}\), passed through the network will be passed through the incorrect approximation of the model and provide a poor estimate of the incorrect model parameter values. This means that true posteriors on the model parameters can only be obtained with the exact model which generates the <em>observed</em> data <strong>and</strong> an infinite amount of training data from that model, to be able to correctly provide parameter estimates.</p>
<p><strong>This is not realistic!</strong></p>
<h1 id="solutions">Solutions</h1>
<p>We have so far built a description of how to obtain the probability to obtain targets, \({\bf t}\), from data, \({\bf d}\), passed through a neural network… and unfortunately, we have learned that it is not possible to obtain.</p>
<p>There is still one problem where we can use neural networks safely despite all of the above. This is to do model parameter inference.</p>
<p>So far we have only considered a neural network as an approximation to a model \(\mathcal{M} : {\bf d}\to{\bf t}\). Now lets say we have a physical model, \(\mathcal{Z}(\boldsymbol{\iota}) : \boldsymbol{\theta}\to{\bf d}\), which generates the data, \({\bf d}\) from a set of model parameters, \(\boldsymbol{\theta}\), dependent on a set of initial conditions \(\boldsymbol{\iota}\), we can safely use a neural network, \(\mathbb{N}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to\boldsymbol{\tau}\), to infer the model parameters of some observed data, \({\bf d}^\textrm{obs}\). Note that we cannot use a network to predict model parameters directly \((\mathbb{NN} : {\bf d}\to\boldsymbol{\theta})\) due to all of the arguments above. Instead we need to set up a statistical inference framework which encompasses the neural network.</p>
<p>Charnock et al. 2019 and Charnock, Lavaux and Wandelt 2018 show two different methods to perform physical model parameter inference using neural networks, in a well justified way.</p>
<h2 id="writing-down-the-likelihood">Writing down the likelihood</h2>
<p>I should mention an extremely rare case where the model \(\mathcal{M} : {\bf d}\to{\bf t}\), is simple enough to be parameterised by an extremely simple network with very few parameters, which are non-degenerate and well behaved and for which the hyperparameters, \(\boldsymbol{\alpha}\), can be well designed to avoid needing to sample over this space.</p>
<p>For this case, the likelihood could be written, and therefore, fully established and sampled from, and biases from training data-target pairs could be totaly avoided.</p>
<p><strong>It is pretty unlikely that such a network could be found without considering physical principles.</strong></p>
<h2 id="model-extension">Model extension</h2>
<p>In Charnock et al. 2019, the connection between the observed data and the output of the physical model is not known, i.e. the data from a model given initial conditions, \(\boldsymbol{\iota}\), is \(\mathcal{Z}(\boldsymbol{\iota}) : \boldsymbol{\theta}\to{\bf d}\). This \({\bf d}\) does not look like \({\bf d}^\textrm{obs}\) although we know that want the posterior distribution of \(\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs})\). In Charnock et al. 2019, we know we can observe the universe and model the underlying dark matter of the universe, but the complex astrophysics which maps the dark matter of the universe to the observable tracers is unknown. We do, however, know some physical properties of this mapping. In this case, we build a neural network with the physically motivated symmetries to take the output of the physical model to the distribution which is as close to the observed data as possible (<a href="/method/machine%20learning/npe.html">read more</a>). In the language used previously, thanks to the problems we deal with in cosmology and astrophysics we can actually choose the hyperparameters of a neural network, \(\boldsymbol{\alpha}\), in a reasoned manner. These physically motivated neural networks therefore massively reduce the volume of the \(\boldsymbol{\alpha}\) domain. With a careful choice of \(\boldsymbol{\alpha}\) we can also build a network whose priors on the network paremeters, \(\boldsymbol{w}\), can be (at least reasonably) well informed.</p>
<p>We can write the parameter inference as</p>
\[\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs}) \propto \int d\boldsymbol{\iota}d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf d}^\textrm{obs}\vert \boldsymbol{\iota},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{P}(\boldsymbol{\iota}\vert \boldsymbol{\theta})p(\boldsymbol{w},\boldsymbol{\alpha})\]
<p>That is, the posterior distribution for the model parameters given some observed data is proportional to the marginal distribution of how likely the observed data is given the initial conditions of the model, \(\boldsymbol{\iota}\), which depend on the model parameters, \(\boldsymbol{\theta}\), which generate the initial conditions and evolve the model forward to the input of the neural network with network parameters and hyperparameters \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\).</p>
<p>In this presented case, there is no training data for the network, instead the data needed to obtain the posterior is part of the statistical framework. Therefore, the network provides non-agnostic posterior parameter inference because we do not learn the posterior distribution, \(\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha})\) using training data. In essence, this defines the procedure to perform zero-shot training.</p>
<p>It should be noted that this procedure is difficult. It necessitates a sampling scheme for the neural network and the physical model. In Charnock et al. 2019, we use an advanced Hamiltonian Monte Carlo sampling technique on a model where we have calculated the adjoint gradient and the neural network whose architecture is well informed but fixed.</p>
<h2 id="likelihood-free-inference">Likelihood-free inference</h2>
<p>The model extension method works well, but still depends on knowing the form of the likelihood of the observed data. In practice, this could be extremely difficult. It also depends on a choice of hyperparameters (or at least a well defined prior based on physical principles). In Charnock, Lavaux and Wandelt 2018, we showed another model extension method which allows use to obtain optimal model parameter inference using neural networks by (semi)-classically training a neural network, \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to{\bf t}\), where the target distribution is the set of Gaussianly distributed summaries which maximise the Fisher information matrix. Although the network in this work is, in some way, optimal - the main point of this paper is that parameter inference can be done using likelihood-free inference by extending the physical model \(\mathcal{M} : \boldsymbol{\theta}\to{\bf d}\) to \(\mathcal{N} :\boldsymbol{\theta}\to{\bf t}\) where \({\bf t}\) is <em>any</em> set of summaries.</p>
<p>Likelihood-free inference is a framework where, via generating data using the physical model, \(\mathcal{M} : \boldsymbol{\theta}\to{\bf d}\), the joint probablity of data and parameters, \(\mathcal{P}({\bf d},\boldsymbol{\theta})\), can be characterised. Once this space is well defined, a slice through the distribution at any \({\bf d}^\textrm{obs}\) gives the posterior distribution \(\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs})\) - likewise the slice through the joint distribution at any parameter \(\boldsymbol{\theta}^*\) gives the likelihood distribution \(\mathcal{L}({\bf d}\vert \boldsymbol{\theta^*})\). This works for any system where we can model the data!</p>
<p>The neural networks become essential as functions which perform data compression (although, it should be noted that any summary of the data will work). Since, in general, the dimensionality of the data is much larger than the number of model parameters, a neural network can be trained to compress the data in some way, \(\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to{\bf t}\). We can train this in any way to give us some absolute summaries, \({\bf t}\), where we, essentially, do not care what the summaries are. Note that \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\) do not need to be maximum likelihood estimates. By pushing all the generated data from the physical model through this <em>fixed</em> network we can characterise the probability distribution of parameters and compressed summaries, \(\mathcal{P}({\bf t},\boldsymbol{\theta})\), which we can slice at any \(\boldsymbol{\theta}^*\) to give the likelihood of obtaining any summaries, \(\mathcal{L}({\bf t}\vert \boldsymbol{\theta}^*)\), or (more interestingly) slice at any observed data pushed through the network, \(\mathbb{NN}(\boldsymbol{w}^*, \boldsymbol{\alpha}^*) : {\bf d}^\textrm{obs}\to{\bf t}^\textrm{obs}\), to get the posterior,</p>
\[\mathcal{P}(\boldsymbol{\theta}\vert {\bf t}^\textrm{obs})=\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs},\boldsymbol{w}^*,\boldsymbol{\alpha}^*).\]
<p>This posterior, whilst conditional on the network parameters and hyperparameters, is unbiased in the sense that when the neural network, \(\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to{\bf t}\) is not optimal, the posterior can only become inflated (and not incorrectly biased).</p>
<p>The information maximising neural network, presented in Charnock, Lavaux and Wandelt 2018, provides the optimal summaries<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> for the likelihood-free inference - but any neural network can be used in this inference framework. In particular, any neural network which looks like it provides good estimates of the targets for a model \(\mathcal{M} : {\bf d}\to{\bf t}\) (as discussed throughout), will likely have extremely informative summaries, even if their outputs are improbably unlikely to be equal to the true target values (see traditionally training neural networks)!</p>
<h1 id="conclusions">Conclusions</h1>
<p>Presented here is a thorough statistical diagnostic of neural networks. I have shown that, by design, neural networks cannot provide realistic posterior predictive densities for arbitrary targets. This essentially makes all neural networks unusable in science.</p>
<p>However, I have presented how my previous works can undermine this previous statment for model parameter inference. Since either a statistical interpretation or a fully trained neural network can be appended to a physical model, we can build a statistical framework around both the model and the neural network to allow us to do rigorous, scientific analysis of model parameters, which is one of the essential tasks in science today.</p>
<hr />
<p>Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, Supranta Sarma Boruah, Jens Jasche, Michael J. Hudson, <b>2019</b>, submitted to MNRAS, arXiv:1909.06379</p>
<p>Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, <b>2018</b>, Physical Review D 97, 083004 (2018), arxiv:1802.03537</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>It should be noted that the work in Charnock et al. 2019 was tackling a larger problem and asking a different question than the one stated here for Bayesian neural networks. Bayesian neural networks are a subset of the techniques from that paper, although closely linked. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>We can hope that the generated target distribution gets close to the true data distribution and decide we are not bothered about statistics anymore. Maybe a dangerous situation for science‽ <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Optimal in the sense that the Fisher information is maximised. This has some assumptions such as the unimodality (but not necessarily Gaussianity) of the posterior, and the fact that the neural network being maximised is capable of finding a function which Gaussianises the data. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sat, 07 Dec 2019 00:00:00 +0200
https://www.aquila-consortium.org/method/machine%20learning/nn.html
https://www.aquila-consortium.org/method/machine%20learning/nn.htmlmethodmachine learningNeural physical engines for inferring the halo mass distribution function<p>To be able to make the most of the wealth of cosmological information available via observations of the large scale structure of the universe it is vital to have a strong model of how observable objects such as galaxies trace the underlying dark matter.
In this work we used a neural bias model: a physically motivated neural network from which we can infer the halo mass distribution function.
This function describes the abundance of halos with a certain mass given a dark matter density environment, where the halos are compact dark matter objects in which galaxies are hosted.
As such, the neural bias model gives us a strong, but agnostic, bias model mapping the dark matter density field to (tracers of) the observable universe.
Such a neural bias model can be included in the BORG inference scheme such that the initial conditions of the dark matter density and the parameters of the neural bias model are sampled using Hamiltonian Monte Carlo.</p>
<h1 id="halo-mass-distribution-function">Halo mass distribution function</h1>
<p>The halo mass distribution function describes the number of dark matter halos at a certain mass given a dark matter density environment.
It has been well studied in the past, and as such we know the approximate form of the function, which is described by the Press Schechter formalism which is a power law at small masses with an exponential cut off at high masses.
There are less well understood elements also, including how the non-local density environment affects the abundance of halos and the form of the stochasticity from which halos are drawn from the halo mass distribution function.
This stochasticity describes how one obtains the actual number of observed halos of a certain mass given that the halo mass distribution function only describes the probability of observing such a halo.
The sampling of halos from the halo mass distribution function is normally assumed to be Poissonian, but this is known to be insufficient.
Whilst we consider a Poissonian likelihood in this work, it should be noted that it is Poisson for a field of summaries provided by a neural physical engine and so includes information from the local surrounding region.</p>
<h1 id="zero-shot-training-bayesian-neural-networks">Zero-shot training, Bayesian neural networks</h1>
<p>The neural network used in this work is not pre-trained and is conditioned on the observed data only, in this case a halo catalogue obtained from a high resolution dark matter simulation.
Zero-shot training describes a method of fitting a function without any training data.
Several components are necessary to be able to achieve such a fitting of the neural bias model introduced here.
These are: basing the design of the architecture of the network on physical principles; using appropriate functions to model the form of the halo mass distribution function; and finding a stable sampling procedure to obtain parameter samples from the posterior.</p>
<h2 id="neural-physical-engines">Neural physical engines</h2>
<p>Neural physical engines are simply neural networks that are built using physical principles.
For example, with a physical model of how some data is distributed according to the parameters of a model, one builds a neural network with the symmetries of such a model built into its architecture.
This is particularly useful for several reasons.
Primarily, such a neural physical engine is massively protected from overfitting.
Overfitting is prevented because only relevant information for the problem in hand is allowed to be fitted, and the network is insensative to spurious features of the data, such as noise.
An added benefit to these networks is the massive reduction in the number of parameters necessary to fit the required function.
This improves the computational efficiency of the algorithm, decreases training times and increases the interpretability of the network.</p>
<p class="figure wide"><img src="/assets/posts/npe/NPE.svg" alt="Neural physical engine" />
<em>The neural physical engine is a physically motivated neural network which maps a dark matter density distribution, evolved by Lagrangian perturbation theory, to a set of summaries which are informative about the abundance of halos of a certain mass on the grid.</em></p>
<p>When building the neural bias model we construct a neural physical engine which takes a small patch of the gridded dark matter density field evolved from the initial conditions to today using Lagrangian perturbation theory as an input and outputs a single informative summary per voxel about the abundance of halos with a certain mass at that patch of the dark matter density field.
We know that the halo mass distribution function is only sensitive to local information, and at the resolution we are working at, mostly due to the amplitude of the dark matter density field rather than the exact position of structures such as filaments or nodes in the dark matter field.
We also know that the data is distributed evenly across the volume, i.e. there is translational and rotational invariance in the dark matter density field.
This encourages us to use parameterised three-dimensional convolutional kernels with an extent which is only as large as the relevant scales and where the parameters are shared within the kernels according to a radial symmetry.</p>
<p class="figure wide"><img src="/assets/posts/npe/kernels.svg" alt="Multipole expansion of convolutional kernel" />
<em>The convolutional kernels used in neural networks are discrete and gridded, with each element of the array being an independent trainable parameter.
We introduce a method by which we can expand the kernels in terms of multipoles by associating weights at equal distances (and at given rotational angles) from the centre of the kernel.
Take for example a 3x3x3 convolutional kernel.
Normally this would have 27 free parameters.
By looking at the radially symmetric kernel, i.e. ℓ=0, each corner has an associated weight, as does each edge and each face and there is a single weight for the central element, equating to a total of 4 free parameters.
Then in the case of the dipolar kernel, i.e. ℓ=1, there are three independent kernels each with 3 parameters, making a total of 9.
For ℓ=2, there are now 5 independent kernels with 2 parameters each and including ℓ=3 saturates the freedom of the convolutional kernel and so no further multipoles are needed to fully parameterise the general kernel.
We can use this expansion to either reduce the number of parameters necessary by truncating in multipoles, or we can learn more about the informational content of the data in terms of expansion in multipoles.
In the second case, once trained, one can look at the response of the data in independent multipole paths, the larger the response the more informative that multipole is about the roll of the data in the neural network.
The code for producing the multipole kernels can be found at <a href="https://github.com/tomcharnock/multipole_kernels">github:multipole_kernels</a>.</em></p>
<p class="figure wide"><img src="/assets/posts/npe/receptive_field.svg" alt="Importance of receptive field" />
<em>The size of the convolutional kernel used is extremely important for a neural physical engine.
The size of the kernel is known as the receptive field, and dictates the size of the correlations which can be learned by the neural network. The receptive field should be chosen based on the data. If it is too small then it is impossible to learn about relevant features in the data and will tend to average out even the small scale features since it cannot distiguish the large scale modes. Likewise, if the receptive field is too large then the kernel will be massively overparameterised which can lead to overfitting and the fitting of spurious large scale features of the data. Since these large scale features are less common they are therefore less likely to be averaged out during training.This leads to a network which is difficult to train and has a much larger computational cost.
It should be noted that stacking convolutions leads to a larger receptive field throughout the network, but does not protect one from the above problems. The kernel size should be chosen carefully at each layer make the most of the distribution of information at each layer independently (this can be very tricky to do).</em></p>
<h2 id="neural-density-estimators">Neural density estimators</h2>
<p>Since we wish to model the halo mass distribution function we need to consider an architecture whose output is a function (or at least an evaluation of the function).
To do so we use a modified mixture density network which is a type of neural density estimator.
Neural density estimators are neural networks whose outputs are samples from a fitted probabililty distribution function.
For the halo mass distribution function we use a mixture of two Gaussian distributions where we allow the predicted amplitudes to be free positive parameters but organise the predicted mean parameters in order of magnitude.
This breaks the degeneracy between the two Gaussians and allows us to have a smooth function whose amplitude can accurately approximate the abundance of halos.</p>
<p class="figure wide"><img src="/assets/posts/npe/MDN.svg" alt="Mixture density network" />
<em>A mixture density network is a neural network which maps an input to a set of parameters for a collection of probability distributions. For example, one can predict the means, μ, standard deviations, σ, and amplitudes, α, of several Gaussian distributions and sum these Gaussians together. Provided that the amplitudes sum to 1, the mixture density will remain correctly normalised to be interpreted as a probability distribution. The mixture density network can then be trained by evalutating the value of the distribution at the labels for the input data and minimising the negative logarithm of the distribution.</em></p>
<h2 id="likelihood-mathcallboldsymbolthetaboldsymboldelta_textsflpt">Likelihood, \(\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{\delta}_\textsf{LPT})\)</h2>
<p>To fit the halo mass distribution to the halo catalogue used in this work we consider a Poisson likelihood.
If our evolved dark matter density field, \(\boldsymbol{\delta}_\textsf{LPT}\), is passed through the neural physical engine, with parameters \(\boldsymbol{\theta}_\textsf{NPE}\), to get a field of summaries, \(\boldsymbol{\psi}_\textsf{NPE} = \boldsymbol{\psi}_\textsf{NPE}(\boldsymbol{\delta}_\textsf{LPT}, \boldsymbol{\theta}_\textsf{NPE})\), our halo mass distribution function is given by</p>
\[n(M|\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_\textsf{MDN})= \sum_{i=1,2} \alpha(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)\mathcal{N}(M| μ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i),σ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)),\]
<p>where \(\mathcal{N}(M|μ, σ)\) is the value of a Gaussian with mean \(\mu\) and standard deviation \(\sigma\) evaluated at halo mass \(\textsf{log}(M)\).
The Poisson likelihood can be written as two terms.
The first term evaluates the neural halo mass distribution function for every halo in the catalogue, where the density environment is obtained from the patch of \(δ_\textsf{LPT}\) around each voxel index corresponding to each halo.
This term therefore fits the abundance scale due to the catalogue.
The second term is the integral over halo mass of the whole function for the entire evolved density field and therefore fits the shape of the function.</p>
<p>Note that by using this likelihood we never have to explicitly make a stochastic sampling of the halos to compare to the catalogue, although we could use the fitted halo mass distribution function to generate halo catalogues by using the value of the evaluated neural bias model as the rate parameter for Poisson sampling.</p>
<p>We will also include a Gaussian prior, \(\pi(\boldsymbol{\theta})\), on all the parameters of the neural bias model.
We ensure that these weights and biases are centred on zero by rescaling them using prior knowledge of the amplitude of the abundance measured from the halo catalogue and the halo mass threshhold.
Since the parameters of the neural bias model are centred on zero, we just need to a width to the Gaussian prior which is large enough to allow for parameter exploration, but tight enough to make sampling the parameters feasible.</p>
<h2 id="hmclet">HMCLET</h2>
<p>To be able to sample the weights of the neural bias model we use a modified Hamiltonian Monte Carlo.
Hamiltonian Monte Carlo is a way of efficiently drawing samples from extremely large dimensional likelihood distributions.
One starts with an initial set of neural bias model parameters, \(\boldsymbol{\theta}_0\), and proposes a new set, \(\boldsymbol{\theta}^*\), given a momentum, \({\bf p}\), drawn from a proposal distribution, \({\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})\).
M is a mass matrix which describes the time scale along the parameter direction and correlation between the parameters.
One then solves Hamilton’s equations, \(d\boldsymbol{\theta}/dt = {\bf M}^{-1}{\bf p}\) and \(d{\bf p}/dt = -\nabla \mathcal{V}(\boldsymbol{\theta})\) where the Hamiltonian is described by \(\mathcal{H}(\boldsymbol{\theta}, {\bf p}) = \mathcal{V}(\boldsymbol{\theta}) + \mathcal{K}(\boldsymbol{p})\), with \(\mathcal{V}(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta}|\delta_\textsf{LPT}) + \pi(\boldsymbol{\theta})\) as the potential energy formed from the likelihood and the prior and \(\mathcal{K}(\boldsymbol{p}) = -{\bf p}^\textsf{T}{\bf M}^{-1}{\bf p}\) as a kinetic energy.
Proposed parameters are then excepted according to a probablity given by \(\alpha = \textsf{Min}[\textsf{exp}(\Delta\mathcal{H}), 1]\), where \(\Delta\mathcal{H}\) is the difference between the energy at the proposed parameter values and the current parameter values.
By conserving energy, one ensures that all proposals are accepted.
It is ususal to use a symplectic integration scheme, such as the leapfrog algorithm (ϵ-discretisation) to solve these ODEs.</p>
<p class="figure wide"><img src="/assets/posts/npe/leapfrog.svg" alt="Leapfrog algorith" />
<em>The leapfrog algorithm involves drawing a momentum from a proposal distribution, \({\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})\), and taking a step of size \(\epsilon\) from the initial parameter positions \(\boldsymbol{\theta}_0\) according to \({\bf p} = {\bf p} - \epsilon\nabla \mathcal{V}(\boldsymbol{\theta}_0)/2\) giving \(\boldsymbol{\theta}_\textsf{next} = \boldsymbol{\theta}_0+\epsilon{\bf M}^{-1}{\bf p}\). This makes up the first half step in the leapfrog. The same procedure of updating \({\bf p}\) and \(\boldsymbol{\theta}\) occurs N number of steps, where the rest of the steps are full (\({\bf p} = {\bf p}-\epsilon\nabla \mathcal{V}(\boldsymbol{\theta})\)). The last half step is then taken. The choice of ϵ dictates the accuracy of the integration. If \(\epsilon\) is large then Hamilton’s equations are solved more inaccurately which can lead to energy loss between the initial and proposed parameters, which increases the rejection. On the other hand, if \(\epsilon\) is small then more samples are accepted since there is less (or less likely to be) energy loss, but this comes at a higher computational cost.</em></p>
<p>Since neural networks are complex and in general have a large number of highly somewhat-degenerate parameters, it is very difficult to know the mass matrix <em>a priori</em>.
This means that extremely large steps can be made along the likelihood surface leading to numerical stability issues and improper sampling.
To overcome this, we can consider using the second order geometric information of the likelihood surface by calculating its Hessian using quasi-Newtonian methods.</p>
<p class="figure wide"><img src="/assets/posts/npe/second_order.svg" alt="Flat likelihood and second order geometric information" />
<em>The Hessian (\({\bf B}\)), i.e. the second order gradient, of the likelihood surface can be calculated using quasi-Newtonian methods. Quasi-Newtonian methods are root-finding algorithms where the Hessian (or Jacobian) are approximated. There are many ways to calculate the approximate Hessian, we use the BFGS method in this work. This method is convenient since it can be calculated for free as part of the leapfrog algorithm. When using the second order geometric information the ODEs become \(d\boldsymbol{\theta}/dt = {\bf B}{\bf M}^{-1}{\bf p}\) and \(d{\bf p}/dt = -{\bf B}\nabla \mathcal{V}(\boldsymbol{\theta})\). This means that, although the mass matrix is still needed to set the time scales along the parameter directions, the momenta get effectively rescaled by the Hessian, breaking parameter degeneracies and allowing for an efficient acceptance ratio.</em></p>
<h1 id="results">Results</h1>
<p>With a neural bias model formed of a neural physical engine which is sensitive to non-local radial information, a neural density estimator to give us evaluations of suffciently arbitrary functions and a sampling scheme which can effectively explore the complex likelihood landscape we can now infer the halo mass distribution function.</p>
<p class="figure wide"><img src="/assets/posts/npe/NBM_square.svg" alt="Outline of the BORG algorithm" />
<em>The BORG algorithm infers the initial conditions of the dark matter distribution. First the initial conditions are drawn from a prior given a cosmology to generate an initial dark matter density field. In this work, this dark matter density field is then evolved forward using Lagrangian perturbation theory to obtain the dark matter density field today. This is then passed through the neural physical engine to obtain an informative field of summaries about the abundance of dark matter halos on a grid. This can then be compared to the observed halo catalogue via the Poissonian likelihood between the halo mass distribution function provided by the neural density estimator of the neural bias model. Evaluating this likelihood allows us to obtain posterior samples of all of the initial phases of the dark matter density distribution and all of the parameters of the neural bias model.</em></p>
<p>We use a halo catalogue constructed using Rockstar from a chunk of the VELMASS Ω dark matter simulation, which has a Planck-like cosmology. This catalogue has about 10,000 halos with a mass threshhold of 2x10<sup>12</sup> solar masses.</p>
<p>As shown in the figures below, we are able to fit the halo mass distribution function extremely well, with sampling around the observed catalogue. Furthermore, the information used comes from the non-local region around the each voxel in the gridded density field, showing that the surrounding area holds information about the abundance of halos.</p>
<p class="figure wide"><img src="/assets/posts/npe/hmdf.svg" alt="Halo mass distribution function" />
<em>The abundance of halos at a certain mass given a density environment from the VELMASS halo catalogue is plotted using the diamonds with dashed lines.
The more dense the environment, the more halos are expected at all masses.
The solid lines are the mean halo mass distribution function values from the neural bias model.
The filled areas are the 1σ intervals either side of the mean obtained by the samples from the Markov chain.
We can see that the fit is very good (even with the very simple model considered here), and that the shape of the function changes with density environment.
This shows that the neural bias model is able to account for the response of the density field.</em></p>
<p class="figure wide"><img src="/assets/posts/npe/3D_projections.svg" alt="3D projections of the field" />
<em>Here we see an example of an initial density field and the same field evolved using Lagrangian perturbation theory on the top row.
The bottom row shows the effect of the neural physical engine which provides an enhancement in constrast, which is a more informative summary of the abundance of halos than the LPT field.
This is because non-local information is gathered from the surrounding voxels by the neural physical engine.
The last box (bottom right) is the true halos from the VELMASS halo catalogue placed onto the same grid.
Note that the NPE field does not look like the halo distribution since a Poisson sampling of the halo mass distribution function is needed to get a stochastic realisation of the halo distribution.</em></p>
<h1 id="future-work">Future work</h1>
<p>The methods presented in this paper show a state of the art in terms of machine learning as well as new methods for dealing with the bias model in BORG and for generating halo catalogues from the neural bias model.
We will continue our work in two main directions.
The first is to look at bypassing the halos completely by learning the form of the likelihood using some form of neural density estimation (or neural flow) which would allow us to be more agnostic about the form of the likelihood.
This would mean that we could, in principle, marginalise out the effect of the ambiguity in the likelihood to provide robust constraints on the initial density phases and cosmology.
The second is to use architecture optimisation schemes to find a better fit to the halo mass distribution function for use in halo catalogue generation.</p>
<h1 id="references">References</h1>
<ul>
<li>Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, Supranta Sarma Boruah, Jens Jasche, Michael J. Hudson, 2019, submitted to MNRAS, <a href="https://arxiv.org/abs/1909.06379">arXiv:1909.06379</a></li>
</ul>
Tue, 15 Oct 2019 00:00:00 +0300
https://www.aquila-consortium.org/method/machine%20learning/npe.html
https://www.aquila-consortium.org/method/machine%20learning/npe.htmlmethodmachine learningA fifth-force resolution of the Hubble tension<h1 id="background">Background</h1>
<p>At least on large scales, the standard cosmological model suffers from just one \(>3\sigma\) inconsistency. This is the Hubble tension: while the local expansion rate inferred from the Cosmic Microwave Background is \(67.4 \pm 0.5\) km s\(^{-1}\) Mpc\(^{-1}\), \(H_0\) measured locally (by combining distance measurements to objects successively further away in a “cosmic distance ladder”) is \(74.03 \pm 1.42\) km s\(^{-1}\) Mpc\(^{-1}\). This discrepancy is \(4.4\sigma\), and appears to imply some form of new physics that invalidates direct comparison between low and high redshift probes of \(H_0\) within \(\Lambda\)CDM.</p>
<p>A key assumption in the local measurement of \(H_0\) is that the objects that calibrate the distance ladder – primarily Cepheid stars and Type 1a Supernovae – have identical properties between successive rungs. But in a wide variety of beyond-\(\Lambda\)CDM cosmological models which invoke so-called “screened fifth forces”, this is likely not true. Rather, while the Cepheids in the Milky Way and NGC 4258 (whose distance is measured independently by means of a water maser) will be screened by the dense environments of their hosts, those at higher redshift that calibrate supernova absolute magnitudes will be unscreened and hence feel the full fifth force. This induces a bias in the Cepheid period–luminosity relation which causes the conventional analysis to underestimate the distance to extragalactic Cepheid hosts, and hence, at fixed redshift, to overestimate \(H_0\) (Figure 1). Thus, in such models, the expansion rate measured locally would be more in accord with that inferred from recombination.</p>
<p class="figure wide"><img src="/assets/posts/h0/Fig_1.png" alt="Figure 1" />
<em>Left panel: The rungs of the cosmic distance ladder and their typical screening status. Right panel: The Cepheid period–luminosity relation when various parts of a Cepheid are unscreened. Assuming unscreened Cepheids lie on the Newtonian relation underestimates their luminosity and hence their distance.</em></p>
<h1 id="unscreening-the-cosmic-distance-ladder">Unscreening the cosmic distance ladder</h1>
<p>We have quantified these effects to flesh out this potential resolution to the Hubble tension <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. We began by formulating a set of observational proxies for the screening behaviour of Cepheids, which encompasses both well-studied screening mechanisms such as chameleon, k-mouflage and Vainshtein, a newly-proposed mechanism based on interactions between baryons and dark matter <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>, and others described phenomenologically and not yet associated with an underlying theory. We then utilised the density field reconstruction of <a href="/method/borgpm.html">the BORG-PM model</a>; as encapsulated in the screening maps described in an earlier <a href="/method/observations/fifth_force.html">post</a> to evaluate these proxies over the Cepheids used in the distance ladder, and hence calculate the change in distance to the Cepheid hosts that the action of a screened fifth force would imply.</p>
<p>The magnitude of the difference depends on the strength of the fifth force. We determined maximum viable values of this in our models by means of consistency tests within the distance ladder data. The most constraining test compares the distances to galaxies measured by both the Cepheid period–luminosity relation and the tip of the red giant branch: these distances are pushed in different directions by a fifth force, so their consistency imposes a limit on the force’s strength. This is shown in Fig. 2, as a function of the fraction of galaxies that are unscreened and separately for the cases in which Cepheid cores (governing luminosity) are unscreened, or only Cepheid envelopes (governing period). This test is the strongest of its kind, and is completely agnostic as to the nature or origin of the modification to gravity.</p>
<p class="figure wide"><img src="/assets/posts/h0/Fig_2.png" alt="Figure 2" />
<em>Constraints on fifth-force strength (relative to gravity) from comparing Cepheid and tip-of-the-red-giant-branch distances, as a function of the fraction of galaxies that are unscreened. Dashed lines indicate typical unscreened fractions in our models.</em></p>
<h1 id="15sigma-consistency-of-local-and-cmb-h_0">\(1.5\sigma\) consistency of local and CMB \(H_0\)</h1>
<p>Setting the screening threshold to ensure that the galaxies that calibrate the period–luminosity relation (N4258 and the MW) are screened, and imposing the bound on fifth-force strength from Fig. 2, we calculated the maximum reduction in the inferred \(H_0\) that each model could afford. Our results are shown in Fig. 3. While models that only unscreen Cepheid envelopes (right panel) can reduce the tension with Planck to \(\gtrsim2\sigma\), those that unscreen cores (among them the baryon–dark matter interaction model, a dark energy model that is otherwise very little constrained) can achieve \(1.5\sigma\) consistency. These results reveal another possible advantage to cosmologies with fifth forces, as well as demonstrating more generally that novel local resolutions of the \(H_0\) problem are possible.</p>
<p class="figure wide"><img src="/assets/posts/h0/Fig_3.png" alt="Figure 3" />
<em>Constraints on local \(H_0\) for each of our screening models. The most successful models reach 1.5\(\sigma\) consistency with the Planck result, well below the level at which statistical fluctuations may account for the discrepancy.</em></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Harry Desmond, Bhuvnesh Jain, Jeremy Sakstein, 2019, <em>A local resolution of the Hubble tension: The impact of screened fifth forces on the cosmic distance ladder</em>, submitted to Phys. Rev. D., <a href="https://arxiv.org/pdf/1907.03778">arxiv 1907.03778</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>J. Sakstein, H. Desmond, B. Jain, 2019, <em>Screened Fifth Forces Mediated by Dark Matter–Baryon Interactions: Theory and Astrophysical Probes</em>, submitted to Phys. Rev. D., <a href="https://arxiv.org/pdf/1907.03775">arxiv 1907.03775</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Sat, 13 Jul 2019 00:00:00 +0300
https://www.aquila-consortium.org/method/observations/h0.html
https://www.aquila-consortium.org/method/observations/h0.htmlmethodobservationsAlgorithms for likelihood-free cosmological data analysis<h1 id="overview">Overview</h1>
<p>The extraction of physical information from wide and deep astronomical surveys relies on statistical techniques to compare models and observations. A common scenario in cosmology is when we can generate synthetic data through forward simulations, but cannot explicitly formulate the likelihood of the model. The generative process can be extremely general (a noisy non-linear dynamical system involving an unrestricted number of latent variables) and is often computationally expensive. Likelihood-free inference (LFI) provides a framework for performing Bayesian inference in this context, by replacing likelihood calculations with data model evaluations. In its simplest form, LFI takes the form of likelihood-free rejection sampling (LFRS), which tends to be (i) extremely expensive, since many simulated data sets get rejected, and (ii) very limited in the number of parameters that can be treated.</p>
<p>In two recent articles, we presented methodological advances, aiming at fitting cosmological data with “black-box” numerical models. Each of them addresses one of the shortcomings of LFRS. The first approach, BOLFI, is intended for specific cosmological models (with \(n \lesssim 10\) parameters) and a general exploration of parameter space. It combines Gaussian process regression of the distance between observed and simulated data with Bayesian optimization. As a result, the number of required simulations is reduced by several orders of magnitude with respect to LFRS. The second approach, SELFI, allows the inference of \(n \gtrsim 100\) parameters (as is necessary for a model-independent parametrization of theory) while assuming stronger prior constraints in parameter space. It relies on a Taylor expansion of the simulator to build an effective posterior distribution. The resulting algorithm allows LFI in much higher-dimensional settings than LFRS.</p>
<h1 id="likelihood-free-inference-of-black-box-data-models">Likelihood-free inference of black-box data models</h1>
<p>Simulator-based statistical models are usually given in terms of numerical “black-boxes”. They provide realistic predictions for artificial observations when provided with all necessary input parameters. These consist of target parameters as well as nuisance parameters such as initial phases, noise realization, sample variance, etc. This “latent space” can often be hundred-to-multi-million dimensional. Once all input parameters are fixed, the black-box typically consists of a simulation step and a data compression step. Black-box models can be written in a hierarchical form and conveniently represented graphically (figure 1).</p>
<p class="figure"><img src="/assets/posts/lfi/black-box_bhm.png" alt="Hierarchical representation of a black-box data model" />
<em>Hierarchical representation of a typical black-box data model. The rounded green boxes represent probability distributions and the purple square represent deterministic functions. For more details, see figure 1 in Leclercq et al. 2019.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">1</a></sup></em></p>
<p>The goal of LFI is to find suitable approximations that allow an estimation of the probability distribution of target parameters conditional on observed data summaries, using only black-box evaluations.</p>
<h1 id="bolfi-bayesian-optimization-for-likelihood-free-inference">BOLFI: Bayesian Optimization for Likelihood-Free Inference</h1>
<p>BOLFI (Bayesian Optimization for Likelihood-Free Inference<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup><sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup>) is a cutting-edge machine learning algorithm for LFI under the constraint of a very limited simulation budget (typically a few thousand), suitable when the problem has a sufficiently small number of target parameters (\(n \lesssim 10\)). Conventional approaches such as LFRS generally require too many simulations, due to their lack of knowledge about how the parameters affect the distance between observed and simulated data. As a response, BOLFI combines Gaussian process regression of this distance to build a surrogate surface with Bayesian Optimization to actively acquire training data (figure 2).</p>
<p class="figure wide"><img src="/assets/posts/lfi/bayesian_optimization.png" alt="Bayesian optimization" />
<em>Illustration of four consecutive steps of Bayesian optimization to learn a test function. For each step, the top panel shows the training data points (red dots) and the Gaussian process regression (blue line and shaded region). The bottom panel shows the acquisition function (solid green line). The next acquisition point, i.e. where to run a simulation to be added to the training set, is shown in orange. For more details, see figure 4 in Leclercq 2018.<sup id="fnref:2:1" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup></em></p>
<p>The target parameter space is explored efficiently and in all generality. We extended the method to use the optimal acquisition function for the purpose of minimizing the expected uncertainty in the approximate posterior density, in the parametric approach to likelihood approximation. As a result, the number of required simulations is typically reduced by two to three orders of magnitude, and the proposed acquisition function produces more accurate posterior approximations, as compared to LFRS.</p>
<h1 id="selfi-simulator-expansion-for-likelihood-free-inference">SELFI: Simulator Expansion for Likelihood-Free Inference</h1>
<p>Another limitation of conventional approaches to LFI is their inability to scale with the number of target parameters. In order to address problems of high-dimensional inference from black-box data models, we introduced SELFI (Simulator Expansion for Likelihood-Free Inference<sup id="fnref:3:1" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">1</a></sup>). Our approach builds upon a novel effective likelihood and upon the linearization of the simulator around an expansion point in parameter space. The workload with SELFI consists of evaluating the covariance matrix and the gradient of data summaries at the expansion point (figure 3). Contrary to likelihood-based Markov Chain Monte Carlo (MCMC) techniques and to BOLFI, it is fixed <em>a priori</em> and perfectly parallel.</p>
<p class="figure wide"><img src="/assets/posts/lfi/covariance_gradient.png" alt="Covariance and gradient of the black-box" />
<em>Covariance matrix (left) and gradient (right) of data summaries at the expansion point, evaluated through black-box realizations only. These are the only two ingredients necessary to apply SELFI. For more details, see figures 6 and 7 in Leclercq et al. 2019.<sup id="fnref:3:2" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">1</a></sup></em></p>
<p>The effective posterior of the target parameters is then obtained through simple “filter equations,” the form of which is analogous to a Wiener filter. SELFI allows the solution of inference tasks from black-box data models, in much higher dimension than conventional approaches to LFI.</p>
<h1 id="cosmological-applications-key-results">Cosmological applications: key results</h1>
<p>In respective papers, we presented the first applications of BOLFI and SELFI to cosmological data analysis.</p>
<h2 id="supernova-cosmology-with-bolfi">Supernova cosmology with BOLFI</h2>
<p>We applied BOLFI to the inference of cosmological parameters from the Joint Lightcurve Analysis (JLA) supernovae data. The model contains two cosmological parameters (the matter density of the Universe \(\Omega_m\) and the equation of state of dark energy \(w\)) and four nuisance parameters, which are marginalized over. The posterior contours obtained with MCMC, LFRS, and BOLFI are represented in figure 4.</p>
<p class="figure wide"><img src="/assets/posts/lfi/bolfi_jla.png" alt="Supernova cosmology with BOLFI" />
<em>Prior and posterior distributions for the joint inference of the matter density of the Universe, \(\Omega_m\), and the dark energy equation of state, \(w\), from the JLA supernovae data set. BOLFI (red posterior) reduces the number of necessary simulations by two orders of magnitude with respect to LFRS (green posterior) and three orders of magnitude with respect to MCMC (orange posterior). For more details, see figure 7 in Leclercq 2018.<sup id="fnref:2:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup></em></p>
<p>As can be observed, BOLFI is able to precisely recover the true posterior with as few as 6,000 simulations, which constitutes a reduction by two orders of magnitude with respect to LFRS and three orders of magnitude with respect to MCMC. This reduction in the number of required simulations accelerates the inference massively.</p>
<h2 id="primordial-power-spectrum-and-cosmological-parameters-inference-with-selfi">Primordial power spectrum and cosmological parameters inference with SELFI</h2>
<p>We applied SELFI to a realistic synthetic galaxy survey, with a data model accounting for physical structure formation and incomplete and noisy observations. This data model is provided by the publicly-available <strong>Simbelmynë</strong> code, a hierarchical probabilistic simulator of galaxy survey data.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> Through this application, we showed that the use of non-linear numerical models allows the galaxy power spectrum to be fitted up to at least \(k_\mathrm{max} = 0.5~h/\mathrm{Mpc}\), which represents an increase by a factor of \(\sim~5\) in the number of modes used, with respect to traditional techniques. The result is an unbiased inference of the primordial power spectrum (living in \(n =100\) dimensions) across the entire range of scales considered, including a high-fidelity reconstruction of baryon acoustic oscillations (figure 5).</p>
<p class="figure wide"><img src="/assets/posts/lfi/selfi_power_spectrum.png" alt="Primordial power spectrum reconstruction with SELFI" />
<em>Primordial power spectrum inference with SELFI from a realistic synthetic galaxy survey. In spite of survey complications which limit the information captured, the inference is unbiased and the signature of baryon acoustic oscillations is well reconstructed up to \(k \approx 0.3~h/\mathrm{Mpc}\), with 5 inferred acoustic peaks, result which could be improved using more volume (this analysis uses \((1~\mathrm{Gpc}/h)^3\)). For more details, see figure 10 in Leclercq et al 2019.<sup id="fnref:3:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">1</a></sup></em></p>
<p>The primordial power spectrum can be seen as a largely agnostic and model-independent parametrization of theory, relying only on weak assumptions (isotropy and gaussianity). Using the linearized black-box, it can be easily translated <em>a posteriori</em> to constraints on specific cosmological models without (or with minimal) loss of information. For instance, constraints on the parameters of the standard cosmological model, for two different synthetic data realizations (with different input cosmologies, phase and noise realizations), are shown in figure 6.</p>
<p class="figure wide"><img src="/assets/posts/lfi/selfi_cosmology.png" alt="Inference of cosmological parameters with SELFI" />
<em>Cosmological parameter inference using a linearized black-box model of galaxy surveys. The prior is shown in blue, and the effective posteriors for two different data realizations are shown in red and purple.</em></p>
<p>We therefore obtain an unbiased and robust measurement of cosmological parameters.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:3" role="doc-endnote">
<p>F. Leclercq, W. Enzi, J. Jasche & A. Heavens 2019, <em>Primordial power spectrum and cosmology from black-box galaxy surveys</em>, MNRAS <strong>490</strong>, 4237 (2019), <a href="https://arxiv.org/pdf/1902.10149">arxiv:1902.10149</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:3:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:3:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a> <a href="#fnref:3:3" class="reversefootnote" role="doc-backlink">↩<sup>4</sup></a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>M. U. Gutmann & J. Corander 2016, <em>Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models</em>, Journal of Machine Learning Research <strong>17</strong>, 1 (2016), <a href="https://arxiv.org/pdf/1501.03291">arxiv:1501.03291</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>F. Leclercq 2018, <em>Bayesian optimisation for likelihood-free cosmological inference</em>, Physical Review D <strong>98</strong>, 063511 (2018), <a href="https://arxiv.org/pdf/1805.07152">arxiv:1805.07152</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:2:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>The Simbelmynë code: <a href="http://simbelmyne.florent-leclercq.eu">homepage</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Thu, 25 Apr 2019 00:00:00 +0300
https://www.aquila-consortium.org/method/lfi.html
https://www.aquila-consortium.org/method/lfi.htmlmethodPainting halos from 3D dark matter fields<h1 id="overview">Overview</h1>
<p>Investigating the formation and evolution of dark matter halos, as the key building blocks of cosmic
large-scale structure, is essential for constraining various cosmological models and further
understanding our Universe. The highly non-linear dynamics involved nevertheless renders this a
complex problem, with computationally costly simulations of gravitational structure formation
currently the only tool to compute the non-linear evolution from initial conditions, yielding mock
dark matter halo catalogues as the main output. However, running very large simulations of pure dark
matter to generate fake observations of the full Universe several times is not feasible, requiring a
large amount of memory and disk storage. A way to emulate such simulations, quickly and reliably,
would be of use to a wide community as a new method for data analysis and light cone production for
the next cosmological survey missions such as Euclid and Large Synoptic Survey Telescope. In this
context, we employ a deep learning approach to construct an emulator to learn the mapping from dark
matter density to halo fields.</p>
<h1 id="halo-painting-network">Halo painting network</h1>
<p>Our physical mapping network is inspired by a recently proposed variant of generative models, known
as generative adversarial networks (GANs). In particular, we will use the key ideas in training WGANs,
i.e. GANs optimized using the Wasserstein distance, to ensure that our network is able to paint halos
well. A schematic of this Wasserstein mapping framework is provided in Fig. 1. Our generator is the
halo painting network whose role is to learn the underlying non-linear relationship between the input
3D density field and the corresponding halo count distribution. Our critic provides as output the
approximately learned Wasserstein distance between the real and predicted halo distributions.
Intuitively, this Wasserstein distance can be interpreted as the amount of work required to transform
a given probability distribution into the desired target distribution. This distance therefore
corresponds the loss function that must be minimized to train the halo painting network.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/WGN_schematic.jpg" alt="Schematic representation of Wasserstein halo painting network" />
<em>Schematic representation of Wasserstein halo painting network implemented in this work.
The role of the generator is to learn the underlying non-linear relationship between the
input 3D density field and the corresponding halo count distribution. The difference
between the output of the critic for the real and predicted halo distributions is the
approximately learnt Wasserstein distance and is used as the loss function which must be
minimized to train the generator.</em></p>
<h1 id="remarkable-performance-of-halo-painting-emulator">Remarkable performance of halo painting emulator</h1>
<p>We showcased the performance our halo painting model using quantitative diagnostics. As a preliminary
qualitative assessment, we performed a visual comparison. Fig. 2 depicts the reference and predicted
halo distributions. Qualitative agreement is impressive, implying that the halo painting network is
capable of mapping the complex structures of the cosmic web, such as halos, filaments and voids, to
the corresponding distribution of halo counts.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/visual_comparison_N500.jpg" alt="Visual comparison" />
<em>Prediction of 3D halo field by our halo painting model for a slice of depth \(\sim 100h^{-1}\) Mpc
and side length of \(\sim2000h^{-1}\) Mpc. A blind validation dataset is shown in the top right
panel, with the predicted halo count depicted below it. The corresponding second order Lagrangian Perturbation Theory (2LPT) density field is
displayed in the top left panel, with the difference between the reference and predicted halo
distributions depicted in the lower left panel. A visual comparison of the reference and predicted
halo count distributions indicates qualitatively the efficacy of our halo painting network.</em></p>
<h2 id="power-spectrum">Power spectrum</h2>
<p>As quantitative assessment, the standard practice in cosmology is to use summary statistics.
These summary statistics provide a reliable metric to evaluate our halo painting network in
terms of their capacity to encode essential information. Assuming the cosmological density field
is approximately a Gaussian random field, as is the case on the large scales or at earlier times,
the power spectrum provides a sufficient description of the field. We therefore demonstrated
the capability of our network in reproducing the power spectrum of the reference halos. The left
panel of Fig. 3 illustrates the extremely close agreement of the 3D power spectra of the reference
and predicted halo fields.</p>
<p>We investigated the influence of the fiducial cosmology adopted for the simulations on the efficacy
of our halo mapping model. In the right panel of Fig. 3, we show the network predictions for two
cosmology variants in terms of their respective transfer functions, which is the ratio of the
square root of the ratio of the predicted to reference power spectra. The corresponding transfer
functions show a deviation of about \(10\%\) from the reference power spectra of their respective
real halo distributions on the smallest and largest scales. This shows that our halo painting model
is slightly sensitive to the underlying cosmology at the level of the power spectrum.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/Pk_cosmo_variation.jpg" alt="3D power spectra of reference and predicted halo fields" />
<em>Left panel: Summary statistics of the 3D power spectra of the reference and predicted halo fields
for one thousand randomly selected patches. The solid lines indicate their respective means, while
the shaded regions indicate their respective \(1\sigma\) confidence regions, i.e. 68\% probability
volume. The above diagnostics demonstrate the ability of our halo painting model to reproduce the
characteristic statistics of the reference halo fields and therefore provide substantial
quantitative evidence for the performance of our neural network in mapping 3D density fields to
their corresponding halo distributions. Right panel: The corresponding transfer functions highlight
the consistency between the power spectra reconstructed from the predicted and real halo fields for
the three cosmology variants, with the deviation from their respective reference spectra being below
\(10\%\).</em></p>
<h2 id="bispectrum">Bispectrum</h2>
<p>The non-linear dynamics involved in gravitational evolution of cosmic structures contributes to a
certain degree of non-Gaussianity of the cosmic density field on the small scales. Higher-order
statistics are therefore required to characterize this non-Gaussian field. We used the bispectrum
to quantify the spatial distribution of the density and halo fields. The bispectra reconstructed
from the second order Lagrangian Perturbation Theory (2LPT), reference and predicted halo fields are displayed in Fig. 4. In particular, we show
the bispectra for a given small- and large-scale configurations. The 2LPT halo field corresponds
to a statistical description of the halo distribution, derived from the 2LPT density field, which
is valid, by construction, at the level of two-point statistics and on large scales. This allows
us to make a fair comparison between the clustering of the respective halo fields. The left panels
of Fig. 4 demonstrate that our halo painting network reproduces the non-linear halo field both on
the small and large scales, and is therefore capable of mapping the complex cosmic structures
apparent in the reference halo field. Our network predictions also show a significant improvement
over the corresponding 2LPT halo fields. In the right panels of Fig. 4, we find that there is a
more significant dependence of our network on the fiducial cosmology at higher order statistics.</p>
<p class="figure wide"><img src="/assets/posts/halo_painting/bispectrum_cosmo_variation.jpg" alt="3D bispectra of reference and predicted halo fields" />
<em>Left panels: Summary statistics of the 3D bispectra of the 2LPT, reference and predicted halo
fields for a given small- and large-scale configurations, as indicated by their respective titles.
In both cases, there is a close agreement between the bispectra from the reference and predicted
halo distributions. Our network predictions are a significant improvement over the corresponding
2LPT halo fields. Right panels: Deviation from the 3D bispectra of the reference halo distributions
of the corresponding predictions for the two cosmology variants. The above bispectrum diagnostics
show that our network is more sensitive to the fiducial cosmology than at the level of power spectrum.
The \(1\sigma\) confidence regions for five hundred randomly selected patches are depicted in each panel.</em></p>
<h1 id="key-advantages">Key advantages</h1>
<ul>
<li>Extremely efficient once trained. Our emulator is capable of rapidly predicting simulations of halo
distribution based on a computationally cheap cosmic density field. For instance, the network
prediction for a \(256^3\) simulation size requires roughly one second on the NVIDIA Quadro P6000.</li>
<li>Can predict the 3D halo distribution for any arbitrary simulation box size. A large simulation box,
therefore, does not require tiling of smaller sub-elements. More importantly, this implies that our
neural network can be trained on smaller simulations and subsequently used to predict large halo
distributions.</li>
<li>Encodes mass information of halos, such that our method can predict the mass distribution of halos.</li>
<li>Allows us to bypass ad hoc galaxy bias models and work in terms of better understood models.</li>
</ul>
<h1 id="potential-applications">Potential applications</h1>
<ul>
<li>Fast generation of mock halo catalogues and light cone production. This would be useful for the data
analysis of upcoming large galaxy surveys of unprecedented sizes.</li>
<li>To fill in small-scale structure at a high resolution from low resolution large-scale simulations.</li>
<li>As a component in Bayesian forward modelling techniques for large-scale structure inference (cf. BORG)
or cosmological parameter inference (cf. ALTAIR) to accelerate the scientific process, rendering
detailed and high-resolution analyses feasible. This would provide statistically interpretable results,
while maintaining the scientific rigour.</li>
</ul>
<h1 id="references">References</h1>
<ul>
<li>D. Kodi Ramanah, T. Charnock & G. Lavaux, 2019, submitted to PRD, <a href="https://arxiv.org/pdf/1903.10524">arxiv 1903.10524</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /></li>
<li>A notebook tutorial to paint the halos of the article: <a href="https://nbviewer.jupyter.org/github/doogesh/halo_painting/blob/master/wasserstein_halo_mapping_network.ipynb">notebook</a></li>
<li>Source code repository: <a href="https://github.com/doogesh/halo_painting">https://github.com/doogesh/halo_painting</a></li>
</ul>
Sun, 31 Mar 2019 00:00:00 +0200
https://www.aquila-consortium.org/method/halo-painting.html
https://www.aquila-consortium.org/method/halo-painting.htmlmethodBayesian treatment of unknown foregrounds<h1 id="overview">Overview</h1>
<p>To probe the Universe on the cosmological scales, we employ large galaxy redshift
catalogues which encode the spatial distribution of galaxies. However, these galaxy
surveys are contaminated by various effects, such as the contamination from dust,
stars and the atmosphere, commonly referred to as foregrounds. Conventional methods
for the treatment of such contaminations rely on a sufficiently precise estimate of
the map of expected foreground contaminants to account for them in the statistical
analysis. Such approaches exploit the fact that the sources and mechanisms involved
in the generation of these contaminants are well-known.</p>
<p>But how can we ensure robust cosmological inference from galaxy surveys if we are
facing as yet unknown foreground contaminations? In particular, the next-generation
of surveys (e.g. <a href="https://www.euclid-ec.org/">Euclid</a>, <a href="https://www.lsst.org/">LSST</a>)
will not be limited by noise but by such systematic effects. We propose a novel
likelihood<sup id="fnref:K" role="doc-noteref"><a href="#fn:K" class="footnote" rel="footnote">1</a></sup> which accurately accounts for and corrects effects of unknown foreground
contaminations. Robust likelihood approaches, as presented below, have a potentially
crucial role in optimizing the scientific returns of state-of-the-art surveys.</p>
<h1 id="robust-likelihood">Robust likelihood</h1>
<p>The underlying conceptual framework of our novel likelihood relies on the
marginalization of the unknown large-scale foreground contamination amplitudes. To
this end, we need to label voxels having the same foreground modulation and this is
encoded via a colour indexing scheme that groups the voxels into a collection of
angular patches. This requires the construction of a sky map which is divided into
regions of a given angular scale, with each region denoted by a specific colour, as
illustrated in Fig. 1 (a). The corresponding representation on a 3D grid results in
a 3D distribution of patches, with the a given slice of the coloured grid depicted
in Fig. 1 (b). The collection of voxels belonging to a particular patch is employed
in the computation of the robust likelihood.</p>
<p>Our proposed data model is conceptually straightforward and provides a maximally
ignorant approach to deal with unknown systematics, with the colouring scheme being
independent of any prior foreground information. As such, the numerical implementation
of our novel likelihood is generic and does not require any adjustments to the other
components in the forward modelling framework of BORG (Bayesian Origin Reconstruction
from Galaxies) for the inference of non-linear cosmic structures.</p>
<p class="figure wide"><img src="/assets/posts/robust/colours.jpg" alt="Colour indexing scheme on the sphere" />
<em>(a) Schematic to illustrate the colour indexing of the survey elements. Colours are
assigned to patches of a given angular scale. (b) Slice through the 3D coloured box
resulting from the extrusion of the colour indexing scheme on the left panel onto a
3D grid. This collection of coloured patches is subsequently employed in the
computation of the robust likelihood.</em></p>
<h1 id="comparison-with-a-standard-poissonian-likelihood-analysis">Comparison with a standard Poissonian likelihood analysis</h1>
<p>We showcase the application of our robust likelihood to a mock data set with
significant foreground contaminations and evaluated its performance via a comparison
with an analysis employing a standard Poissonian likelihood, as typically used in
modern large-scale structure analyses. The results illustrated below clearly
demonstrate the efficacy of our proposed likelihood in robustly dealing with unknown
foreground contaminations for the inference of non-linearly evolved dark matter
density fields and the underlying cosmological power spectra from deep galaxy
redshift surveys.</p>
<h2 id="inferred-dark-matter-density-fields">Inferred dark matter density fields</h2>
<p>We first study the impact of the large-scale contamination on the inferred non-linearly
evolved density field. We compare the ensemble mean density fields and
corresponding standard deviations for the two Markov chains obtained using BORG with
the Poissonian and novel likelihoods, respectively, illustrated in the top and bottom
panels of Fig. 2, for a particular slice of the 3D density field. As can be deduced from
the top left panel of Fig. 2, the standard Poissonian analysis results in spurious
effects in the density field, particularly close to the boundaries of the survey since
these are the regions that are the most affected by the dust contamination. In contrast,
our novel likelihood analysis yields a homogeneous density distribution through the
entire observed domain, with the filamentary nature of the present-day density field
clearly seen. From this visual comparison, it is evident that our novel likelihood is
more robust against unknown large-scale contaminations.</p>
<p class="figure wide"><img src="/assets/posts/robust/panels_density.png" alt="Inferred density fields" />
<em>Mean and estimated uncertainty of the non-linearly evolved density fields, computed
from the sampled realizations of the respective Markov chains obtained from both the
Poissonian (upper panels) and novel likelihood (lower panels) analyses, with the same
slice through the 3D fields being depicted. Unlike our robust data model, the standard
Poissonian analysis yields some artefacts in the reconstructed density field,
particularly near the survey boundary, where the foreground contamination is stronger.</em></p>
<h2 id="reconstructed-matter-power-spectra">Reconstructed matter power spectra</h2>
<p>From the realizations of our inferred 3D initial density field, we can reconstruct the
corresponding matter power spectra and compare them to the prior cosmological power
spectrum adopted for the mock generation. The top panels of Fig. 3 illustrates the
inferred power spectra for both likelihood analyses, with the bottom panels displaying
the ratio of the a posteriori power spectra to the prior power spectrum. While the
standard Poissonian analysis yields excessive power on the large scales due to the
artefacts in the inferred density field, the analysis with our novel likelihood allows
us to recover an unbiased power spectrum across the full range of Fourier modes.</p>
<p class="figure wide"><img src="/assets/posts/robust/Pk.jpg" alt="Reconstructed power spectra from likelihood analysis" />
<em>Reconstructed power spectra from the inferred initial conditions from the BORG analysis
for the robust likelihood (left panel) and the Poissonian likelihood (right panel).
The power spectra of the individual realizations, after the initial burn-in phase, from
the robust likelihood analysis possess the correct power across all scales considered,
demonstrating that the foregrounds have been properly accounted for. In contrast, the
standard Poissonian analysis exhibits spurious power artefacts due to the unknown
foreground contaminations, yielding excessive power on these scales.</em></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:K" role="doc-endnote">
<p>N. Porqueres, D. Kodi Ramanah, J. Jasche, G. Lavaux, 2018, submitted to A&A, <a href="https://arxiv.org/pdf/1812.05113">arxiv 1808.07496</a> <img class="inline-logo" src="/assets/images/arxiv.png" alt="arxiv" /> <a href="#fnref:K" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Fri, 14 Dec 2018 00:00:00 +0200
https://www.aquila-consortium.org/method/robust.html
https://www.aquila-consortium.org/method/robust.htmlmethod