The Aquila consortium

Constraining dark matter annihilation and decay in large-scale structures

Sun, 19 Jun 2022 00:00:00 +0300

Background

Dark matter makes up ~25% of the universe’s mass and is the key component in structure formation. Our knowledge of dark matter currently comes solely from its gravitational influence, but revealing its particle nature will require identifying its other interaction with itself and standard model particles. Fig 1 shows the three processes that such interactions would allow: production of dark matter through collision of standard model particles, scattering between dark matter and standard model particles, and annilihation of dark matter into standard matter particles. Each of these has an associated detection method: we could produce dark matter in particle colliders (“direct detection”), see recoils of standard model particles due to incident dark matter (“scattering”), or identify standard model particles produced by dark matter annihilations (“indirect detection”). A final possible interaction, included in “indirect detection” but involving only a single dark matter particle, is the spontaneous decay of dark matter into standard model particles.

Cartoon illustrating the ways in which dark matter may interact with Standard Model particles.

We are concerned here with dark matter annihilation and decay. This is typically sought by identifying astrophysical objects whose kinematics indicate that they are particularly dark matter-rich, which include the Milky Way, dwarf spheroidal galaxies (dSphs) in the Local Group and massive clusters further afield. One then targets these objects with telescopes sensitive to gamma-rays, which the standard model particles produced by dark matter would be expected to decay into. These searches have enabled us to rule out a thermal relic origin for WIMP dark matter at low mass, depending on the annihilation channel. Such methods however have two important disadvantages:

by focusing on specific objects they necessarily miss gamma-rays coming from the interactions of any dark matter not in the objects,
they are susceptible to systematic errors due to gamma-rays produced in those objects by non-dark matter, i.e. baryonic, means.

To avoid these potential pitfalls, we instead forward-model the gamma-ray flux from annihilation and decay pixel-by-pixel across the full sky using BORG-based models of the large-scale dark matter distribution. This enables a field-level inference of annihilation and decay rates on comparison with similarly all-sky data from the Fermi Large Area Telescope.¹

Predicting annihilation and decay fluxes

Our method leverages CSiBORG (also used in a previous blog post), a suite of 101 high-resolution N-body simulations with initial conditions spanning the posterior of the 2M++ BORG-PM chain. Each box provides a plausible realisation of the dark matter distribution out to ~200 Mpc, including the clumping of dark matter into halos. We use this to make a prediction for the annihilation (decay) flux that would be seen for each line of sight on the sky for a given annihilation cross-section (decay rate) and channel, which we then project onto a Healpix grid to match the resolution of Fermi. This is shown in Fig 2, in terms of the “J-factor” (left) and “D-factor” (right) that describe the astrophysical contributions to the flux (i.e. without the particle physics terms). Assuming a Poisson likelihood for the measured flux in each pixel, and marginalising over the CSiBORG realisations, a model for substructure within each halo and a set of templates that describes the contribution from non-dark matter sources, we constrain the parameters of dark matter interactions using all nearby dark matter that is resolved by CSiBORG.

Full-sky Mollweide projection in galactic coordinates of the ensemble mean J and D factors over the CSiBORG realisations. These are proportional to the gamma-ray flux produced by dark matter annihilation and decay respectively.

Bounding DM interactions

We find no evidence for enhanced gamma-ray flux tracing dark matter density squared, as would be expected in an annihilation model. This allows us to set constraints on the cross-section, which we show in Fig 3 (left) as a function of dark matter particle mass for a range of different annihilation channels. The grey dot-dashed line shows the thermal relic cross-section, which is the value needed to explain the current dark matter abundance through thermal freeze-out in the early universe (the standard production mechanism for WIMPs). Locations where bounds are below this indicate that the thermal relic scenario is ruled out. The black dashed line shows a previous constraint from cross-correlation of gamma-ray flux with the positions of low surface brightness galaxies, and the dotted line is from dSphs in the Local Group. While these constraints are much stronger than ours as dSphs are very close leading to large predicted flux (not included in our analysis because they are below the CSiBORG resolution limit), they are sensitive to flux contributions from baryonic processes. In our approach, significant constraining power comes from regions largely devoid of baryons such as the filaments that connect halos.

The right panel of Fig 3 shows our constraint on the flux due to dark matter decays, separately for each of Fermi’s energy bins. That many are centred away from zero indicates that we do detect gamma-rays with a flux distribution across the sky that traces the dark matter density, as expected for decays. However, the spectrum of the signal is much more closely aligned with a power-law than the expectation from decay (pink vs red line), suggesting a more mundane, baryonic origin such as blazars.

In conclusion, we have used BORG to open yet another field – dark matter indirect detection – to full-sky, field-level Bayesian inference. In principle this allows all the information to be extracted from gamma-ray surveys and thus represents the most promising astrophysical method for uncovering the non-gravitational interactions of dark matter.

Left: 2\(\sigma\) bounds on the dark matter annihilation cross-section for various different channels (coloured lines). The grey dot-dashed line is the cross-section of a thermal relic WIMP, and the black dashed and dotted lines show literature constraints from galaxy cross-correlations and Local Group dwarf spheroidals respectively. Right: Flux contribution with the spatial distribution expected from dark matter decay in each Fermi energy bin. The red line is the spectrum expected from decays to \(b\bar{b}\), while the pink line, preferred by the data, is the best-fit power-law spectrum.

D. J. Bartlett, A. Kostic, H. Desmond, J. Jasche & G. Lavaux, 2022, Constraints on dark matter annihilation and decay from the large-scale structure of the nearby universe, Phys Rev D submitted, arXiv:2205.12916 ↩

Field-level inference on galaxy intrinsic alignment

Thu, 09 Dec 2021 00:00:00 +0200

Overview

A common assumption in weak lensing studies has been that galaxy shapes are on average uncorrelated. However, during the formation and evolution of galaxies, anisotropic stress exerted by the large-scale structure on galaxies can affect their shape. This process results in a coherent alignment of galaxy shapes with the large-scale tidal field, known as intrinsic alignment. Therefore, accurate inferences of its amplitude are of paramount importance, in order to avoid biasing cosmological conclusions drawn from weak lensing analyses. Due to the mechanism by which the effect arises, inferring the intrinsic alignment amplitude can constrain the response of galaxy shapes to external structures, in the context of galaxy formation. Further, it serves as a late-time cosmological probe, since galaxy shapes ultimately correlate with the large-scale dark matter density field. In particular for elliptical galaxies, the correlation between galaxy shapes and the large-scale tidal field can be modeled as a linear function. In our latest publication,¹ we constrain the linear alignment model using galaxy shapes from the LOWZ galaxy sample and tidal fields constrained with the LOWZ and CMASS samples of the SDSS-III BOSS survey.²

The linear alignment amplitude

As galaxy shapes are affected by collapsing structures, which carry information on the initial conditions of the Universe, the intrinsic alignment signal is expected to be scale-independent and persist up to linear scales. For this reason, we probe the intrinsic alignment amplitude as a function of scale. To this end, we filter the original tidal fields with a top-hat filter in Fourier space. As a result, at any given scale, we remove contributions from smaller scales to avoid contamination from unmodeled processes, due to the resolution limit of the tidal fields. In the figure below, we present our results on the linear alignment amplitude, \(A_I\), as a function of scale \(R\). The yellow window indicates scales smaller than the original resolution of the tidal fields.

Linear alignment amplitude as a function of scale. The blue (yellow respectively) window represents one standard deviation (scales smaller than the inference resolution respectively).

Although all scales are consistent with a constant amplitude, the signal clearly reach a steady state at \(20h^{-1}\,\mathrm{Mpc}<R<50 h^{-1}\,\mathrm{Mpc}\). At \(R=20h^{-1}\,\mathrm{Mpc}\), we find \(4\sigma\) evidence of \(A_I=3.19\pm0.80\). The uncertainty is dominated by processes other than intrinsic alignment. Those processes are modeled as purely independent random sample from a Gaussian distribution. Since they are not a priori known, we constrain them jointly with the linear alignment amplitude. In the Figure 2 below, we present our constraints on this random uncertainty component, \(\sigma\), and at \(R=20h^{-1}\,\mathrm{Mpc}\) we find \(\sigma=0.24\pm0.01\).

Root mean square random galaxy shape noise as a function of scale. The blue and yellow windows represent 1 standard deviation and scales smaller than the inference resolution, respectively.

Evolution with galaxy properties

Brighter galaxies have been found to align stronger, whereas an evolution of the intrinsic alignment amplitude with redshift, may point to different galaxy formation scenarios. At the same time, potential evolution with galaxy color is important to analyses considering a wide color range. As a result, we split our galaxy sample into luminosity, redshift and color bins to study how the linear alignment amplitude scales with these properties. In the figure below, the three subplots from top to bottom show the evolution with luminosity, redshift and color, respectively.

a) The linear alignment amplitude as a function of smoothing scale for the brightest (L1) and faintest (L4) galaxy sub-sample. b) The linear alignment amplitude as a function of smoothing scale for two redshift sub-samples. Z1 covers the range 0.21 < z < 0.29 and Z2 the range 0.29 < z < 0.36. c) The linear alignment amplitude as a function of the smoothing scale for the bluest (C1) and reddest (C5) sub-sample.

We observe no significant correlation between luminosity and the linear alignment amplitude. Given the short redshift and color range of our galaxy sample, we further observe no correlation with the corresponding properties.

Outlook

Field-level approaches like the one we present here will allow the improvement on modeling both effects, on the basis of the data likelihood and the physics model. Ultimately, this work is a first step toward joint field-level inferences of intrinsic alignment and weak lensing,³ particularly at high redshifts.

Eleni Tsaprazi, Nhat-Minh Nguyen, Jens Jasche, Fabian Schmidt and Guilhem Lavaux, 2021, arXiv:2112.04484 ↩
G. Lavaux, J. Jasche and F. Leclercq, 2019, Systematic-free inference of the cosmic matter density field from SDSS3-BOSS data, arXiv:1909.06396 ↩
N. Porqueres, A. Heavens, D. Mortlock and G. Lavaux, MNRAS, 502 (2021), 3035–3044 , arXiv:2011.07722 ↩

Is the speed of light energy dependent?

Sat, 06 Nov 2021 00:00:00 +0200

Overview

A fundamental assumption in our current theories of the Universe is that photons always travel at the same speed, \(c\), independent of their energy. But this needs not be true. If the photon had a non-zero rest mass, then lower energy photons would travel slower (\(v < c\)). Alternatively, it is expected that quantum fluctuations of spacetime at high energies in so-called quantum gravity (QG) theories would make spacetime appear “foamy”, and thus empty space would have an energy-dependent refractive index. Or perhaps photons of different energy couple to gravity with different strengths (and thus violate the weak equivalence principle), so that photons of different energy travel differently through a gravitational field. In any one of these cases, photons of different energies from a distant source would arrive at different times, even if they were emitted simultaneously. Since the expected time delay increases with distance travelled, by studying the energy-dependent arrival times (spectral lag) of photons from sources at high redshift, we can place tight constraints on the quantum gravity length scale, \(\ell_{\rm QG}\), the photon mass, \(m_\gamma\), or the different couplings of photons to gravity at different energy, \(\Delta \gamma\). The high redshifts and short durations of Gamma Ray Bursts (GRBs) are ideal for this, so this is what we consider here. For the majority of Gamma Ray Burts, high energy photons are detected before lower energy photons, which is qualitatively the same as for a massive photon and some quantum gravity models, and could thus provide evidence for such theories.

The gravitational time delay

To constrain the equivalence principle, we must be able to predict how long it takes a photon to travel through a gravitational field. The resulting time delay to a distant source depends on the gravitational potential along the path that it travels and thus depends on the direction in the sky. If one had knowledge of the true present-day matter field, then one could create maps of the expected time delay as a function of source position. Indeed, in previous attempts to constrain equivalence-principle violation, \(\delta \phi\) was modelled as arising from one or a few isolated sources near the line of sight, however the long range of gravity casts doubt on the multiple point masses approximation. Instead, we account fully for the contributions to the time delay from all mass in the non-linear cosmological density field. We derive the contribution from local structures using constrained density fields generated by the BORG reconstruction of SDSS-III/BOSS ¹, and combine this with an unconstrained contribution from distant sources to produce a Monte-Carlo based source-by-source forward model for the expected time delay. The ensemble mean of the resulting time delay fluctuation map is plotted in Figure 1, and is \(\sim 10^{11} {\rm \, s}\) for a source at \(z=0.1\).

Mollweide projection in equatorial coordinates of the ensemble mean of the time delay fluctuations at \(z=0.1\) from wavelengths resolved by the BORG reconstruction.

Forward modelling the time delays

We use a catalogue of 668 Gamma Ray Bursts for the BATSE satellite since these not only have spectral lag data, but also pseudoredshifts calculated using the spectral peak energy-peak luminosity relation. Propagating uncertainties on the pseudoredshifts, sky localisation and spectral parameters through Monte Carlo Sampling, we produce source-by-source forward models for the likelihood of a time delay from quantum gravity, a photon mass or equivalence principle violation. However, these are not the only types of physics that can lead to spectral lags: these may also be generated through intrinsic differences in the emission of photons of different wavelength at the source or their propagation through the medium surrounding the Gamma Ray Burst, or through instrumental effects at the observer. Without a robust physical model for the time delays these lead to, we model them using a generic functional form (a sum of Gaussians) with free parameters that we marginalise over in constraining \(m_\gamma\), \(\ell_{QG}\) and \(\Delta\gamma\). We vary the number of Gaussians used to describe these observational and astrophysical processes to find the best-fitting model to the data. Importantly, we find that our results are insensitive to this choice; a vital check that was often neglected in previous work. We compare our predicted time delays to the observed ones through a MCMC algorithm and therefore constrain \(m_\gamma\), \(\ell_{QG}\) and \(\Delta\gamma\).

Is the speed of light energy dependent?

We find no evidence that the speed of light has an energy dependence. We constrain the photon mass to be \(m_\gamma < 4.0 \times 10^{-5} \, h \, {\rm eV}/c^2\) and the quantum gravity length scale to be \(\ell_{\rm QG} < 5.3 \times 10^{-18} \, h \, {\rm \, GeV^{-1}}\) at 95% confidence ². As shown in Figure 2, the quantum gravity constraint is the tightest from time delay studies which consider multiple Gamma Ray Bursts, and the constraint on \(m_\gamma\), although weaker than from using radio data, provides an independent constraint which is less sensitive to the effects of dispersion by electrons. We also place upper limits on an energy dependence of \(\gamma\) of \(\Delta \gamma < 2.1 \times 10^{-15}\) at \(1 \sigma\) confidence between photon energies of \(25 {\rm \, keV}\) and \(325 {\rm \, keV}\) ³. These constraints are 40 times tighter than literature results, illustrating the benefits of using complete mass distributions when studying non-local relativistic effects such as time delays.

So what can we say about quantum gravity, the photon mass and the equivalence principle? Through the use of simulation based, Bayesian statistical forward-modelling techniques and the BORG algorithm, we have produced some of the tightest constraints on these theories to date, and have demonstrated that the results are robust to how one models other astrophysical and observational contributions to the observed signal. It is expected that \(\ell_{\rm QG}\) should be near the Planck length, which is approximately two orders of magnitude smaller than we are currently sensitive to, so we are yet to probe this. It is expected that detecting Gamma Ray Bursts at \(>100 {\rm \, GeV}\) should be routine in the future; with more, higher energy measurements one should begin to probe this energy scale, so there is the tantalising possibility of making the first detection of quantum gravity as these limits approach the Planck scale in the near future.

Lower limits on the quantum gravity energy scale (\(1 / \ell_{\rm QG}\)) from time delay studies which use multiple astrophysical sources. Our work provides the tightest constraint to date. The dashed vertical line is the Planck energy, and it is expected that the quantum gravity energy scale has approximately this value.

G. Lavaux, J. Jasche & F. Leclercq 2019, ``Systematic-free inference of the cosmic matter density field from SDSS3-BOSS data’’, arxiv 1909.06396 ↩
D.J. Bartlett, H. Desmond, P.G. Ferreira & J. Jasche 2021, ``Constraints on quantum gravity and the photon mass from gamma ray bursts’’, PRD accepted, arxiv 2109.07850 ↩
D.J. Bartlett, D. Bergsdal, H. Desmond, P.G. Ferreira & J. Jasche 2021, ``Constraints on equivalence principle violation from gamma ray bursts’’, PRD 104, 084025 , arxiv 2106.15290 ↩

Testing gravity with the positions of supermassive black holes

Mon, 26 Oct 2020 00:00:00 +0200

Overview

Testing General Relativity on large scales is largely tantamount to searching for new fundamental interactions (‘‘fifth forces’’) between masses, mediated by dynamical fields beyond the metric tensor. An important competitor to General Relativity is galileon gravity, which introduces a new light scalar field with a Lagrangian that is symmetric under Galilean transformations. Historically the galileon has been a leading contender for explaining dark energy, but now it is viewed mainly as an archetype of ‘‘Vainshtein-screened’’ theories where the fifth force from the scalar field vanishes in high-density regions due to second derivative terms in the equation of motion. This behaviour arises in many theories beyond the Standard Model.

A key feature of the galileon is that it couples to nonrelativistic matter but not to gravitational binding energy, violating the strong equivalence principle. This means that black holes – the only purely gravitational objects – are entirely unaffected by the galileon, while the stars, gas and dark matter in galaxies feel the full fifth force. As illustrated in Figure 1, this causes the supermassive black holes at the centres of galaxies to lag behind the other galactic components in the direction of an external galileon field. We have used this effect in a recent article to place stringent constraints on the strength of a galileon coupling to matter.¹

Cartoon illustrating the formation of galaxy–black hole offsets under galileon gravity. The restoring force on the black hole due to its offset from the galaxy centre compensates for the fact that it doesn’t feel the galileon fifth force.

CSiBORG: Mapping the large-scale gravitational field

To make predictions for black hole positions in galileon gravity, we need to know the fifth-force field on a galaxy-by-galaxy basis. To do this we introduced CSiBORG (Constrained Simulations in BORG), a suite of ~100 RAMSES N-body simulations using initial conditions sampled from the posterior of the BORG-PM algorithm. CSiBORG gives an accurate picture of dark matter structures within \(\sim 250\) Mpc of the Milky Way with a mass resolution of \(4.4 \times 10^9 \text{M}_\odot\), including full propagation of the uncertainties in the initial conditions.

We use CSiBORG to map out the local galileon field in the linear, quasistatic approximation. Combined with a flexible model for halo structure, this allows us to calculate the expected galaxy–black hole offsets as a function of the galileon coupling coefficient and the radius within which the fifth force is suppressed by the Vainshtein mechanism, \(r_V\). We apply this to \(\sim 2000\) galaxies in which the offset has been measured by comparing optical images of galaxies to multi-wavelength observations of Active Galactic Nuclei. Marginalising over an empirical model describing astrophysical noise, we then use a Bayesian likelihood framework and MCMC algorithm to constrain the galileon parameters.

Constraining cosmological galileons

We find no evidence that black holes are offset from the centres of their hosts in the direction or with the relative magnitude expected from galileons. This allows us to place strong constraints on the strength of the galileon fifth force relative to gravity, \(\Delta G/G_N\). In the left panel of Figure 2 we show this constraint for four observational datasets as a function of \(r_V\): our final bound, driven by the largest sample, is \(\Delta G/G_N < 0.16\) at \(1\sigma\) confindence for \(r_V \lesssim \text{Gpc}\). In the right panel we translate this result to a constraint on the coupling coefficient \(\alpha\) as a function of the lengthscale that appears in the galileon action, known as the crossover scale \(r_c\). Figure 2 also shows previous constraints from Lunar Laser Ranging and the black hole in M87 as well as the expected relation between \(\alpha\) and \(r_c\) in a higher-dimensional modified gravity model that introduces galileons called DGP.

Enabled by BORG, ours is the first work to model a large-scale galileon field point-by-point in space. It is therefore the first to probe crossover scales as large as the observable universe, and the first to achieve statistically rigorous constraints. By supplementing our model with numerical solutions of the galileon equation of motion in the nonlinear regime it will be possible to push our bound to smaller \(r_c\), superseding the Lunar Laser Ranging result and ruling out the self-accelerating branch of DGP. More generally, a Monte Carlo-based forward-modelling approach calibrated against simulations and marginalised over noise holds great promise for precision tests of fundamental physics with galaxy survey datasets.

Left: \(1\sigma\) constraint on \(\Delta G/G_N\) as a function of average Vainshtein radius, \(rV\), from four observational datasets. \(L_{eq}\) is the length scale at which the matter power spectrum turns over. Right: Constraint on the coupling of a cubic galileon to matter, \(\alpha\), as a function of the crossover scale, \(r_c\), from lunar laser ranging (LLR), the black hole at the centre of M87, and our work. Our test probes larger-\(r_c\) galileons than others because it models the full galileon field from large-scale structure.

D. J. Bartlett, H. Desmond & P. G. Ferreira, 2020, ``Constraints on galileons from the positions of supermassive black holes’’, Phys Rev D submitted, arxiv 2010.05811 ↩

Simulating the Universe on a mobile phone

Mon, 25 May 2020 00:00:00 +0300

Overview

There are about two trillion galaxies in the observable Universe, and the evolution of each of them is sensitive to the presence of all the others. Can we put this all into a computer, or even a mobile phone, to simulate the evolution of the Universe? In a recent paper, we introduced a perfectly parallel algorithm for cosmological simulations which addresses this question.

Modern cosmology relies on very large data sets to determine the content of our Universe, in particular the amounts of dark matter and dark energy. These large datasets include the positions and electromagnetic spectra of very distant galaxies, up to 20 billion light-years away. In the next decade, the Euclid mission¹ and the Vera Rubin observatory,² in particular, will obtain information on several billion galaxies.

Physical challenges

Making the link between our knowledge of physics, for example the equations that govern the evolution of dark matter and dark energy, and astronomical observations requires considerable computational resources. Indeed, the most recent observations cover huge volumes: of the order of that of a cube of 12 billion light-years side length. As the typical distance between two galaxies is only a few million light-years, we have to simulate around one trillion galaxies to reproduce the observations.

In addition, in order to follow the physics of the formation of these galaxies, the spatial resolution should be of the order of ten light-years. Ideally, simulations should therefore have a scale ratio (that is, the ratio between the largest and smallest physical lengths of the problem) close to a billion. No computer, existing or even under construction, can achieve such a goal.

In practice, it is therefore necessary to use approximate techniques, consisting in “populating” the large-scale structures of the Universe with fictitious (but realistic) galaxies. This approximation is further justified by the fact that the evolution of galaxies’ components, for example stars and interstellar gas, involves very fast phenomena in comparison to the global evolution of the cosmos. The use of fictitious galaxies still requires simulating the dynamics of the Universe with a scale ratio of around 4,000, which is just possible with today’s supercomputers.

The problem of computational limits

Simulating the gravitational dynamics of the Universe is what physicists call a \(N\)-body problem. Although the equations to be solved are analytical, as in most cases in physics, solutions have no simple expressions and require numerical techniques as soon as \(N\) is larger than four. The direct numerical solution consists in explicitly calculating the interactions between all the pairs of bodies, also called “particles”. The computation of forces by direct summation was the favoured technique in cosmology at the beginning of the development of numerical simulations, in the 1970s. At present, it is mainly used for simulations of star clusters and galactic centres. The number of particles used in “direct summation” simulations is represented by green dots in figure 1, where the \(y\)-axis has a logarithmic scale.

Evolution of the number of particles used in \(N\)-body simulations as a function of year of publication. Different symbols and colours correspond to different methods used to compute gravitational dynamics (direct summation in green, advanced algorithms in orange). For comparison, Moore’s law concerning computer performance is represented by the black dotted line.

The direct summation method has a numerical cost which increases like \(N^2\), the number of pairs of particles considered. For this reason, in spite of improvements provided by hardware accelerators such as graphics processing unit (GPUs), the number of particles used with this method cannot grow as quickly as in the famous “Moore’s Law”, which predicts a doubling of computer hardware performance every 18 months. Moore’s law was verified for about four decades (1965-2005), but as traditional hardware architectures are reaching their physical limit, the performance of individual compute cores attained a plateau around 2015 (see figure 2). Therefore, cosmological simulations cannot merely rely on processors becoming faster to reduce the computational time.

Single-threaded floating point performance of CPUs as a function of time. Different trademarks and models are represented by different colours and symbols as indicated in the caption. This plot is based on adjusted SPECfp® results.³

In order to reduce the cost of simulations, most of the work in numerical cosmology since 1980 has consisted in improving algorithms. The aim was to circumvent the explicit calculation of all gravitational interactions between particles, especially for pairs which are the most distant in the volume to be simulated. These algorithmic developments have enabled a huge increase in the number of particles used in cosmological simulations (see the orange triangles in figure 1). In fact, since 1990, the increase in computational capacity in cosmology has been faster than Moore’s Law, with software improvements adding to the increase in computer performance (more details in this blog post).

In 2020, with the architectures of modern supercomputers, calculations are no longer limited by the number of operations that processors can perform in a given time, but by the inherent latencies in communications among the different processors involved in so-called “parallel” calculations. In these computational techniques, a large number of processors work together synchronously to perform calculations far too complex to be carried out on a conventional computer. The stagnation of performances due communication latencies has been theorised in “Amdahl’s law” (see figure 3), named after the computer scientist who formulated it in 1967. It is now the main challenge for cosmological simulations: without improving the “degree of parallelism” of our algorithms, we will soon reach a technological plateau.

Amdahl’s law: theoretical speed-up in the execution of a program as a function of the number of processors executing it, for different values of the parallel fraction of the program (different lines). The speed-up is limited by the serial part of the program. For example, if 90% of the program can be parallelised, the theoretical maximum speed-up factor using a large number of processors would be 10.

The sCOLA approach: divide and conquer

Let us go back to the physical problem to be solved: it is about simulating the gravitational dynamics of the Universe at different scales. At “small” scales, there are many objects that interact with each other: numerical simulations are required. But at “large” spatial scales, that is to say if we look at figure 4 from very far, not much happens during evolution (except for a linear increase of the amplitude of inhomogeneities). Despite this, with traditional simulation algorithms, the gravitational effect of all the particles on each other must be calculated, even if they are very far apart. It is expensive and almost useless, since most of gravitational evolution is correctly described by simple equations, which can be solved analytically without a computer.

Comparison between a traditional simulation (left panel) and a simulation using our new algorithm (right panel). In our approach, the volume of the simulation is a mosaic made of “tiles” calculated independently and whose edges are represented by dotted lines.

In order to minimise unnecessary numerical calculations, it is possible to use a hybrid simulation algorithm: analytical at large scales and numerical at small scales. The underlying idea, called spatial comoving Lagrangian acceleration (sCOLA⁴), is common in physics: it is a “change of frame of reference”. In this framework, large-scale dynamics is taken into account by the new frame of reference, while small-scale dynamics is solved numerically by the computer, using conventional calculations of the gravity field. Unfortunately, the most naive version of the sCOLA algorithm gives results that are too approximate to be usable. In our last publication,⁵ we modified sCOLA in order to improve its accuracy.

Furthermore, we have realised that this concept makes it possible to “divide and conquer”. Indeed, given a large volume to be simulated, sCOLA allows sub-volumes of smaller size to be simulated independently, without communication with neighbouring sub-volumes. Our approach therefore makes it possible to represent the Universe as a large mosaic: each of the “tiles” in figure 4 is a small simulation that a modest computer can solve, and the assembly of all the tiles gives the overall picture. This is what is called in computer science a “perfectly parallel” algorithm, unlike all cosmological simulation algorithms so far. Thanks to it, we have been able to obtain cosmological simulations at a satisfactory resolution, while remaining on a relatively modest computing facility (figure 5).

Our perfectly parallel sCOLA algorithm has been implemented in the publicly available Simbelmynë code,⁶ where it is included in version 0.4.0 and later.

A GPU-based computer at the Institut d’Astrophysique de Paris. Its costs represents only a hundredth of that of a supercomputer at national computing facilities.

New hardware to simulate the Universe

This new algorithm is not limited to being used in small computing facilities, but allows to envisage new ways of exploiting computing hardware. Ideally, each of the “tiles” could be small enough to fit in the “cache memory” of our computers, that is, the part of the memory that processors can access in the smallest amount of time. The resultant communication speed up would allow us to simulate the entire volume of the Universe extremely quickly, or even at a resolution never achieved so far.

Going further, we can even imagine that each of the simulations corresponding to a “tile” would be small enough that it can be run on a modern mobile phone! This parallelisation technique would be based on a platform such as Cosmology@Home⁷, which is dedicated to distributed collaborative computing. This platform is derived from the efforts initiated by SETI@Home⁸ for the search for extraterrestrial intelligence.

https://www.euclid-ec.org/ ↩
https://www.lsst.org/ ↩
http://spec.org/ ↩
S. Tassev, D. J. Eisenstein, B. D. Wandelt, M. Zaldarriaga, sCOLA: The N-body COLA Method Extended to the Spatial Domain (2015), arXiv:1502.07751 ↩
F. Leclercq, B. Faure, G. Lavaux, B. D. Wandelt, A. H. Jaffe, A. F. Heavens, W. J. Percival, C. Noûs, Perfectly parallel cosmological simulations using spatial comoving Lagrangian acceleration, A&A, in press (2020), arXiv:2003.04925 ↩
The Simbelmynë code: homepage ↩
https://www.cosmologyathome.org/ ↩
https://setiathome.berkeley.edu/ ↩

Why neural networks don’t work and how to use them

Sat, 07 Dec 2019 00:00:00 +0200

Neural networks as universal model approximators

We can think of a neural network, \(\mathbb{NN}(\boldsymbol{w}, \boldsymbol{\alpha}) : {\bf d}\to\boldsymbol{\tau}\), as an approximation of a model, \(\mathcal{M} : {\bf d}\to{\bf t}\), where \({\bf d}\) is some input data to the network and the output of the network is \(\boldsymbol{\tau}\) which is an estimate of some target, \({\bf t}\), associated with the data. The neural network itself is a function of some trainable parameters called weights, \(\boldsymbol{w}\), and some hyperparameters, \(\boldsymbol{\alpha}\), which encompass the architecture of the network, the initial values of the weights, the form of activation functions, the choice of cost function, etc.

Likelihood of obtaining targets given a network

In a traditional sense, the training of a neural network is equivalent to minimising a cost or loss function, \(\Lambda({\bf t}, \boldsymbol{\tau})\), with respect to the weights of the network, \(\boldsymbol{w}\) (and hyperparameters, \(\boldsymbol{\alpha}\)) given a set of pairs of data and targets for training and validation, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) and \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\). The cost function, \(\Lambda({\bf t}, \boldsymbol{\tau})\), measures how close the outputs of a fixed network, \(\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to\boldsymbol{\tau}\), are to some target, \({\bf t}\), given a data-target pair, \(\{ {\bf d}, {\bf t}\}\), at some fixed network parameters and hyperparameters, \(\boldsymbol{w}=\boldsymbol{w}^*\) and \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\). That is, how likely is it that the output of the network provides the true target for the input data given a chosen set of weights and fixed network hyperparameters, i.e. the cost function is equivalent to the (negative logarithm of the) likelihood function

\[\Lambda({\bf t}, \boldsymbol{t})\simeq-\textrm{ln}\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w}^*,\boldsymbol{\alpha}^*).\]

The likelihood surface, although regular for a given set of network parameters and hyperparameters, is extremely complex, degenerate, and even discrete and non-convex in the directions of the network parameters and hyperparameters.

Although the cost function is normally chosen to be convex, i.e. with a global minimum and defined everywhere, at a given value of \(\boldsymbol{w}=\boldsymbol{w}^*\) and \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\), the shape of the likelihood is extremely complex, degenerate and bumpy when considering all possible \(\boldsymbol{w}\) and will often be discrete and non-convex in the \(\boldsymbol{\alpha}\) direction.

Maximum likelihood network parameter estimates

The normal procedure for using neural networks is to train them. This means finding the maximum likelihood estimates of the weights of a network with a given set of training data-target pairs \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) and fixed hyperparameters, \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\), by doing

\[\boldsymbol{w}^\textrm{MLE}=\underset{\boldsymbol{w}}{\textrm{argmax} }\left[\mathcal{L}(\{ {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\vert \{ {\bf d}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}, \boldsymbol{w}, \boldsymbol{\alpha}^*)\right].\]

That is, find the set of \(\boldsymbol{w}\) for which the likelihood function evaluated at every member in the training set is maximum. In the case that each pair of data and targets, \(\{ {\bf d}, {\bf t}\}\) are independent and identically distributed we can write the likelihood as

\[\mathcal{L}(\{ {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\vert \{ {\bf d}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}, \boldsymbol{w}, \boldsymbol{\alpha}^*)=\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w},\boldsymbol{\alpha}).\]

By finding the set of \(\boldsymbol{\tau}\) which are closest (in the sense of the minimum cost function) to the target \({\bf t}\), given some a neural network and some input data \({\bf d}\), the weights of the network traverse the negative logarithm of the likelhiood surface for the true target, hopefully ending at some minimum (which is a maximum in the likelihood).

To find the maximum likelihood of the weights, one would normally consider some sort of stochastic gradient descent. Since most software is more efficient at finding minima rather than maxima, we actually minimise the negative logarithm of the likelihood, i.e. the cost function

\[\begin{align} \boldsymbol{w}^\textrm{MLE}&=\underset{\boldsymbol{w}}{\textrm{argmin} }\left[\sum_i^{n_\textrm{train} }\Lambda({\bf t}^\textrm{train}_i, \boldsymbol{\tau}^\textrm{train}_i)\right]\\ &=\underset{\boldsymbol{w}}{\textrm{argmin} }\left[-\sum_i^{n_\textrm{train} }\textrm{ln}\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w}, \boldsymbol{\alpha}^*)\right]. \end{align}\]

The weights are updated using \(\boldsymbol{w}\to\boldsymbol{w}-\nabla_\boldsymbol{w} \sum_i^{n_\textrm{train} }\ \textrm{ln}\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i, \boldsymbol{w}, \boldsymbol{\alpha}^*)\). In the ideal case there would be one global minimum in the likelihood so that after training the value of the weights of the neural network would be equal to the maximum likelihood estimates, \(\boldsymbol{w}=\boldsymbol{w}^\textrm{MLE}\). However, since the likelihood surface is, in reality, extremely degenerate and flat in the space of weight values, it is most likely that the weights only achieve a local maximum, i.e. \(\boldsymbol{w}=\boldsymbol{w}^\textrm{local MLE}\). In fact, which local maximum is found will normally depend extremely strongly on the initial \(\boldsymbol{w}=\boldsymbol{w}_\textrm{init}\) which is used for the gradient descent.

The initialisation of the weights will be very important in determining which local maximum likelihood estimate is found. This is because the surface of the likelihood is very bumpy. It can also be highly degenerate which leads to whole families of pseudo-maximum likelihood estimates.

Once the maximum (or at least local maximum) is found, it is normal to evaluate the accuracy (or some other figure of merit) using some validation set, \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\). This validation set is used to modify the hyperparameters, \(\boldsymbol{\alpha}\), of the network to achieve the best fit to both the training and validation sets as possible. These modifications could include changing the initial seeds of the weights, changing the activation functions, or changing the entire architecture, for example. However, networks trained in such a way do not provide a way to obtain scientifically robust estimates of the true targets \({\bf t}\), given observed data \({\bf d}\). To see why, we need to consider the probabilistic interpretation of neural networks.

Probabilistic interpretation of neural networks

The posterior predictive density of obtaining a target, \({\bf t}\), given some input data, \({\bf d}\), is

\[\mathcal{P}({\bf t}\vert {\bf d}) = \int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}).\]

The likelihood of obtaining the true value of the target \(\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w},\boldsymbol{\alpha})\), which is the (unnormalised) negative exponential of the cost function, when given some input data \({\bf d}\) and network parameters and hyperparameters \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\). \(\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})\) is the probability of obtaining the weights and hyperparameters of the neural network. Since the likelihood of obtaining any value of the target, \({\bf t}\), given some input data, \({\bf d}\), for any given neural network, i.e. any combination of \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\), is essentially equal then the likelihood, \(\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\), is almost flat. Therefore, the majority of the information about the posterior predictive density, \(\mathcal{P}({\bf t}\vert {\bf d})\), comes from the any a priori or a posteriori knowledge of the weights \(\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha})\), and therefore, it has to be chosen or found very carefully.

The form of the posterior predictive density of the targets \({\bf t}\) depends mostly on the probability of the weights and hyperparameters of the network. This means that the prior for the weights and hyperparameters must be chosen carefully or the posterior extremely well characterised via training data.

A Bayesian neural network is a network which provides the true posterior predictive density of targets \({\bf t}\) given data \({\bf d}\).

Failure of traditionally trained neural networks

As described above, given a set of training pairs, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\), and validation pairs, \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\), we can find the (local) maximum likelihood estimates of the weights, \(\boldsymbol{w}=\boldsymbol{w}^\textrm{local MLE}\), and optimise the hyperparameters to \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\) which gives the best fit to both the training and validation data-target pair sets. Since we fix both the parameters and hyperparameters, those values are set in stone and we degenerate the posterior distribution to a Dirac \(\delta\) function, neglecting any information brought by the training data, i.e.

\[\begin{align} \mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}|\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}) &\propto\mathcal{L}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})p(\boldsymbol{w},\boldsymbol{\alpha})\\ &\to\delta(\boldsymbol{w}-\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*) \end{align}\]

where \(p(\boldsymbol{w},\boldsymbol{\alpha})\) is a prior distribution over the weights and hyperparameters. By making such a choice, we erase the entirety of the information about the distribution of data and work only with the best fit model, which may (or may not) be complete. As such, the predictive probability density of the targets \({\bf t}\) given data \({\bf d}\) is

\[\mathcal{P}({\bf t}\vert {\bf d}) =\delta({\bf t}-\boldsymbol{\tau}({\bf d})),\]

i.e., the probability of obtaining an estimate from the network is zero everywhere apart from at the value of the output of the network, \(\mathbb{NN}(\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}^*) : {\bf d}\to\boldsymbol{\tau}\) - the function is completely deterministic. Effectively, this means that the probability of obtaining \({\bf t}\) given the fixed network parameters and hyperparameters and some data \({\bf d}\) is impossibly small.

Consider a third test set, \(\{ {\bf d}^\textrm{test}_i, {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\). One normally determines how well a neural network is trained using this unseen (blind) set. To test the network, all of the test data, \(\{ {\bf d}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\), are passed through the network to get estimates \(\{\boldsymbol{\tau}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) which can be plotted against the known targets, \(\{ {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) (see above figure).

For any set of data, a trained neural network with fixed hyperparameters and network parameters at their maximum likelihood values, the probability of obtaining a target is a \(\delta\) function. There is no knowledge of whether the output of the network will be equal to the target, and it is, in fact, improbably unlikely that they will be.

A network which produces \(\{\boldsymbol{\tau}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) which correlate very strongly with \(\{ {\bf t}^\textrm{test}_i\vert i\in[1, n_\textrm{test}]\}\) is probably a network that is in a very good local maximum for both the weights and the hyperparameters. However, there is no assurance that the true \({\bf t}\) should be obtained by the network, and due to the complexity of the likelihood \(\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\), there is also no way of ensuring that \(\boldsymbol{\tau}\) should be similar to \({\bf t}\). Simply, for complex models, it is not possible to prove that the neural network is equivalent to the model, \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}\), and so there is no trust that the network will provide \(\boldsymbol{\tau}={\bf t}\). In fact, because \(\mathcal{P}({\bf t}\vert {\bf d})=\delta(\boldsymbol{\tau})\), it is improbably unlikely to ever find \(\boldsymbol{\tau}={\bf t}\). For extremely simple architectures it may be possible to prove that at the global maximum likelihood estimates of the weights that \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}\), but unfortunately, such simple networks are much less likely to contain the exact representation of \(\mathcal{M}\). Therefore, one can only prove \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha})\equiv\mathcal{M}\) in the limit of infinite data. This is because, in the limit of infinite training data and infinite validation data then we can assume (but not know) that a network could be found (via optimising the hyperparameters over the space of all possible architectures, activation functions, initial conditions of the weights, etc.) which has the capability to exactly reproduce the model \(\mathcal{M} : {\bf d}\to{\bf t}\) by finding the true global maximum of the weights over the space of all possible weights in all possible architectures.

An interesting point to make, especially for regression to model parameters, is that one attempts to use the neural network to find a mapping from a many-to-one value space since the same \({\bf t}\) could produce a very large number of different \({\bf d}\), i.e. the forward model is stochastic. It is an extremely difficult procedure to undo stochastic processes, which is why the neural network will likely never achieve the target function.

Variational inference using approximate weight priors

A neural network can be trained via variational inference where parameters of the network predict the parameters of a variational distribution from which the weights for the forward propagation are drawn.

All of the problems with the traditional picture arise due to degenerating the probability of the weights and hyperparameters \(\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})\to\delta(\boldsymbol{w}-\boldsymbol{w}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)\). We can recover variational inference by assuming the posterior distribution of the weights becomes an approximate variational distribution, \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\), which approximates posterior of \(\boldsymbol{w}\) given a secondary set of network parameters which define the shape of the variational distribution, \(\boldsymbol{v}\), a set of hyperparameters, \(\boldsymbol{\alpha}\), and a set of training data and target pairs, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\). The posterior predictive density for the targets \({\bf t}\) is then written

\[\mathcal{P}({\bf t}\vert {\bf d})=\int d\boldsymbol{w}d\boldsymbol{v}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})p(\boldsymbol{v},\boldsymbol{\alpha}).\]

In practice, the parameters controlling the shape of the variational distribution, \(\boldsymbol{v}\) and the hyperparameters, \(\boldsymbol{\alpha}\) are optimised iteratively using a training and validation set as with the traditional training framework and as such the posterior predictive density becomes

\[\begin{align} \mathcal{P}({\bf t}\vert {\bf d})&=\int d\boldsymbol{w}d\boldsymbol{v}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha}, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\\ &\phantom{=hello}\times\delta(\boldsymbol{v}-\boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}-\boldsymbol{\alpha}^*)\\ &=\int d\boldsymbol{w}~\mathcal{L}({\bf t}\vert {\bf d},\boldsymbol{w}, \boldsymbol{\alpha}^*)\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}^*, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}). \end{align}\]

When the posterior distribution for the weights and hyperparameters of a neural network are approximated using a variational distribution, the posterior predictive density for the targets given some data has a form dictated mostly by the shape of the variational distribution. This shape is not necessarily correct since only simple distributions are usually used for the variational distribution and the distribution of weights can be extremely complex.

In principle, if \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}^\textrm{local MLE}, \boldsymbol{\alpha}^*, \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\) well represents the true posterior of the weights and hyperparameters, \(\mathcal{P}(\boldsymbol{w}, \boldsymbol{\alpha})\), then this can be a good approximation. However, this is very dependent on the distributions which \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) can represent. \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) is normally chosen to be Gaussian, or perhaps a mixture of Gaussians. As discussed already, the likelihood of obtaining any set of weights, \(\boldsymbol{w}\), is actually extremely bumpy and degenerate and, as such, \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) must be chosen to be able to properly represent this. If \(\mathcal{Q}(\boldsymbol{w}\vert \boldsymbol{v}, \boldsymbol{\alpha})\) is poorly proposed then the posterior predictive density of the targets, \(\mathcal{P}({\bf t}\vert {\bf d})\), will be incorrect.

The variational distribution often does not have enough complexity to fully model the intricate nature of the true posterior distribution of weights and hyperparameters. This can lead variational inference te be misleading.

Bayesian neural networks

A Bayesian neural network is similar a traditional one apart from the distribution of the weights (and hyperparameters) of the network are characterised by the posterior for the weights and hyperparameters given a set of training data.

An effective Bayesian neural network can be be built if we use the true posterior of \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\) given some training data, rather than degenerating it to a Dirac \(\delta\), and instead keeping

\[\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\propto\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i,\boldsymbol{w},\boldsymbol{\alpha})p(\boldsymbol{w},\boldsymbol{\alpha}).\]

With this, the predictive probability density of \({\bf t}\) given \({\bf d}\) becomes

\[\begin{align} \mathcal{P}({\bf t}\vert {\bf d}) =&~\int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\\ \propto&~\int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\prod_i^{n_\textrm{train} }\mathcal{L}({\bf t}^\textrm{train}_i\vert {\bf d}^\textrm{train}_i,\boldsymbol{w},\boldsymbol{\alpha})p(\boldsymbol{w},\boldsymbol{\alpha}). \end{align}\]

Obviously the Bayesian neural network comes at a much higher computational cost than just finding the maximum likelihood estimate for the weights, but it does provide a more reasoned posterior predictive probability density, \(\mathcal{P}({\bf t}\vert {\bf d})\). Notice that the prior, \(p(\boldsymbol{w},\boldsymbol{\alpha})\), still enters and so we need to make an informed decision on our belief for what the values of \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\) should be. However, for enough training data-target pairs (and enough time to sample through whatever chosen prior, \(p(\boldsymbol{w},\boldsymbol{\alpha})\)) the posterior \(\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\) becomes informative enough to obtain useful posterior predictions for the targets.

For small numbers of data points, the likelihood is poorly characterised and so can lead to biasing in the posterior predictive density. It is therefore important to have enough data to properly know the likelihood - it is not easy to determine how much this is.

In effect, to make use of Bayesian neural networks, one has to resort to sampling techniques, such as Markov chain Monte Carlo, to describe \(\mathcal{P}({\bf t}\vert {\bf d})\). Because of the (normally extremely large) dimension of the number of weights, techniques such as Metropolis-Hastings cannnot be considered. We proposed using a second-order geometrical adaptation of Hamiltonian Monte Carlo (QN-HMC) in Charnock et al. 2019 (read more). By using such a sampling technique, one could generate samples for the posterior predictive density, \(\mathcal{P}({\bf t}\vert {\bf d})\), whose distribution describes what was the probability of getting a target \({\bf t}\) from data \({\bf d}\) marginalised over all network parameters \(\boldsymbol{w}\) given a hyperparameter, \(\boldsymbol{\alpha}=\boldsymbol{\alpha}^*\)¹. It is difficult to sample \(\boldsymbol{\alpha}\) when using the QN-HMC since gradients of the likelihood need to be computed and the likelihood in the \(\boldsymbol{\alpha}\) direction is often discrete. How to properly sample from \(\boldsymbol{\alpha}\) is still up for debate.

So now lets say we have enough computational power to build a true Bayesian neural network. Are we guaranteed to obtain a correct posterior predictive density?

Source of the problem

Training on data

Notice how all of the techniques mentioned above are dependent on a set of training data and target pairs, \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) (and possibly validation data and targets, \(\{ {\bf d}^\textrm{val}_i, {\bf t}^\textrm{val}_i\vert i\in[1, n_\textrm{val}]\}\)). It is in the posterior (or variational distribution) for the weights that the training data arises

\[\mathcal{P}({\bf t}\vert {\bf d}) = \int d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf t}\vert {\bf d}, \boldsymbol{w}, \boldsymbol{\alpha})\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha}\vert \{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\})\]

and, as already explained, the last term in the integral contains the informative part about the posterior predictive density. As such, any biasing due to \(\{ {\bf d}^\textrm{train}_i, {\bf t}^\textrm{train}_i\vert i\in[1, n_\textrm{train}]\}\) greatly affects \(\mathcal{P}({\bf t}\vert {\bf d})\).

When depending on a training set, \(\mathcal{P}({\bf t}\vert {\bf d})\) is always unknowably biased until the limit of infinite data is reached. So, no method mentioned so far provides us with the correct probability of obtaining the target!

For networks, such as emulators (or generative networks as they are commonly called), where the probability distribution of generating targets, \(\mathcal{\bf P}({\bf t}\vert {\bf z})\), with generated data \({\bf t}\) and a latent distribution \({\bf z}\), should approximate the distribution of true data \(\mathcal{P}({\bf d})\), then the above argument means that we cannot find \(\mathcal{P}({\bf d})\) by training a neural network without infinite training data².

Incorrect models

One interesting use for neural networks is the predicting of physical model parameters, \(\boldsymbol{\theta}\), for a model \(\mathcal{M} : \boldsymbol{\theta}\to{\bf d}\). In this case, even for infinite data, we cannot obtain true posterior distributions for the parameters. Take a network which maps \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to\hat{\boldsymbol{\theta}}\), where \(\hat{\boldsymbol{\theta}}\) are estimates of the model parameters, \(\boldsymbol{\theta}\), which generate the data. Even if there is infinite training data, \(\{ {\bf d}^\textrm{train}_i, \boldsymbol{\theta}^\textrm{train}_i\vert i\in[1,\infty]\}\), if the original model is incorrect, then the neural network will be conditioned on the wrong map from data, \({\bf d}\), to parameters, \(\boldsymbol{\theta}\), and so any observed data, \({\bf d}^\textrm{obs}\), passed through the network will be passed through the incorrect approximation of the model and provide a poor estimate of the incorrect model parameter values. This means that true posteriors on the model parameters can only be obtained with the exact model which generates the observed data and an infinite amount of training data from that model, to be able to correctly provide parameter estimates.

This is not realistic!

Solutions

We have so far built a description of how to obtain the probability to obtain targets, \({\bf t}\), from data, \({\bf d}\), passed through a neural network… and unfortunately, we have learned that it is not possible to obtain.

There is still one problem where we can use neural networks safely despite all of the above. This is to do model parameter inference.

So far we have only considered a neural network as an approximation to a model \(\mathcal{M} : {\bf d}\to{\bf t}\). Now lets say we have a physical model, \(\mathcal{Z}(\boldsymbol{\iota}) : \boldsymbol{\theta}\to{\bf d}\), which generates the data, \({\bf d}\) from a set of model parameters, \(\boldsymbol{\theta}\), dependent on a set of initial conditions \(\boldsymbol{\iota}\), we can safely use a neural network, \(\mathbb{N}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to\boldsymbol{\tau}\), to infer the model parameters of some observed data, \({\bf d}^\textrm{obs}\). Note that we cannot use a network to predict model parameters directly \((\mathbb{NN} : {\bf d}\to\boldsymbol{\theta})\) due to all of the arguments above. Instead we need to set up a statistical inference framework which encompasses the neural network.

Charnock et al. 2019 and Charnock, Lavaux and Wandelt 2018 show two different methods to perform physical model parameter inference using neural networks, in a well justified way.

Writing down the likelihood

I should mention an extremely rare case where the model \(\mathcal{M} : {\bf d}\to{\bf t}\), is simple enough to be parameterised by an extremely simple network with very few parameters, which are non-degenerate and well behaved and for which the hyperparameters, \(\boldsymbol{\alpha}\), can be well designed to avoid needing to sample over this space.

For this case, the likelihood could be written, and therefore, fully established and sampled from, and biases from training data-target pairs could be totaly avoided.

It is pretty unlikely that such a network could be found without considering physical principles.

Model extension

In Charnock et al. 2019, the connection between the observed data and the output of the physical model is not known, i.e. the data from a model given initial conditions, \(\boldsymbol{\iota}\), is \(\mathcal{Z}(\boldsymbol{\iota}) : \boldsymbol{\theta}\to{\bf d}\). This \({\bf d}\) does not look like \({\bf d}^\textrm{obs}\) although we know that want the posterior distribution of \(\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs})\). In Charnock et al. 2019, we know we can observe the universe and model the underlying dark matter of the universe, but the complex astrophysics which maps the dark matter of the universe to the observable tracers is unknown. We do, however, know some physical properties of this mapping. In this case, we build a neural network with the physically motivated symmetries to take the output of the physical model to the distribution which is as close to the observed data as possible (read more). In the language used previously, thanks to the problems we deal with in cosmology and astrophysics we can actually choose the hyperparameters of a neural network, \(\boldsymbol{\alpha}\), in a reasoned manner. These physically motivated neural networks therefore massively reduce the volume of the \(\boldsymbol{\alpha}\) domain. With a careful choice of \(\boldsymbol{\alpha}\) we can also build a network whose priors on the network paremeters, \(\boldsymbol{w}\), can be (at least reasonably) well informed.

We can write the parameter inference as

\[\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs}) \propto \int d\boldsymbol{\iota}d\boldsymbol{w}d\boldsymbol{\alpha}~\mathcal{L}({\bf d}^\textrm{obs}\vert \boldsymbol{\iota},\boldsymbol{w},\boldsymbol{\alpha})\mathcal{P}(\boldsymbol{\iota}\vert \boldsymbol{\theta})p(\boldsymbol{w},\boldsymbol{\alpha})\]

That is, the posterior distribution for the model parameters given some observed data is proportional to the marginal distribution of how likely the observed data is given the initial conditions of the model, \(\boldsymbol{\iota}\), which depend on the model parameters, \(\boldsymbol{\theta}\), which generate the initial conditions and evolve the model forward to the input of the neural network with network parameters and hyperparameters \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\).

In this presented case, there is no training data for the network, instead the data needed to obtain the posterior is part of the statistical framework. Therefore, the network provides non-agnostic posterior parameter inference because we do not learn the posterior distribution, \(\mathcal{P}(\boldsymbol{w},\boldsymbol{\alpha})\) using training data. In essence, this defines the procedure to perform zero-shot training.

It should be noted that this procedure is difficult. It necessitates a sampling scheme for the neural network and the physical model. In Charnock et al. 2019, we use an advanced Hamiltonian Monte Carlo sampling technique on a model where we have calculated the adjoint gradient and the neural network whose architecture is well informed but fixed.

Likelihood-free inference

The model extension method works well, but still depends on knowing the form of the likelihood of the observed data. In practice, this could be extremely difficult. It also depends on a choice of hyperparameters (or at least a well defined prior based on physical principles). In Charnock, Lavaux and Wandelt 2018, we showed another model extension method which allows use to obtain optimal model parameter inference using neural networks by (semi)-classically training a neural network, \(\mathbb{NN}(\boldsymbol{w},\boldsymbol{\alpha}) : {\bf d}\to{\bf t}\), where the target distribution is the set of Gaussianly distributed summaries which maximise the Fisher information matrix. Although the network in this work is, in some way, optimal - the main point of this paper is that parameter inference can be done using likelihood-free inference by extending the physical model \(\mathcal{M} : \boldsymbol{\theta}\to{\bf d}\) to \(\mathcal{N} :\boldsymbol{\theta}\to{\bf t}\) where \({\bf t}\) is any set of summaries.

Likelihood-free inference is a framework where, via generating data using the physical model, \(\mathcal{M} : \boldsymbol{\theta}\to{\bf d}\), the joint probablity of data and parameters, \(\mathcal{P}({\bf d},\boldsymbol{\theta})\), can be characterised. Once this space is well defined, a slice through the distribution at any \({\bf d}^\textrm{obs}\) gives the posterior distribution \(\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs})\) - likewise the slice through the joint distribution at any parameter \(\boldsymbol{\theta}^*\) gives the likelihood distribution \(\mathcal{L}({\bf d}\vert \boldsymbol{\theta^*})\). This works for any system where we can model the data!

The neural networks become essential as functions which perform data compression (although, it should be noted that any summary of the data will work). Since, in general, the dimensionality of the data is much larger than the number of model parameters, a neural network can be trained to compress the data in some way, \(\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to{\bf t}\). We can train this in any way to give us some absolute summaries, \({\bf t}\), where we, essentially, do not care what the summaries are. Note that \(\boldsymbol{w}\) and \(\boldsymbol{\alpha}\) do not need to be maximum likelihood estimates. By pushing all the generated data from the physical model through this fixed network we can characterise the probability distribution of parameters and compressed summaries, \(\mathcal{P}({\bf t},\boldsymbol{\theta})\), which we can slice at any \(\boldsymbol{\theta}^*\) to give the likelihood of obtaining any summaries, \(\mathcal{L}({\bf t}\vert \boldsymbol{\theta}^*)\), or (more interestingly) slice at any observed data pushed through the network, \(\mathbb{NN}(\boldsymbol{w}^*, \boldsymbol{\alpha}^*) : {\bf d}^\textrm{obs}\to{\bf t}^\textrm{obs}\), to get the posterior,

\[\mathcal{P}(\boldsymbol{\theta}\vert {\bf t}^\textrm{obs})=\mathcal{P}(\boldsymbol{\theta}\vert {\bf d}^\textrm{obs},\boldsymbol{w}^*,\boldsymbol{\alpha}^*).\]

This posterior, whilst conditional on the network parameters and hyperparameters, is unbiased in the sense that when the neural network, \(\mathbb{NN}(\boldsymbol{w}^*,\boldsymbol{\alpha}^*) : {\bf d}\to{\bf t}\) is not optimal, the posterior can only become inflated (and not incorrectly biased).

The information maximising neural network, presented in Charnock, Lavaux and Wandelt 2018, provides the optimal summaries³ for the likelihood-free inference - but any neural network can be used in this inference framework. In particular, any neural network which looks like it provides good estimates of the targets for a model \(\mathcal{M} : {\bf d}\to{\bf t}\) (as discussed throughout), will likely have extremely informative summaries, even if their outputs are improbably unlikely to be equal to the true target values (see traditionally training neural networks)!

Conclusions

Presented here is a thorough statistical diagnostic of neural networks. I have shown that, by design, neural networks cannot provide realistic posterior predictive densities for arbitrary targets. This essentially makes all neural networks unusable in science.

However, I have presented how my previous works can undermine this previous statment for model parameter inference. Since either a statistical interpretation or a fully trained neural network can be appended to a physical model, we can build a statistical framework around both the model and the neural network to allow us to do rigorous, scientific analysis of model parameters, which is one of the essential tasks in science today.

Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, Supranta Sarma Boruah, Jens Jasche, Michael J. Hudson, 2019, submitted to MNRAS, arXiv:1909.06379

Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, 2018, Physical Review D 97, 083004 (2018), arxiv:1802.03537

It should be noted that the work in Charnock et al. 2019 was tackling a larger problem and asking a different question than the one stated here for Bayesian neural networks. Bayesian neural networks are a subset of the techniques from that paper, although closely linked. ↩
We can hope that the generated target distribution gets close to the true data distribution and decide we are not bothered about statistics anymore. Maybe a dangerous situation for science‽ ↩
Optimal in the sense that the Fisher information is maximised. This has some assumptions such as the unimodality (but not necessarily Gaussianity) of the posterior, and the fact that the neural network being maximised is capable of finding a function which Gaussianises the data. ↩

Neural physical engines for inferring the halo mass distribution function

Tue, 15 Oct 2019 00:00:00 +0300

To be able to make the most of the wealth of cosmological information available via observations of the large scale structure of the universe it is vital to have a strong model of how observable objects such as galaxies trace the underlying dark matter. In this work we used a neural bias model: a physically motivated neural network from which we can infer the halo mass distribution function. This function describes the abundance of halos with a certain mass given a dark matter density environment, where the halos are compact dark matter objects in which galaxies are hosted. As such, the neural bias model gives us a strong, but agnostic, bias model mapping the dark matter density field to (tracers of) the observable universe. Such a neural bias model can be included in the BORG inference scheme such that the initial conditions of the dark matter density and the parameters of the neural bias model are sampled using Hamiltonian Monte Carlo.

Halo mass distribution function

The halo mass distribution function describes the number of dark matter halos at a certain mass given a dark matter density environment. It has been well studied in the past, and as such we know the approximate form of the function, which is described by the Press Schechter formalism which is a power law at small masses with an exponential cut off at high masses. There are less well understood elements also, including how the non-local density environment affects the abundance of halos and the form of the stochasticity from which halos are drawn from the halo mass distribution function. This stochasticity describes how one obtains the actual number of observed halos of a certain mass given that the halo mass distribution function only describes the probability of observing such a halo. The sampling of halos from the halo mass distribution function is normally assumed to be Poissonian, but this is known to be insufficient. Whilst we consider a Poissonian likelihood in this work, it should be noted that it is Poisson for a field of summaries provided by a neural physical engine and so includes information from the local surrounding region.

Zero-shot training, Bayesian neural networks

The neural network used in this work is not pre-trained and is conditioned on the observed data only, in this case a halo catalogue obtained from a high resolution dark matter simulation. Zero-shot training describes a method of fitting a function without any training data. Several components are necessary to be able to achieve such a fitting of the neural bias model introduced here. These are: basing the design of the architecture of the network on physical principles; using appropriate functions to model the form of the halo mass distribution function; and finding a stable sampling procedure to obtain parameter samples from the posterior.

Neural physical engines

Neural physical engines are simply neural networks that are built using physical principles. For example, with a physical model of how some data is distributed according to the parameters of a model, one builds a neural network with the symmetries of such a model built into its architecture. This is particularly useful for several reasons. Primarily, such a neural physical engine is massively protected from overfitting. Overfitting is prevented because only relevant information for the problem in hand is allowed to be fitted, and the network is insensative to spurious features of the data, such as noise. An added benefit to these networks is the massive reduction in the number of parameters necessary to fit the required function. This improves the computational efficiency of the algorithm, decreases training times and increases the interpretability of the network.

The neural physical engine is a physically motivated neural network which maps a dark matter density distribution, evolved by Lagrangian perturbation theory, to a set of summaries which are informative about the abundance of halos of a certain mass on the grid.

When building the neural bias model we construct a neural physical engine which takes a small patch of the gridded dark matter density field evolved from the initial conditions to today using Lagrangian perturbation theory as an input and outputs a single informative summary per voxel about the abundance of halos with a certain mass at that patch of the dark matter density field. We know that the halo mass distribution function is only sensitive to local information, and at the resolution we are working at, mostly due to the amplitude of the dark matter density field rather than the exact position of structures such as filaments or nodes in the dark matter field. We also know that the data is distributed evenly across the volume, i.e. there is translational and rotational invariance in the dark matter density field. This encourages us to use parameterised three-dimensional convolutional kernels with an extent which is only as large as the relevant scales and where the parameters are shared within the kernels according to a radial symmetry.

The convolutional kernels used in neural networks are discrete and gridded, with each element of the array being an independent trainable parameter. We introduce a method by which we can expand the kernels in terms of multipoles by associating weights at equal distances (and at given rotational angles) from the centre of the kernel. Take for example a 3x3x3 convolutional kernel. Normally this would have 27 free parameters. By looking at the radially symmetric kernel, i.e. ℓ=0, each corner has an associated weight, as does each edge and each face and there is a single weight for the central element, equating to a total of 4 free parameters. Then in the case of the dipolar kernel, i.e. ℓ=1, there are three independent kernels each with 3 parameters, making a total of 9. For ℓ=2, there are now 5 independent kernels with 2 parameters each and including ℓ=3 saturates the freedom of the convolutional kernel and so no further multipoles are needed to fully parameterise the general kernel. We can use this expansion to either reduce the number of parameters necessary by truncating in multipoles, or we can learn more about the informational content of the data in terms of expansion in multipoles. In the second case, once trained, one can look at the response of the data in independent multipole paths, the larger the response the more informative that multipole is about the roll of the data in the neural network. The code for producing the multipole kernels can be found at github:multipole_kernels.

The size of the convolutional kernel used is extremely important for a neural physical engine. The size of the kernel is known as the receptive field, and dictates the size of the correlations which can be learned by the neural network. The receptive field should be chosen based on the data. If it is too small then it is impossible to learn about relevant features in the data and will tend to average out even the small scale features since it cannot distiguish the large scale modes. Likewise, if the receptive field is too large then the kernel will be massively overparameterised which can lead to overfitting and the fitting of spurious large scale features of the data. Since these large scale features are less common they are therefore less likely to be averaged out during training.This leads to a network which is difficult to train and has a much larger computational cost. It should be noted that stacking convolutions leads to a larger receptive field throughout the network, but does not protect one from the above problems. The kernel size should be chosen carefully at each layer make the most of the distribution of information at each layer independently (this can be very tricky to do).

Neural density estimators

Since we wish to model the halo mass distribution function we need to consider an architecture whose output is a function (or at least an evaluation of the function). To do so we use a modified mixture density network which is a type of neural density estimator. Neural density estimators are neural networks whose outputs are samples from a fitted probabililty distribution function. For the halo mass distribution function we use a mixture of two Gaussian distributions where we allow the predicted amplitudes to be free positive parameters but organise the predicted mean parameters in order of magnitude. This breaks the degeneracy between the two Gaussians and allows us to have a smooth function whose amplitude can accurately approximate the abundance of halos.

A mixture density network is a neural network which maps an input to a set of parameters for a collection of probability distributions. For example, one can predict the means, μ, standard deviations, σ, and amplitudes, α, of several Gaussian distributions and sum these Gaussians together. Provided that the amplitudes sum to 1, the mixture density will remain correctly normalised to be interpreted as a probability distribution. The mixture density network can then be trained by evalutating the value of the distribution at the labels for the input data and minimising the negative logarithm of the distribution.

Likelihood, \(\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{\delta}_\textsf{LPT})\)

To fit the halo mass distribution to the halo catalogue used in this work we consider a Poisson likelihood. If our evolved dark matter density field, \(\boldsymbol{\delta}_\textsf{LPT}\), is passed through the neural physical engine, with parameters \(\boldsymbol{\theta}_\textsf{NPE}\), to get a field of summaries, \(\boldsymbol{\psi}_\textsf{NPE} = \boldsymbol{\psi}_\textsf{NPE}(\boldsymbol{\delta}_\textsf{LPT}, \boldsymbol{\theta}_\textsf{NPE})\), our halo mass distribution function is given by

\[n(M|\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_\textsf{MDN})= \sum_{i=1,2} \alpha(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)\mathcal{N}(M| μ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i),σ(\boldsymbol{\psi}_\textsf{NPE}, \boldsymbol{\theta}_i)),\]

where \(\mathcal{N}(M|μ, σ)\) is the value of a Gaussian with mean \(\mu\) and standard deviation \(\sigma\) evaluated at halo mass \(\textsf{log}(M)\). The Poisson likelihood can be written as two terms. The first term evaluates the neural halo mass distribution function for every halo in the catalogue, where the density environment is obtained from the patch of \(δ_\textsf{LPT}\) around each voxel index corresponding to each halo. This term therefore fits the abundance scale due to the catalogue. The second term is the integral over halo mass of the whole function for the entire evolved density field and therefore fits the shape of the function.

Note that by using this likelihood we never have to explicitly make a stochastic sampling of the halos to compare to the catalogue, although we could use the fitted halo mass distribution function to generate halo catalogues by using the value of the evaluated neural bias model as the rate parameter for Poisson sampling.

We will also include a Gaussian prior, \(\pi(\boldsymbol{\theta})\), on all the parameters of the neural bias model. We ensure that these weights and biases are centred on zero by rescaling them using prior knowledge of the amplitude of the abundance measured from the halo catalogue and the halo mass threshhold. Since the parameters of the neural bias model are centred on zero, we just need to a width to the Gaussian prior which is large enough to allow for parameter exploration, but tight enough to make sampling the parameters feasible.

HMCLET

To be able to sample the weights of the neural bias model we use a modified Hamiltonian Monte Carlo. Hamiltonian Monte Carlo is a way of efficiently drawing samples from extremely large dimensional likelihood distributions. One starts with an initial set of neural bias model parameters, \(\boldsymbol{\theta}_0\), and proposes a new set, \(\boldsymbol{\theta}^*\), given a momentum, \({\bf p}\), drawn from a proposal distribution, \({\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})\). M is a mass matrix which describes the time scale along the parameter direction and correlation between the parameters. One then solves Hamilton’s equations, \(d\boldsymbol{\theta}/dt = {\bf M}^{-1}{\bf p}\) and \(d{\bf p}/dt = -\nabla \mathcal{V}(\boldsymbol{\theta})\) where the Hamiltonian is described by \(\mathcal{H}(\boldsymbol{\theta}, {\bf p}) = \mathcal{V}(\boldsymbol{\theta}) + \mathcal{K}(\boldsymbol{p})\), with \(\mathcal{V}(\boldsymbol{\theta}) = \mathcal{L}(\boldsymbol{\theta}|\delta_\textsf{LPT}) + \pi(\boldsymbol{\theta})\) as the potential energy formed from the likelihood and the prior and \(\mathcal{K}(\boldsymbol{p}) = -{\bf p}^\textsf{T}{\bf M}^{-1}{\bf p}\) as a kinetic energy. Proposed parameters are then excepted according to a probablity given by \(\alpha = \textsf{Min}[\textsf{exp}(\Delta\mathcal{H}), 1]\), where \(\Delta\mathcal{H}\) is the difference between the energy at the proposed parameter values and the current parameter values. By conserving energy, one ensures that all proposals are accepted. It is ususal to use a symplectic integration scheme, such as the leapfrog algorithm (ϵ-discretisation) to solve these ODEs.

The leapfrog algorithm involves drawing a momentum from a proposal distribution, \({\bf p} \sim \mathcal{N}({\bf 0}, {\bf M})\), and taking a step of size \(\epsilon\) from the initial parameter positions \(\boldsymbol{\theta}_0\) according to \({\bf p} = {\bf p} - \epsilon\nabla \mathcal{V}(\boldsymbol{\theta}_0)/2\) giving \(\boldsymbol{\theta}_\textsf{next} = \boldsymbol{\theta}_0+\epsilon{\bf M}^{-1}{\bf p}\). This makes up the first half step in the leapfrog. The same procedure of updating \({\bf p}\) and \(\boldsymbol{\theta}\) occurs N number of steps, where the rest of the steps are full (\({\bf p} = {\bf p}-\epsilon\nabla \mathcal{V}(\boldsymbol{\theta})\)). The last half step is then taken. The choice of ϵ dictates the accuracy of the integration. If \(\epsilon\) is large then Hamilton’s equations are solved more inaccurately which can lead to energy loss between the initial and proposed parameters, which increases the rejection. On the other hand, if \(\epsilon\) is small then more samples are accepted since there is less (or less likely to be) energy loss, but this comes at a higher computational cost.

Since neural networks are complex and in general have a large number of highly somewhat-degenerate parameters, it is very difficult to know the mass matrix a priori. This means that extremely large steps can be made along the likelihood surface leading to numerical stability issues and improper sampling. To overcome this, we can consider using the second order geometric information of the likelihood surface by calculating its Hessian using quasi-Newtonian methods.

The Hessian (\({\bf B}\)), i.e. the second order gradient, of the likelihood surface can be calculated using quasi-Newtonian methods. Quasi-Newtonian methods are root-finding algorithms where the Hessian (or Jacobian) are approximated. There are many ways to calculate the approximate Hessian, we use the BFGS method in this work. This method is convenient since it can be calculated for free as part of the leapfrog algorithm. When using the second order geometric information the ODEs become \(d\boldsymbol{\theta}/dt = {\bf B}{\bf M}^{-1}{\bf p}\) and \(d{\bf p}/dt = -{\bf B}\nabla \mathcal{V}(\boldsymbol{\theta})\). This means that, although the mass matrix is still needed to set the time scales along the parameter directions, the momenta get effectively rescaled by the Hessian, breaking parameter degeneracies and allowing for an efficient acceptance ratio.

Results

With a neural bias model formed of a neural physical engine which is sensitive to non-local radial information, a neural density estimator to give us evaluations of suffciently arbitrary functions and a sampling scheme which can effectively explore the complex likelihood landscape we can now infer the halo mass distribution function.

The BORG algorithm infers the initial conditions of the dark matter distribution. First the initial conditions are drawn from a prior given a cosmology to generate an initial dark matter density field. In this work, this dark matter density field is then evolved forward using Lagrangian perturbation theory to obtain the dark matter density field today. This is then passed through the neural physical engine to obtain an informative field of summaries about the abundance of dark matter halos on a grid. This can then be compared to the observed halo catalogue via the Poissonian likelihood between the halo mass distribution function provided by the neural density estimator of the neural bias model. Evaluating this likelihood allows us to obtain posterior samples of all of the initial phases of the dark matter density distribution and all of the parameters of the neural bias model.

We use a halo catalogue constructed using Rockstar from a chunk of the VELMASS Ω dark matter simulation, which has a Planck-like cosmology. This catalogue has about 10,000 halos with a mass threshhold of 2x10¹² solar masses.

As shown in the figures below, we are able to fit the halo mass distribution function extremely well, with sampling around the observed catalogue. Furthermore, the information used comes from the non-local region around the each voxel in the gridded density field, showing that the surrounding area holds information about the abundance of halos.

The abundance of halos at a certain mass given a density environment from the VELMASS halo catalogue is plotted using the diamonds with dashed lines. The more dense the environment, the more halos are expected at all masses. The solid lines are the mean halo mass distribution function values from the neural bias model. The filled areas are the 1σ intervals either side of the mean obtained by the samples from the Markov chain. We can see that the fit is very good (even with the very simple model considered here), and that the shape of the function changes with density environment. This shows that the neural bias model is able to account for the response of the density field.

Here we see an example of an initial density field and the same field evolved using Lagrangian perturbation theory on the top row. The bottom row shows the effect of the neural physical engine which provides an enhancement in constrast, which is a more informative summary of the abundance of halos than the LPT field. This is because non-local information is gathered from the surrounding voxels by the neural physical engine. The last box (bottom right) is the true halos from the VELMASS halo catalogue placed onto the same grid. Note that the NPE field does not look like the halo distribution since a Poisson sampling of the halo mass distribution function is needed to get a stochastic realisation of the halo distribution.

Future work

The methods presented in this paper show a state of the art in terms of machine learning as well as new methods for dealing with the bias model in BORG and for generating halo catalogues from the neural bias model. We will continue our work in two main directions. The first is to look at bypassing the halos completely by learning the form of the likelihood using some form of neural density estimation (or neural flow) which would allow us to be more agnostic about the form of the likelihood. This would mean that we could, in principle, marginalise out the effect of the ambiguity in the likelihood to provide robust constraints on the initial density phases and cosmology. The second is to use architecture optimisation schemes to find a better fit to the halo mass distribution function for use in halo catalogue generation.

References

Tom Charnock, Guilhem Lavaux, Benjamin D. Wandelt, Supranta Sarma Boruah, Jens Jasche, Michael J. Hudson, 2019, submitted to MNRAS, arXiv:1909.06379

A fifth-force resolution of the Hubble tension

Sat, 13 Jul 2019 00:00:00 +0300

Background

At least on large scales, the standard cosmological model suffers from just one \(>3\sigma\) inconsistency. This is the Hubble tension: while the local expansion rate inferred from the Cosmic Microwave Background is \(67.4 \pm 0.5\) km s\(^{-1}\) Mpc\(^{-1}\), \(H_0\) measured locally (by combining distance measurements to objects successively further away in a “cosmic distance ladder”) is \(74.03 \pm 1.42\) km s\(^{-1}\) Mpc\(^{-1}\). This discrepancy is \(4.4\sigma\), and appears to imply some form of new physics that invalidates direct comparison between low and high redshift probes of \(H_0\) within \(\Lambda\)CDM.

A key assumption in the local measurement of \(H_0\) is that the objects that calibrate the distance ladder – primarily Cepheid stars and Type 1a Supernovae – have identical properties between successive rungs. But in a wide variety of beyond-\(\Lambda\)CDM cosmological models which invoke so-called “screened fifth forces”, this is likely not true. Rather, while the Cepheids in the Milky Way and NGC 4258 (whose distance is measured independently by means of a water maser) will be screened by the dense environments of their hosts, those at higher redshift that calibrate supernova absolute magnitudes will be unscreened and hence feel the full fifth force. This induces a bias in the Cepheid period–luminosity relation which causes the conventional analysis to underestimate the distance to extragalactic Cepheid hosts, and hence, at fixed redshift, to overestimate \(H_0\) (Figure 1). Thus, in such models, the expansion rate measured locally would be more in accord with that inferred from recombination.

Left panel: The rungs of the cosmic distance ladder and their typical screening status. Right panel: The Cepheid period–luminosity relation when various parts of a Cepheid are unscreened. Assuming unscreened Cepheids lie on the Newtonian relation underestimates their luminosity and hence their distance.

Unscreening the cosmic distance ladder

We have quantified these effects to flesh out this potential resolution to the Hubble tension ¹. We began by formulating a set of observational proxies for the screening behaviour of Cepheids, which encompasses both well-studied screening mechanisms such as chameleon, k-mouflage and Vainshtein, a newly-proposed mechanism based on interactions between baryons and dark matter ², and others described phenomenologically and not yet associated with an underlying theory. We then utilised the density field reconstruction of the BORG-PM model; as encapsulated in the screening maps described in an earlier post to evaluate these proxies over the Cepheids used in the distance ladder, and hence calculate the change in distance to the Cepheid hosts that the action of a screened fifth force would imply.

The magnitude of the difference depends on the strength of the fifth force. We determined maximum viable values of this in our models by means of consistency tests within the distance ladder data. The most constraining test compares the distances to galaxies measured by both the Cepheid period–luminosity relation and the tip of the red giant branch: these distances are pushed in different directions by a fifth force, so their consistency imposes a limit on the force’s strength. This is shown in Fig. 2, as a function of the fraction of galaxies that are unscreened and separately for the cases in which Cepheid cores (governing luminosity) are unscreened, or only Cepheid envelopes (governing period). This test is the strongest of its kind, and is completely agnostic as to the nature or origin of the modification to gravity.

Constraints on fifth-force strength (relative to gravity) from comparing Cepheid and tip-of-the-red-giant-branch distances, as a function of the fraction of galaxies that are unscreened. Dashed lines indicate typical unscreened fractions in our models.

\(1.5\sigma\) consistency of local and CMB \(H_0\)

Setting the screening threshold to ensure that the galaxies that calibrate the period–luminosity relation (N4258 and the MW) are screened, and imposing the bound on fifth-force strength from Fig. 2, we calculated the maximum reduction in the inferred \(H_0\) that each model could afford. Our results are shown in Fig. 3. While models that only unscreen Cepheid envelopes (right panel) can reduce the tension with Planck to \(\gtrsim2\sigma\), those that unscreen cores (among them the baryon–dark matter interaction model, a dark energy model that is otherwise very little constrained) can achieve \(1.5\sigma\) consistency. These results reveal another possible advantage to cosmologies with fifth forces, as well as demonstrating more generally that novel local resolutions of the \(H_0\) problem are possible.

Constraints on local \(H_0\) for each of our screening models. The most successful models reach 1.5\(\sigma\) consistency with the Planck result, well below the level at which statistical fluctuations may account for the discrepancy.

Harry Desmond, Bhuvnesh Jain, Jeremy Sakstein, 2019, A local resolution of the Hubble tension: The impact of screened fifth forces on the cosmic distance ladder, submitted to Phys. Rev. D., arxiv 1907.03778 ↩
J. Sakstein, H. Desmond, B. Jain, 2019, Screened Fifth Forces Mediated by Dark Matter–Baryon Interactions: Theory and Astrophysical Probes, submitted to Phys. Rev. D., arxiv 1907.03775 ↩

Algorithms for likelihood-free cosmological data analysis

Thu, 25 Apr 2019 00:00:00 +0300

Overview

The extraction of physical information from wide and deep astronomical surveys relies on statistical techniques to compare models and observations. A common scenario in cosmology is when we can generate synthetic data through forward simulations, but cannot explicitly formulate the likelihood of the model. The generative process can be extremely general (a noisy non-linear dynamical system involving an unrestricted number of latent variables) and is often computationally expensive. Likelihood-free inference (LFI) provides a framework for performing Bayesian inference in this context, by replacing likelihood calculations with data model evaluations. In its simplest form, LFI takes the form of likelihood-free rejection sampling (LFRS), which tends to be (i) extremely expensive, since many simulated data sets get rejected, and (ii) very limited in the number of parameters that can be treated.

In two recent articles, we presented methodological advances, aiming at fitting cosmological data with “black-box” numerical models. Each of them addresses one of the shortcomings of LFRS. The first approach, BOLFI, is intended for specific cosmological models (with \(n \lesssim 10\) parameters) and a general exploration of parameter space. It combines Gaussian process regression of the distance between observed and simulated data with Bayesian optimization. As a result, the number of required simulations is reduced by several orders of magnitude with respect to LFRS. The second approach, SELFI, allows the inference of \(n \gtrsim 100\) parameters (as is necessary for a model-independent parametrization of theory) while assuming stronger prior constraints in parameter space. It relies on a Taylor expansion of the simulator to build an effective posterior distribution. The resulting algorithm allows LFI in much higher-dimensional settings than LFRS.

Likelihood-free inference of black-box data models

Simulator-based statistical models are usually given in terms of numerical “black-boxes”. They provide realistic predictions for artificial observations when provided with all necessary input parameters. These consist of target parameters as well as nuisance parameters such as initial phases, noise realization, sample variance, etc. This “latent space” can often be hundred-to-multi-million dimensional. Once all input parameters are fixed, the black-box typically consists of a simulation step and a data compression step. Black-box models can be written in a hierarchical form and conveniently represented graphically (figure 1).

Hierarchical representation of a typical black-box data model. The rounded green boxes represent probability distributions and the purple square represent deterministic functions. For more details, see figure 1 in Leclercq et al. 2019.¹

The goal of LFI is to find suitable approximations that allow an estimation of the probability distribution of target parameters conditional on observed data summaries, using only black-box evaluations.

BOLFI: Bayesian Optimization for Likelihood-Free Inference

BOLFI (Bayesian Optimization for Likelihood-Free Inference²³) is a cutting-edge machine learning algorithm for LFI under the constraint of a very limited simulation budget (typically a few thousand), suitable when the problem has a sufficiently small number of target parameters (\(n \lesssim 10\)). Conventional approaches such as LFRS generally require too many simulations, due to their lack of knowledge about how the parameters affect the distance between observed and simulated data. As a response, BOLFI combines Gaussian process regression of this distance to build a surrogate surface with Bayesian Optimization to actively acquire training data (figure 2).

Illustration of four consecutive steps of Bayesian optimization to learn a test function. For each step, the top panel shows the training data points (red dots) and the Gaussian process regression (blue line and shaded region). The bottom panel shows the acquisition function (solid green line). The next acquisition point, i.e. where to run a simulation to be added to the training set, is shown in orange. For more details, see figure 4 in Leclercq 2018.³

The target parameter space is explored efficiently and in all generality. We extended the method to use the optimal acquisition function for the purpose of minimizing the expected uncertainty in the approximate posterior density, in the parametric approach to likelihood approximation. As a result, the number of required simulations is typically reduced by two to three orders of magnitude, and the proposed acquisition function produces more accurate posterior approximations, as compared to LFRS.

SELFI: Simulator Expansion for Likelihood-Free Inference

Another limitation of conventional approaches to LFI is their inability to scale with the number of target parameters. In order to address problems of high-dimensional inference from black-box data models, we introduced SELFI (Simulator Expansion for Likelihood-Free Inference¹). Our approach builds upon a novel effective likelihood and upon the linearization of the simulator around an expansion point in parameter space. The workload with SELFI consists of evaluating the covariance matrix and the gradient of data summaries at the expansion point (figure 3). Contrary to likelihood-based Markov Chain Monte Carlo (MCMC) techniques and to BOLFI, it is fixed a priori and perfectly parallel.

Covariance matrix (left) and gradient (right) of data summaries at the expansion point, evaluated through black-box realizations only. These are the only two ingredients necessary to apply SELFI. For more details, see figures 6 and 7 in Leclercq et al. 2019.¹

The effective posterior of the target parameters is then obtained through simple “filter equations,” the form of which is analogous to a Wiener filter. SELFI allows the solution of inference tasks from black-box data models, in much higher dimension than conventional approaches to LFI.

Cosmological applications: key results

In respective papers, we presented the first applications of BOLFI and SELFI to cosmological data analysis.

Supernova cosmology with BOLFI

We applied BOLFI to the inference of cosmological parameters from the Joint Lightcurve Analysis (JLA) supernovae data. The model contains two cosmological parameters (the matter density of the Universe \(\Omega_m\) and the equation of state of dark energy \(w\)) and four nuisance parameters, which are marginalized over. The posterior contours obtained with MCMC, LFRS, and BOLFI are represented in figure 4.

Prior and posterior distributions for the joint inference of the matter density of the Universe, \(\Omega_m\), and the dark energy equation of state, \(w\), from the JLA supernovae data set. BOLFI (red posterior) reduces the number of necessary simulations by two orders of magnitude with respect to LFRS (green posterior) and three orders of magnitude with respect to MCMC (orange posterior). For more details, see figure 7 in Leclercq 2018.³

As can be observed, BOLFI is able to precisely recover the true posterior with as few as 6,000 simulations, which constitutes a reduction by two orders of magnitude with respect to LFRS and three orders of magnitude with respect to MCMC. This reduction in the number of required simulations accelerates the inference massively.

Primordial power spectrum and cosmological parameters inference with SELFI

We applied SELFI to a realistic synthetic galaxy survey, with a data model accounting for physical structure formation and incomplete and noisy observations. This data model is provided by the publicly-available Simbelmynë code, a hierarchical probabilistic simulator of galaxy survey data.⁴ Through this application, we showed that the use of non-linear numerical models allows the galaxy power spectrum to be fitted up to at least \(k_\mathrm{max} = 0.5~h/\mathrm{Mpc}\), which represents an increase by a factor of \(\sim~5\) in the number of modes used, with respect to traditional techniques. The result is an unbiased inference of the primordial power spectrum (living in \(n =100\) dimensions) across the entire range of scales considered, including a high-fidelity reconstruction of baryon acoustic oscillations (figure 5).

Primordial power spectrum inference with SELFI from a realistic synthetic galaxy survey. In spite of survey complications which limit the information captured, the inference is unbiased and the signature of baryon acoustic oscillations is well reconstructed up to \(k \approx 0.3~h/\mathrm{Mpc}\), with 5 inferred acoustic peaks, result which could be improved using more volume (this analysis uses \((1~\mathrm{Gpc}/h)^3\)). For more details, see figure 10 in Leclercq et al 2019.¹

The primordial power spectrum can be seen as a largely agnostic and model-independent parametrization of theory, relying only on weak assumptions (isotropy and gaussianity). Using the linearized black-box, it can be easily translated a posteriori to constraints on specific cosmological models without (or with minimal) loss of information. For instance, constraints on the parameters of the standard cosmological model, for two different synthetic data realizations (with different input cosmologies, phase and noise realizations), are shown in figure 6.

Cosmological parameter inference using a linearized black-box model of galaxy surveys. The prior is shown in blue, and the effective posteriors for two different data realizations are shown in red and purple.

We therefore obtain an unbiased and robust measurement of cosmological parameters.

F. Leclercq, W. Enzi, J. Jasche & A. Heavens 2019, Primordial power spectrum and cosmology from black-box galaxy surveys, MNRAS 490, 4237 (2019), arxiv:1902.10149 ↩ ↩² ↩³ ↩⁴
M. U. Gutmann & J. Corander 2016, Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models, Journal of Machine Learning Research 17, 1 (2016), arxiv:1501.03291 ↩
F. Leclercq 2018, Bayesian optimisation for likelihood-free cosmological inference, Physical Review D 98, 063511 (2018), arxiv:1805.07152 ↩ ↩² ↩³
The Simbelmynë code: homepage ↩

Painting halos from 3D dark matter fields

Sun, 31 Mar 2019 00:00:00 +0200

Overview

Investigating the formation and evolution of dark matter halos, as the key building blocks of cosmic large-scale structure, is essential for constraining various cosmological models and further understanding our Universe. The highly non-linear dynamics involved nevertheless renders this a complex problem, with computationally costly simulations of gravitational structure formation currently the only tool to compute the non-linear evolution from initial conditions, yielding mock dark matter halo catalogues as the main output. However, running very large simulations of pure dark matter to generate fake observations of the full Universe several times is not feasible, requiring a large amount of memory and disk storage. A way to emulate such simulations, quickly and reliably, would be of use to a wide community as a new method for data analysis and light cone production for the next cosmological survey missions such as Euclid and Large Synoptic Survey Telescope. In this context, we employ a deep learning approach to construct an emulator to learn the mapping from dark matter density to halo fields.

Halo painting network

Our physical mapping network is inspired by a recently proposed variant of generative models, known as generative adversarial networks (GANs). In particular, we will use the key ideas in training WGANs, i.e. GANs optimized using the Wasserstein distance, to ensure that our network is able to paint halos well. A schematic of this Wasserstein mapping framework is provided in Fig. 1. Our generator is the halo painting network whose role is to learn the underlying non-linear relationship between the input 3D density field and the corresponding halo count distribution. Our critic provides as output the approximately learned Wasserstein distance between the real and predicted halo distributions. Intuitively, this Wasserstein distance can be interpreted as the amount of work required to transform a given probability distribution into the desired target distribution. This distance therefore corresponds the loss function that must be minimized to train the halo painting network.

Schematic representation of Wasserstein halo painting network implemented in this work. The role of the generator is to learn the underlying non-linear relationship between the input 3D density field and the corresponding halo count distribution. The difference between the output of the critic for the real and predicted halo distributions is the approximately learnt Wasserstein distance and is used as the loss function which must be minimized to train the generator.

Remarkable performance of halo painting emulator

We showcased the performance our halo painting model using quantitative diagnostics. As a preliminary qualitative assessment, we performed a visual comparison. Fig. 2 depicts the reference and predicted halo distributions. Qualitative agreement is impressive, implying that the halo painting network is capable of mapping the complex structures of the cosmic web, such as halos, filaments and voids, to the corresponding distribution of halo counts.

Prediction of 3D halo field by our halo painting model for a slice of depth \(\sim 100h^{-1}\) Mpc and side length of \(\sim2000h^{-1}\) Mpc. A blind validation dataset is shown in the top right panel, with the predicted halo count depicted below it. The corresponding second order Lagrangian Perturbation Theory (2LPT) density field is displayed in the top left panel, with the difference between the reference and predicted halo distributions depicted in the lower left panel. A visual comparison of the reference and predicted halo count distributions indicates qualitatively the efficacy of our halo painting network.

Power spectrum

As quantitative assessment, the standard practice in cosmology is to use summary statistics. These summary statistics provide a reliable metric to evaluate our halo painting network in terms of their capacity to encode essential information. Assuming the cosmological density field is approximately a Gaussian random field, as is the case on the large scales or at earlier times, the power spectrum provides a sufficient description of the field. We therefore demonstrated the capability of our network in reproducing the power spectrum of the reference halos. The left panel of Fig. 3 illustrates the extremely close agreement of the 3D power spectra of the reference and predicted halo fields.

We investigated the influence of the fiducial cosmology adopted for the simulations on the efficacy of our halo mapping model. In the right panel of Fig. 3, we show the network predictions for two cosmology variants in terms of their respective transfer functions, which is the ratio of the square root of the ratio of the predicted to reference power spectra. The corresponding transfer functions show a deviation of about \(10\%\) from the reference power spectra of their respective real halo distributions on the smallest and largest scales. This shows that our halo painting model is slightly sensitive to the underlying cosmology at the level of the power spectrum.

Left panel: Summary statistics of the 3D power spectra of the reference and predicted halo fields for one thousand randomly selected patches. The solid lines indicate their respective means, while the shaded regions indicate their respective \(1\sigma\) confidence regions, i.e. 68\% probability volume. The above diagnostics demonstrate the ability of our halo painting model to reproduce the characteristic statistics of the reference halo fields and therefore provide substantial quantitative evidence for the performance of our neural network in mapping 3D density fields to their corresponding halo distributions. Right panel: The corresponding transfer functions highlight the consistency between the power spectra reconstructed from the predicted and real halo fields for the three cosmology variants, with the deviation from their respective reference spectra being below \(10\%\).

Bispectrum

The non-linear dynamics involved in gravitational evolution of cosmic structures contributes to a certain degree of non-Gaussianity of the cosmic density field on the small scales. Higher-order statistics are therefore required to characterize this non-Gaussian field. We used the bispectrum to quantify the spatial distribution of the density and halo fields. The bispectra reconstructed from the second order Lagrangian Perturbation Theory (2LPT), reference and predicted halo fields are displayed in Fig. 4. In particular, we show the bispectra for a given small- and large-scale configurations. The 2LPT halo field corresponds to a statistical description of the halo distribution, derived from the 2LPT density field, which is valid, by construction, at the level of two-point statistics and on large scales. This allows us to make a fair comparison between the clustering of the respective halo fields. The left panels of Fig. 4 demonstrate that our halo painting network reproduces the non-linear halo field both on the small and large scales, and is therefore capable of mapping the complex cosmic structures apparent in the reference halo field. Our network predictions also show a significant improvement over the corresponding 2LPT halo fields. In the right panels of Fig. 4, we find that there is a more significant dependence of our network on the fiducial cosmology at higher order statistics.

Left panels: Summary statistics of the 3D bispectra of the 2LPT, reference and predicted halo fields for a given small- and large-scale configurations, as indicated by their respective titles. In both cases, there is a close agreement between the bispectra from the reference and predicted halo distributions. Our network predictions are a significant improvement over the corresponding 2LPT halo fields. Right panels: Deviation from the 3D bispectra of the reference halo distributions of the corresponding predictions for the two cosmology variants. The above bispectrum diagnostics show that our network is more sensitive to the fiducial cosmology than at the level of power spectrum. The \(1\sigma\) confidence regions for five hundred randomly selected patches are depicted in each panel.

Key advantages

Extremely efficient once trained. Our emulator is capable of rapidly predicting simulations of halo distribution based on a computationally cheap cosmic density field. For instance, the network prediction for a \(256^3\) simulation size requires roughly one second on the NVIDIA Quadro P6000.
Can predict the 3D halo distribution for any arbitrary simulation box size. A large simulation box, therefore, does not require tiling of smaller sub-elements. More importantly, this implies that our neural network can be trained on smaller simulations and subsequently used to predict large halo distributions.
Encodes mass information of halos, such that our method can predict the mass distribution of halos.
Allows us to bypass ad hoc galaxy bias models and work in terms of better understood models.

Potential applications

Fast generation of mock halo catalogues and light cone production. This would be useful for the data analysis of upcoming large galaxy surveys of unprecedented sizes.
To fill in small-scale structure at a high resolution from low resolution large-scale simulations.
As a component in Bayesian forward modelling techniques for large-scale structure inference (cf. BORG) or cosmological parameter inference (cf. ALTAIR) to accelerate the scientific process, rendering detailed and high-resolution analyses feasible. This would provide statistically interpretable results, while maintaining the scientific rigour.

References

D. Kodi Ramanah, T. Charnock & G. Lavaux, 2019, submitted to PRD, arxiv 1903.10524
A notebook tutorial to paint the halos of the article: notebook
Source code repository: https://github.com/doogesh/halo_painting