《1. Introduction》

1. Introduction

The relationship between human visual experience and the evoked neural activity is central to the field of computational neuroscience [1,2]. Brain encoding and decoding via functional magnetic resonance imaging (fMRI) are important in gaining an understanding of the visual perception system [3–5]. An encoding model attempts to predict brain response based on a given visual stimulus [6,7], whereas a decoding model attempts to predict the corresponding visual stimulus by analyzing a given brain response [8–22]. Brain encoding and decoding (Fig. 1) have thus become two significant ways of promoting the development of sensory neuroscience because they provide many insights into brain function.

《Fig. 1》

Fig. 1. Brain encoding and decoding in fMRI. The encoding model attempts to predict brain responses based on the presented visual stimuli, while the decoding model attempts to infer the corresponding visual stimuli by analyzing the observed brain responses. In practice, encoding and decoding models should not be seen as mutually exclusive. Effectively unifying encoding and decoding procedures may permit more accurate predictions and facilitate our understanding of information representation in the human brain.

《1.1. Encoding models》

1.1. Encoding models

In the previous literature, most encoding models have been established based on specific computational rules. Neuroscientists believe that these computational rules may be the mathematical basis for the brain’s response to specific visual stimuli. For example, Kay et al. [1] used pyramid-shaped Gabor wavelet filters to build an encoding model. Based on this encoding model, the authors successfully identified the preferred natural images for given human brain activities. Later, Kay et al. [6] further proposed a two-stage cascade encoding model based on the well-established local oriented filters, divisive normalization, compressive spatial summation, and variance-like nonlinearity. Recently, St-Yves and Naselaris [7] constructed a feature-weighted receptive field model based on the intermediate feature maps of a pre-trained deep neural network (DNN); this model can be used to predict the voxel response and study the shape of the receptive field of each voxel. Furthermore, Zeidman et al. [23] built a Bayesian population receptive field (pRF) model for interpretable brain encoding studies. In recent years, DNNs have achieved great success in computer vision, and researchers have begun to use DNNs to construct more complex brain encoding models [7,20,24]. In addition to encoding models for visual information, researchers have studied how semantic information is expressed in the brain. For example, Huth et al. [25] established the mapping relationship between text semantic vectors and cerebral cortex activities, thereby providing a detailed semantic map of the cerebral cortex.

《1.2. Decoding models》

1.2. Decoding models

Previous studies have demonstrated the feasibility of decoding the identity of binary contrast patterns [12–14], handwritten characters [15,16], human facial images [17–19], natural picture/video stimuli [2,20], and dreams [12,21] from the corresponding brain activation patterns. For example, Miyawaki et al. [12] constructed a multiscale neural decoding model to reconstruct perceived binary contrast patterns from brain responses. Schoenmakers et al. [15] proposed a linear decoding model to reconstruct handwritten characters from brain responses. Güçlütürk et al. [19] proposed the combination of probabilistic inference with adversarial training for reconstructions of perceived faces from brain responses. Horikawa and Kamitani [2] showed that the hierarchical features of visual stimuli calculated by a computer vision model could be predicted by utilizing the responses of multiple brain regions. These findings indicate that there is a close relationship between the hierarchical visual cortex and the complex visual features obtained by the computer vision model. Furthermore, Wen et al. [20] proposed a dynamic neural decoding method based on deep learning that can reconstruct the dynamic visual scenes perceived by a human and predict their semantic labels. Horikawa and Kamitani [21] even showed that brain activity could be used to predict the objects in humans’ dreams.

Most of the aforementioned decoding studies are based on the multi-voxel pattern analysis (MVPA) method [8]. However, brain connectivity patterns are also a key feature of the brain state and can be used for brain decoding. Previous decoding studies [26– 30] have shown that brain connectivity information can been utilized as distinguishing features in decoding procedures. For example, by employing brain connectivity information in brain decoding, Yargholi and Hossein-Zadeh [29] were able to successfully reconstruct two handwritten digits—namely, 6 and 9—from human brain activities. Manning et al. [30] proposed a probabilistic model for extracting dynamic functional connectivity patterns in brain activity. The proposed functional connectivity patterns can be used in brain decoding studies.

《1.3. Hybrid encoding–decoding with bidirectional models》

1.3. Hybrid encoding–decoding with bidirectional models

Although recent developments in brain encoding and decoding [3–21,29,31–33] have shown promising results, many challenges remain in constructing an accurate decoding model in order to reconstruct the corresponding visual stimuli from fMRI data. From the Bayesian machine learning perspective, an encoding model can be acquired with a generative model that accounts for the measured brain activity. When this encoding model is combined with prior knowledge about the stimuli, a posterior probability distribution of the stimuli—that is, a predictive distribution for decoding—could be obtained, given a brain activity pattern. Therefore, encoding and decoding models should not be seen as mutually exclusive. Effectively unifying encoding and decoding procedures may permit accurate predictions and facilitate an understanding of information representation in the human brain [13,34]. For example, Fujiwara et al. [13] proposed a ‘‘bidirectional” approach to visual image reconstruction, in which a set of latent variables was assumed to relate image pixels and fMRI voxels; this approach allowed predictions for both encoding and decoding to be generated. These scholars employed the Bayesian canonical correlation analysis (BCCA) framework, which computed multiple correspondences, via latent variables, between image pixels and fMRI voxels. Since the pixel weights for each latent variable can be thought to define an image basis, training the BCCA model using measured data leads to automatic extraction of image bases. Although it is premature to speculate on functional implications of the estimated image bases, this data-driven ‘‘bidirectional” approach could be extended to discover the modular architecture of the brain in representing complex natural stimuli, behavior, and mental experience defined in high-dimensional space.

《2. Correspondence between DNNs and the human visual system》

2. Correspondence between DNNs and the human visual system

Deep learning [35,36] is a large class of machine learning methods that extract hierarchical representations from input data. The architectures of DNN were first inspired by the structure and computational principles of the biological nervous system [37]. Recently, DNN-based deep learning methods have achieved great success in image recognition, speech recognition, natural language understanding, and other aspects. In terms of architecture, the hierarchical layers of DNNs are very similar to those of the ventral visual system of the human brain [7,35,38] (Fig. 2). In terms of function, existing research on neural encoding and decoding based on deep learning has shown that the shallow representation of DNN is similar to the function of the primary visual area, while the deep representation of DNN is similar to the back end of the ventral visual system [2,24,39,40].

《Fig. 2》

Fig. 2. The ventral visual system and a deep convolutional neural network (CNN). (a) Forward and backward projections between four Brodmann areas (V1, V2, V4, and IT); (b) an illustration of a simple feedforward deep CNN, whose hierarchical structure is used to simulate the hierarchical representation of the ventral visual system. LGN: lateral geniculate nucleus. (a) Reproduced from Ref. [38] with permission of Elsevier, © 2014; (b) reproduced from Ref. [7] with permission of Elsevier, © 2017.

Humans can perceive complex objects quickly and accurately through the ventral visual stream, a system of interconnected brain regions that processes increasingly complex features in hierarchical structures [41,42,43]. However, the automated discovery of early visual concepts from visual images with no supervised information is a major open challenge in machine perception research. On the one hand, it would be helpful or the representations extracted from the image to perform well in real-world tasks. On the other hand, it would be desirable to be able to interpret these representations, and for them to be useful for tasks beyond those that are explicit in their initial design. From a traditional standpoint, it is difficult to use a pre-trained DNN model to learn such representations from visual images, because the semantic meaning of each dimensionality in the representation vector automatically extracted from the input image by that DNN model is unknown. Without disentangled representations, it is difficult to interpret these representations across different tasks. Fortunately, Higgins et al. [44] have shown that specially designed deep generative models are capable of learning disentangled representations.

《3. Brain decoding with deep generative models》

3. Brain decoding with deep generative models

A promising research direction involves the integration of deep learning methods into brain decoding research. Deep generative models such as variational autoencoders (VAEs) [45,46] and generative adversarial networks (GANs) [47] have achieved great success in the field of image generation. An increasing amount of attention has recently been focused on research on visual image reconstruction using deep generative models [19,31–33,48,49].

《3.1. VAE-based methods》

3.1. VAE-based methods

VAEs—which were originally presented in Refs. [45,46]—are a probabilistic extension of the autoencoder model. A VAE has a bottom-up encoding network and a top-down decoding network. These two networks are jointly trained to maximize the lower bound of the data likelihood, thereby reformulating the autoencoder model as a variational inference problem. Recent works have demonstrated that VAE-based models are capable of learning disentangled representations that correspond to distinct factors of variation in the input data [43,50,51]. This is very important for brain encoding and decoding tasks, since some of the visual concepts learned by VAE-based models are also perceived by the human brain. Inspired by this fact researchers have explored the use of VAE-based models in image reconstruction from brain activities [31,32].

For example, Du et al. [31] proposed a deep generative multiview model (DGMM) for reconstructing the perceived images from brain fMRI activities (Fig. 3). The DGMM can be viewed as a nonlinear extension of the linear BCCA. Under the DGMM framework, the encoding and decoding procedures are simultaneously formulated by two distinct generative models:

where denotes the normal distribution,  denotes the visual images, Y  denotes the evoked fMRI activities,   is the likelihood function of the visual images with neural network parameters  is the likelihood function of the evoked fMRI activities, denotes the full covariance matrix, B denotes the projection weights of the fMRI activities, and Z  denotes the shared latent variables between the visual images and the evoked fMRI activities. The  denote the mean and covariance of that normal distribution, respectively, and they are obtained by different nonlinear transformations with respect to the latent variables. The training set consists of N paired samples, which can be denoted by , where  for i = 1,……,N. Specifically, the DGMM uses a DNN-based generative process to model the distribution of visual images, while using a sparse linear generative process to model the distribution of brain response data. On the one hand, the DNN used here can effectively capture the hierarchical features of the visual image, which are similar to the structure of the ventral visual system of the human brain [2,24,39,40]. On the other hand, the sparse linear generative model used here not only conforms to the sparse expression principle of the human brain, but also avoids overfitting of brain response data [52]. Note that these two generative processes share the same latent variables. Therefore, in the test phase, the use of these processes makes it possible to infer the corresponding visual image from the brain response through the shared latent variables. In fact, the DGMM framework can capture “bidirectional” mapping relationships between the visual images and the corresponding fMRI activities. Thanks to its autoencoding variational Bayesian architecture, the DGMM can be optimized efficiently by means of meanfield variational inference, which is similar to the classical VAE solution. Compared with non-probabilistic deep multi-view learning methods, the DGMM’s Bayesian framework makes it naturally more flexible and adaptive.

《Fig. 3》

Fig. 3. Illustration of the deep generative multi-view framework for neural decoding. (a) Model training: view-specific generative models are used for data generation; specifically, a DNN is adopted to model visual images, while a linear regression model is used to model brain activities. (b) Image reconstruction: brain activities that are independent of those used for training are decoded into visual images.

《3.2. GAN-based methods》

3.2. GAN-based methods

GANs were first proposed in Ref. [47]. The basic GAN is an unsupervised model that generates images from a noise vector. The idea of adversarial training comes from game theory, in which two competitors compete in order to make progress together. The typical configuration of a GAN includes a generator and a discriminator. The task of the generator is to synthesize images from noise in order to deceive the discriminator into believing that the synthesized images are real-world scenes. Meanwhile, the discriminator attempts to distinguish between the synthesized data and real data. When the Nash equilibrium is reached, the generator learns the distribution of real-world images, and the discriminator is sensitive to capturing the difference between real and fake data. GANs have been widely used in various applications, including image generation [53], image-to-image translation [54], and text-toimage synthesis [55,56].

Unlike a VAE, a GAN is a likelihood-free model—that is, it does not make any prior assumptions regarding the data distribution, and the data distribution is totally learned through adversarial training. This is a favorable characteristic in neural encoding and decoding tasks. A GAN often requires exact semantic information flow in its generator and discriminator. However, the useful semantic information in the blood-oxygen-level-dependent (BOLD) signal is merged deep in noise, which is a great challenge for model training. Recent brain decoding research [19] has proposed the combination of probabilistic inference with adversarial training for the reconstruction of perceived faces from brain activations (Fig. 4). Assume that is the visual image, is its latent features, is the corresponding brain response, and is a latent feature model such that and  Then, the perceived visual images can be reconstructed from brain responses by means of the following equation:

where is the posterior distribution of the latent variables. Eq. (3) can be reformulated through Bayes’ theorem:

where is the likelihood function and is the prior distribution of the latent variables. The authors first intuitively decode the observed brain responses to the latent features with maximum posteriori estimation. Next, they generate the perceived images according to the decoded latent features using adversarial learning. This two-step brain decoding method can accurately generate reconstructions of perceived faces from brain responses. More recently, researchers have attempted to reconstruct natural images from measured fMRI signals [33,48,49] by utilizing GANs that have been pre-trained on large-scale image datasets.

《Fig. 4》

Fig. 4. Illustration of deep adversarial neural decoding. By combining probabilistic inference with adversarial learning, this method can clearly reconstruct the corresponding image of a face from brain activity. PCA: principal component analysis. Reproduced from Ref. [19] with permission of Neural Information Processing Systems Foundation, Inc., © 2017.

《4. Improving brain encoding and decoding with dual learning》

4. Improving brain encoding and decoding with dual learning

Data-driven brain encoding and decoding methods often require the acquisition of a large number of paired (stimulusresponse) data instances in order to train a model that is customized to an individual subject. In many encoding and decoding studies, however, it is possible to gather a few thousand noisy paired data instances—at most—from a single subject. To improve the generalization ability of the encoding and decoding models, it is therefore necessary to make good use of large-scale unpaired data instances (e.g., visual images).

Inspired by recently proposed dual learning for machine translation [57,58], we suggest that it is possible to train encoding and decoding models simultaneously by minimizing the reconstruction loss resulting from the bidirectional mapping model. The encoding and decoding models represent a primal-dual pair and form a closed loop, allowing the application of dual learning (Fig. 5). Specifically, the reconstruction loss measured over unpaired data (e.g., visual images) would generate informative feedback to train the bidirectional mapping model. Under this dual learning framework, it is possible to leverage large-scale unpaired visual images to improve the generalization ability of the encoding and decoding models. In fact, dual learning is a general framework for learning the bidirectional mappings from one data domain Md to another data domain Nd [59,60]. For MdNd, the goal is to learn an encoder mapping E such that the distribution E(Md) is indistinguishable from the distribution Nd using an adversarial loss. Similarly, for NdMd, the goal is to learn a decoder mapping D such that the distribution (Nd) is indistinguishable from the distribution Md using another adversarial loss. In particular, for the paired data, it is possible to combine these two adversarial losses and the cycle consistency losses (dual losses) to push [(Md)] ≈Md and [(Nd)]  ≈Nd.

《Fig. 5》

Fig. 5. Improving brain encoding and decoding with dual learning. Dual loss measured over unpaired data (either visual images or brain responses) generates informative feedback to train the bidirectional mapping model. Under this dual learning framework, it is possible to leverage large-scale unpaired data to improve the models’ generalization ability.

《5. Conclusions》

5. Conclusions

In conclusion, brain encoding and decoding are central to the field of computational neuroscience and have the potential to create better brain-machine interfaces. The architecture and computational rules of DNNs share some similarity with human visual streams. The use of deep generative models (e.g., VAEs and GANs) in brain encoding and decoding studies holds promise for providing deeper insight into relationships between human visual experience and the evoked neural activity. By leveraging large-scale unpaired data, dual learning is expected to play an important role in developing neural encoding and decoding models.

《Acknowledgements》

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2018YFC2001302), National Natural Science Foundation of China (91520202), Chinese Academy of Sciences Scientific Equipment Development Project (YJKYYQ20170050), Beijing Municipal Science and Technology Commission (Z181100008918010), Youth Innovation Promotion Association of Chinese Academy of Sciences, and Strategic Priority Research Program of Chinese Academy of Sciences (XDB32040200).

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Changde Du, Jinpeng Li, Lijie Huang, and Huiguang He declare that they have no conflict of interest or financial conflicts to disclose.