aDepartment of Automation, Tsinghua University, Beijing 100084, China
bResearch Center for Industries of the Future (RCIF) & School of Engineering, Westlake University, Hangzhou 310030, China
cKey Laboratory of 3D Micro/Nano Fabrication and Characterization of Zhejiang Province, School of Engineering, Westlake University, Hangzhou 310024, China
dShanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Shanghai 201800, China
eWyant College of Optical Sciences, University of Arizona, Tucson, AZ 85721, USA
fShanghai Artificial Intelligence Laboratory, Shanghai 200232, China
It has been over a decade since the first coded aperture video compressive sensing (CS) system was reported. The underlying principle of this technology is to employ a high-frequency modulator in the optical path to modulate a recorded high-speed scene within one integration time. The superimposed image captured in this manner is modulated and compressed, since multiple modulation patterns are imposed. Following this, reconstruction algorithms are utilized to recover the desired high-speed scene. One leading advantage of video CS is that a single captured measurement can be used to reconstruct a multi-frame video, thereby enabling a low-speed camera to capture high-speed scenes. Inspired by this, a number of variants of video CS systems have been built, mainly using different modulation devices. Meanwhile, in order to obtain high-quality reconstruction videos, many algorithms have been developed, from optimization-based iterative algorithms to deep-learning-based ones. Recently, emerging deep learning methods have been dominant due to their high-speed inference and high-quality reconstruction, highlighting the possibility of deploying video CS in practical applications. Toward this end, this paper reviews the progress that has been achieved in video CS during the past decade. We further analyze the efforts that need to be made—in terms of both hardware and algorithms—to enable real applications. Research gaps are put forward and future directions are summarized to help researchers and engineers working on this topic.
Zhihong Zhang, Siming Zheng, Min Qiu, Guohai Situ, David J. Brady, Qionghai Dai, Jinli Suo, Xin Yuan.
A Decade Review of Video Compressive Sensing: A Roadmap to Practical Applications.
Engineering, 2025, 46(3): 183-197 DOI:10.1016/j.eng.2024.08.013
Vision is the most important channel by which humans sense the world and then perceive the environment. Since the invention of the digital camera in the 1960s [1], the Internet and smartphones have made it common for individuals to record their daily lives with videos and share them through social media platforms such as Twitter and TikTok, regardless of time and location. Following this emerging trend, the acquisition and display of ultrahigh definition (e.g., 4K and 8K) and high-speed videos have become a dominant requirement. In industry, novel techniques such as surveillance, autonomous driving, unmanned aerial vehicles (UAVs), and robotics also heavily rely on real-time, high-quality video capture and large-scale video data processing to accomplish their respective tasks.
During the past several decades, a classical imaging paradigm has been constructed using charge-coupled devices (CCDs) or complementary metal-oxide semiconductors (CMOSs) and image signal processors (ISPs). Specifically, the light reflected by a recorded scene is first integrated by the sensor to form a raw image, which is then processed by the ISP to generate the corresponding output. However, in today’s large-scale video era, this pipeline requires a large on-chip memory to cache the acquired raw images and perform ISP computations, as well as a wide bandwidth for image transmission. The increasing demands for video acquisition and processing in various scenarios have put tremendous pressure on the existing imaging framework. Moreover, in the post-Moore era, increasing on-chip memory capacity has become challenging and costly. Scenarios such as the Internet of Things (IoT) platform also present new challenges in terms of massive data transmission. As a result, there is a pressing need for a novel imaging paradigm that can offer high throughput while operating within low bandwidth limitations.
Under these circumstances, computational imaging has garnered increasing attention and is regarded as a promising direction for the future development of imaging [2]. Its objective is to enhance imaging quality [3], [4], [5], capture high-dimensional information [6], [7], [8], or optimize the performance of imaging systems [9], [10], [11]. Over the last decade, significant advances in optical engineering, integrated circuits, and deep learning have paved the way for the practical applications of novel computational imaging techniques. Among these techniques, video compressive sensing (CS) [7], [12] has emerged as a representative approach in the field of high-throughput low-bandwidth imaging. This approach involves encoding scene information into compressive measurements during the imaging process, which can then be reconstructed into the original/desired video frames using post-reconstruction algorithms. By employing this method, massive amounts of video data can be compressively captured, stored, and transmitted without placing significant burdens on on-chip memory and transmission bandwidth.
The underlying mathematical principle of video CS lies in CS theory [13], [14], [15]. Although CS has been proposed for decades, limited progress has been made in its practical applications in the imaging field due to hardware limitations and insufficient algorithm performance. Fortunately, recent advancements in optical modulation devices and learning-based reconstruction algorithms have eased these issues to a great extent.
On the hardware side, novel techniques in the design and production of spatial light modulators, such as digital micromirror devices (DMDs) and liquid crystal on silicon (LCoS), have significantly enhanced the refresh speed and spatial resolution. This enables video CS systems to meet the increasing demand for high-speed, high-fidelity video acquisition. Additionally, advancements in semiconductor chips and integrated circuits have led to application-specific sensors that provide pixel-wise exposure control capacity [16], [17]. These sensors eliminate the need for external optical modulation devices, making it possible for video CS [7] to be supported directly by compacted cameras.
On the algorithm side, traditional iteration-based optimization algorithms suffer from low reconstruction quality and high time costs. About five years ago, the reconstruction of a 1024 × 1024 × 10 video sequence from a coded snapshot could take hours [18], [19], limiting the practical applications of video CS, especially in real-time tasks. In the last decades, deep learning has rapidly surpassed traditional algorithms in various low-level and high-level computer vision (CV) tasks, including denoising [20], deblurring [21], classification [22], detection [23], and tracking [24]. In regard to reconstruction algorithms in CS problems, learning-based methods and joint learning and optimization frameworks, such as plug-and-play (PnP) [25], [26], [27] and deep unfolding [28], [29], [30], have been developed and have significantly improved reconstruction performance in terms of quality, speed, and flexibility. Overall, recent advances in both optical hardware and reconstruction algorithms have opened a promising avenue toward applications of video CS in practical scenarios [31].
As a novel and promising imaging paradigm, video CS not only facilitates the acquisition of visual information to fit human vision but also provides new insights for designing more efficient end-to-end machine vision frameworks involving information acquisition, storage, transmission, processing, and analysis. The imaging mechanism of video CS endows this technology with significant advantages over traditional imaging paradigms in terms of high information capacity and low bandwidth occupation. In addition, the compressed data format relieves the computational burden, thus increasing the processing speed for high-level CV tasks such as object detection and route planning. These characteristics are particularly advantageous for load-limited platforms such as UAVs and edge sensors in the IoT, as they greatly reduce power consumption. Moreover, video CS could help intelligent systems such as autonomous cars, UAVs, and robots to perform real-time sensing and decision-making tasks. Some initial attempts have been made to validate the effectiveness of this framework in practical applications [32], [33], [34], [35].
At this milestone of the development of video CS, this paper aims to provide a comprehensive review of the history of video CS. The remaining sections of this paper are organized as follows. In Section 2, we provide an overview of the fundamental theory and imaging schematic of video CS. 3 Video CS hardware, 4 Video CS algorithms then present detailed reviews of the hardware design and reconstruction algorithms of video CS, respectively. Section 5 discusses existing challenges and promising opportunities, and provides a prospective roadmap for the further development of video CS. Subsequently, Section 6 summarizes representative applications of video CS. Finally, Section 7 concludes the paper and provides our outlook on the future of video CS.
2. Fundamental theory and imaging schematic
From the perspective of information theory, conventional imaging systems obey the Shannon–Nyquist sampling theorem [36], while video CS is based on the compressive sampling or CS theorem [13], [14]. The Shannon–Nyquist sampling theorem states that it is possible to perfectly recover the original signal from its samples if the sampling rate is two times higher than the signal’s highest frequency. This rule imposes a great challenge on improving imaging systems’ throughput just through hardware upgrading, especially in the post-Moore’s-law period. However, the proposal of the CS theorem provides new insights into circumventing this problem. According to this theorem, there is a great possibility to reconstruct a signal from far fewer samples than required by the Shannon–Nyquist sampling theorem with almost no information loss in the case that two conditions are satisfied. The first of these conditions is that the original signal should be sparse in some domain. This condition is generally satisfied for most natural signals such as audio, images, and videos. The second condition is that the sampling matrix should be incoherent with the basis matrix of the aforementioned sparse domain. In practice, random sampling matrices following a Gaussian or Bernoulli distribution can be employed to meet this condition.
Although the CS theorem makes it possible to improve an imaging system’s throughput under the constraint of existing sensor bandwidth, it has not been extensively used in the imaging field over the last few decades for lack of effective sampling hardware and efficient reconstruction algorithms. In recent years, with advances in optics engineering, computational imaging, and deep learning, various video CS systems and reconstruction algorithms have been proposed that significantly push forward the CS theorem for practical application in daily life. In Ref. [37], a theory is specifically developed for CS, which is further reviewed in detail in Ref. [7]. In this section, we will give a brief introduction on the general imaging schematic and mathematical formulation of video CS. A detailed review of specific video CS systems and reconstruction algorithms will be presented in the following sections.
2.1. Imaging schematic
A basic schematic of video CS is illustrated in Fig. 1. For simplicity, we represent a continuous dynamic scenario with discrete high-speed frames, which are coincident with the sensor’s outputs. As demonstrated by the schematic, during acquisition, original frames from the scene space are first modulated by random encoding masks and then integrated by the sensor in one exposure to form a compressive coded measurement. According to the CS theorem, the original high-speed frames can be recovered from the measurement with the aid of CS reconstruction algorithms. In this manner, the video CS system can efficiently achieve the goal of improving data throughput while keeping a low bandwidth.
As described above, the hardware system of video CS is mainly composed of three parts: imaging, modulation, and recording. Among these, the imaging and recording parts are similar to those in conventional cameras. Modulation plays an essential role in video CS, as it performs compressive sampling to reduce the data volume and release the sensor’s bandwidth pressure in the process of converting massive video frames into fewer coded measurements. The refresh rate of the modulation device directly determines the upper bound of the compressive ratio of the video CS system and thus the speed limit of the recovered video. The design of the encoding masks also has a significant influence on the final reconstruction performance. Thanks to developments in optical and mechanical engineering, spatial light modulators with a refresh rate of 10 kHz or higher are already available as commercial products. In practice, random binary masks sampled from the Bernoulli distribution composed of {0,1} are generally leveraged in the modulation for the sake of low implementation complexity and high contrast. After the coded measurements are acquired, various CS reconstruction algorithms can be employed to recover the original high-speed video frames from these compressive measurements. A detailed review of the reconstruction algorithms will be presented in Section 4.
2.2. Mathematical formulation
According to the aforementioned schematic and principle, we can formulate the mathematical model of video CS using a linear equation, as follows:
where the real-number matrices and are the kth () high-speed video frame and corresponding encoding mask with pixels, respectively. is the number of high-speed video frames integrated by the sensor during a single exposure. is the additive noise, and denotes the final coded measurement. represents the Hadamard (entry-wise) product.
2.2.1. Forward model
Mathematically, Eq. (1) can be further derived to the following form through vectorization:
where and with . Accordingly, the high-speed video signal and the sampling matrix are given by the following expressions:
From Eq. (4), we can find that the sampling matrix has a special sparse structure, which is formed by the concatenation of diagonal matrices. Hence, the compressive ratio here is equal to . Prior works [37] have proved that the reconstruction error of video CS is bounded even when > 1 on the condition that the signal is structured enough. This special structure makes it possible to reduce the computational complexity in some optimization-based video CS reconstruction algorithms [18], [19], [38].
In the following, we summarize the theoretical results of Ref. [37] designed for video CS. Firstly, the forward models of video CS and the classical CS (e.g., single-pixel imaging (SPI)) are different; thus, the theoretical guarantee derived for CS does not fit video CS. The main difference between video CS and SPI [39] lies in the following two aspects:
(1)In SPI, the sensing matrix is dense. Each row of the sensing matrix corresponds to one pattern of the modulator imposed on the scene (a two-dimensional (2D) still image), and the single-pixel detector captures one measurement (one element in the measurement).
(2)In video CS, the sensing matrix in Eq. (4) is a sparse matrix, which is a concatenation of diagonal matrices. Each element of the measurement is a weighted sum of the corresponding elements in the video frames modulated by the masks.
2.2.2. Theoretical guarantees
The theoretical derivation of video CS [37] is based on signal compression results applied to compressive sensing [40]. A compressible signal pursuit (CSP)-type optimization was proposed in Ref. [37] as a compression-based recovery algorithm for video CS. Consider the compact set equipped with a compression code whose compression rate is equal to , and assume that the compact set can be described by mappings , where denotes the encoding mapping function and denotes the decoding mapping function.
Consider that , and the reconstructed signal is obtained by solving the following optimization:
where denotes the codebook of the codec defined by . In other words, given a measurement vector , this optimization, among all compressible signals (i.e., signals in the codebook), picks the one that is closest to the observed measurement when sampled according to .
The following theorem in Ref. [37] characterizes the performance of video CS recovery using CSP-type optimization by connecting the parameters of the (compression/decompression) code, its compression rate , and the corresponding distortion to the number of frames , which determines the compressive sampling ratio and the resulting overall reconstruction quality.
Theorem 1 Assume that ,; is a positive number with an upper bound. Further assume that the rate- code achieves distortion on . Moreover, assume that each element in is drawn from the standard Gaussian distribution—that is,. Let denote the solution of compressible signal pursuit optimization in Eq. (5). Assume that is a free parameter, such that . Then,
with a probability larger than .
Note that we assume that the signal is bounded—that is, . This is necessary for the proof of the theorem; moreover, image and video pixel values are usually bounded after being captured by a camera (through the dynamic range of the sensor). Theorem 1 indicates that, given a compressible signal parameterized by the compression rate and distortion , if this signal is compressively captured by a video CS system, then the reconstruction error (i.e., the error between the estimated and ground truth signals) is bounded with high probability by the distortion.
The result from Theorem 1 can be applied to any video CS system with the same forward model. The key assumption is that the desired high-dimensional data is highly compressible, which is generally true for the designed video CS systems. Essentially, as mentioned before, since the compressive sampling rate of video CS is equal to , the reconstruction error is bounded by the inherent compressibility of the signal. Here, we want to point out that the bound derived in Theorem 1 is not tight, and more efforts are expected in this research direction. Further analysis and the research gap to be filled can be found in Refs. [7], [28].
3. Video CS hardware
With recent advances in optics, electronics, and mechanics, novel optical encoding schemes and various video CS systems have been proposed and have achieved significant performance improvement in terms of spatial resolution, frame rate, and imaging quality, among others. In this section, we list existing video CS system designs and present a detailed summary of their principles, performance, and applications. Some of the representative video CS systems are also presented in Table 1[9], [12], [16], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54] and Fig. 2[9], [12], [16], [41], [42], [43], [44], [46], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58].
3.1. Spatial light modulation
The most frequently used approach in video CS systems is to introduce an external spatial light modulator and corresponding relay optics in the optical path to encode the incident scene light. Llull et al. [12] first utilized mechanical translation of a lithographically patterned mask driven by a piezoelectric stage to generate temporal-varying patterns for optical encoding. They implemented a video CS system that could reconstruct more than ten greyscale frames (sometimes more than 100) from a single coded snapshot; later, they extended it to capture color videos [41]. Koller et al. [42] then improved this work in terms of spatial dimensions and reconstruction quality by optimizing the mask pattern design and hardware implementation. Their prototype could realize 2M pixel resolution and 743 frames per second (FPS) high-speed video recording. These works demonstrate that the mechanical modulation scheme has the significant advantages of high spatial resolution, low system cost, and flexible scalability, although it could result in a bulky system and bring about some degree of instability in practical applications. Furthermore, the limited translation speed and frequency response of the piezoelectric stage become barriers to increasing the compressive ratio.
Another line of works employs off-the-shelf programmable spatial light modulators, including DMDs [46], [56], [59], [60], [61], [62] and LCoS [43], [44], [45], to conduct optical encoding. Compared with mechanical translation, these schemes have the merits of compactness, stability, speed, and flexibility, but they are more expensive, and their spatial resolution is limited to several million pixels.
A DMD is a type of micro-opto–electro-mechanical system that consists of a rectangular array of several million microscopic mirrors. Each mirror is mounted on a yoke, and the yoke is connected to two posts by compliant torsion hinges. These hinges can drive the mirror to rotate for ±12°, thus realizing the binary modulation of incident lights by changing their reflection directions. The maximum refresh rate of DMDs can be more than 10 kHz, which makes it possible to achieve high compressive ratios in video CS systems. The drawback of DMDs lies in their possible diffraction effect caused by the periodic micro-mirror arrangement; the diffraction might create artifacts in the resulting images, thereby degrading the imaging quality.
LCoS is another type of widely used spatial light modulator, which takes advantage of the polarization characteristics of liquid crystal to realize light modulation. More specifically, each pixel in an LCoS is composed of a liquid crystal layer, a reflective layer, and a silicon substrate. During operation, the silicon substrate regulates the voltage applied to the liquid crystal to alter the orientation of the liquid crystal molecules. Due to the unique optical anisotropy of the liquid crystal material, the refractive index of the liquid crystal and the phase of light passing through the liquid crystal undergo corresponding changes. Consequently, manipulation of the output voltage of the silicon substrate permits modulation of the polarization state of the incident light. The reflective layer is used to reflect the modulated light toward the optical system. In general, in video CS systems, a pair of orthogonal polarizers will be placed at the entrance and exit of the LCoS, converting the polarization modulation to the required binary amplitude modulation. The frequently used ferroelectric liquid crystal on silicon (FLCoS) in video CS systems has a refresh rate of up to 4.5 kHz. Although this is lower than that of a DMD, it is still higher than the mechanical translation of lithography masks and can meet the requirements for typical applications. The main issue with LCoS-based video CS systems is their lower light throughput, as half of the natural light is filtered out by the polarizer placed in front of the LCoS. However, their strength lies in their superior imaging quality in comparison with mechanical translation or DMD-based schemes.
While the aforementioned systems usually need to utilize global shutter cameras to synchronize the modulation device and the camera, rolling shutter cameras have recently been investigated to conduct video CS [47], [48]. Although it is challenging to directly use a rolling shutter in a video CS system, a shuffled solution was proposed in Ref. [48] and verified using DMD. There is a long way to implement this solution in hardware, but it is indeed promising. Another solution is to use multiple rolling shutter cameras with different rotations to capture the scene [47]. However, this poses a challenge in hardware design and raises the complexity of the optics.
Typically, 2D sensors are employed to capture the modulated images generated using the aforementioned approaches. However, in certain specialized applications such as terahertz and infrared imaging, array detectors may be inaccessible or prohibitively expensive. In such cases, single-pixel cameras offer a viable alternative. Although SPI is primarily utilized for compressive image capture, since it requires a large number of measurements and takes a long time to reconstruct a single frame, recent advancements in high-speed modulation strategies [63], [64] and reconstruction algorithms [57], [58], [65] have made it possible to employ SPI in video CS. For example, Hahamovich et al. [63] proposed a rapid SPI system employing a spinning mask for high-speed spatial modulation. The system demonstrated a modulation rate of 2.4 MHz and a spatial resolution of 101 × 103 pixels, enabling the real-time capture of dynamic scenes at 72 FPS. Kilcullen et al. [64] accelerated SPI by using swept aggregate patterns implemented with a DMD and laser scanning hardware, achieving a modulation speed of 14.1 MHz. They also developed a lightweight algorithm that supports parallel computing, enabling a real-time video reconstruction of 101 × 103 pixels at 100 FPS. Other researchers, such as Higham et al. [58] and Mur et al. [65], have introduced convolutional auto-encoder networks and recurrent neural networks, respectively, into the SPI framework. These techniques have significantly improved the reconstruction quality and speed, particularly in situations with high compressive ratios.
In summary, the spatial light modulation scheme is commonly used for implementing video CS systems with existing commercial sensors. Yet, it comes with certain drawbacks. On the one hand, introducing spatial light modulators increases the system power consumption and cost; it also sacrifices the light throughput, thus resulting in a lower signal-to-noise ratio (SNR) in the captured measurements. On the other hand, due to inevitable optical aberration and system disturbance, these systems require tedious but meticulous calibration before each data acquisition, and the reconstruction quality is highly sensitive to calibration errors. However, the calibration process may fail to produce accurate results under challenging conditions such as poor illumination or unstable platforms.
3.2. Active illumination
The main idea of video CS is to modulate a high-speed scene before it is captured by the camera. While the spatial light modulation approaches described above implement this principle in a passive manner (i.e., natural light is used, and the modulation happens within the imaging system), another way to implement this principle is to impose structured light on the target scene in order to modulate the outgoing radiance. Active illumination is a novel way to achieve structured light, especially in an indoor environment.
Toward this end, Sun et al. [49] used a projector to achieve active illumination for video CS. They also obtained high-speed three-dimensional (3D) imaging using the relationship between pattern scale and object depth. In this active system, depth-map videos at 1000 FPS could be reconstructed from measurement captured at 200 FPS. One drawback of this setup is its short working distance and high sensitivity to disturbance from ambient light, since visible light is used for illumination. Recently, Guzmán et al. [50] addressed this issue by using infrared (IR)-pulsed illumination to modulate the scene, where the use of IR illumination enables the separation of the spatial image (visible) and temporal image channels. Their setup achieved 210-FPS reconstructed video from compressed measurement captured at 15 FPS. It is worth mentioning that two additional measurements are used to improve the reconstruction in their system. In addition to these systems, active illumination using light-emitting diodes (LEDs) and DMDs has been used for joint video and spectral compressive sensing [66].
Mathematically, active illumination is the same as the other programmable spatial light modulators (e.g., using DMD or LCoS), but it is much more practical to implement in the near term due to its high flexibility and low cost.
3.3. Pixel-wise coded exposure (PCE) sensors
With recent advancements in semiconductor and integrated circuit technology, novel CMOS sensors have been designed to incorporate pixel-wise integration control features. These sensors enable the implementation of video CS without the need for additional external components, opening up possibilities for this technology’s volume production and extensive application in the future.
Zhang et al. [16] designed an all-CMOS chip with PCE capacity and demonstrated its application in video CS. They built a prototype image sensor with a resolution of 127 × 90 pixels, which could reconstruct 100-FPS videos from coded measurements captured at 5 FPS. Later, Martel et al. [67] demonstrated a video CS system implemented with SCAMP-5 [68], a programmable sensor–processor with 256 × 256 pixels. They co-designed the per-pixel shutter functions and the reconstruction algorithm in their framework and achieved superior reconstruction quality under a compressive ratio of 16. Differently, Sarhangnejad et al. [69] and Luo et al. [51], [70] separately designed a novel kind of dual-tap coded-exposure-pixel sensor that could output two complementary coded images during a single exposure. This kind of sensor is also known as a coded two-bucket camera [17]. It integrates two charge-collection buckets and a writable memory that controls which bucket is active in each pixel. By assigning programmable binary patterns to control the active buckets, the sensor can realize pixel-wise exposure coding and output two complementary coded snapshots per video frame. Compared with the one-bucket paradigm proposed by Zhang et al. [16], the coded two-bucket sensor makes adequate use of all incident light and thus features higher light efficiency, which facilitates reconstruction algorithms of video CS [71]. In recent years, a series of works [72], [73] concerning coded two-bucket sensors have persistently been conducted to improve the sensor’s performance in terms of spatial resolution, fill factor, modulation speed, dynamic range, power consumption, and so forth. Other pixel-wise exposure control paradigms have also been proposed [52].
Compared with external modulation schemes, sensors with pixel-wise exposure control capacity enable direct video CS without the need for relay optics and can be mass-produced. This feature significantly reduces video CS systems’ volume, lowers their power consumption, and improves their stability, thereby broadening the applications of video CS in different fields. Nevertheless, it is worth noting that there is still a significant gap between existing pixel-coded-exposure sensors and mature conventional global-exposure image sensors concerning fill factor, spatial dimensions (resolutions), imaging quality, and more. Exploring these areas is the future direction of this field.
3.4. Sophisticated encoding schemes
Apart from direct spatial modulation schemes implemented with external modulators or PCE sensors, there are a variety of elaborately designed video CS systems that exhibit superior performance or possess special functions. Compressed ultrafast photography (CUP) is one of the representative high-speed imaging techniques based on video CS [55], [56]. In CUP, a DMD is first employed to spatially encode the input images with a static pseudo-random binary pattern. Then, a streak camera is used to temporally disperse the encoded images along a spatial axis before integration on the sensor to form a coded snapshot measurement. In essence, the combination of spatial encoding using the DMD and temporal shearing with the streak camera enables the required spatial–temporal modulation for video CS, which shares the same general idea with coded aperture snapshot spectral imaging [74]. Due to the high switching speed of the streak camera, CUP is capable of capturing transient events at a maximum rate of 1011 FPS, enabling the observation of physical phenomena such as the reflection and refraction of laser pulses. In recent years, the imaging speed and reconstruction quality of CUP have been significantly enhanced [75]. Moreover, CUP has been expanded to capture multidimensional visual information such as spectra [59].
The primary limitation of CUP is its restricted spatial resolution, which is determined by the width of the open entrance slit of the streak camera. It is also burdened by a high hardware cost, which restricts its applications primarily to scientific research. Another area of research focuses on achieving better spatial and temporal resolutions simultaneously for more general applications in industry and daily life by designing more advanced modulation schemes. Deng et al. [54] introduced frequency-domain modulation implemented with sinusoidal sampling into video CS in order to pursue a higher compressive ratio while maintaining the spatial resolution. The underlying principle of their method involved mapping multiple sets of spatial–temporal encoded measurements to distinct positions in the Fourier domain. This improved the information density of the captured coded snapshot, leading to a higher overall compressive ratio below 0.01. Zhang et al. [9] focused instead on overcoming the spatial resolution limitation imposed by current modulation devices. They developed a novel hybrid coded aperture snapshot compressive imaging (HCA-SCI) scheme for spatial-temporal modulation in video CS and implemented a ten-mega-pixel prototype using a combination of a static high-resolution lithography mask and a dynamic low-resolution LCoS. Compared with the mask-translation scheme, which also has the potential to achieve higher spatial resolution, HCA-SCI exhibits significant advantages in terms of power consumption, volume factor, and system stability.
In addition to the aforementioned methods, some creative works have extended video CS in different dimensions. For example, Tsai et al. [76] extended the mechanical-translation-based video CS system to capture multi-spectral, high-speed scenes in a compressive way by introducing spectral dispersion. Their prototype camera could reconstruct fifteen spectral channels and ten temporal frames from a single coded snapshot. Sun et al. [77] introduced asymmetric stereo techniques into video CS to achieve passive high-speed 3D imaging, which allowed depth-map videos at 800 FPS to be reconstructed from measurement acquired at 80 FPS. Qiao et al. [53] designed a dual-view video CS system that concurrently encoded incident lights from two distinct perspectives and integrated them via the same sensor to form a single coded measurement. Two video sequences from the corresponding view could then be reconstructed from the coded snapshot [60]. The fundamental principle of this system is to modulate each view onto orthogonal polarized lights and shift their coded images to introduce a lateral displacement between them. The researchers constructed a prototype to validate the design and achieved dual-view video CS with a resolution of 650 × 650 pixels and a temporal compressive ratio of 20. More recently, Dou et al. [78] extended the spatial modulation of video CS to digital holography in order to increase the sampling speed, and Luo et al. [79] introduced the principle of video CS to structured illumination super resolution. A similar idea was used in snapshot ptychography to capture scenes with a large field of view and high spatial resolution simultaneously in a single shot [80]. A comparison of different sampling techniques of snapshot compressive imaging is investigated in Ref. [81].
4. Video CS algorithms
In video CS, recovering the original high-speed video from the coded measurements obtained through compressed sampling is an ill-posed problem. In its solution process, the number of equations is far less than the number of unknowns, so there is no unique solution. Traditional optimization algorithms explore various types of prior knowledge, such as sparsity, to reduce this ill-posedness. A great deal of theoretical research and technology in the field of image and video processing provide strong support for the structural analysis of natural scenes. The results show that natural images have features such as local smoothness, non-local self-similarity, and sparsity. These priors provide effective constraints for the computational reconstruction of single-exposure compressed imaging, thereby reducing the difficulty of solving the ill-posed problem. Recently, deep-learning-based reconstruction algorithms have emerged as the dominant algorithms.
At present, reconstruction algorithms can be categorized into four classes, as shown in Fig. 3. The first class comprises iterative optimization algorithms employing different priors. Deep-learning-based algorithms are classified into three different scenarios. Aiming to integrate iterative algorithms and deep denoising algorithms, PnP algorithms employing pretrained denoising networks were proposed to reconstruct large-scale videos. Later, even without training data, self-supervised reconstruction algorithms employing untrained neural networks for denoising were used for reconstruction, albeit with limited results [82]. For faster inference, end-to-end reconstruction algorithms using neural networks were proposed. So far, the best performance has been achieved using deep unfolding networks composed of several stages, with each stage including projection and a neural network to learn priors [28]. These classes are summarized in Fig. 3, and a roadmap is shown in Fig. 4[12], [18], [19], [29], [30], [38], [43], [44], [83], [84], [85], [86], [87], [88], [89], [90], [91], [92], [93] with representative algorithms.
4.1. Traditional optimization algorithms
The sparse priors in the traditional optimization framework include total variation (TV), discrete cosine transform (DCT), wavelet transform, low-rank prior, over-complete dictionary, and Gaussian mixture model (GMM). Among them, TV constraint [38] is based on gradient statistics following Laplace statistics. It has relatively good noise robustness and texture detail retention, and it is fast, emphasizing local smoothness. DCT and wavelet transform belong to global sparse priors, but they often cannot provide the required sparsity in some specific scenes. Compared with the first three global priors, the low-rank prior is a non-local prior, which considers that natural images have non-local self-similarity [94], [95]. Similar textures exist at different locations in natural images, and many textures in natural images themselves have regularity. This shows that the information of natural images is redundant. The redundant information of images can thus be used to reconstruct and recover images or videos. The DeSCI algorithm [18] further expands the application of the non-local self-similar model in video CS, constraining similar information in different areas of high-dimensional high-resolution video sequences with low rank and thereby improving the reconstruction accuracy.
Over-complete dictionary and the GMM both belong to local sparse priors. Naya and others [44], [45] have reconstructed dynamic scenes using over-complete dictionary-based algorithms. The algorithms learn an over-complete dictionary from a large number of videos and represent any given video as a sparse linear combination of elements in the dictionary. Since the dictionary is learned from the video data itself, it captures common video features and achieves satisfactory results. The GMM considers local image blocks as a mixture of a series of Gaussian distributions. Each image block is independent, identically distributed, and follows a Gaussian distribution with a specific mean and variance. Therefore, as long as the parameters of these Gaussian distributions and the indexes of each image block are known, the image can be reconstructed. The local sparse prior obtained from the image database can extract more information and thus has higher reconstruction quality [83]. Its disadvantage is that it can only obtain the low-dimensional eigenstructure of specific training scenes; for different dynamic scenes, it needs to be retrained, which is computationally expensive [96].
Traditional optimization algorithms are highly flexible and can quickly adapt to new hardware systems in different application scenarios. Nevertheless, they generally require a large amount of computation, such that the reconstruction process is slow and cannot meet the timeliness requirements. They also suffer from poor reconstruction quality.
4.2. Deep-learning-based algorithms
Traditional optimization algorithms often rely on different optimization frameworks and various regularization priors, solving problems through iterative procedures. With advances in deep learning, researchers have recognized that deep denoising networks can serve as effective image priors, achieving superior performance while retaining the advantage of flexibility in traditional optimization methods. As a result, the concept of deep PnP methods has emerged.
4.2.1. Deep PnP algorithms
In contrast to conventional optimization algorithms, deep PnP reconstruction methods replace the traditional prior terms with deep neural networks, thus achieving higher inference speed and better reconstruction quality. The PnP framework was initially introduced in 2013 by Venkatakrishnan et al. [97] for image reconstruction, although deep neural networks were not employed as priors at that time. In 2020, a PnP method [19] was proposed for large-scale videos—particularly for 4K resolution videos—and was extensively applied across various video scales. Appropriate pretrained video and image denoising networks were integrated as deep priors into the generalized alternating projection (GAP) framework [38], achieving the best results among unsupervised methods at the time. In addition, the utilization of graphics processing units (GPUs) led to enhanced computational efficiency compared with traditional convex optimization methods.
The aforementioned methods utilize priors pretrained on other datasets, necessitating the prior training of denoising models for specific scenarios. The pretrained parameters in one network may not match different scenarios in real applications. In 2022, Wu et al. [84] developed an adaptive PnP method by automatically updating the parameters in the network of the deep denoising prior according to the specific dynamic scene, thereby bridging the gap between the pretrained network and real applications.
However, another problem remains in the above PnP methods: the lack of training data. To achieve a high-quality denoising network, the plugged model needs to be trained on a large amount of data. However, in some special application scenarios, it could be challenging to generate the training data. Untrained denoising networks do not rely on specific training data, so they are more universally applicable to various types of noise and signals. This enhances their robustness when dealing with unknown or different types of data. Based on these advantages, Qiao et al. [82] proposed an untrained denoising network based PnP algorithm for snapshot temporal compressive microscopy.
It is well known that PnP algorithms exhibit strong generalization capability and can be well-suited for a range of systems. However, a significant challenge of PnP algorithms is their inability to guarantee global convergence. Introducing denoising algorithms or regularization terms, which are often nonlinear, can lead to convergence difficulties or unstable reconstruction results in certain scenarios, and such algorithms require meticulous hyperparameter tuning in practice. To address this challenge, numerous studies have proposed improved PnP algorithms by incorporating stability conditions or utilizing adaptive step sizes, among other techniques. Nonetheless, the problem of reconstruction stability remains an open challenge.
4.2.2. End-to-end deep learning algorithms
To address the issue of low inference speed in iterative methods, end-to-end networks for solving video CS reconstruction tasks have been developed. Among these, the end-to-end convolutional neural network (E2E-CNN) algorithm proposed by Qiao et al. [46] in 2020 employs the U-Net architecture and achieves significantly improved video CS reconstruction quality compared with traditional optimization methods. This algorithm also exhibits real-time reconstruction capability benefiting from the high inference speed of neural networks. However, it still has limitations: It underutilizes inter-frame correlation information in video reconstruction and faces declined performance as the data compression ratio increases. In response to these challenges, BIRNAT algorithm, based on bidirectional recurrent neural networks (RNNs), significantly enhances reconstruction performance while maximizing inter-frame information utilization [85]. Moreover, the extensible nature of RNN structures allows for scalability as the data compression ratio increases. With an increase in model complexity, although the reconstruction performance improves, the training convergence speed significantly decreases, accompanied by heightened GPU resource requirements. The introduction of the RevSCI algorithm in 2021 effectively overcame these issues [86]. RevSCI employs a reversible neural network structure and was the first to introduce 3D convolutional neural networks (3D-CNNs) into video CS reconstruction. The use of reversible neural networks enables the calculation of activation values at each network layer during training from subsequent layer values, eliminating the need to store activation values and drastically reducing GPU resource demands during training. The incorporation of 3D-CNNs allows the network to consider both spatial features within individual frames and inter-frame correlation information across neighboring frames. RevSCI improves video CS reconstruction outcomes and facilitates data reconstruction under high compression ratios.
However, another mismatch issue emerges, as the training mask set might be different from the ones being used in the real system. This results in performance degradation because deep neural networks become coupled with a substantial amount of mask-related information during the learning process in order to enhance reconstruction quality, which necessitates time-consuming retraining to attain previous reconstruction quality. Thus, in scenarios where the mask changes, especially in practical applications, the ability to rapidly adapt to entirely new systems becomes critical. MetaSCI, introduced by Wang et al. [87] in 2021, addresses this challenge by constructing a shared backbone for different systems with lightweight meta-modulation parameters, allowing for swift adaptation to new masks and easy scalability to large-scale data.
4.2.3. Deep unfolding algorithms
Inspired by optimization algorithms such as alternating direction method of multipliers (ADMM) [98] and GAP [99], many deep unfolding algorithms [29], [30], [88], [89] have been proposed to tackle inverse problems in video CS. These methods consist of a number of structurally similar modules, each representing an iterative step in traditional optimization algorithms. Despite the successful assimilation of the advantages of iterative optimization algorithms and the enabling of end-to-end training, the number of network modules in deep unfolding must be kept relatively small for two reasons: ① These networks should remain concise in order to achieve real-time reconstruction speed; and ② training deep unfolding networks with numerous stages is challenging due to memory limitations. To address this, Zhao et al. [90] proposed a deep equilibrium model (DEQ)-based algorithm for video CS reconstruction in 2023. This algorithm effectively combines data-driven regularization methods with stable iterative convergence, achieving low memory consumption and stable reconstruction. DEQ employs the same transformation at each iteration layer, akin to training a network of arbitrary depth with fixed memory occupancy. This corresponds to both the PnP architecture and the deep unfolding network architecture, effectively simulating an infinite number of iterative steps.
4.2.4. A short summary
End-to-end deep neural networks rely heavily on abundant training data, with distribution alignment between training and test data being essential for optimal performance. Despite significant advancements in reconstruction quality, these algorithms lack interpretability due to the “black-box” nature of deep neural networks.
Deep unfolding algorithms represent a fusion of traditional optimization methods with neural networks. These algorithms involve unfolding each iteration of the optimization process into network layers. In traditional optimization, manually selected sparse priors—such as TV in the gradient domain, and DCT or wavelet transform in the signal domain—are used to impose constraints on latent videos during reconstruction. However, different sparse priors impose varying constraints on signals, and choosing the ideal constraints is often challenging. In deep unfolding algorithms, the process of manually selecting sparse priors is replaced by embedding deep neural networks in each layer to adaptively learn corresponding constraint conditions. While traditional optimization algorithms typically require hundreds or thousands of iterations, deep unfolding networks, leveraging the learning capabilities of deep neural networks, can drastically reduce the number of iterations (by approximately two orders of magnitude) to around tens. GAP-net was the first deep-unfolding algorithm for video CS reconstruction [88]. DUN-3DUnet significantly improves reconstruction performance [29]; it employs the 3D-Unet as the backbone and uses dense feature map fusion to overcome limitations in information transmission within the network. Li et al. [100] introduced Anderson acceleration to enhance model convergence speed.
Deep unfolding algorithms amalgamate the strengths of traditional optimization techniques and deep learning, thereby enhancing model interpretability. By incorporating traditional optimization methods, the network can focus more on learning the intrinsic features of images and videos, decoupling the network from mask information to a greater extent. This flexibility aims to ensure model adaptability when facing different systems (e.g., changes in modulation masks). However, existing algorithms still lack flexibility, as the depth of deep unfolding algorithms directly determines the number of optimization iterations. Achieving superior reconstruction outcomes often requires more stages, accompanied by higher GPU memory requirements, which poses a substantial challenge for training with large-scale, highly compressed data.
TwoStage-VCS [101] is a structurally streamlined two-stage deep unfolding network designed for the task of video CS reconstruction. This deep unfolding model not only achieves desirable results for video CS reconstruction problems but also exhibits high adaptability to different modulation masks. It is capable of producing satisfactory results for various masks and scales, demonstrating stability in reconstruction quality. Models trained on low-resolution datasets can be directly transferred and applied to large-scale datasets, thereby significantly alleviating the computational demands of training resources. Ensemble learning priors have been employed to address the scalability challenge in video CS [30]. Since then, the challenge of training models for large-scale data has been effectively addressed, and the primary framework of reconstruction algorithms has transitioned from CNNs to Transformers.
STFormer [102] exploits the correlation in both the spatial and temporal domains by means of the combination of a spatial self-attention branch and a temporal self-attention branch in each sub-block. CTM-SCI [91], which is composed of 3D CNN and 3D scalable blocked dense and dilated sparse attention, can well capture local and global spatial–temporal interactions. Moreover, it introduces the estimation of uncertainty for increased attention on areas with high reconstruction variance. It is well known that algorithms based on the Transformer architecture often involve substantial computational cost. To address this issue, EfficientSCI [92], [103] builds hierarchical dense connections within a single residual block to reduce model computational complexity.
4.3. Evolution of reconstruction network backbones
In addition to different structures of deep-learning-based algorithms for video CS, it is interesting to investigate the evolution of the network backbones being used, as shown in Fig. 3(d). The first deep learning network for video CS [93] was based on the deep fully connected network. After this, although various CNNs have been used in other image-restoration problems, to the best of our knowledge, no significant results have been reported for video CS reconstruction. Soon after U-Net [104] became widely used in other CV tasks, it began to be used as a backbone for video CS [46].
One of the main reasons for the success of video CS is the redundancy among video frames, which makes employing the recurrent information between adjacent frames intuitive. Motivated by this, RNN was introduced for video CS in BIRNAT [85] and was further extended to other video CS systems in Ref. [105]. Notably, BIRNAT was the first deep-learning-based algorithm to surpass the performance of the state-of-the-art optimization-based algorithm DeSCI [18] in terms of accuracy. BIRANT set the performance bar for only a short while, however, its record was soon broken by the recently emerging Transformer-based network developed in Refs. [91], [102].
From this evolution flow, it can be seen that, whenever a network emerges, it can be adopted for video CS and will usually upgrade the results. However, this trend just fills the accuracy gap of deep learning algorithms. Another reason for the employment of deep learning in video CS is the consideration of inference speed.
4.4. From accuracy to speed
Accuracy was the first obstruction preventing video CS from practical applications. Since BIRNAT was proposed, this issue has been solved to some extent. As mentioned earlier, optimization-based algorithms are usually slow, although some (e.g., DeSCI) can provide high accuracy. On the other hand, after training (which usually takes a long time of up to days or weeks), deep-learning-based algorithms exhibit fast inference. Ref. [46] provides hope for real-time (30-FPS) reconstruction using U-Net.
As researchers aim for higher accuracy, the model size is increasingly growing, as shown in Fig. 5. This necessitates a huge GPU memory, in addition to a long training time. To address this issue, RevSCI was proposed [86] in order to reduce the memory footprint size for reconstruction.
4.5. Big model or efficient model
Inspired by the success of large language models (LLMs) for natural language processing tasks, it is desirable, on the one hand, to train a large model for video CS in various scenarios, such as under different spatial sizes and different compression ratios, or even under different light conditions; on the other hand, if video CS is to be pushed forward for practical applications such as use on mobile devices, it is necessary to make the model small and shorten the inference time during testing. This dilemma leads to a tradeoff between the model’s generalization ability and its deployment/operating efficiency. Although we believe that large models are the trend for diverse vision tasks, efficient models are also desired for mobile applications.
Toward this end, EfficientSCI [92] was proposed for the use of small models for video CS reconstruction and has led to state-of-the-art results. EfficientSCI has also been used in other related fields, such as digital holographic microscopy [78].
We are working on lightweight techniques such as network quantization to move one step further toward real applications, and it is encouraging that a binary network has been developed for spectral compressive imaging [106].
5. More gaps to fill
While considerable efforts have been made and significant advances have been achieved in both the hardware systems and reconstruction algorithms of video CS, there are still some gaps that need to be filled in order to enhance its availability and performance in practical applications.
5.1. Hardware and systems
Regarding the hardware aspect, video CS systems work by modulating and integrating M consecutive frames into a single snapshot, resulting in a reduction of the dynamic range of the reconstructed frames by a factor of M. Visually, the limited dynamic range restricts brightness and contrast levels, thus hindering the distinction of image details [62]. One promising approach to mitigate this issue is the employment of sensors with larger full well capacity. In addition, the use of post dynamic range enhancing algorithms can improve the visual quality of the reconstructed videos.
Compared with conventional cameras, video CS systems also need to consider user experience factors such as system compactness, robustness, power consumption, and the capacity for focusing and zooming [107]. Currently, most video CS prototype systems are built on laboratory platforms and use toy examples for demonstration. These systems are bulky and fragile due to the use of discrete optical components for system construction, an extra control board for modulation and synchronization, and a computer for data storage and reconstruction. This hinders their practical applications in real outdoor scenarios. Furthermore, these systems generally have fixed focal lengths and focal planes, making it difficult to change the imaging distance and field of view.
To overcome or mitigate these drawbacks and enable commercial production, optimization of the system with optical engineering techniques is necessary. Moreover, future advancements in semiconductor and integrated circuits could facilitate the integration of the entire system, making it possible to achieve “on-chip” video CS and reconstruction using mature PCE image sensors, ISPs, and network processing units (NPUs).
5.2. Reconstruction algorithms
Similar to the hardware and systems, the algorithms require further efforts in order for video CS systems to be deployed to end users. As these systems gain traction in various applications, from surveillance to medical imaging, it is crucial to address the algorithmic challenges that can hinder their successful application.
The robustness of reconstruction algorithms is fundamental to the performance of video CS systems. These algorithms must be capable of effectively handling different types of noise and varying illumination conditions in order to ensure high-quality video reconstruction. Video data transmitted over networks is often subject to various types of noise—such as Gaussian noise, salt-and-pepper noise, quantization noise, and more—that can significantly degrade the quality of reconstructed videos. To address these challenges, reconstruction algorithms must incorporate advanced noise-reduction techniques that can enhance the clarity and quality of the reconstructed videos. Videos captured under different illumination conditions often exhibit variations in brightness, contrast, and color balance. To ensure consistent video quality, reconstruction algorithms must be equipped with illumination normalization techniques and color-correction abilities. These techniques can adjust the brightness and contrast of the video frames and correct color imbalances, thereby improving the overall video quality.
Optimizing low-power reconstruction networks is crucial for the cost-effective deployment of video CS systems, particularly in mobile and remote applications. Field-programmable gate arrays (FPGAs) and specialized chips offer a practical solution to reduce the power consumption of video CS systems. By implementing low-power reconstruction networks on these platforms, the overall energy consumption of the system can be significantly reduced, making it more suitable for utilization in resource-limited environments [108]. Energy efficiency is a critical aspect of low-power reconstruction networks, and algorithms need to be optimized to minimize energy consumption while maintaining high-quality video reconstruction. Techniques such as algorithmic pruning, quantization, transfer learning, and low-rank approximation can be employed to reduce the computational complexity of the algorithms, thereby lowering their energy consumption.
Real-time adaptive sensing is essential in order for video CS systems to be able to effectively handle dynamic scenes and varying content [109]. Video CS systems must be capable of adapting to changes in the scene dynamics in real time. This requires the development of adaptive sensing algorithms that can adjust the sampling rate and compression ratio based on the scene’s complexity and motion content. By doing so, the system can ensure that important video content is adequately captured and reconstructed. The algorithms used in video CS systems must be flexible enough to handle varying video content (i.e., rigid motion and fluid motion), transmission conditions, and user requirements, which necessitates the development of adaptive algorithms that can adapt to different acquisition conditions. Such flexibility is key for ensuring the system’s usability and effectiveness in diverse applications.
6. Leading to real applications
Video CS is a novel and promising imaging paradigm that presents several advantages including high throughput, low bandwidth, and reduced power, memory, and computation requirements. In Fig. 6, a free fall is captured by a real video CS system. The spatial resolution of this scene is 800 pixel × 800 pixel, while the frame rate of the camera is only 50 FPS. With a compression ratio of 40, a video with 2000 FPS can be easily achieved. As for certain off-the-shelf high-speed cameras that may support such frame rates, they often rely on expensive, high-end sensors or struggle with limited recording duration. The reconstructed video corresponding to Fig. 6 and more data with a higher frame rate (8000–18 000 FPS) can be found in Appendix A Videos S1 and S2, respectively.
In conjunction with emerging techniques such as LLMs and large vision models (LVMs), video CS has opened up new possibilities in various application fields including human–machine interaction, auto-driving, UAVs, and robotics. These advancements allow for the development of more efficient end-to-end vision-based frameworks for perception, planning, decision-making, and control.
In recent years, early attempts have been made to integrate front-end video CS with back-end semantic CV tasks to enhance overall efficiency. These efforts involve cascading specific CV tasks, such as action recognition [33] and object detection [34], [110], after video CS. Such approaches eliminate the need for video reconstruction and enable end-to-end optimization of both the encoding strategy and vision algorithms. They have shown significant performance improvements compared with traditional approaches based on conventional sensors and separate pipeline design, especially in high-speed scenarios where conventional sensors suffer from severe motion blur.
The drawbacks of these works arise from their limited application scenarios and insufficient integration of front-end video acquisition and back-end CV tasks. Bearing these drawbacks in mind, Lu et al. [108] took a step forward in the real application of video CS in connected and autonomous vehicles (CAVs) by developing a novel vehicle–edge server–cloud closed-loop framework named EdgeCompression. This framework takes into consideration the existing challenges of CAVs in imaging system, video analysis, and edge computing platforms. By introducing video CS to lower the consumption with respect to power, memory, and computation, the framework achieves desirable detection accuracy on par with reconstruction-based methods but with significant improvement in processing speed.
Similarly, Zhang et al. [35] proposed a more efficient vision-based semantic retrieval framework by combining video CS with well-established CV neural networks. Their framework incorporates two compressive-domain network backbones to directly extract descriptive and discriminative features from the coded measurements. By retraining or fine-tuning existing CV networks on these features, the video CS paradigm becomes compatible with existing CV algorithms, facilitating their joint development. Furthermore, this framework emphasizes task-specific or adaptive video CS capabilities with assistance of feedback from the CV task side.
In summary, video CS presents new opportunities for more efficient visual information acquisition and processing. However, the widespread application of video CS in practical scenarios still presents significant challenges due to the existing infrastructure of hardware, algorithms, and workflows, which are primarily intended for conventional imaging paradigms. Nevertheless, with the maturation of supporting techniques and platforms, we anticipate the acceleration of progress in the coming years, leading to the revolutionary impact of video CS in the field of vision.
7. Conclusions
Some techniques are developed to enhance human sensing capability, and this is especially so for video CS. Toward this end, video CS aims to capture ultrafast scenes, mostly for the purpose of discovering new phenomena, while simultaneously enhancing the sensing capability of existing cameras, primarily to improve the efficacy of existing systems. The application of this technology depends on the degree of efficacy video CS can bring and how it compares with competing techniques, such as other methods to capture high-speed scenes (e.g., the event camera) [111]. Bearing these considerations in mind, this paper reviewed a decade’s progress in video CS, including hardware systems and reconstruction algorithms. Research gaps were also pointed out in order to shed light on future research topics.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61931012, 62171258, 62088102, and 62271414), the Zhejiang Provincial Outstanding Youth Science Foundation (LR23F010001), and the Key Project of Westlake Institute for Optoelectronics (2023GD007).
Compliance with ethics guidelines
Zhihong Zhang, Siming Zheng, Min Qiu, Guohai Situ, David J. Brady, Qionghai Dai, Jinli Suo, and Xin Yuan declare that they have no conflict of interest or financial conflicts to disclose.
PengYE, VeeraraghavanA, HeidrichW, WetzsteinG.Deep optics: joint design of optics and image recovery algorithms for domain specific cameras.In: Proceedings of the ACM SIGGRAPH 2020 Courses; 2020 Aug 17–28; online. New York City: Association for Computing Machinery; 2020. p. 1–133.
[4]
ZhangB, YuanX, DengC, ZhangZ, SuoJ, DaiQ.End-to-end snapshot compressed super-resolution imaging with deep optics.Optica2022; 9(4):451-454.
[5]
ZhangZ, DongK, SuoJ, DaiQ.Deep coded exposure: end-to-end co-optimization of flutter shutter and deblurring processing for general motion blur removal.Photon Res2023; 11(10):1678-1686.
[6]
BaekSH, IkomaH, JeonDS, LiY, HeidrichW, WetzsteinG, et al.Single-shot hyperspectral-depth imaging with learned diffractive optics.In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 11–17; online. New York City: IEEE; 2021. p. 2651–60.
[7]
YuanX, BradyDJ, KatsaggelosAK.Snapshot compressive imaging: theory, algorithms, and applications.IEEE Signal Process Mag2021; 38(2):65-88.
[8]
TangH, MenT, LiuX, HuY, SuJ, ZuoY, et al.Single-shot compressed optical field topography.Light Sci Appl2022; 11:244.
[9]
ZhangZ, DengC, LiuY, YuanX, SuoJ, DaiQ.Ten-mega-pixel snapshot compressive imaging with a hybrid coded aperture.Photon Res2021; 9(11):2277-2287.
[10]
LuoY, ZhaoY, LiJ, RivensonY, JarrahiM, et al.Computational imaging without a computer: seeing through random diffusers at the speed of light.eLight2022; 2:4.
[11]
SinhaA, LeeJ, LiS, BarbastathisG.Lensless computational imaging through deep learning.Optica2017; 4(9):1117-1125.
CandesEJ, TaoT.Near-optimal signal recovery from random projections: universal encoding strategies?.IEEE Trans Inf Theory2006; 52(12):5406-5425.
[14]
DonohoD.Compressed sensing.IEEE Trans Inf Theory2006; 52(4):1289-1306.
[15]
YaoH, DaiF, ZhangS, ZhangY, TianQ, XuC.DR2-Net: deep residual reconstruction network for image compressive sensing.Neurocomputing2019; 359:483-493.
[16]
ZhangJ, XiongT, TranT, ChinS, Etienne-CummingsR.Compact all-CMOS spatiotemporal compressive sensing video camera with pixel-wise coded exposure.Opt Express2016; 24(8):9013-9024.
[17]
WeiM, SarhangnejadN, XiaZ, GusevN, KaticN, GenovR, et al.Coded two-bucket cameras for computer vision.In: Proceedings of the Computer Vision–ECCV 2018; 2018 Sep 8–14; Munich, Germany. Berlin: Springer; 2018. p. 54–71.
[18]
LiuY, YuanX, SuoJ, BradyDJ, DaiQ.Rank minimization for snapshot compressive imaging.IEEE Trans Pattern Anal Mach Intell2019; 41(12):2990-3006.
[19]
YuanX, LiuY, SuoJ, DaiQ.Plug-and-play algorithms for large-scale snapshot compressive imaging.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020 Jun 13–19; Seattle, WA, USA. New York City: IEEE; 2020. p. 1444–54.
[20]
IzadiS, SuttonD, HamarnehG.Image denoising in the deep learning era.Artif Intell Rev2023; 56(7):5929-5974.
[21]
ZhangK, RenW, LuoW, LaiWS, StengerB, YangMH, et al.Deep image deblurring: a survey.Int J Comput Vis2022; 130(9):2103-2130.
[22]
RawatW, WangZ.Deep convolutional neural networks for image classification: a comprehensive review.Neural Comput2017; 29(9):2352-2449.
[23]
ZhuH, WeiH, LiB, YuanX, KehtarnavazN.A review of video object detection: datasets, metrics and methods.Appl Sci2020; 10(21):7834.
[24]
JiaoL, WangD, BaiY, ChenP, LiuF.Deep learning in visual tracking: a review.IEEE Trans Neural Netw Learn Syst2021; 34(9):5497-5516.
[25]
YuanX.Various plug-and-play algorithms with diverse total variation methods for video snapshot compressive imaging.In: Proceedings of the Artificial Intelligence: First CAAI International Conference; 2021 Jun 5–6; Hangzhou, China. Berlin: Springer; 2021. p. 335–46.
[26]
YuanX, LiuY, SuoJ, DurandF, DaiQ.Plug-and-play algorithms for video snapshot compressive imaging.IEEE Trans Pattern Anal Mach Intell2021; 44(10):7093-7111.
[27]
ChenY, GuiX, ZengJ, ZhaoXL, HeW.Combining low-rank and deep plug-and-play priors for snapshot compressive imaging.IEEE Trans Neural Netw Learn Syst. In press.
WuZ, ZhangJ, MouC.Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging.In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 11–17; Montreal, BC, Canada. New York City: IEEE; 2021. p. 4892–901.
[30]
YangC, ZhangS, YuanX.Ensemble learning priors driven deep unfolding for scalable video snapshot compressive imaging.In: Proceedings of the Computer Vision–ECCV 2022; 2022 Oct 23–27; Tel Aviv, Israel. Berlin: Springer; 2022. p. 600–18.
[31]
SuoJ, ZhangW, GongJ, YuanX, BradyDJ, DaiQ, et al.Computational imaging and artificial intelligence: the next revolution of mobile vision.Proc IEEE2023; 111(12):1607-1639.
[32]
KwanC, ChouB, YangJ, RangamaniA, TranT, ZhangJ, et al.Target tracking and classification using compressive measurements of MWIR and LWIR coded aperture cameras.JSIP2019; 10(3):73-95.
[33]
OkawaraT, YoshidaM, NagaharaH, YagiY.Action recognition from a single coded image.In: Proceedings of the 2020 IEEE International Conference on Computational Photography (ICCP); 2020 Apr 24–26; Saint Louis, MO, USA. New York City: IEEE; 2020. p. 1–11.
[34]
HuC, HuangH, ChenM, YangS, ChenH.Video object detection from one single image through opto-electronic neural network.APL Photonics2021; 6(4):046104.
[35]
ZhangZ, ZhangB, YuanX, ZhengS, SuX, SuoJ, et al.From compressive sampling to compressive tasking: retrieving semantics in compressed domain with low bandwidth.PhotoniX2022; 3:19.
[36]
ShannonC.Communication in the presence of noise.Proc IRE1949; 37(1):10-21.
[37]
JalaliS, YuanX.Snapshot compressed sensing: performance bounds and algorithms.IEEE Trans Inf Theory2019; 65(12):8005-8024.
[38]
YuanX.Generalized alternating projection based total variation minimization for compressive sensing.In: Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP); 2016 Sep 25–28; Phoenix, AZ, USA. New York City: IEEE; 2016. p. 2539–43.
[39]
DuarteMF, DavenportMA, TakharD, LaskaJN, SunT, KellyKF, et al.Single-pixel imaging via compressive sampling.IEEE Signal Process Mag2008; 25(2):83-91.
[40]
JalaliS, MalekiA.From compression to compressed sensing.Appl Comput Harmon Anal2016; 40(2):352-385.
[41]
YuanX, LlullP, LiaoX, YangJ, BradyDJ, SapiroG, et al.Low-cost compressive sensing for color video and depth.In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2014 Jun 23–28; Columbus, OH, USA. New York City: IEEE; 2014. p. 3318–25.
[42]
KollerR, SchmidL, MatsudaN, NiederbergerT, SpinoulasL, CossairtO, et al.High spatio–temporal resolution video with compressed sensing.Opt Express2015; 23(12):15992-16007.
[43]
ReddyD, VeeraraghavanA, ChellappaR.P2C2: programmable pixel compressive camera for high speed imaging.In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2011 Jun 21–23; Springs, CO, USA. New York City: IEEE; 2011. p. 329–36.
[44]
HitomiY, GuJ, GuptaM, MitsunagaT, NayarSK.Video from a single coded exposure photograph using a learned over-complete dictionary.In: Proceedings of the 2011 International Conference on Computer Vision (ICCV); 2011 Nov 6–13; Barcelona, Spain. New York City: IEEE; 2011. p. 287–94.
[45]
LiuD, GuJ, HitomiY, GuptaM, MitsunagaT, NayarS.Efficient space–time sampling with pixel-wise coded exposure for high-speed imaging.IEEE Trans Pattern Anal Mach Intell2014; 36(2):248-260.
[46]
QiaoM, MengZ, MaJ, YuanX.Deep learning for video compressive sensing.APL Photonics2020; 5(3):030801.
[47]
GuzmánF, MezaP, VeraE.Compressive temporal imaging using a rolling shutter camera array.Opt Express2021; 29(9):12787-12800.
[48]
VeraE, GuzmánF, DíazN.Shuffled rolling shutter for snapshot temporal imaging.Opt Express2022; 30(2):887-901.
[49]
SunY, YuanX, PangS.High-speed compressive range imaging based on active illumination.Opt Express2016; 24(20):22836-22846.
[50]
GuzmánF, SkowronekJ, VeraE, BradyDJ.Compressive video via IR-pulsed illumination.Opt Express2023; 31(23):39201-39212.
[51]
LuoY, JiangJ, CaiM, MirabbasiS.CMOS computational camera with a two-tap coded exposure image sensor for single-shot spatial–temporal compressive sensing.Opt Express2019; 27(22):31475-31489.
[52]
YoshidaM, SonodaT, NagaharaH, EndoK, SugiyamaY, TaniguchiR.High-speed imaging using CMOS image sensor with quasi pixel-wise exposure.IEEE Trans Comput Imaging2020; 6:463-476.
DengC, ZhangY, MaoY, FanJ, SuoJ, ZhangZ, et al.Sinusoidal sampling enhanced compressive camera for high speed imaging.IEEE Trans Pattern Anal Mach Intell2021; 43(4):1380-1393.
MurAL, PeyrinF, DucrosN.Recurrent neural networks for compressive video reconstruction.In: Proceedings of the 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI); 2020 Apr 3–7; Iowa City, IA, USA. New York City: IEEE; 2020. p. 1651–4.
[66]
MaX, YuanX, ArceGR.High resolution LED-based snapshot compressive spectral video imaging with deep neural networks.IEEE Trans Comput Imaging2023; 9:869-880.
[67]
MartelJNP, MullerLK, CareySJ, DudekP, WetzsteinG.Neural sensors: learning pixel exposures for HDR imaging and video compressive sensing with programmable sensors.IEEE Trans Pattern Anal Mach Intell2020; 42(7):1642-1653.
[68]
CareySJ, LopichA, BarrDRW, WangB, DudekPA.100,000 fps vision sensor with embedded 535GOPS/W 256×256 SIMD processor array.In: Proceedings of the 2013 Symposium on VLSI Circuits; 2013 Jun 12–14; Kyoto, Japan. New York City: IEEE; 2013. p. C182–3.
[69]
SarhangnejadN, KaticN, XiaZ, WeiM, GusevN, DuttaG, et al.5.5 Dual-tap pipelined-code-memory coded-exposure-pixel CMOS image sensor for multi-exposure single-frame computational imaging. In: Proceedings of the 2019 IEEE International Solid-State Circuits Conference (ISSCC); 2019 Feb 17–21; San Francisco, CA, USA. New York City: IEEE; 2019. p. 102–4.
[70]
LuoY, HoD, MirabbasiS.Exposure-programmable CMOS pixel with selective charge storage and code memory for computational imaging.IEEE Trans Circuits Syst2018; 65(5):1555-1566.
[71]
ShedligeriP, AnupamaS, MitraK.A unified framework for compressive video recovery from coded exposure techniques.In: Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV); 2021 Jan 3–8; Waikoloa, HI, USA. New York City: IEEE; 2021. p. 1599–608.
[72]
GulveR, SarhangnejadN, DuttaG, SakrM, NguyenD, RangelR, et al.A 39,000 subexposures/s CMOS image sensor with dual-tap coded-exposure data-memory pixel for adaptive single-shot computational imaging.In: Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits; 2022 Jun 12–17; Honolulu, HI, USA. New York City: IEEE; 2022. p. 78–9.
[73]
GulveR, RangelR, BarmanA, NguyenD, WeiM, SkarMA, et al.Dual-port CMOS image sensor with regression-based HDR flux-to-digital conversion and 80 ns rapid-update pixel-wise exposure coding.In: Proceedings of the 2023 IEEE International Solid State Circuits Conference (ISSCC); 2023 Feb 19–23; San Francisco, CA, USA. New York City: IEEE; 2023. p. 104–6.
QiaoM, LiuX, YuanX.Snapshot temporal compressive microscopy using an iterative algorithm with untrained neural networks.Opt Lett2021; 46(8):1888-1891.
[83]
YangJ, YuanX, LiaoX, LlullP, BradyDJ, SapiroG, et al.Video compressive sensing using gaussian mixture models.IEEE Trans Image Process2014; 23(11):4863-4878.
[84]
WuZ, YangC, SuX, YuanX.Adaptive deep PnP algorithm for video snapshot compressive imaging.Int J Comput Vis2023; 131(7):1662-1679.
[85]
ChengZ, LuR, WangZ, ZhangH, ChenB, MengZ, et al.BIRNAT: bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging.In: Proceedings of the Computer Vision—ECCV 2020; 2020 Aug 23–28; Glasgow, UK. Berlin: Springer; 2020. p. 258–75.
[86]
ChengZ, ChenB, LiuG, ZhangH, LuR, WangZ.Memory-efficient network for large-scale video compressive sensing.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20–25; Nashville, TN, USA. New York City: IEEE; 2021. p. 16241–50.
[87]
WangZ, ZhangH, ChengZ, ChenB, YuanX.MetaSCI: scalable and adaptive reconstruction for video compressive sensing.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20–25; Nashville, TN, USA. New York City: IEEE; 2021. p. 2083–92.
[88]
MengZ, JalaliS, YuanX.GAP-Net for snapshot compressive imaging.2020. arXiv: 2012.08364.
[89]
MaJ, LiuXY, ShouZ, YuanX.Deep tensor ADMM-net for snapshot compressive imaging.In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019 Oct 27–Nov 2; Seoul, Republic of Korea. New York City: IEEE; 2019. p. 10222–31.
[90]
ZhaoY, ZhengS, YuanX.Deep equilibrium models for snapshot compressive imaging.In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI); 2023 Feb 7–14; Washington, DC, USA. Pennsylvania Ave: The Association for the Advancement of Artificial Intelligence; 2023. p. 3642–50.
[91]
ZhengS, YuanX.Unfolding framework with prior of convolution-transformer mixture and uncertainty estimation for video snapshot compressive imaging.In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023 Oct 2–6; Paris, France. New York City: IEEE; 2023. p. 12738–49.
[92]
WangL, CaoM, YuanX.EfficientSCI: densely connected network with space–time factorization for large-scale video snapshot compressive imaging.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. New York City: IEEE; 2023. p. 18477–86.
[93]
IliadisM, SpinoulasL, KatsaggelosAK.Deep fully-connected networks for video compressive sensing.Digit Signal Process2018; 72:9-18.
[94]
DongW, ShiG, LiX, MaY, HuangF.Compressive sensing via nonlocal low-rank regularization.IEEE Trans Image Process2014; 23(8):3618-3632.
[95]
MaggioniM, BoracchiG, FoiA, EgiazarianK.Video denoising, deblocking, and enhancement through separable 4D nonlocal spatiotem-poral transforms.IEEE Trans Image Process2012; 21(9):3952-3966.
[96]
YangJ, LiaoX, YuanX, LlullP, BradyDJ, SapiroG, et al.Compressive sensing by learning a gaussian mixture model from measurements.IEEE Trans Image Process2015; 24(1):106-119.
[97]
VenkatakrishnanSV, BoumanCA, WohlbergB.Plug-and-play priors for model based reconstruction.In: Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing; 2013 Dec 3–5; Austin, TX, USA. New York City: IEEE; 2013. p. 945–8.
[98]
BoydS, ParikhN, ChuE, PeleatoB, EcksteinJ.Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Foundations and Trends, Norwell (2011)
[99]
LiaoX, LiH, CarinL.Generalized alternating projection for weighted-ℓ2,1 minimization with applications to model-based compressive sensing.SIAM J Imaging Sci2014; 7(2):797-823.
[100]
LiY, QiM, WeiM, GenovR, KutulakosKN, HeidrichW, et al.End-to-end video compressive sensing using Anderson-accelerated unrolled networks.In: Proceedings of the 2020 IEEE International Conference on Computational Photography (ICCP); 2020 Apr 24–26; Saint Louis, MO, USA. New York City: IEEE; 2020. p. 1–12.
[101]
ZhengS, YangX, YuanX.Two-stage is enough: a concise deep unfolding reconstruction network for flexible video compressive sensing.2022. arXiv: 2201.05810.
[102]
WangL, CaoM, ZhongY, YuanX.Spatial–temporal transformer for video snapshot compressive imaging.IEEE Trans Pattern Anal Mach Intell2022; 45(7):9072-9089.
[103]
CaoM, WangL, ZhuM, YuanX.Hybrid CNN-transformer architecture for efficient large-scale video snapshot compressive imaging.Int J Comput Vis2024;132:4521–40.
[104]
RonnebergerO, FischerP, BroxT.U-net: convolutional networks for biomedical image segmentation.In: Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; 2015 Oct 5–9; Munich, Germany. Berlin: Springer; 2015. p. 234–41.
[105]
ChengZ, ChenB, LuR, WangZ, ZhangH, MengZ, et al.Recurrent neural networks for snapshot compressive imaging.IEEE Trans Pattern Anal Mach Intell2023; 45(2):2264-2281.
[106]
CaiY, ZhengY, LinJ, YuanX, ZhangY, WangH.Binarized spectral compressive imaing.In: Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems (NeurIPS-2023); 2023 Dec 10; New Orleans, LA, USA. San Diego: NeurIPS Proceedings; 2023. p. 1–9.
[107]
WangP, WangL, YuanX.Deep optics for video snapshot compressive imaging.In: Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023 Oct 1–6; Paris, France. New York City: IEEE; 2023. p. 10646–56.
[108]
LuS, YuanX, ShiW.Edge compression: an integrated framework for compressive imaging processing on CAVs.In: Proceedings of the 2020 IEEE/ACM Symposium on Edge Computing (SEC); .2020 Nov 11–13; San Jose, CA, USA. New York City: IEEE; 2020. p. 125–38.
[109]
LuS, YuanX, KatsaggelosAK, ShiW.Reinforcement learning for adaptive video compressive sensing.ACM Trans Intell Syst Technol2023; 14(5):1-21.
[110]
BethiYRT, NarayananS, RanganV, ChakrabortyA, ThakurCS.Real-time object detection and localization in compressive sensed video.In: Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP); 2021 Sep 19–22; Anchorage, AK, USA. New York City: IEEE; 2021. p. 1489–93.
[111]
GallegoG, DelbruckT, OrchardG, BartolozziC, TabaB, CensiA, et al.Event-based vision: a survey.IEEE Trans Pattern Anal Mach Intell2022; 44(1):154-180.