1. Introduction
The rapid expansion of satellite constellations in recent years has resulted in the generation of massive amounts of data. This surge in data, coupled with diverse application scenarios, underscores the escalating demand for high-performance computing over space. Computing over space entails the deployment of computational resources on platforms such as satellites to process large-scale data under constraints such as high radiation exposure, restricted power consumption, and minimized weight.
For instance, computing over space can be applied to satellite remote sensing. As the ground resolution of remote-sensing images has developed from 10.0 to 0.3 m, the data volume under the same swath has increased by approximately 1000 times. However, the bandwidth of satellite–ground communications and the duration of the link while a satellite passes over the ground station are limited [
1], and onboard computing may save several days of transmission and processing time [
2], which is too long for specialized tasks such as emergency response. Computing over space can extract high-value information from huge amounts of data and significantly reduce the required transmission bandwidth and service time. Based on this concept, Prof. Deren Li from Wuhan University has proposed the Oriental Smart Eye (OSE) satellite constellation to provide real-time intelligent services for satellite remote-sensing information [
3].
In addition to remote sensing, satellite communication is a critical application of space computing. Onboard deployment of the core network is crucial for realizing satellite networks, and the performance of space computing has always been a key limiting factor for tasks such as signal processing, multiplexing, traffic management, and resource allocation [[
4], [
5], [
6]]. Therefore, Prof. Shangguang Wang from Beijing University of Posts and Telecommunications has proposed the Tiansuan Constellation and conducted experiments on the onboard deployment of the core network [
7]. Communication satellites have numerous users, making network planning and optimization important. Prof. Ping Zhang has proposed semantic communication [
8], which can improve communication efficiency by extracting and utilizing semantic information. Still, semantic communication will place even greater demands on onboard computing capability.
Furthermore, tasks such as autonomous collision avoidance for spacecraft [
9] and robotic space exploration [
10] require high-performance computing to complete complex algorithms such as artificial intelligence algorithms. To address this issue, Lumen Orbit has proposed the concept of the space data center [
11]. By leveraging advantages such as solar power resources and low-temperature cooling in space, a space data center can reduce the comprehensive cost to about one twentieth of that on the ground [
11].
While the demand for computing over space is continuously growing, the performance of computing over space is typically low. For example, the RAD5500 processors that are commonly used in space have a performance of only 0.9 giga floating-point operations per second (GFlops). In contrast, the NVIDIA A100, a commercial off-the-shelf (COTS) chip commonly used on the ground, already achieves a performance of 156 tera floating-point operations per second (TFlops). As shown in
Fig. 1, the performance of the chips commonly used in satellites and the COTS chips on the ground consistently differ by three to four orders of magnitude, largely because of space radiation. The electronic components used in space usually require radiation hardening or radiation-resistant treatment to withstand the cumulative effects of radiation [
12].
The onboard use of COTS devices, along with system-level hardening measures to alleviate the reliability deficiencies of such devices, is an important technical approach to meet the increasingly higher demand for computing over space. In 2003, Behr et al. [
13] experimented with the use of COTS devices for computing over space. In 2021, Hewlett Packard Enterprise (HPE) and National Aeronautics and Space Administration (NASA) collaborated to send the HPE Spaceborne Computer 2 to the International Space Station. This computer carried an NVIDIA T4 graphics processing unit (GPU), providing 65 tera operations per second (TOPS) of computing performance. Our team developed the Jiguang 1000 space intelligent computer, which is equipped with Cambricon neural processing unit (NPU) chips, achieving 32 TOPS of computing performance; the computer was launched aboard the Jilin-1 01A01 satellite in 2022. In 2024, Jiguang 1000-OSE was launched on the OSE 01 satellite. On this computing platform, we have implemented the inference of a visual large language model (VLLM). In addition to the aforementioned applications, there are a multitude of applications for similar COTS devices within space computing systems, including but not limited to those outlined in Refs. [[
14], [
15], [
16], [
17], [
18]].
Although the use of COTS devices has improved the performance of computing over space, there is still a significant gap in performance between current space computing systems and the most advanced systems on the ground. To further develop computing over space, it is necessary to address the following key issues: first, to design a computing architecture and fault-tolerance measures to ensure reliability; second, to design effective thermal control systems for high-heat-flux-density COTS devices in the vacuum space environment; and third, to develop intelligent applications to meet diverse scenario requirements. The next section discusses these challenges and provides possible solutions.
2. Key technologies for computing over space
2.1. The computing architecture
The architecture of space computing systems plays a critical role in determining both performance and reliability. Over the past decades, guided by the pursuit of higher reliability, enhanced performance, reduced power consumption, and minimized costs, the development of computing architecture over space can be summarized into four phases, as shown in
Fig. 2:
•Phase 1: distributed embedded systems (DES), where the satellite system consists of independent embedded systems;
•Phase 2: integrated electronic systems (IES), which centralize multiple systems on a single platform;
•Phase 3: external intelligent systems (EIS), in which additional high-performance computing devices are added to IES, enabling the execution of complex algorithms such as artificial intelligence algorithms;
•Phase 4: integrated intelligent systems (IIS), where EIS and IES are unified into an integrated intelligent system, with lower power consumption, smaller volume, and higher performance.
The widespread adoption of EIS and IIS heavily relies on the increased reliability of space computing systems. In the extreme environment of space, the effects of radiation create high failure risks. Since COTS components have inherent reliability limitations, it is essential to analyze potential failures and system reliability and to carry out targeted system fault-tolerant reinforcement design. However, the complexity of the computing architecture increases the difficulty of reliability analysis. As shown in
Fig. 3, a space computing system may include multiple boards, such as main control boards, exchange boards, storage boards, computing boards, and so forth. Along with cold/hot redundant backup strategies, these boards form a complex interdependent system. To address the reliability analysis and reinforcement issues of a complex system, a hierarchical fault-tolerant theory model is necessary. Modeling methods including Monte Carlo simulation [
19], vulnerability analysis [
20], state transition models [
21], reliability block diagrams [
22], and fault tree analysis [
23], in combination with simulation and testing methods including fault injection [
24], irradiation experiments, and system testing experiments, can quantify complex uncertainty factors into probability curves, providing a theoretical basis for fault-tolerant mechanism design.
To incorporate fault-tolerant hardening methods, a collaborative fault-tolerant system should be established at different levels such as the component, system architecture, operating system, and algorithm levels. For example, at the component level, methods such as instruction-level time redundancy [
25] and multi-device redundancy [
26] can be used to improve error-correction capability at the cost of significant performance loss. At the system architecture level, methods such as key module redundancy, cold/hot backup, and watchdog [
27,
28] can be used to increase tolerance to critical component failures. At the software level, technologies such as cloud native [
29] or microkernel [
30] can improve the availability of the operating system. Finally, at the algorithm level, redundancy can be applied to data [
31] or neural network models [
32] to reduce silent data errors.
2.2. The thermal control system
In the vacuum space environment, with its huge temperature differences, the heat-dissipation capacity of the thermal control system is crucial to the performance of computing systems. Electronic components have operational temperature ranges, with industrial-grade devices typically rated from –40 to 85 °C. Exceeding these limits can lead to performance degradation or even component failure [
33]. Heat flux density measures the thermal power generated per unit area of a device; the higher the heat flux density is, the more challenging the heat dissipation becomes. Since computing performance, power consumption, and heat flux density are usually positively correlated, high-performance COTS devices exhibit high heat flux density. Moreover, compared with the environment of ground-based computing systems, the working environment of space computing is extremely harsh. The side of the satellite exposed to direct sunlight can reach temperatures of over 100 °C, while the temperature of the shaded side can drop to as low as –100 to –200 °C.
In space, common thermal control designs such as fins [
34] are invalid. Common heat-transfer methods for space computing devices include solid conduction, heat pipes, and fluid loops. Among these, solid conduction is the most commonly used method; it relies on the natural properties of materials and structural design to conduct heat. Its supporting maximum heat flux density is around 20 W·cm
−2. Heat pipes consist of sealed pipes and internal working fluids that conduct heat through evaporation and condensation, circulating through capillary action or gravity. Fluid loops achieve heat transfer through fluid convection; they can support a higher heat flux density but are limited in terms of weight, layout, micro-vibrations, and so forth [
35]. As a result, fluid loops are usually used in large spacecraft such as the International Space Station [
36] and the Shenzhou spacecraft [
37]. Radiation-hardened (rad-hard) devices typically have low power consumption. For instance, the RAD750 has a power consumption of only 5 W. Although solid conduction is sufficient, it cannot support high-performance GPU devices. For example, the NVIDIA A100 has a power consumption of around 300 W and a chip area of approximately 8.26 cm
2, with a heat flux density reaching about 36.3 W·cm
−2. Thus, fluid loops may become an important solution for future computing over space.
At present, due to considerations of reliability, weight, power consumption costs, and so forth, few space computing systems utilize fluid loops. However, fluid circuits have become an almost essential approach to achieving the highest possible computing capability in space. Substantial research gaps remain on the topic of how to optimize fluid circuits. In order to reduce the impact of fluid circuits on reliability, we propose a hybrid passive–active cooling (HPAC) method, as shown in
Fig. 4. The active cooling part is responsible for cooling high-power chips, while the passive cooling part is responsible for cooling low-power chips. The HPAC system ensures basic functionality even in case of fluid loop failures.
2.3. Applications
The use of high-performance COTS devices in space unlocks advanced data processing and analysis capabilities, making complex intelligent applications possible. For example, Jiguang 1000-OSE has implemented algorithms including object recognition, cloud image discrimination, and image compression, significantly increasing onboard data-utilization efficiency.
Future advancements in satellite application algorithms will improve data-fusion capabilities and operational efficiency, which are critical for time-sensitive applications such as disaster monitoring. Integrating large language models (LLMs) into satellite systems offers a promising solution for achieving intelligent information fusion and the natural language interpretation of human instructions. As illustrated in
Fig. 5(a), the proposed multimodal VLLM architecture demonstrates the advantages of bidirectional natural language communication with ground operators and automated analysis of remote-sensing imagery.
Based on the VLLM architecture, we constructed Jiguang VLLM and conducted text question-answering experiments on the Jiguang 1000-OSE platform. To address the relatively weak computing performance and limited bandwidth in the satellite–ground communication link, we have implemented optimization techniques, including model distillation and parameter quantization [
38]. As shown in
Fig. 5(b), the experiment successfully validated the possibility of the VLLM’s onboard inference
Further advancements and refinements are anticipated in the future. For example, the integration of federated learning within satellite computing networks offers a pathway for satellites to collaboratively leverage distributed data resources while maintaining data privacy and security [
39]. Moreover, the incorporation of emerging modules alongside customized, task-specific modules into configurable VLLM architectures has the potential to significantly increase the model’s adaptability and capacity for continuous evolution in dynamic and heterogeneous environments [
40]. Such advancements are expected to pave the way for more robust, scalable, and adaptive intelligent systems, thereby expanding the potential applications of VLLM in complex and dynamic spatial computing scenarios.
3. Conclusions
In conclusion, high-performance computing over space represents a transformative capability, enabling real-time data processing and analysis across diverse fields such as remote sensing and communications. This paper highlighted the pivotal role of COTS devices in advancing high-performance space computing; addressed critical technical challenges, including system reliability, thermal control, and applications; and proposed potential solutions. In regard to computing architecture, the evolution of EIS and IIS relies heavily on the increased reliability of COTS-based space computing systems. By implementing reliability analyses and fault-tolerant methods at multiple levels, the overall system reliability can be enhanced. In regard to thermal control systems, the challenges of computing over space are significant due to the extreme temperature differentials in space and the lack of air convection in a vacuum. The HPAC method holds promise as an important solution to heat-dissipation issues. On the application front, by converting massive image data into high-value natural language text, the VLLM opens up new possibilities for rapid information services. As these technologies mature, computing over space is poised to revolutionize fields such as autonomous exploration and space-based data centers.
CRediT authorship contribution statement
Yaoqi Liu: Writing – original draft. Yinhe Han: Writing – original draft. Hongxin Li: Writing – original draft. Shuhao Gu: Software. Jibing Qiu: Writing – original draft. Ting Li: Resources.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Key Research and Development Program of China (2022YFB3902802), in part by the Beijing Natural Science Foundation (L241013), and in part by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA000000).