This article presents views on the future development of data science, with a particular focus on its importance to artificial intelligence (AI). After discussing the challenges of data science, it elucidates a possible approach to tackle these challenges by clarifying the logic and principles of data related to the multi-level complexity of the world. Finally, urgently required actions are briefly outlined.
1. Challenges in scientific data systems
Scientific data systems are becoming increasingly important in research and development (R&D) and are receiving widespread attention from both academia and industry. Data has become a core driving force behind the rapid advancement of AI in recent years, playing a crucial role throughout the development, training, evaluation, and optimization of AI models. Data quality is therefore essential for building efficient, reliable, and applicable AI systems. Consequently, there is a growing expectation that data will play a foundational role in the future, especially in accurately and completely presenting a human understanding of the complex world.
In fact, scientific data is primarily derived from long-term research and the accumulation of multi-level complex spatiotemporal dynamic processes, and the current understanding of these complex spatiotemporal structures remains incomplete. This lack of understanding results in numerous challenging issues that have not been adequately addressed in the accumulation, modeling, and application of data. Resolving these issues is crucial for the healthy and sustainable development of data science and will impose new demands on related scientific research in all disciplines and fields. We must pay sufficient attention to this issue and seriously address it.
Taking the current application of image recognition as an example, the image data exhibits a hierarchical structure, with layers going from the bottom up, including pixels, edges, textures, parts, and object. Each layer contains feature information at different scales. This inherent hierarchical structure provides a natural framework for building image-recognition AI models. For example, convolutional neural networks (CNNs)
[1] conduct image recognition in this order, processing images layer by layer from the bottom up
[2].
It is thus evident that the collection and organization of scientific data should not ignore the data’s intrinsic logic. Furthermore, if the logic and architecture of a scientific data system can reflect the internal characteristics, structure, behavior, and functional relationships of the research subject, it will benefit the construction of AI models with higher accuracy, robustness, and interpretability. On the other hand, if the logic and architecture of the models, software, and corresponding hardware resources used for processing scientific data do not align with those of the data itself, it will likely result in significant model prediction errors, poor generalization ability, difficulties in mining causal relationships, increased modeling computational costs, larger training data requirements, and weakened model interpretability. There is an urgent need to address this challenging issue in current AI.
This challenge not only pertains to the long-term development of AI and data science but is also an important—yet often overlooked—aspect of scientific research. For example, the data obtained by different researchers on the same phenomenon often differ, likely resulting from level classification errors or omissions. More significantly, there is a tendency to apply averaging techniques for complex spatiotemporal structures, which neglects the most critical substantive content: for example, what are the relationships between the system, levels, and scales? This data issue has become a substantial challenge in shifting research paradigms, addressing major challenges, and filling gaps in knowledge systems
[3].
2. Scientific data collecting and processing should follow certain principles
In recent years, some progress has been made in understanding complexity principles, especially in exploring the common principle in complexity and diversity of complexity. This has led to the concept of mesoscience
[4], which has been applied in various systems. We believe that the complexity of a system is usually manifested as multi-level complex structure, with each level comprising multiple scales (i.e., the element scale, mesoscale, and system scale). At each level, complexity likely emerges in the meso-regime at the mesoscale between the element scale and system scale. Complex systems are likely to be governed by at least two dominant mechanisms, and the compromise in competition (CIC) among these two mechanisms is the origin of system complexity.
To address the data issues discussed above, considering the multi-level characteristics of complex systems and the fact that each level constitutes a multi-scale subsystem with attributes that interact with adjacent levels while simultaneously being relatively independent
[4], future data collecting and processing should adhere to the following principles, in addition to meeting the requirements of existing conventional data specifications:
• When collecting data, it is necessary to clarify its possible multi-level characteristics and to identify and define the specific level of the collected data accurately, in order to prevent confusion and the misplacement of different levels of data.
• It is necessary to clarify the spatiotemporal structural characteristics of each level of data and to identify the key variables of intra-level interaction and adjacent-inter-level influence in order to ensure the integrity and reliability of data.
• At a specific level, considering the changes in its boundaries and operating conditions (including interaction between levels), there might be multiple operational regimes. Therefore, it is necessary to provide a clear expression of the critical conditions for transitions or abrupt changes between these different regimes.
• For dynamic structural data at specified levels that is currently unobtainable due to technological limitations, detailed annotations should be provided, space for improvement should be reserved, and users should be reminded to pay continuous attention to the missing information.
It should be noted that these points just provide a rough framework, without including the details of the data system. A guide for practical implementation is needed that considers the commonality and diversities of different disciplines and fields.
With such a framework, the logic of AI models should be rearranged into a multi-level architecture. For example, large language models (LLMs)
[5],
[6] currently employ the transformer architecture
[7], which processes text as a sequence of tokens, primarily focusing on the attention among these tokens. However, human-comprehensible text data typically has its own inherent logic and structure. Starting with words as the most basic elements, sentences, paragraphs, sections, and ultimately the entire document are constructed in a bottom–up manner. The structure and narrative logic of text clearly exhibit multi-level characteristics, with semantic temporal relationships existing among elements at the same level, lower-level elements serving as the building blocks for upper levels, and there are semantic connections among different levels. If these structures and logic could be integrated into the construction of an LLM, the model would be able to capture richer and deeper semantic information from text—as well as the text’s inherent logic—more effectively. This would be beneficial in enhancing the LLM’s capabilities of text comprehension, sentence generation, and logical reasoning.
3. The importance of studying the logic and architecture of data systems should be fully recognized
At present, the principles listed above are often not considered or inadequately addressed in data collecting and processing, restricting the continuous development of data systems and even AI. In fact, successful applications of AI have been achieved in fields where data architectures are defined relatively clearly, whereas AI performance is often unsatisfactory in fields with unclear data levels and structures, especially in engineering fields involving multiple levels of processes (e.g., in industrial process systems). This observation makes the significance of the logic and architecture of data systems even clearer from another perspective.
Therefore, researchers and practitioners in all sectors should pay full attention to the logic and architecture of data systems. In the future, continuous innovation and exploration in this area will be imperative. A standard protocol framework followed by an operation guide for global hierarchical structured data must be established in order to thoroughly address this issue. It is only in this way that we can set clear requirements for scientific research that generates and collects data, ensuring the gradual formation of a high-quality data ecosystem and thereby promoting the healthy development and efficient application of AI.
Furthermore, it is promising for both data science and AI to apply the principle that
mesoscale complexity originates from multi-level, multi-scale CIC between dominant mechanisms in the process of data collecting, analysis, and modeling
[8].
In summary, in scientific research activities under the new paradigm, special attention should be paid to the multi-level structures of the complex systems being studied when collecting data, organizing data, and conducting AI analysis. It is essential to strictly adhere to the principle that the behavior and functional relationships of the data must follow the same logic and framework as the research object—a principle that imposes higher demands on interdisciplinary research. We should not be restrained by the inertia of long-established disciplinary divisions but should actively integrate the process of paradigm shifting into scientific research. In particular, the common requirements of data systems—that is, the common logic and landscape of knowledge systems—should permeate the entire research process, as well as data products in all disciplines and fields, to meet the new challenges of the AI era. Attention should be paid to preventing the inclusion of illogical data into scientific data systems in all scientific fields at this time. We consider this to be one of the most urgent issues the global scientific community must address.