A Hierarchical Task Graph Parallel Computing Framework for Chemical Process Simulation

Shifeng Qu , Shaoyi Yang , Wenli Du , Zhaoyang Duan , Feng Qian , Meihong Wang

Engineering ›› 2025, Vol. 51 ›› Issue (8) : 244 -255.

PDF (5229KB)
Engineering ›› 2025, Vol. 51 ›› Issue (8) :244 -255. DOI: 10.1016/j.eng.2024.06.019
Research
research-article
A Hierarchical Task Graph Parallel Computing Framework for Chemical Process Simulation
Author information +
History +
PDF (5229KB)

Abstract

Sequential-modular-based process flowsheeting software remains an indispensable tool for process design, control, and optimization. Yet, as the process industry advances in intelligent operation and maintenance, conventional sequential-modular-based process-simulation techniques present challenges regarding computationally intensive calculations and significant central processing unit (CPU) time requirements, particularly in large-scale design and optimization tasks. To address these challenges, this paper proposes a novel process-simulation parallel computing framework (PSPCF). This framework achieves layered parallelism in recycling processes at the unit operation level. Notably, PSPCF introduces a groundbreaking concept of formulating simulation problems as task graphs and utilizes Taskflow, an advanced task graph computing system, for hierarchical parallel scheduling and the execution of unit operation tasks. PSPCF also integrates an advanced work-stealing scheme to automatically balance thread resources with the demanding workload of unit operation tasks. For evaluation, both a simpler parallel column process and a more complex cracked gas separation process were simulated on a flowsheeting platform using PSPCF. The framework demonstrates significant time savings, achieving over 60% reduction in processing time for the simpler process and a 35%–40% speed-up for the more complex separation process.

Graphical abstract

Keywords

Parallel computing / Process simulation / Task graph parallelism / Sequential modular approach

Cite this article

Download citation ▾
Shifeng Qu, Shaoyi Yang, Wenli Du, Zhaoyang Duan, Feng Qian, Meihong Wang. A Hierarchical Task Graph Parallel Computing Framework for Chemical Process Simulation. Engineering, 2025, 51(8): 244-255 DOI:10.1016/j.eng.2024.06.019

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

As the environmental impact of the process industry gains increasing attention from the international community, smart and optimal manufacturing has been applied in the process industry to assist factories in reducing their energy consumption, resource consumption, and environmental pollution [1], aligning with the demand for green production [2,3]. Central to efficient and smart manufacturing is the digitalization of industrial processes, which includes building digital factories for petrochemical enterprises via modeling technology [4] and achieving virtual manufacturing by combining process mechanisms [5]. Process simulation, an indispensable digitalization technology, is used extensively for designing and optimizing chemical processes. It is a comprehensive computation system based on multidisciplinary theories from chemical engineering, systems engineering, applied mathematical statistics, computational methods, and computer technology. As the process industry evolves toward more intelligent operation and maintenance, many commercial process flowsheeting simulators are increasingly challenged by lengthy computational times and costs, particularly for large-scale design and optimization problems [6]. Conventional process-simulation frameworks have difficulty performing timely computations for large-scale design-based tasks or real-time optimization for chemical plants.

With the advancement of single-processor computing nearing its limits, multi-core processors have been widely adopted to enhance computing capacity. Significant research has been directed toward utilizing parallel computing to address challenging industrial tasks. The sequential-modular (SM) approach and equation-oriented (EO) approach are two predominant approaches in process simulation systems [7]. In EO-based systems, an overall plant model can be formulated by assembling the mathematical models of unit operations in the flowsheet, along with connectivity relations, into a large system of equations [8]. Typically, Newton’s method is used to solve these systems of nonlinear equations, enabling simultaneous optimization and convergence of the flowsheet [9]. In such systems, parallelism is often achieved through the parallel computation of matrices. In the 1980s, Ortega and Voigt [10] provided an extensive review of numerical methods for solving partial differential equations on vector and parallel computers. Modular integration methods have been proposed as viable alternatives for simulating large systems by integrating a single massive set of equations, offering the possibility of simulating the dynamics of very large-scale process systems using multiple processors [11]. Recent studies have further developed these methods. Ma et al. [12] developed a parallel computation method to solve large-scale EO models by dividing large-scale nonlinear equations into several groups and using multiple threads for simultaneous function evaluations. Weng et al. [13] presented a multi-thread parallel computation method to compute the molecular weight distribution of multisite free-radical polymerization. Carrasco and Lima [14] proposed bi-level and parallel programming-based approaches and applied them to nonlinear and high-dimensional modular systems to reduce computational time challenges associated with system dimensionality. Lu et al. [15] developed a coarse-grained discrete particle method for the simulation of gas–solid flows, which takes full advantage of central processing unit (CPU)–graphics processing unit (GPU) hybrid supercomputing. Most studies reorder and partition the matrices arising in large chemical process simulations by exploiting the sparsity of these higher-order matrices and assigning each sub-matrix to multiple computers or threads for independent computation. Parallelism has also been adopted in flowsheet optimization and synthesis for computing acceleration. To improve the efficiency of stochastic optimization for distillation processes, Lyu et al. [16] distributed the population into groups on different threads by a pool model, thus making full use of a multi-core CPU. However, the majority of these parallel solutions are based on the EO approach and suffer from limited problem formulation flexibility. Parallel computation for generic SM-based process simulations remains a relatively unexplored area in the literature.

Unlike the EO approach, SM-based process simulation involves a more extensive iterative procedure for large-scale design and optimization problems, leading to prohibitively expensive time costs. Nevertheless, the SM approach remains the preferred simulation approach adopted by most commercially available process flowsheeting software, such as Aspen Plus (Aspen Technology, Inc., USA), Aspen HYSYS (Aspen Technology, Inc.), AVEVA PRO/II (Schneider Electric, USA), UniSim (Honeywell International, Inc., USA), and Petro-SIM (KBC Advanced Technologies plc., USA) [17,18]. The reasons for this preference are multifaceted:

•In SM-based process simulation, the data flow across unit operations aligns with the material flow in real chemical processes, matching the intuitive experience of engineers.

•Programming, maintenance, and expansion of an SM-based simulation system are usually convenient.

•Compared with the EO approach, it is easier to identify the causes of simulation failure in the SM framework.

•Simulation computing under the SM approach requires less computer memory than the EO-based framework.

•The convergence stability of the SM-based framework is less dependent on initial guesses compared with the EO approach [19].

As highlighted, the SM approach has a distinct advantage over the EO approach in solving systems of nonlinear equations from poor initial guesses [20,21]. This is particularly important in the early stages of modeling, when model variable values are largely unknown and initial guesses may be far away from the solution. For robustness, several types of EO-based process flowsheeting software perform an initial SM simulation to provide better starting values for variables in large systems of nonlinear equations [22]. In summary, SM-based flowsheeting systems are still indispensable for process-simulation tasks in today’s chemical and petrochemical industries.

In order to address the time-consuming nature of SM-based computation and accelerate simulations, a parallelism framework for SM-based flowsheeting systems is needed for the adaptation to widely equipped multi-core processors. According to the SM approach, the data flow of a flowsheet’s solving process is consistent with the material flow between unit operations. Thus, the computation of a flowsheet can be naturally represented as a task graph of independent stages that communicate explicitly over data channels [23]. A well-configured task graph can be executed in parallel through task graph computing systems (TGCSs). TGCSs, which play a crucial role in large-scale scientific computing, encapsulate function calls and their dependencies in a top-down task graph, enabling irregular parallel decomposition strategies that scale to many processors, including multi-core CPUs. TGCS has seen significant research interest in recent years, with systems such as Fastflow [23], StarPU [24], task parallel library (TPL) [25], Legion [26], Kokkos directed acyclic graph (DAG) [27], PaRSEC [28], HPX [29], and Taskflow [30] achieving success in a variety of scientific computing applications, including machine learning and simulation.

In this paper, we propose an efficient process-simulation parallel computing framework (PSPCF) based on TGCS. This framework is the first generic computing framework that allows an SM-based system to support parallel computing using TGCS, and it can be easily integrated into existing process modeling systems to achieve parallelism. PSPCF parallelizes SM-based chemical process simulations at the unit operation level and significantly speeds up computation. It offers more coarse-grained parallelism than most published EO-based parallel computing strategies, providing higher scalability and generalizability. With the development of a hierarchical task graph generation system, the proposed framework also efficiently parallelizes processes with recycles. Work-stealing schemes are integrated to automatically balance thread resources with task workload.

The rest of this paper is organized as follows. Section 2 discusses converting an SM-based simulation problem into a task graph supporting parallel computing. Section 3 covers the details of Taskflow, an advanced TGCS integrated into the proposed framework. Section 4 introduces the main systems and workflow of PSPCF. In Section 5, PSPCF is applied to two process flowsheets to evaluate computation performance. Finally, a conclusion is drawn in Section 6.

2. Converting a process flowsheet to a task graph: A new perspective to solve SM simulation

For process simulation under the SM approach, each chemical process can be represented by linking a set of standard mathematical unit operations in arbitrary and flexible networks with material and energy streams [31]. According to the SM approach, a dedicated subprogram is prepared for each type of unit operation, such as the flash unit, distillation columns, absorbers, various reactors, compressors, pumps, valves, and so forth. This subprogram encompasses the corresponding mathematical model equations and model-solving programs, as illustrated in Fig. 1. The simulation system then resolves the total process model by sequentially calling each unit operation based on their interconnections [32]. Once supplied with the feed stream’s physical properties, unit structure parameters, and operating parameters, each unit begins calculating its physical outputs, which are then passed as inputs to the succeeding units. Notably, the computation in such an SM-based simulation could represent a prime scenario for stream parallelism. Stream parallelism is a programming paradigm that supports the parallel execution of a sequence of tasks using a set of sequential or parallel stages. A process flowsheet can be naturally represented as a task graph of independent tasks that communicate explicitly over data channels. In this context, parallelism is achieved by simultaneously executing each stage of the graph. By converting a chemical process flowsheet into a task graph, its calculation becomes a series of transformations on the data stream.

The degree of parallelism significantly influences the computing speed [33]. A practical parallel granularity for process simulation is at the level of physical property algorithms. Parallelism at this level can effectively utilize multi-core computing resources. However, the interactions between physical property calculations and unit operation objects are complex, posing challenges to parallelism design. Implementing parallel computing at this level necessitates discarding traditional object-oriented structures and codes, and reorganizing numerous dependencies between algorithms. Similarly, this limitation applies to most parallel strategies in EO-based systems, which fail to offer scalable and generalized solutions.

To avoid such arduous reconfiguration, parallelism at the unit operation level is more feasible. To achieve this, the entire process simulation is divided into a series of subtasks that can be computed independently. With this decomposition, a process-simulation program can be represented by a task graph with nodes (unit operations) and edges (dependencies). If the simulation task can be converted into a task graph problem, it becomes feasible to execute it using an advanced TGCS for parallelism.

Taking a simple chemical process flowsheet, as shown in Fig. 2, as an example, one unit operation takes the outputs of its preceding unit as inputs. For instance, the simulation task of unit operation Heater_1 can only commence after the computations of Comp_1 and Flash_1. In the case of parallel unit operations such as Heater_1 and Valve_1, the execution of one unit is independent of the other. Thus, three unit operations can be offloaded to different threads and executed simultaneously. Despite the potential for parallel computation offered by independent unit operations, allocating numerous tasks in large-scale process simulations to multiple threads while adhering to dependencies and achieving load balancing remains a challenge.

Parallelism in process simulation becomes even more sophisticated with the presence of cyclical structures, which happens to be common in chemical processes. The simulation of a cyclical structure requires the process computing system to concurrently handle all unit operations within a recycle [34]. For this purpose, recycle blocks (indicated by the red box in Fig. 2) must first be identified through the partitioning technique [35], which separates maximal cyclic nets based on the flowsheet topology. After partitioning, the calculation sequence within the recycle is determined using tearing algorithms [36]. The iterative process, which involves assumptions and recalculations of tear stream variables until a satisfactory agreement is reached, is critical in simulating processes with recycles [37]. Similarly, the simulation of units dependent on the iterative computing of recycles can only proceed after completing the units’ preceding recycle. Therefore, all unit operations within a recycle should be treated as a single entity and be executed alongside other units outside recycles at the same level.

A process without cyclical structures can be mapped to a directed acyclic task-dependency graph comprising simulation subroutines of the unit operations in the flowsheet. By transforming the flowsheet into a task-dependency graph, chemical process-simulation problems can be efficiently executed with the power of existing TGCSs, providing great accessibility to simulation parallelism. Tasks dependent on each other must be executed sequentially according to their dependencies. Most TGCSs are equipped with advanced work-stealing algorithms to schedule tasks between threads, thereby addressing load imbalance issues [38]. Therefore, developing a simulation parallel computing system based on a TGCS is feasible, allowing the allocation of thread resources for tasks based on dependency and executing parallel computing for mutually independent units. Nevertheless, when dealing with processes with recycles, the cyclic execution of iterative procedures exceeds the capability of the traditional DAG models commonly used in TGCSs. In this context, a novel TGCS named Taskflow emerges as a solution. Taskflow not only incorporates an advanced work-stealing scheme but also integrates unique tasking models that support the iterative execution of recycles and the construction of hierarchical task-dependency graphs. The next section details Taskflow and its integration into the proposed framework.

3. An advanced TGCS: Taskflow

In recent years, a great deal of research on TGCSs has been carried out, achieving notable success in various scientific computing applications, including machine learning, data analysis, and simulation. Taskflow, developed by Huang et al., is a novel TGCS that supports various types of tasking models [30]. Two unique models—composable tasking and conditional tasking—are integral to Taskflow, making it a tool for addressing the iterative execution of recycle processes. For this reason, Taskflow was selected as the simulation runtime environment for our proposed framework. Taskflow is designed with two levels of scheduling, task-level and worker-level, which are elaborated upon in this section.

3.1. Task-level scheduling

Task-level scheduling involves devising a feasible execution plan for in-graph control flow and efficiently dispatching tasks in a task graph to workers based on their dependencies. Among the various tasking models in Taskflow, three fundamental tasking models are integrated into our framework: static tasking, conditional tasking, and composable tasking.

Static tasking, the most basic task type in Taskflow, runs a callable with no arguments. Conditional tasking, a unique feature in Taskflow, addresses the limitations of other TGCSs in expressing cyclic control flow beyond DAGs. With cyclic control flow, condition tasks support the iterative execution of recycles in simulated process flowsheets. The framework proposed in this paper will benefit from this ability to enable cyclic control flow decisions without partitioning or unrolling control flow into a flat DAG. Fig. 3 illustrates an if-else control flow in Taskflow, where a condition task precedes two static tasks, B and D, directing execution based on the condition’s return value. Composable tasking allows developers to define task hierarchies and construct large task graphs from modular blocks. With composable tasking, users can break down a heavy parallel workload into small module tasks that carry specific task-dependency graphs, which largely facilitates modularity in parallel task programming. This model is particularly suitable for modularizing recycle blocks in chemical process flowsheets, as shown in Fig. 4. The top-level task graph (i.e., the main graph) defines one static task A that runs before a module task composing a subgraph of three dependent tasks mA, mB, and mC.

For task-level scheduling, while static tasking models and composable tasking models serve as the basic nodes of a task-dependency graph, conditional tasking is the key to implementing control flow. The execution logic between condition tasks and other tasks is differentiated using two dependency notations: weak dependency (depicted by dashed arrows in Fig. 3) and strong dependency (solid arrows in Fig. 3). Tasks with zero dependencies—those not reliant on any other tasks’ execution results—are prioritized by the task scheduler, which executes tasks as soon as strong remaining dependencies are met or directly jumps to the task indexed by the return of a condition task in the case of weak dependency.

3.2. Worker-level scheduling

At the worker level, Taskflow leverages work-stealing to execute task graphs with dynamic load balancing. Work-stealing is a scheduling strategy for balancing thread resources with task workload. It resolves the problem of executing a dynamically multithreaded computation, one that can “spawn” new threads of execution, on a statically multithreaded computer, with a fixed number of cores (or workers). Work-stealing scheduling has received extensive research interest over the past few decades [[39], [40], [41]]. However, most work-stealing approaches have difficulty with solution generality and performance scalability. With a novel worker-management scheme, the adaptive work-stealing scheduling in Taskflow adapts the number of working threads to dynamically generate task parallelism, thus overcoming the limitations of complex data structures. This scheme reduces competition for tasks between threads during the execution of task-dependency graphs and assigns executable tasks to appropriate threads with higher probability.

The scheme is demonstrated in Fig. 5, where the task-dependency graph (left) from a scientific computing process shows tasks (nodes) and dependencies (edges). The work-stealing scheduler architecture in Taskflow (right) demonstrates efficiency in thread management. Each worker has a private deque for ready tasks. A worker can add and pop tasks from one end of its deque, while other workers can steal tasks from the opposite end. Thieves, created by the scheduler, are responsible for carrying out this stealing job. During initialization, the scheduler generates CPU workers ready for tasks. After that, the workers start executing the task graph, with a lock protecting the main queue from the concurrent submission of task graphs. Generally, a worker thread gets tasks from the head of its internal deque. When it completes a task, the scheduler automatically reduces dependencies on succeeding tasks and pushes new tasks to this worker’s deque when dependencies are satisfied. If the deque is empty, tasks are sourced from other busy threads’ deques or main queues. The number of thieves adjusts to reduce resource waste or increase active workers up to a maximum, based on the task load.

The executor, which serves as the carrier of worker-level scheduling, is an object created to manage worker threads and execute task graphs through an efficient work-stealing algorithm. The number of workers of the executor can be adjusted by issuing a call that takes an unsigned integer as a parameter. Notably, the executor treats module tasks spawned by composable tasking in a different way than other types of tasks. When a module task is assigned to a worker thread, the execution of a module task demands exclusive use of a single worker thread. Moreover, static tasks inside this module task need to be performed by a separate worker thread. Thus, the executor always spawns an extra worker thread when executing a module task. It should be noted that, even in the case that the number of workers of the executor is set to 1, if a module task is encountered, a second thread is still spawned to fulfill the implementation of module tasks.

In our proposed computing framework, this adaptive worker-level scheduling scheme accelerates the execution of the process-simulation task-dependency graph.

4. Process-simulation parallel computing framework

In this section, we introduce PSPCF, which includes a main graph setting system (MGSS) and a recycle subgraph generation system (RSGS). PSPCF enhances process-simulation computation at the unit operation level and accelerates the parallel computing process through a work-stealing mechanism. We have implemented an iterative procedure for recycle computing by applying the in-graph control flow of Taskflow to our framework. By setting the control flow node parameters, users can specify the recycles’ acceleration method, convergence criteria, and maximum number of iterations. The parallel computing framework we propose is hierarchical, enabling layered parallelism in process-simulation calculation.

4.1. Main graph setting system

The MGSS constructs a hierarchical executable task graph from a simulated process flowsheet, as shown in Fig. 6. The process-simulation flowsheet (top) is converted to the main graph (bottom) comprising all the simulation tasks of the unit operations in the flowsheet (i.e., simulation task nodes). These simulation task nodes, supported by Taskflow’s static tasking mechanism, contain subprograms for unit operations, including mathematical model equations and solving algorithms. The subprograms are executed when the corresponding unit operations are simulated. The MGSS operates differently depending on whether the simulated process flowsheet includes a recycle loop.

4.1.1. Process flowsheet without recycles

When MGSS receives a configured process flowsheet without recycles (top in Fig. 6), it creates simulation task nodes of unit operations and material streams, linking them to their corresponding simulation subprograms prepared during preprocessing [31]. Dependencies between simulation task nodes are established according to the global process topology, with the main graph outputting all simulation task nodes and their dependencies.

4.1.2. Process flowsheet with recycles

For flowsheets containing recycle loops, the unit operations inside loops require iterative solving. These loops, known as recycle blocks (indicated by the red box in Fig. 7), necessitate a two-level hierarchical computing structure. At the higher level, out-of-block tasks are executed together with a new type of node—namely, the subgraph task node—that consists of all the tasks inside a specific recycle block (i.e., in-block tasks). Moreover, the in-block tasks are performed at a lower level within their corresponding subgraph task nodes. For out-of-block task nodes that depend on in-block task nodes, their program is delayed until the iterative execution of the whole preceding recycle block is completed and the converged results become available. These subgraph task nodes are facilitated by Taskflow’s composable tasking mechanism.

MGSS first identifies the recycle blocks using the system partitioning technique based on the topology of the flowsheet. After that, an optimum set of tear streams for each recycle block is inferred with the tearing algorithm. Alternatively, users can manually assign the set of tear streams to adjust the convergence of the simulation. After the tear streams of each recycle block are specified, corresponding recycle subgraph task nodes are created. The details of subgraph construction are elaborated in the following section. These recycle subgraph task nodes are then added to the main graph at the same level as other out-of-block simulation task nodes and are connected with out-of-block nodes according to the global process topology.

Fig. 8 provides a two-level hierarchical main graph corresponding to the process flowsheet in Fig. 7. The out-of-block task nodes, together with the recycle subgraph task node, form the higher level of the main graph. The lower level is built from the task nodes inside the recycle subgraph—that is, the in-block tasks. With this two-level hierarchical computing structure, the iterative computation of recycle blocks and the computation of out-of-block unit operations are on the same level. It is worth noting that, for simulation task nodes with an immediate connection to the recycle subgraph on the main graph, the dependency relations of these nodes are established with the lumped recycle subgraph task nodes, rather than the in-block simulation task node of the directly connected unit operation in the flowsheet. For example, in Fig. 8, the dependency relation for simulation task node S18 is set upon the recycle subgraph task node, rather than the simulation task node Column.

4.2. Recycle subgraph generation system

When MGSS constructs the hierarchical executable main graph of a process flowsheet with recycles, an RSGS is used to handle the iterative recycle procedure.

As a common routine of calculating recycle blocks in chemical process simulation, initial guess values are assigned to tear streams, and all the unit operations inside the recycle block are then simulated sequentially. After each full loop calculation of the block is made, the guess values for the tear streams are updated, following certain rules. In this way, the simulation programs of tear streams are performed twice during each iteration, and the iterative procedure is repeated until the difference between the guessed and the calculated values of the tear streams is no greater than the specified tolerance. The in-graph control flow of a TGCS is a potential tool for implementing this calculation procedure, albeit still challenging to be applied to RSGS. To be specific, it is difficult for a conventional task scheduler to schedule cyclic simulations of a recycle subgraph with convergence checks and manipulate the execution of in-block tasks with different types of dependencies. For this reason, two types of task nodes are created inside the recycle subgraph: a recycle positioning task node and a conditional task node.

Recycle positioning task node. A recycle positioning task node, supported by the static tasking model of Taskflow, is set up to inform the task scheduler from which task node the recycle subgraph starts the calculation. In each recycle subgraph, two recycle positioning task nodes are added before the first simulation task nodes and serve as the starting point, which helps the scheduler to identify the calculation sequence of in-task nodes. The presence of such positioning nodes also prevents the occurrence of out-of-sync errors when some branches compute faster and enter the next iteration earlier than other slow branches in the same subgraph. In this way, the positioning task node ensures the overall synchronization of each iteration by detaining all parallel simulations of the recycle subgraph within the same iteration.

Conditional task node. In order to implement the iterative computation of a recycle subgraph, a conditional task node is defined as a logic unit placed at the end of each iteration to check whether the current error satisfies the convergence tolerance given by the user. Conditional task nodes exist only in recycle subgraphs and are supported by the conditional tasking model of Taskflow.

Conditional task nodes carry programs of multiple convergence acceleration algorithms and regulate the in-graph control flow of recycle subgraphs. As discussed in Section 3.1, depending on whether the execution of a task relies on the conditional task nodes or on other task nodes, the dependency relations of the tasks in the iterative procedure can be classified as weak dependency (out-of-condition tasks) and strong dependency (other else). Task nodes with no arrow entry are called zero-dependency tasks. The calculation procedure always starts from the execution of zero-dependency tasks. The task scheduler then continues the execution of tasks with strong dependencies until the exposure of a weak dependency, when the conditional task node evaluates whether convergence has been reached. The scheduler directly dumps to the task indexed by the return of that conditional task node. It is notable that, in order for a sequence of task nodes to be executed, a task with zero dependency is mandatory. Otherwise, a loop with no end is formed, as in Fig. 9. The task scheduler in such a case would fail to choose where to start, and all the tasks in the subgraph would be blocked out from the execution of the main graph.

Thus far, all node types in RSGS have been introduced. Fig. 10 shows the output of RSGS, a task subgraph corresponding to the recycle block. After MGSS identifies recycle blocks and infers the tear stream set, RSGS is activated to generate the corresponding recycle subgraphs. Tear streams are the first task nodes that RSGS places in the subgraph (S7 and S9 in Fig. 10). Other in-block simulation task nodes are then sequentially added to the subgraphs, with their dependency relations being established in accordance with the topology of the flowsheet. It is worth mentioning that, after the last in-block task nodes are placed, which are exactly the preceding unit operation task nodes connected to the tear streams (Pump_1 and Flash_3 in Fig. 10), no connections between them and their succeeding tear streams should be set up yet. Otherwise, the absence of zero-dependency tasks causes the recycle subgraph to become a loop with no end. To get around this problem, considering that the simulation programs of tear streams are performed twice in each iteration, a copy of the tear streams’ instantiated object (S7′, S9′) can be added to the memory after all the in-block simulation task nodes have been added to the subgraph. By connecting the last in-block task nodes Pump_1 and Flash_3 to the copied tear streams S7′ and S9′, rather than to the original tear streams, a one-way subgraph with tear streams at both the head and the tail is generated up to this point.

Next, the recycle positioning task node start point is placed in front of the head tear streams to mark the start of each iteration. A conditional task node is also added after the tail tear streams to evaluate whether the recycle is converged by analyzing the stored calculation results of the tear streams. By connecting the conditional node and the start point node with a weak dependency, a recycle loop is formed in the recycle subgraph. In the end, for the task scheduling algorithm to work properly, a positioning task node init should be placed in front of the start point to provide a zero-dependency source, so the scheduler recognizes where the recycle subgraph starts.

4.3. Architecture of PSPCF

Fig. 11 illustrates the architecture of PSPCF. In this setup, users drag unit operations onto the canvas and connect them to configure a flowsheet, including the setting of structural and operating parameters. Upon receiving a process flowsheet, MGSS generates a hierarchical executable main task graph at the higher level of the computing structure. This level primarily executes out-of-block simulation units and recycle blocks. If MGSS detects cyclical structures within the flowsheet topology, it activates RSGS to construct the executable task subgraph with in-graph control flow at a lower level of the computing structure, facilitating the iterative execution of in-block tasks.

As the output from MGSS, the main graph of a process flowsheet is essentially a set of unit operation simulation task nodes with dependencies. These tasks are allocated to multiple threads through a task scheduler, which is responsible for managing a set of worker threads to schedule and execute dependent tasks. When the task scheduler receives a main graph from MGSS, it dispatches the tasks in the graph to workers based on the dependencies. To accelerate the execution of the process-simulation task graph, PSPCF also incorporates a worker-level scheduler, which dynamically balances the worker count with task parallelism based on a work-stealing scheme, as detailed in Section 3.2.

A key aspect of PSPCF is its ability to handle parallel computing for in-block tasks. As depicted in Fig. 11, when a recycle subgraph is identified, it is assigned to one of the worker threads, while the in-block tasks are dumped to other threads to facilitate parallel execution.

5. Results

To evaluate the performance of PSPCF, we studied two cases: a simple chemical process with eight parallel columns and a real-world process for separating cracked gas in ethylene production. We focused on the runtime, load balancing, and hierarchy mechanism. The experiments were conducted on a Windows 11 x86 64-bit machine with an 11th Gen Intel® CoreTM i7-1165G7 2.80 GHz CPU and 16 GB memory. PSPCF was compiled using Ninja with C++19 standards, and the data provided represents an average of 10 runs.

5.1. Columns in parallel

We used a flowsheet with eight paralleled distillation columns in the first case study. The task graph of the flowsheet is illustrated in Fig. S1 in the Appendix A. We manually configured the number of thread workers in PSPCF to test the acceleration performance under different conditions.

An identical process flowsheet was set up in Aspen Plus to assess the accuracy of the PSPCF calculations. Fig. S2 in the Appendix A compares the concentration profiles of the heavy key component (ethane) and the light key component (ethylene) in the liquid phase between PSPCF and Aspen Plus, showing satisfactory agreement. Fig. 12 plots the total execution time of the flowsheet over an increasing number of workers. A 55%–65% runtime speed-up was observed with four or more workers, compared with using only one worker.

When the number of workers exceeds five, the downtrend in total execution time reverses, with a slight increase as the number of workers grows. At eight workers, a decrease in the total execution time occurs. This can be attributed to two reasons. Firstly, a significant amount of time is required to execute a distillation column, compared with other unit operations, as shown in Appendix A Fig. S3. This impacts the load balancing between workers. When the number of workers changes from one to four, the number of batches needed for column execution decreases, leading to optimal load balancing at four workers. Beyond that, no further time savings are observed, and some threads become idle during the second batch of column execution. This explains the observed changes in execution time. Moreover, the simultaneous execution of complex distillation tasks by multiple workers leads to waiting losses due to calls to equilibrium flash algorithms and shared memory data access, further explaining the increase in total execution time.

5.2. Cracked gas separation process in ethylene production

We further evaluated PSPCF’s performance on a complex separation process in ethylene production that is vital for isolating high-purity ethylene and recovering valuable byproducts such as propylene and butadiene. This process, illustrated in Appendix A Fig. S4, involves multiple steps and enhances both the efficiency and the yield of ethylene production, while ensuring environmental compliance and operational safety.

The separation flowsheet consists of four recycle blocks, in which in-graph control loops—supported by condition tasks—implement the iterative computing of in-block tasks. PSPCF’s task graph for the separation process flowsheet is shown in Appendix A Fig. S5. The Aspen Plus’s process flow diagram for this flowsheet, along with a detailed description of the process, are provided in the Appendix A Fig. S6. The total execution timelines with five specified workers can be found in Fig. 13, offering detailed insights into the in-graph control flow, particularly in Recycle 4. Table 1 lists PSPCF’s and Aspen Plus’s results for the molar flow and product concentration of the main product streams in the separation process, demonstrating PSPCF’s high calculation accuracy in complex processes.

Following the scheme introduced in Section 4.2, the timeline span of the recycle subgraph task starts from the recycle positioning task. The span of simulation tasks inside the loop repeats in each recycle timeline until the in-block tear streams meet the convergence condition. As mentioned in Section 3.2, the executor spawns an extra worker thread for module tasks. In Fig. 13, these extra spawned workers are labeled “L1” to distinguish them from the threads executing recycle subgraphs. For example, Recycle 4’s recycle subgraph is assigned to W2.L0, with an extra thread W2.L1 created for executing in-block tasks synchronously. In scenarios with five specified workers, since three workers are employed to execute recycle subgraphs (Recycle 1 is assigned to W1.L0, Recycle 2 to W4.L0, and Recycles 3 and 4 to W2.L0), three extra threads are spawned (W1.L1, W4.L1, and W2.L1), resulting in a total of eight worker threads.

Distillation columns remain the most time-consuming task in the separation process. In this case, multiple processors cannot be efficiently utilized, since the CPU time for different unit operations will vary significantly [42]. The unit operations inside Recycle 4, including the distillation columns, are iterated 10 times in total. In compliance with the task-scheduling mechanism for condition tasks, a new round of recycle iteration will not start until its previous iteration ends. Therefore, the tasks in different iterations cannot be carried out in parallel, even though idle workers are available. As shown in Fig. 13, column tasks are performed 10 times in W2.L1 and W4.L0, denoting the execution of RigorousColumn_4 and RigorousColumn_5, and these tasks of different iterations can only be executed sequentially rather than in parallel. This situation does not only occur in the execution of recycle subgraphs but also when out-of-block tasks that take a very short time to execute must wait for the time-consuming execution of the preceding recycle subgraphs or distillation columns to be completed.

The total execution runtime, plotted in Fig. 14, shows a 35%–40% runtime speed-up with PSPCF compared with single-thread computation. Table 2, Table 3 compare the total execution runtime and each worker thread’s load rate, revealing that the work-scheduling algorithm reaches its capacity limit when the number of specified workers hits seven, leading to idle workers and significant load imbalances due to the dependency structure of the process flowsheet.

The average load rate of workers is inversely correlated with the number of spawned workers, indicating that increasing in the worker thread number without considering of the dependency structure of the processes might not improve computation speed as expected. Indeed, there is an optimal number of spawned workers that maximizes the efficiency of parallel computation. This optimal number varies based on the specific dependency structures of the processes and the computer configurations. In this particular case, the optimal number of workers was identified to be three.

It is important to note that the process flowsheets used for validation were modeled on our self-developed process-simulation software. While PSPCF is a versatile framework intended for SM-based software to enable parallelism, the variance in simulation times between our software and other commercial platforms precludes a direct comparison for assessing PSPCF’s efficacy in parallelism under different simulation platforms. We are actively pursuing the validation of PSPCF on well-known commercial platforms such as Aspen Plus. However, adapting their non-open-source code for validation purposes is not feasible.

6. Conclusions

This paper introduces a hierarchical task graph parallel computing framework, PSPCF, for chemical process simulation. Incorporating the advanced TGCS Taskflow, PSPCF parallelizes simulation calculations at the unit operation level and enhances computing speed through a work-stealing scheme. It effectively addresses the parallelism of processes with recycles using conditional and composable tasking models. The performance of PSPCF is evaluated on two chemical processes considering runtime, load balancing, and hierarchy mechanism. The results demonstrate significant speed improvements in process calculations. More specifically, for the complex cracked gas process, PSPCF achieves a 35%–40% reduction in runtime. The results also highlight that flowsheet dependency structures notably influence PSPCF’s performance, with the optimal number of worker threads for efficient parallel computation varying based on process dependencies and computer configurations.

PSPCF stands out as the first generic computing framework for SM-based systems that supports parallel computing using TGCS and is capable of simulating arbitrary topology structure flowsheets. This research provides novel insights into chemical process simulation by transforming simulation problems into hierarchical task graphs for efficient execution using an existing TGCS. The runtime framework of existing process modeling tools can be replaced with PSPCF without much cost, enabling these tools to generate more efficient initial guesses for EO-based simulations through SM-based computation. Benefiting from the flexibility of graph construction, PSPCF also facilitates the acceleration of higher-level process computations, such as flowsheet optimization using heuristic algorithms. For optimization tasks, PSPCF can replicate specific process parts involved in optimization and construct parallel optimization task graphs.

A key future direction involves integrating an AI model into PSPCF to ascertain the effects of hardware configuration and the degree of task parallelization on the optimal number of threads. This integration would enable the work scheduler to automatically determine the best number of worker threads for parallel process execution. In addition, to maximize computing resource utilization, future research will focus on developing advanced work-stealing scheduling algorithms that scale to both GPUs and CPUs, enhancing the computing speed of process simulations on large server platforms.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (2022YFB3305900), the National Natural Science Foundation of China (Key Program) (62136003), the National Natural Science Foundation of China (62394345), the Major Science and Technology Projects of Longmen Laboratory (LMZDXM202206), and the Fundamental Research Funds for the Central Universities.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.eng.2024.06.019.

References

[1]

Yang T, Yi X, Lu S, Johansson KH, Chai T.Intelligent manufacturing for the process industry driven by industrial artificial intelligence.Engineering 2021; 7(9):1224-1230.

[2]

Mao S, Wang B, Tang Y, Qian F.Opportunities and challenges of artificial intelligence for green manufacturing in the process industry.Engineering 2019; 5(6):995-1002.

[3]

Ge W, Guo L, Li J.Toward greener and smarter process industries.Engineering 2017; 3(2):152-153.

[4]

Zhou J, Li P, Zhou Y, Wang B, Zang J, Meng L.Toward new generation intelligent manufacturing.Engineering 2018; 4(1):11-20.

[5]

Qian F, Zhong W, Du W.Fundamental theories and key technologies for smart and optimal manufacturing in the process industry.Engineering 2017; 3(2):154-160.

[6]

Dowling AW, Biegler LT.A framework for efficient large scale equation-oriented flowsheet optimization.Comput Chem Eng 2015; 72:3-20.

[7]

Shacham M, Macchieto S, Stutzman LF, Babcock P.Equation oriented approach to process flowsheeting.Comput Chem Eng 1982; 6(2):79-95.

[8]

Barton PI, Pantelides CC.Modeling of combined discrete/continuous processes.AIChE J 1994; 40(6):966-979.

[9]

Biegler LT, Grossmann IE, Westerberg AW.Systematic methods for chemical process design. Old Tappan: Prentice Hall; 1997.

[10]

Ortega JM, Voigt RG.Solution of partial differential equations on vector and parallel computers.SIAM Rev 1985; 27(2):149-240.

[11]

Liu YC, Brosilow CB.Simulation of large scale dynamic systems—I. modular integration methods.Comput Chem Eng 1987; 11(3):241-253.

[12]

Ma Y, Shao Z, Chen X, Biegler LT.A parallel function evaluation approach for solution to large-scale equation-oriented models.Comput Chem Eng 2016; 93:309-322.

[13]

Weng J, Chen X, Biegler LT.A multi-thread parallel computation method for dynamic simulation of molecular weight distribution of multisite polymerization.Comput Chem Eng 2015; 82:55-67.

[14]

Carrasco JC, Lima FV.Bilevel and parallel programing-based operability approaches for process intensification and modularity.AIChE J 2018; 64(8):3042-3054.

[15]

Lu L, Xu J, Ge W, Yue Y, Liu X, Li J.EMMS-based discrete particle method (EMMS–DPM) for simulation of gas–solid flows.Chem Eng Sci 2014; 120:67-87.

[16]

Lyu H, Cui C, Zhang X, Sun J.Population-distributed stochastic optimization for distillation processes: implementation and distribution strategy.Chem Eng Res Des 2021; 168:357-368.

[17]

Bimakr F, Baniadam M, Fathikalajahi J.Evaluation of performance of an industrial gas sweetening plant by application of sequential modular and simultaneous modular methods.Chem Biochem Eng Q 2008; 22(4):411-420.

[18]

Haydary J.Chemical process design and simulation: Aspen Plus and Aspen Hysys applications. John Wiley & Sons, Inc., Hoboken (2019)

[19]

Pantelides CC, Nauta M, Matzopoulos M, Grove H.Equation-oriented process modelling technology: recent advances & current perspectives.In: Proceedings of the 5th Annual TRC-Idemitsu Work; 2015 Feb 11–12; Abu Dhabi, UAE; 2015.

[20]

Pantelides CC, Renfro JG.The online use of first-principles models in process operations: review, current status and future needs.Comput Chem Eng 2013; 51:136-148.

[21]

Biegler LT.Nonlinear programming: concepts, algorithms, and applications to chemical processes. Philadelphia: Society for Industrial and Applied Mathematics; 2010.

[22]

Pattison RC, Baldea M.Equation-oriented flowsheet simulation and optimization using pseudo-transient models.AIChE J 2014; 60(12):4104-4123.

[23]

Aldinucci M, Danelutto M, Kilpatrick P, Torquati M.Fastflow: high-level and efficient streaming on multi-core.In: Pllana S, Xhafa F, editors. Programming multi-core and many-core computing systems. Hoboken: John Wiley & Sons, Inc.; 2017. p. 261–80.

[24]

Augonnet C, Thibault S, Namyst R, Wacrenier PA.STARPU: a unified platform for task scheduling on heterogeneous multicore architectures.In: Sips H, Epema D, Lin HX, editors. Euro-Par 2009—parallel processing. Berlin: Springer; 2009. p. 863–74.

[25]

Leijen D, Schulte W, Burckhardt S.The design of a task parallel library.ACM SIGPLAN Not 2009; 44(10):227-242.

[26]

Bauer M, Treichler S, Slaughter E, Aiken A.Legion: expressing locality and independence with logical regions.In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 2012 Nov 10–16; Salt Lake City, UT, USA. Piscataway: IEEE; 2012.

[27]

Carter Edwards H, Trott CR, Sunderland D.Kokkos: enabling manycore performance portability through polymorphic memory access patterns.J Parallel Distrib Comput 2014; 74(12):3202-3216.

[28]

Bosilca G, Bouteiller A, Danalis A, Faverge M, Herault T, Dongarra JJ.PaRSEC: exploiting heterogeneity to enhance scalability.Comput Sci Eng 2013; 15(6):36-45.

[29]

Kaiser H, Heller T, Adelstein-Lelbach B, Serio A, Fey D.HPX: a task based programming model in a global address space.In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models; 2014 Oct 6–10; Eugene, OR, USA. New York City: Association for Computing Machinery; 2014, p. 6.

[30]

Huang TW, Lin DL, Lin CX, Lin Y.Taskflow: a lightweight parallel and heterogeneous task graph computing system.IEEE Trans Parallel Distrib Syst 2022; 33(6):1303-1320.

[31]

Motard RL, Shacham M, Rosen EM.Steady state chemical process simulation.AIChE J 1975; 21(3):417-436.

[32]

Westerberg AW, Piela PC.Equational-based process modeling. 1994.

[33]

Li H, Gao B, Chen Z, Zhao Y, Huang P, Ye H, et al.A learnable parallel processing architecture towards unity of memory and computing.Sci Rep 2015; 5(1):13330.

[34]

Pierucci SJ, Ranzi EM, Biardi GE.Solution of recycle problems in a sequential modular approach.AIChE J 1982; 28(5):820-827.

[35]

Cho KW, Yeo YK, Kim MK.Application of partitioning and tearing techniques to sulfolane extraction plant.Korean J Chem Eng 1999; 16(4):462-469.

[36]

Varma GV, Lau KH, Ulrichson DL.A new tearing algorithm for process flowsheeting.Comput Chem Eng 1993; 17(4):355-360.

[37]

Kisala TP, Trevino-Lozano RA, Boston JF, Britt HI, Evans LB.Sequential modular and simultaneous modular strategies for process flowsheet optimization.Comput Chem Eng 1987; 11(6):567-579.

[38]

Blumofe RD, Leiserson CE.Scheduling multithreaded computations by work stealing.J Assoc Comput Mach 1999; 46(5):720-748.

[39]

Tardieu O, Wang H, Lin H.A work-stealing scheduler for X10’s task parallelism with suspension.ACM SIGPLAN Not 2012; 47(8):267-276.

[40]

Acar UA, Charguéraud A, Rainey M.Scheduling parallel programs by work stealing with private deques.In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2013 Feb 23–27; Shenzhen, China. New York City: Association for Computing Machinery; 2013. p. 219–28.

[41]

Muller SK, Acar UA.Latency-hiding work stealing: scheduling interacting parallel computations with work stealing.In: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures; 2016 Jul 11–13; Pacific Grove, CA, USA. New York City: Association for Computing Machinery; 2016. p. 71–82.

[42]

Cera GD.Parallel dynamic process simulation of a distillation column on the BBN butterfly parallel processor computer.Comput Chem Eng 1989; 13(7):737-752.

RIGHTS & PERMISSIONS

THE AUTHOR

PDF (5229KB)

Supplementary files

Supplementary data

4710

Accesses

0

Citation

Detail

Sections
Recommended

/