《1. Introduction》

1. Introduction

In recent years, information technology and mobile communication technology have been closely integrated and rapidly developed. The software and hardware of smart devices are also upgraded and evolved continuously. These technologies have promoted the development of the internet, the mobile internet, cloud computing, big data, and the Internet of Things. At the same time, a variety of new service models have improved the quality of living greatly; these include the e-commerce services represented by Amazon and Taobao, the social network services represented by Facebook and Wechat, and the vehicle services represented by Uber and Didi.

However, the emergence and rapid development of new technology and new service modes lead to a common situation in which a massive amount of users’ personal information interacts across information systems, digital ecosystems, and even national network boundaries. In each step of the whole information lifecycle, users’ personal information is inevitably retained in various information systems, such as collection, storage, processing, release (including exchange), destruction, and so on. This leads to separation of the ownership, management, and utilization right of information, which seriously threatens users’ rights to consent, to be erased/forgotten, and to extend authorization. Furthermore, the lack of effective monitoring technology leads to difficulty in the tracing and forensics of privacy invasion.

Most existing privacy-preserving schemes focus on relatively isolated application scenarios and technical points, and propose solutions to specific problems within a given application scenario. While a privacy-preserving scheme based on access control technology is suitable for a single information system, the problem of privacy preservation in metadata storage and publishing remains unsolved. Similarly, a privacy-preserving scheme based on cryptography is only applicable to a single information system. Although the implementation of key management with the help of trusted third parties can realize the exchange of private information between multiple information systems, users’ deletion right and extended authorization after the exchange remain unsolved. A privacy-preserving scheme based on generalization, confusion, and anonymity technologies distort the data, making it impossible to be restored, and therefore can be applied to many scenarios, such as anonymizing data with one operation or multiple operations to obtain an increased level of privacy preservation. However, this kind of privacy-preserving scheme reduces the utility of the data, which leads to the adoption of weaker privacy-preserving schemes in actual information systems, or to the simultaneous storage of original data. At present, a description method and a computing model that can integrate private information with the demand for privacy preservation are unavailable, and we lack a computing architecture to protect privacy on demand in complex application scenarios, such as private information exchange across an information system, private information sharing with multi-service requirements, and dynamic anonymizing of private information.

In brief, existing privacy-preserving technologies cannot meet the privacy preservation requirements in complex information systems, which leads to unsolved privacy-preserving problems in typical application scenarios such as e-commerce and social networking. For this reason, we put forward a privacy computing theory and a key technology system for privacy preservation. The main technical contributions are as follows:

• For the first time, we propose a privacy computing theory and a key technology system for privacy preservation. This is done from the perspective of whole life-cycle preservation of private information, in order to answer the demand for systematic privacy preservation in complex application scenarios.

• We provide a general framework for privacy computing, including a concept and formal definition of privacy computing, four principles of the privacy computing framework, algorithm design criteria, evaluation of the privacypreserving effect, and a privacy computing language.

• We introduce four cases to verify the effectiveness of our proposed framework and to demonstrate how the framework implements privacy preservation and traces evidence when a privacy invasion occurs.

The remainder of this paper is organized as follows: Section 2 describes related work, while the concept and key technologies of our privacy computing are introduced in Section 3. We utilize four scenarios to describe the ubiquitous application of our privacy computing framework in Section 4, and look forward to future research directions in privacy computing and unsolved problems in Section 5. We conclude our paper in Section 6.

《2. Related work》

2. Related work

Existing research on privacy preservation mainly focuses on the privacy-preserving techniques of data processing, and on privacy measurement and evaluation.

《2.1. Privacy-preserving techniques of data processing》

2.1. Privacy-preserving techniques of data processing

Research on privacy preservation has been conducted on all stages of information lifecycle, including information collection, storage, processing, release, and destruction. In addition, based on access control, information confusion, and cryptography technologies, numerous privacy-preserving schemes have been proposed for typical scenarios such as social networking, locationbased services, and cloud computing.

Access control technology protects private information by creating accessing strategies to ensure that only authorized subjects can access the data resource. In recent years, multiple privacypreserving techniques based on access control technology have been presented. Scherzer et al. [1] proposed a high-assurance smart card privacy-preserving scheme with mandatory access controls (MACs) [2,3], and Slamanig [4] proposed a privacy-preserving framework for outsourced data storage based on discretionary access control (DAC) [5,6]. In order to improve the effectiveness of authority management, Sandhu et al. [7] presented role-based access control (RBAC). In RBAC, a user is mapped to a specific role in order to obtain corresponding accessing authority, which simplifies the authority management greatly in complicated scenarios. Dafa-Alla et al. [8] designed a privacy-preserving data-mining scheme with RBAC for multiple scenarios. In 2018, Li et al. [9] proposed a novel cyberspace-oriented access control (CoAC) model, which can effectively prevent security problems caused by the separation of data ownership and management rights and by secondary/multiple forwarding of information, by comprehensively considering vital factors, such as the access requesting entity, general time and state, access point, device, networks, resources, network interactive graph, and chain of resource transmission. Based on this model, they proposed a scenario-based access control method called HideMe [10] for privacy-aware users in photosharing applications. A scenario is carefully defined based on a combination of factors such as temporal, spatial, and sharing behavior factors. In addition, attribute-based encryption [11,12] transform the identity of the user into a series of attributes, and the attribute information is embedded through a process of encryption and decryption so that the public key cryptosystem has the ability of fine-grained access control. Shao et al. [13] achieved fine-grained access control with attribute-based encryption, and protected the user’s location privacy in location-based services.

Information confusion technology protects the original data with generalization, anonymity, or confusion, which prevents attackers from obtaining useful information from the modified data. Anonymity technologies, such as k-anonymity [14–17], l-diversity [18,19], and t-closeness [20,21], achieve privacy preservation by masking the original data within a cloaking spatial. Differential privacy [22,23] is widely considered to be a privacy-preserving technology because it does not require background knowledge of the attackers. To address the issue of similarity attacks, Dewri [24] proposed an anonymous algorithm that applies differential privacy technology to location-related data; this method is able to maximize the effectiveness of differential privacy. However, differential privacy must add a great deal of randomization to query results, and its utility drastically decreases with increasing privacy preservation requirements [25].

Cryptography technology protects users’ private information through encryption techniques and trapdoor functions. In order to protect private data in cloud computing, the concept of homomorphic encryption was first proposed by Rivest et al. [26]. With homomorphic encryption, Zhu et al. [27] proposed a privacypreserving spatial query framework for location-based services. In 1999, based on composite residuosity, Paillier [28] designed an additive homomorphic encryption algorithm, which is widely used in multiple scenarios. For smart grids, Lu et al. [29] proposed a privacy-preserving data aggregation scheme with the Paillier cryptosystem, which can protect users’ sensitive information and resist various attacks. In 2009, Gentry [30] successfully constructed the fully homomorphic encryption (FHE) algorithm based on an ideal lattice [31]; this method achieves additive and multiplicative homomorphic encryption simultaneously. However, the efficiency of FHE is far from practical in the real world, even though many modified schemes [32–34] have been proposed in recent years. In order to improve the efficiency, Zhu et al. [35] proposed an efficient and privacy-preserving point of interest (POI) query [36] scheme with a lightweight cosine similarity computing protocol for location-based services. The proposed scheme is highly efficient and can protect users’ query and location information simultaneously. Other cryptography-based solutions [37,38] have also been proposed to enhance the privacy of the data owner in cloud computing scenarios.

The above-mentioned privacy-preserving schemes are concrete algorithms that mainly focus on a partial dataset of specific scenarios. As a result, they lack the algorithm framework for the dynamic dataset of specific scenarios, and further lack the universal algorithm framework for the dynamic dataset of multiple scenarios. Moreover, for multimedia data, it is necessary to combine multiple algorithms to achieve privacy preservation. The mature schemes in this area are insufficient. Finally, further research is needed on superimposing different privacy-preserving algorithms on each other in order to obtain better preservation quality.

《2.2. Privacy measurement and evaluation》

2.2. Privacy measurement and evaluation

Specific research groups are now focusing on the field of information theory and applications. Oya et al. [39] proposed a scheme using conditional entropy and mutual information as complementary privacy metrics. Ma and Yau [40] proposed a privacy metric for time-series data to quantify the amount of released information obtained by attackers. Cuff and Yu [41] used mutual information to describe the information obtained by attackers from observing data, and measured the decrease of uncertainty of the original data. Jorgensen et al. [42] combined the controllable character of privacy budget with differential privacy, and generated noise calibrated to lap() based on the privacy demands of the user, where lap(·) is the Laplace distribution function, and is the sensitivity of data. When decreases, the added noise increases, and the intensity of the privacy protection is higher. Asoodeh et al. [43] depicted the risk of privacy leakage with mutual information. They calculated the decrease of the uncertainty of private information in original data during the release of the data. Zhao and Wagner [44] used four novel criteria to evaluate the strength of 41 privacy metrics for vehicular work. Their results show that there is no metric that carries across all criteria and traffic conditions. Furthermore, research on application fields mainly focuses on social networking, location-based services, cloud computing, and so forth.

In the field of social networking, with a focus on webpage searching, Gervais et al. [45] proposed a privacy-preserving scheme based on the obfuscating technique, and quantified the users’ privacy. Considering the different searching behaviors of users with various intentions, they designed a commonly used tool to measure their privacy-preserving scheme based on the obfuscating technique. Aiming at spatiotemporal connection, Cao et al. [46] used calculation to analyze the data and quantified the potential risks under a differential privacy technique through a formal description of privacy. With a focus on mobile crowd sensing, Luo et al. [47] proposed using the Salus algorithm, which preserves differential privacy, to protect private data against datareconstruction attacks. They also quantified privacy risks, and provided accurate utility predictions for crowd-sensing applications containing Salus. For a scenario of social recommendation, Yang et al. [48] proposed PrivRank, a framework that prevents users from inference attacks and guarantees personalized rankingbased recommendations. They utilized Kendall’s rank distance to measure data distortion, and minimized privacy leakage by means of optimal data obfuscation learning.

In the field of location-based services, with the goal of identifying the attacking model and the adversaries’ background knowledge, Shokri et al. [49] used information entropy to describe the precision, certainty, and validity for measuring the effectiveness of privacy preservation. Based on the Bayesian Stackelberg model of game theory [50], the user in this model acts as a leader, and the attacker acts as a follower, to form the game theory model. Kiekintveld et al. [51] proposed a framework to find the optimal privacy mechanism that is able to resist the strongest inference attack. Recently, Zhao et al. [52] proposed a privacy-preserving paradigm-driven framework for indoor localization (P3-LOC). This framework utilizes specially designed k-anonymity and differential privacy techniques to protect the data transmitted in their indoor localization system, which guarantees both the users’ location privacy and the location server’s data privacy. Zhang et al. [53] proposed a location privacy-preserving approach using power allocation strategies to prevent eavesdropping. Based on their highly accurate approximate algorithms, different powerallocation strategies were able to achieve a better tradeoff between localization accuracy and privacy.

In the field of cloud computing, as a service-oriented privacypreserving framework, a privacy-preserving method called SAFE [54] implemented secure coordination for cross-neighbor interaction between the protocol and itself in cloud computing. Based on game theory and differential privacy, Wu et al. [55] quantified the game elements related to the users with multi-level. They also implemented users’ privacy measurement by analyzing a single dataset. Zhang et al. [56] used a definition of differentiation to quantify the level of privacy of participating users, and then to implement an accurate incentive mechanism. To preserve the data owner’s privacy in the cloud, Chaudhari and Das [57] presented a single-keyword-based searchable encryption scheme for applications in which multiple data owners upload their data and multiple users access the data.

Most of the above-mentioned schemes lack a unified definition of the concept of privacy. Moreover, the privacy metric varies dynamically with the subject receiving information, the size of the data quantity, and the scenarios. Furthermore, the dynamic privacy metric method is currently lacking. Finally, the dissemination of information is a cross-information system, but the above schemes lack consistency among different information systems and also lack a formalized description method for the dynamic privacy quantization. Therefore, they are far from satisfying the dynamic requirements of the privacy preservation of crossplatform private information exchanges, extended authorization, and so on.

In summary, existing privacy-preserving technologies and privacy measurement methods are fragmented, and lack a formalized description method for the auditing of private information and constraint conditions. A scheme that integrates privacy preservation with the tracking and forensics of privacy infringement has not yet been considered. In addition, it is difficult to construct a uniform information system that covers all the stages of information collection, storage, process, release, destruction, and so on.

《3. Definition and framework of privacy computing》

3. Definition and framework of privacy computing

《3.1. Concepts of privacy and privacy computing》

3.1. Concepts of privacy and privacy computing

3.1.1. Privacy right and private information

The legal definition of privacy emphasizes protecting an individual’s rights according to the law, and includes the requirement that personal information, activities, and spaces cannot be published, interfered with, or intruded upon illegally. It emphasizes the independence of privacy from public interests and group interests, including personal information that a person does not want others to know, personal affairs that a person does not want others to touch, and a personal area that a person does not want others to invade. The essence of the legal definition is in fact privacy rights.

This paper focuses on the full-life-cycle preservation of private information. More specifically, private information includes personal information that a person does not want others to know or that is inappropriate for others to know, as well as personal information that a person wants to be disseminated within an approved circle of personnel in a way he/she agrees with. Private information can be used to deduce a user’s profile, which may impact his/her daily life and normal work.

Academically speaking, private information is closely related to the spatiotemporal scenario and the cognitive ability of the subject. It shows dynamic perceptual results. Unlike the definition of privacy in the law, we mainly define and describe private information technically, in order to support research on various technical aspects such as the semantic understanding of privacy, privacy information extraction, the design of privacy-preserving algorithms, the evaluation of privacy-preserving effectiveness, and so forth.

3.1.2. Privacy computing

In general, privacy computing refers to a computing theory and methodology that can support the describing, measuring, evaluating, and integrating operations performed on private information during the processing of video, audio, image, graph, numerical value, and behavior information flow in a pervasive network. It comprises a set of symbolized and formulized privacy computing theories, algorithms, and application technologies with quantitative assessment standards and support for the integration of multiple information systems.

Privacy computing includes all computing operations by information owners, collectors, publishers, and users during the entire life-cycle of private information, from data generation, sensing, publishing, and dissemination, to data storage, processing, usage, and destruction. It supports privacy preservation for a massive number of users, with high concurrency and high efficiency. In a ubiquitous network, privacy computing provides an important theoretical foundation for privacy preservation.

From the perspective of full-life-cycle privacy preservation, we have constructed a framework of privacy computing, which is showed in Fig. 1. With the input of any format of plaintext information M, our framework first separates the whole process into a set of elements, as follows: semantic extraction, scenario extraction, private information transformation, integration of private information, privacy operation selection, privacypreserving scheme selection/design, evaluation of privacypreserving effect, scenario description, and feedback mechanism. We further implement the privacy computing framework by carefully organizing these elements into five steps, listed below.

《Fig. 1》

Fig. 1. Privacy computing framework. : privacy computation operation set; A: privacy attribute vector; : location information set; Ω: audit control information set; : constraint condition set; : dissemination control operation set; : normalized private information; : privacy computing operation; : operated normalized private information.

Step 1: Extract private information. According to the format and semantics of plaintext information M, we first extract private information X and obtain private information vector I, the details can be found in Section 3.2.

Step 2: Abstract the scenario. According to the type and semantics of each private information element in I, we then define and abstract the application scenario. Also, the extracted private information should be re-organized by the transformation and integration.

Step 3: Select the privacy operation. According to the privacy operations supported by each , we select and generate a dissemination control operation set.

Step 4: Select or design the privacy-preserving scheme. According to the application requirements, we select or design an appropriate privacy-preserving scheme. If capable schemes are available, they can be selected directly. Otherwise, we have to design new schemes.

Step 5: Evaluate the privacy-preserving effectiveness. According to relevant assessment criteria, we assess the privacypreservation effectiveness of the selected privacy-preserving scheme, by employing measurements such as an entropy-based or distortion-based privacy metric. Details on assessing privacypreservation effectiveness can be found in Section 3.5.

If the evaluation result of the privacy preservation does not meet the expected requirements, the feedback mechanism is executed. This mechanism consists of three situations: ① If the application scenario is mis-abstracted, it should be re-abstracted iteratively; ② if the application scenario is abstracted properly but the privacy operation is selected improperly, the privacy operation should be re-organized; or ③ if the application scenario and privacy operation are selected correctly, the privacy-preserving scheme should be adjusted or improved to eventually achieve a satisfactory effectiveness of privacy preservation.

It is notable that these elements and steps can be combined freely according to the specific scenario; the process is depicted in Fig. 1.

《3.2. Formalization of privacy computing》

3.2. Formalization of privacy computing

In this section, we first define private information X and its six basic components, along with related axioms, theorems, and assumptions; these provide a foundation to describe the other part of privacy computing. It is noted that extraction methods for the private information vector of any information M are outside the scope of this paper, as they are subject to domain-specific extraction conditions. The quantification of private information contained in content is also outside the scope of this paper, as it is the task of the programmer or modeler of the information system.

Definition 1: Private information X consists of six components, : namely, the private information vector, privacy attribute vector, location information set, audit control information set, constraint condition set, and dissemination control operation set, respectively.

Definition 2: The private information vector  , where  is a private information element. Each represents semantically informational and indivisible atomic information. Information types include text, audio, video, images, and so forth, and/or any combination of these types. Semantic characteristics include words, phrases, tone of voice, pitch of tone, phoneme, sound, frame, pixels, color, and so forth, and/or their combination. These are used to represent atomic information that is semantically informative, indivisible, and mutually disjointed within information M. IID is the unique identifier of the private information vector, and is independent of private information elements. For example, in the text "U1 and U2 went to Loc to drink beer,” the private information vector is = (IID; U1 ; and; U2; went; to; Loc, to drink; beer}. In this case, n = 7. Note that certain special pieces of information, such as proverbs, can be effectively divided by natural language processing-based solutions.

Axiom 1: Within a natural language and its grammar rules, and within the granularity of words, phrases, and slang, the number of elements of private information vector I is bounded.

Property 1: The private information vector conforms the first normal form (1NF) and the second normal form (2NF).

Private information component is defined as the smallest granularity that cannot be divided further, which is called the atomic property. The 1NF is a property of a relation in a relational database. A relation R is in 1NF if and only if the domain of each attribute contains only atomic values, and each attribute can only have a single value from that domain. Under this definition, conforms to 1NF. Meanwhile, the private information vector I has the unique identification IID as a primary key. Other non-primary attributes are all dependent on this primary key. A relation R is in 2NF if 1NF and every non-primary attribute of the relation is dependent on the unique primary key. Therefore, conforms to 2NF.

Definition 3: The constraint condition set is denoted by , where is a constraint condition vector corresponding to the private information component  describes the permissions necessary for an entity to access ik in different scenarios—such as who, at what time, using what devices, by what means of access—and uses the privacy attribute component and the duration of usage of the private information vector. Only entities that satisfy the constraint condition vector can access the private information component . An entity can be an owner, a receiver, or a publisher of the information.

Definition 4: The privacy attribute vector is A =, where denotes the privacy attribute component and is used to measure the degree of private information preservation. In practical applications, different private information components are able to form weighted dynamic combinations of different scenarios. These combinations will produce new private information. However, based on the atomicity of the private information components, we represent the private information preservation degree of different combination of with the privacy attribute component. When , there is a one-to-one correspondence between and ; when represents the private information preservation degree of the combination of two or more private information components.

We set [ 0; 1] ;= 0 indicates that the private information component has the highest degree of preservation. Under this condition, information M is not shared; that is, there is no possibility of any leakage, because the information is protected to the highest degree. In that case, mutual information between the protected private information and the original private information is 0. For example, in cryptography-based privacy-preserving methods, = 0 means that the secret key has been lost and the information cannot be reversed; in cases in which noise injection, anonymization, or other irreversible techniques have been applied, = 0 indicates that the degree of distortion of the data has led to a complete irrelevancy between the processed information and the initial information. = 1 indicates that is not protected and can be published freely without limit. The other values between 0 and 1 represent different degrees of private information preservation. The lower the value is, the higher the degree of private information preservation is.

The privacy-preserving quantitative operation function is denoted by ; this can be a manually labelled function, a weighting function, and so forth. Since different types of private information correspond to different kinds of operation functions, the resulting privacy attribute components are also different, and are expressed by , where . For any combination of private information components , we denote it as , where stands for the combination operation of the private information components. Given the privacy-preserving quantitative operation function and the privacy attribute component , we have where . The privacy attribute vector is generated by private information components and their combination vectors . The relationship between the private information vector and the privacy attribute vector can be denoted by . As quantitative operation and constraints go hand in hand, the results of a quantitative operation will vary with different scenarios and entities.

Theorem 1: For a specific private information vector , if the number of its components is bounded, the dimensionality of its corresponding privacy attribute vector is bounded. When each binary or multiple combination of the components of I corresponds to only one privacy attribute component, the number of privacy attribute components m satisfies .

Proof: According to Definition 1 and Axiom 1, the dimension of private information vector I is limited and denoted by n. According to the definition of a privacy attribute vector, its privacy attribute components correspond to private information components and their combination vectors; thus, the size of a privacy attribute vector is limited. When each combination of private information components corresponds with one privacy attribute, the maximum size of the privacy attribute vector is the number of all the combinations of the private information components, including 2 to n -size combinations, denoted as ; hence, the inequality  holds.

Definition 5: The location information set is , where  denotes the location information vector, which stands for the location information and attribute information of within information M. Using , the private information component can be quickly positioned. The location information describes the specific location of in M, such as its page number, chapter, paragraph, serial number, coordinates, frame numbers, time period, audio track, layer, pixels, and so forth. In a text file, location information mainly includes the page number, section, paragraph, serial number, and so forth, while the attribute information mainly includes the font, font size, thickness, italics, underline, strikeout, superscript, subscript, style, line spacing, and so forth. Attribute information in an audio or video file includes font, size, font weight, line spacing, pixels, color, brightness, tone, intonation, and so forth.

Definition 6: The audit control information set is , where  denotes a specific audit control information vector during the propagation process of. This records subjective and objective information, such as the information owner, information sender, information receiver, information sending device, information receiving device, information transmission pattern, and information transmission channel, as well as the operations performed on them during the transfer process. These operations include copy, paste, cut, forward, modify, delete, and so forth. If the private information is revealed, the source of the leakage point can be tracked.

Definition 7: The dissemination control operation set is , where  denotes the dissemination control operation vector. This describes the operations that can be performed on and their combinations, such as copy, paste, cut, forward, modify, delete, and so forth. These operations will not break the atomicity of I. We set , where the constraint condition vector , and judg is the operation judgment function, including artificial markers, weighting function, and so forth.

Axiom 2: During the cross-information-system exchange process, if both of the information control sides that extend authorization cannot perform the exchange completely and effectively, there must be leakage of private information.

Assumption 1: Privacy computing can be defined as a set of finite atomic operations. The other operations are combinations of these finite atomic operations.

Assumption 2: Privacy computing is established under the condition that the number of private information components is finite.

《3.3. Four principles for privacy computing》

3.3. Four principles for privacy computing

The four principles of privacy computing are as follows:

Principle 1: Atomicity. Private information components are independent of each other; they can be divided to a minimum granularity and cannot be divided further.

Principle 2: Consistency. For the same private data, various privacy-preserving algorithms all aim to make all the components of privacy attribute vector A approach 0. Even though they have different privacy-preserving degrees, they have similar aims.

Principle 3: Sequence. In a privacy-preserving algorithm, different orders of some operations may lead to different levels of preservation effectiveness.

Principle 4: Reversibility. Some privacy-preserving algorithms can be recovered reversibly, such as encryption-based algorithms, which can be recovered by decryption. However, others are irreversible for private information processing.

《3.4. Characterization elements of privacy computing》

3.4. Characterization elements of privacy computing

Definition 8: Privacy computing covers four factors( X; F; C; Q) , where X is the private information (see Definition 1 for more detail), F is the privacy computation operation set, C is the cost of privacy preservation, and Q is the effectiveness of privacy preservation.

Definition 9: The privacy computation operation set is , where F is the set of atomic operations on X such as modular addition, modular multiplication, modular exponentiation, insert, delete, and so forth. A privacy-preserving algorithm is composed of multiple elements in the collection of privacy operations, and each element can be used multiple times.

Privacy perception, privacy preservation, privacy analysis, the exchange of private information and secondary transmission, the integration of private information, updating private information, and so forth, are defined as specific operations. These operations consist of the combination of several atomic operations.

Axiom 3: After privacy operations have been performed on information M, a change in the private information vector from I to I' is triggered, followed by a further change in the privacy attribute vector from A to A' . The number and value of component are also changed. In brief, when I undergoes privacy operation is achieved, and , where .

Definition 10: The cost of privacy preservation C represents the quantification of various resources required to achieve a certain level of privacy preservation on information M; these may include computation, storage, network transmission, and computational complexity. Each private information component corresponds to the cost of privacy preservation . The parameter is related to private information component , constraint condition vector , and privacy computing operation vector . It can be described as follows:

Each may comprise a different type of information. For example, a Word file contains characters and images, and may even contain audio clips. Hence, the corresponding function of parameter ik has different expressions, depending on the type of information. The parameter C can be described by the vector { } .

Definition 11: The effectiveness of privacy preservation Q represents the level of privacy preservation on information M—namely, the difference in the privacy metric before and after privacy preservation. In general, we need to consider the private information vectors of information M, the information access entities (including information owners, information receivers, information publishers, and participants of the information creation and transfer process), constraint conditions, privacy computing operations, and other elements. In previous sections, we have introduced the privacy metric—namely, the expression of the privacy attribute component, , where function contains the impact of privacy computing operation. The definition of constraint conditions also covers the factors of information access entities. Therefore, the effectiveness of privacy preservation corresponding to private information component can be expressed as follows:

where  is the privacy metric function before privacy preservation and is the privacy metric function after privacy preservation.

Definition 12: The profit and loss ratio of privacy disclosure L = {Lk} represents the ratio between profit and loss after privacy disclosure. The relationship between L, the cost of privacy preservation C, and the effectiveness of privacy preservation Q can be described as follows:

The core idea of the privacy computing model is to describe the relationships among the four factors of privacy computing and the profit and loss ratio of privacy disclosure, L.

《3.5. Evaluation method of privacy preservation》

3.5. Evaluation method of privacy preservation

Definition 13: The privacy-preserving algorithm or scheme, , is the combined operation on the elements in privacy computing operation set . After the combined operation has been performed on private information vector I, each component in the corresponding privacy attribute vector A approaches 0.

In brief, for vector I and A where A=, then is called a privacy-preserving algorithm, where represents a type of measuring method of vector A, such as L2 norm, which means the square root of the quadratic sum of vector’s components.

Definition 14: The evaluation of the privacy-preserving effect refers to the evaluation of the privacy attribute vector of the new private information vector I' after different privacy-preserving operations  have been performed on I. In brief, the closer to 0 the value of is, the better the effectiveness of the privacypreserving algorithm will be.

Theorem 2: For specific private information content and a relative privacy-preserving algorithm, the effectiveness of privacy preservation Q is measurable.

Proof: According to Definition 2, Axiom 1, and Definition 4, any specific information can be described by the private information vector I, which can be further divided into a limited number of private information elements . Here, we assume that . Since each of the private information elements and combinations can be measured by the privacy attribute vector A =, where ,  and  is a small amount relative to , this vector represents the deviation during the computation of . In our framework, we set  [0; 1] , where = 0 indicates that private information component has the highest degree of preservation, and = 1 means that is not protected and can be published freely without any limit. That is to say, we are able to calculate a specific value for each , with an acceptable deviation in the worst case. Based on Definition 11, , where  is a type of operation. For simplicity, we use "+” here directly. Since , we set . Therefore, the effectiveness of privacy preservation is measurable.

The effectiveness evaluation mainly covers the utility of private information after preservation, the irreversibility of privacy preservation, and the reversibility of privacy preservation in controlled environments. The utility of private information refers to the impacts the new information has on the information system function or performance after the execution of the privacypreserving algorithm. The irreversibility of privacy preservation means that any third party or attacker cannot deduce the original private information from the privacy-preserving algorithms and obtain information. In a controlled environment, the reversibility of privacy preservation means that third parties can restore all the information based on partially known information. As such, this paper generalizes the evaluation of the privacy-preserving effect into five indicators: reversibility, extended controllability, deviation, complexity, and information loss.

3.5.1. Reversibility

Reversibility refers to the ability of private information to be restored after the execution of the privacy-preserving algorithm. To be specific, reversibility is the ability of an attacker or third party to deduce private information component from the observed private information component . If can be inferred accurately, then it is reversible; otherwise, it is not.

For example, when data needs to be published, we first assess the attack-resistance ability of the selected privacy-preserving scheme under different attacks. Then, based on the data after the execution of the privacy-preserving algorithm, we compute the privacy attribute vector. Furthermore, we determine the degree of possible restoration of unauthorized information and authorized information under different attacks.

Conjecture 1: If privacy-preserving policies do not match each other, then a reversible privacy-preserving algorithm may lead to privacy leakage after the private information is disseminated across different trustable domains.

3.5.2. Extended controllability

Extended controllability refers to the degree of matching between the receiver’s effectiveness of privacy preservation and the sender’s requirements for privacy preservation during the cross-information-system exchange process. More specifically, it refers to the dissimilarity between the privacy attribute component in the information system Sys1 and the privacy attribute component  in the information system Sys2 when private information X is transferred from Sys1 to Sys2. In brief, for any values of parameter in different information systems means that the extended control is well maintained. Otherwise, the extension of authorization is deviated. For example, users Alice, Bob, and Charles are friends. Alice publishes private information in WeChat and sets up a sharing list that allows Bob to access this information but prohibits Charles from doing so. However, user Bob transfers this information to Weibo without any access restrictions. In this situation, Charles can see the information, and Alice’s access privileges for the same information in Weibo and in WeChat do not match each other.

3.5.3. Deviation

Deviation refers to the dissimilarity between the private information component and the private information component observed by attackers or third parties after the execution of privacy-preserving algorithms. For example, in location privacy, the physical distance between a mobile user’s real location (m,n) and the processed location (m',n') obtained by locationbased privacy-preserving schemes can be calculated as 

3.5.4. Complexity

Complexity refers to the required cost of performing a privacypreserving algorithm, which is similar to the cost of privacy preservation, C. For example, if a user uses a handheld terminal to execute a 2048-bit Rivest–Shamir–Adelman (RSA) encryption algorithm, the calculation resource cost of this process is greater than that of executing the AES algorithm once.

3.5.5. Information loss

Information loss refers to the loss of information utility after information is processed by an irreversible privacy-preserving algorithm, such as information confusion or information obfuscation.

For location privacy, if mobile users submit their real location to the server without a k-anonymity process, they can receive accurate service information. If they employ k-anonymity to process locations, they will receive coarse-grained service information, and the proportion of unavailable results will increase. This results in a certain loss of information availability.

《3.6. Design principles for privacy-preserving algorithms》

3.6. Design principles for privacy-preserving algorithms

Although the requirements of privacy preservation for different scenarios and information categories vary greatly, some common criteria exist in the design of privacy-preserving algorithms. According to the concept of privacy computing, we summarize five basic criteria for the design of privacy-preserving algorithms.

Criterion 1: Pre-processing. First, we need to preprocess the private information X to determine the data distribution character, its value range, the privacy-preserving sensitivity, the expected value of privacy-preserving operations, empiric value, and so forth. For example, the expected value of privacy-preserving operations can be denoted as time =.

Criterion 2: Algorithmic framework. Based on the scenarios and information categories, the mathematical foundation of the privacy-preserving algorithm can be determined, including the procedures and their combination relationships, and the relationship between the privacy attribute vector and the private information vector. For example, in a scenario in which irreversible operations for privacy preservation are allowed, techniques based on generalization, obfuscation, anonymity, differential privacy, and so forth can be employed. Taking differential privacy as an example, the specific mechanism of noise addition should be determined by following the guidance of Criterion 1 and considering elements including .

Criterion 3: Design of the algorithm parameter. According to the requirements of the privacy preservation effect and usability, the relevant parameters of the privacy-preserving algorithm can be determined by considering Criteria 1 and 2. For example, the expected times of privacy operations should be determined based on the requirements of privacy preservation for differential privacy mechanisms. Furthermore, the sensitivity and empiric value of the privacy operation results should also be determined based on the query function. We can then determine the specific distribution of noise by combining under the guidance of Criterion 2.

Criterion 4: Algorithm combination. To improve the security and the performance of the algorithm, we combine different procedures within a particular algorithm or between similar algorithms based on application scenarios and information characteristics. Taking differential privacy as an example, we can achieve a flexible combination of different procedures within one algorithm by considering factors such as , along with some of the composition properties of differential privacy, including postprocessing, sequential composition, and parallel composition properties. In the case of complex requirements regarding privacy preservation—such as a scenario in which data is published while statistical characteristics and anonymity are simultaneously emphasized—we need to consider the characteristics of different algorithms with similar mathematical mechanisms, and organically integrate such algorithms to satisfy the requirements of privacy preservation during the processing of private information. In this way, the security and performance of the algorithm can be greatly improved.

Criterion 5: Analysis of algorithm complexity and efficiency. In order to evaluate whether the selected algorithm adapts to the corresponding scenario, we need to comprehensively analyze and evaluate the implementation cost of the privacy-preserving algorithm while considering factors such as the number of private information components that need to be protected, the value range of the security parameters, time and space complexity, and the expectation of effectiveness of privacy preservation.

In the following discussion, we explain the applicability of the above-mentioned criteria to a differential privacy mechanism.

(1) Pre-processing: In a differential privacy algorithm, denote the dataset as X. With X, the constraint condition set , propagation control operation set , and private information vector set can be generated. By analyzing the distribution characteristic of , we can determine the value range of I or the value set Ran. Then, based on the statistical query function , which is defined over I, we can determine the expected value of query numbers and the empiric value of query results . We can obtain the noise value space or value set , and compute the sensitivity of the query function . For a statistical function , which is defined on the subset D of I, the sensitivity can be described as follows:

where D1, D2 I are two arbitrary datasets. When the difference between D1 and D2 is up to one element, we call them neighboring sets. Moreover, in Lp norm, , and .

(2) Algorithmic framework: Based on the preprocessing result, the mathematical definition of the differential privacy mechanism can be represented as follows, while fully considering the cost of privacy preservation C, the effectiveness of privacy preservation Q, and so on:

 represents the extended privacy estimate, where   is a constant number that is related to the noise distribution,  is related to the expected value of query numbers, and is related to the empiric value of query results. In addition,  is the correction parameter, which is used to soften conditions in order to make algorithms satisfy the definition of differential privacy. Furthermore, D1 and D2 are a couple of neighboring sets, and Alg is a randomized algorithm.

Then, the framework of differential privacy can be described as follows:


Do Alg

where Noise (·) is the noise function set, which generates noise satisfying the – DP condition (DP is differential privacy);  is the expected value of generated noise;   is the scale parameter function used in controlling the range of operating distribution; and  is the utility function that controls the probability expectation of a certain result being generated with the noise-processed data. In practice, the distribution of noise and the parameters of the algorithm should be selected according to application scenarios and information categories.

(3) Design of algorithm parameters: Based on the users’ requirements for privacy-preserving strength and usability, and in combination with the value range Ran of the private information vector I, the expected value of query numbers , and so forth, we can determine the specific parameters of noise distribution. To be specific,  is related to the mean demand of outputs; since is related to , the sensitivity of dataset , the value space or value set S of noise addition, and so forth, we can infer that . Moreover,  is related to the empiric value of the querying result from S; therefore, .

(4) Algorithm combination: The differential privacy mechanism has the following features:

Post-processing property. If Alg1(·) satisfies  DP, the combined algorithm Alg2 Alg1(·) also satisfies  DP, where Alg2(·) is an arbitrary algorithm, including the randomized algorithm.

• Sequential composition. If Alg1(·) satisfies  DP, and for arbitrary s, Alg2(s) satisfies  DP. Then the combined algorithm Alg (D)= Alg2(Alg1(D) ,D )satisfies (+) DP.

Parallel composition. If Alg1(·) Alg2(·) ,……,Algk (·) are algorithms that satisfy  DP,  DP,…,  DP, respectively, and D1, D2,…, Dk are k datasets that are disjointed, then Alg1(D1) ; Alg2(D2) …, Algk(Dk) satisfy max(,,…,)  DP.

Based on the above-mentioned three features, different steps can be combined to construct a differential privacy-preserving algorithm that supports different datasets and multiple query statistics.

(5) Analysis of algorithm complexity and efficiency: Since the main idea of a differential privacy-preserving algorithm is the addition of noise to private information, the complexity depends on the noise generation, and the effectiveness of privacy preservation also relies on the size of the noise. These are related to the noise-generating parameters such as the characteristics of dataset and sensitivity calculations of the datasets. As a result, the complexity and effectiveness of privacy preservation can be depicted as described below.

The complexity of algorithm Alg can be denoted as follows:

The privacy-preserving quality of algorithm Alg can be denoted as follows:

《3.7. Privacy computing language》

3.7. Privacy computing language

We propose a privacy computing language (PCL) that can automatically implement formal description, dissemination control, computation, and transaction processing in the life-cycle of private information. The PCL consists of three parts: a privacy defining language, a privacy operating language, and a privacy controlling language.

(1) Privacy defining language: The privacy defining language aims to describe the data type and the data format of the six privacy computation factors of information M, as well as the relevant integrity constraints. The data types mainly include the bit string type, integer type, floating-point type, character type, logical type, table data, metadata, web data, text data, image data, audio data, and video data. In addition, the privacy defining language is used to describe computing steps performed on text, image, audio, and video, including private information extraction, scenario abstraction, privacy operation selection, privacy-preserving scheme selection and design, and evaluation of the privacy-preserving effect.

(2) Privacy operating language: The privacy operating language is used to describe the behaviors of operating information M, such as modular addition, modular multiplication, modular exponentiation, exclusive or, replacement, disturbance, query, selection, deletion, modification, copy, paste, cut, and forward.

(3) Privacy controlling language: The privacy controlling language is used to describe the access control authorization, identification, and revocation of information M. The access control permission consists of the selection, copy, paste, forward, cut, modification, deletion, query, and so on.

《3.8. Tracing evidence for privacy invasion》

3.8. Tracing evidence for privacy invasion

In the framework of privacy computing, privacy invasion and the obtaining of evidence exist within each step. Tracing evidence of privacy invasion mainly includes four parts: defining private information, determining privacy violations, obtaining evidence of privacy invasion, and tracing the origin of privacy invasion.

Based on the privacy computing framework, we abstract the characteristics and processes of privacy invasion, and integrate them with each step of the privacy computing framework. The framework of evidence tracing for privacy-invasion behavior is depicted below (Fig. 2):

《Fig. 2》

Fig. 2. Framework for tracing evidence of privacy invasion behavior.

(1) Private information extraction: When information M is generated, we deploy a scenario logic computing analysis to extract information or label private information, so that we can obtain private information vector I, location information set , audit control information set Ω, and privacy attribute vector A. This phase is mainly for identifying and defining private information.

(2) Scenario description: By abstracting the information scenario, we can obtain constraint condition set and dissemination control operation set . This phase provides criteria for judging privacy invasion. If the above conditions are not satisfied, we judge that privacy invasion has occurred.

(3) Privacy operation: According to the limitations of the scenario, we assign executable operations to each private information component. In turn, we form the privacy computing operation set . Furthermore, we construct the dissemination control operation set W. We record all privacy operations that the information subject executes on the information, and then generate or update the audit control information set Ω. Operations beyond the above two sets are judged to be privacy invasions.

(4) Solution selection or design: During this process, we analyze the operations from selected or designed schemes to check whether they can satisfy the set of privacy computing operations, and to determine whether their behavior, object, and result are outside of the constraint condition set, so that we can try to avoid privacy invasion and take this as a criterion for judging privacy invasion.

(5) Evaluation of privacy-preserving effect: In this phase, we analyze and compute the cost of privacy preservation C, the effectiveness of privacy preservation Q, and the profit and loss ratio of privacy disclosure L. If the above indicators do not meet the expected goals, privacy invasion behaviors may have occurred; hence, we need to review the whole life-cycle of private information preservation.

Evidence tracing: When a privacy invasion occurs, it is necessary to analyze the tracing source from the first four parts above in order to trace the main entity of privacy invasion. Based on six tuples of private information and the third-party monitoring or trusteeship, we need to identify and define the private information, and judge the privacy invasion behaviors. Then, through the correlation of each step of the privacy computing framework, we can obtain evidence of abnormal behavior and discover the source of the invasion so that we can realize evidence tracing. Technical details for a specific example can be found in Step 4 in Section 4.1.

《4. Case study for privacy computing》

4. Case study for privacy computing

Information interaction across information systems, digital ecosystems, and national network boundaries has become increasingly pervasive. It is common in practice for private information to be retained without authorization in different information systems. This practice has resulted in a significant risk of privacy leakage. The following four cases illustrate the application of our proposed privacy computing framework, to demonstrate how our proposed framework implements privacy preservation and how it traces evidence when a privacy invasion occurs.

《4.1. Information interaction across different domains in an information system》

4.1. Information interaction across different domains in an information system

Case 1: Let us take different social network applications as the example. Suppose the set of users in Social Network 1 is denoted as U = { u1, u2…}. Each user has multiple friend circles, denoted as M = {m1, m2}, and a user can share information files via his or her friend circles. The friend circle consists of multiple users, described as mi =2U. We define the friend circle function hasCircle as follows:

This represents all friend circles that a user has, while denotes the jth friend circle of user . Then we have:

As shown in Fig. 3, user u1 posts her multimedia file in her friend circle , where D is the file set. Her friend gains that information, and forwards it to user , who is in ’s friend circle .

Step 1: The user’s privacy requirements and scenario specification information need to be preset. Next, we generate user ’s privacy tag via the privacy tag generation function prTag. After that, is appended to a multimedia file operated by via the marking function TagAppend. We can then obtain and upload tagged information . The above-mentioned privacy requirements need to be set by the user; they include the effectiveness of privacy preservation, the scope of file dissemination, allowed access entities, allowed operations, and so forth. The user’s privacy requirements set is denoted as PR = {pr1, pr2 ,…}. Next, we define the privacy requirement setting function as follows:

《Fig. 3》

Fig. 3. Information interaction between different domains within the system.

This represents the privacy requirements of all users. The privacy requirements of user are denoted as follows:

The above-mentioned scenario specification information SS ={ss1, ss2,…} is obtained by analyzing the domain system. This information includes the file-generation time, file producer, operations on the file, and so forth. The scenario information generation function is defined as follows:

This represents the specification information of the scenario where the user is located. The scenario specification information of user is then as follows:

We define the file operation function as:

This means that a new file will be generated after the operation. The new file generated after the operation on old file d by user is denoted as follows:

The privacy tag generation function PrTag represents the generation of a privacy tag after a transfer via each user, which is denoted as follows:

Let us use  to represent the generated privacy tag, where X is the six tuples of private information and F is the set of privacy operations.  is the privacy tag generated by user . We then have:

The marking function tagAppend is denoted as:

This refers to appending to the file the tags generated by every user during file dissemination. We then have:

Step 2: The information system first checks the tagged information of the multimedia file to find out whether user satisfies ’s constraint condition set and dissemination control operation set . If it satisfies these criteria, can carry out allowed operations such as downloading, editing, and so forth. As shown in Fig. 3, as the multimedia file is allowed to be downloaded by friends within the same friend circle, user is allowed to download and obtain .

Step 3: After performing operations such as modify, add, delete, or other allowed operations on the multimedia file, user obtains a new file = openFile (u2, ). In this equation, the notation means that the file was first operated by u1 and then operated by . The file is ready for forwarding to user or to the friend circle. After this, Social Network 1 appends the privacy tag Tagu2 of , where = prTag[ , ; setPR(),  genSS(d; )] , to the multimedia file from . Hence, we have:

Step 4: The information system checks the tagged information  to determine whether the privacy requirements for each tag, , are satisfied. If they are satisfied, user  will be allowed to see the multimedia file posted by  , and will be able to perform allowed operations such as downloading.

During the above-mentioned information transfer process, a privacy invasion occurs when abnormal behavior occurs, such as when a user’s operations or other behaviors are beyond the constraint condition set or the dissemination control operation set . In order to trace the evidence for privacy invasion, we need to analyze the privacy tags appended to the multimedia file, and resort to the audit control information set X or to other information in order to recover the invasion scene to discover where and through what operation the invasion was incurred. We can then effectively monitor the whole life-cycle private information flow and are hence able to realize evidence tracing of privacy-invasion behavior.

《4.2. Autonomous interaction across closed information systems》

4.2. Autonomous interaction across closed information systems

Case 2: In this case, information interaction occurs among different closed information systems within the same enterprise ecosystem. As shown in Fig. 4, user in Social Network 1 appends tags to her generated multimedia file d according to Eq. (14) and obtains the tagged file .  She then publishes it to her friend circle . The server obtains  and forwards the file to Social Network 2 in the same ecosystem, under the constraints of the privacy requirements of .

《Fig. 4》

Fig. 4. Autonomous interactions across closed information systems within an enterprise ecosystem.

In this case, the information can be disseminated across different information systems without the need for user , so Step 3 in Case 1 is omitted and Step 4 is executed directly after Step 2. Social Network 2 then publishes the file obtained from Social Network 1 and provides the download/read operation for user .

A similar case of a user’s autonomous information interaction across different closed information systems within the same enterprise ecosystem can easily be illustrated when there is a common user in both Social Network 1 and Social Network 2.

《4.3. Information interaction between open information systems》

4.3. Information interaction between open information systems

Case 3: Fig. 5 shows the information interaction that occurs between two open information systems. User of the open blackboard system (BBS) Z appends tags to her generated multimedia file d according to Eq. (14), and obtains tagged information file before publishing . User of Z can obtain and perform allowed operations on d under the constraints of . Next, generates a new file with her new tag according to Eq. (15), and publishes it on another open information system BBS T; she then reposts it to under the constraints of and . As a result, all users of T can access the forwarded file .

《Fig. 5》

Fig. 5. Information interaction involving open information systems. BBS: blackboard system.

When information interaction occurs between an open information system and a closed information system, the main difference occurs at Step (4). To be specific, when logs into a closed information system and publishes the file, users that are within ’s friend circle and within the satisfied related constraint conditions can access , while others cannot.

《4.4. Differential privacy computing in Baidu》

4.4. Differential privacy computing in Baidu

Case 4: With the dataset containing users’ access records for visiting all DureOS applications in Baidu, we use the differential privacy when querying the total number of page views (PV) as an example in order to describe how to achieve differential privacy under the guidance of privacy-preserving algorithm design criteria (Fig. 6).

《Fig. 6》

Fig. 6. Statistical data PV publishing that supports privacy preservation.

(1) In the preprocessing phase, based on the application scenario, both and are empty sets (,=)  . We treat the private information vector I as one-dimensional data obtained from the dataset. The query function g is the sum of the users’ query numbers. By analyzing the distribution of PV, we can obtain the empiric value of PV. While computing the sensitivity, we set p = 1 to obtain the following:

where and are two adjacent datasets. In the specific scenario of Baidu, is the maximum value of accessing numbers for a certain application in one day among all users.

(2) In the algorithm framework phase, since PV is numerical data, we adopt the Laplace mechanism. More specifically, we choose , where e is a nature exponential. Without considering parameter ,  and  =0; this aligns with the following formula:

The query statistics function = PV, and the randomized Laplace algorithm Alg is as follows: 

where Lap(·) is the Laplace probability distribution function satisfying the parameters , and is described as follows:

If the value of Alg(·) exceeds , the noise should be regenerated until the condition is satisfied.

(3) In the algorithm design phase, the Laplace parameters should satisfy in order to ensure that the above mechanism satisfies the following differential privacy definition:

In this case, we disregard the utility function while setting the noise parameters. Based on the privacy-preserving requirements such as the expected value of the users’ query numbers, the empirical value of the output, and so on, we adjust parameter to control the noise range and obtain the optimal noise expectation.

(4) In the combine algorithm phase, the weekly growth rate of PV is based on the accumulation of each day’s PV data throughout the week; therefore, with the post-processing property of differential privacy, the accumulated algorithm still satisfies  DP. Moreover, when the PV of each company satisfies  DP, based on the parallel composition property, the uniform data also satisfies  DP.

(5) By researching the requirements for the preservation of the users’ privacy and the PV data’s utility for each department in Baidu, we can analyze the noise-added data statement. Furthermore, with the complexity of differential privacy, the effectiveness of privacy preservation, utility of data, cost, and so forth, can be evaluated.

《5. Further research trends》

5. Further research trends

《5.1. Dynamic privacy metric》

5.1. Dynamic privacy metric

The data controlled by large-scale internet companies flows across information systems, national network boundaries, and digital ecosystems. Due to the existence of a variety of data types and diverse application scenarios, future research on privacy metrics may focus on three aspects: measurement methods of private information that are suitable for multimedia scenarios, the dynamic adjustment mechanism of a privacy metric, and the automatic mapping of a privacy metric to constraint conditions and policies. Solving the core problem of a dynamic privacy metric for huge datasets can support scenario-adaptive privacy control, especially in the case of big data flowing unpredictably through random paths.

《5.2. The fundamental theory of privacy-preserving algorithms》

5.2. The fundamental theory of privacy-preserving algorithms

By focusing on atomic operations for the privacy preservation of different types of information and privacy-preserving requirements, it is necessary to study the fundamental theory of a privacy-preserving primitive. In terms of an encryption-based reversible privacy-preserving primitive, the main focus is to develop highly efficient ciphertext computation theories such as fully homomorphic encryption, partial homomorphic encryption, ciphertext searching, and ciphertext statistics. In terms of a perturbation-irreversible privacy-preserving primitive, the main focus is to improve differential privacy models and develop new theoretical methods in information theory.

《5.3. Evaluation of the privacy-preserving effect and performance》

5.3. Evaluation of the privacy-preserving effect and performance

To conduct an evaluation of the privacy-preserving effect and performance, we need to further investigate how to establish a scientific and reasonable quantification system under which we can propose quantitative methods to evaluate indicators for privacypreserving primitives and primitive combinations. These indicators include privacy leakage, data utility, primitive complexity, and so forth. In this way, we can provide guidelines for the design, comparison, and improvement of privacy-preserving schemes.

《5.4. Privacy computing language》

5.4. Privacy computing language

The grammatical system of the privacy computing language— including statement definitions, programming interfaces, and the fusion of privacy-preserving primitives—should be studied in order to provide a convenient and platform-independent programming tool for the implementation of complex privacy-preserving schemes, so as to support the deployment of privacy-preserving mechanisms in complex interconnected information systems.

《5.5. Decision criteria and forensics of privacy violations》

5.5. Decision criteria and forensics of privacy violations

Based on the description of private information by the privacy computing framework, we can combine scenario perception, private information operation determination, private information constraint condition matching, and so on, in order to carry out a study of joint decision-making criteria with multiple factors for privacy violations, and thus determine the quantitative threshold of a decision-making. In order to solve the key problem in a spatiotemporal scenario reconstruction of privacy violations, we should design practical forensic schemes based on the forensic information embedded in private information descriptions, third party monitoring, and cross-over multi-element big data analysis.

《6. Conclusions》

6. Conclusions

With the rapid development of technologies such as the internet, mobile internet, and Internet of Things, huge amounts of data are being aggregated into big data through cloud services. Typical characteristics of big data include being massive and diverse. Big data provides the public with personalized services, which have profoundly changed the way we work and live. However, information services are facing serious privacy-leakage problems during the life-cycle of information flow, which includes information collection, storage, processing, publishing, and destruction.

Existing privacy-preserving solutions mainly focus on one scenario and provide a certain degree of privacy preservation for particular aspects. They have not yet been formed into a theoretical system. Therefore, we have proposed the concept of privacy computing and its framework with the aim of setting up complete life-cycle preservation of private information. The proposed method includes a privacy computing framework, a formal definition of privacy computing, four principles of privacy computing, algorithm design criteria, evaluation of privacy preservation, and a privacy computing language. The privacy computing framework can support private information exchange, extended authorization of private information circulation, and the forensics tracking of privacy invasion in a cross-platform scenario. The aim of the PCL is to satisfy description unambiguity, platform irrelevance, and computational consistency, which are able to support the layered crossinformation-system implementation of privacy preservation. Based on our proposed privacy computing framework, we have implemented the differential privacy-preserving mechanism in Baidu DuerOS. At the end of this paper, we provided a prospect for research development trends in privacy computing. We expect privacy computing to guide practical research on privacy-preserving technologies, and to guide the exploitation of privacy-preserving subsystems in large-scale information systems. We also expect privacy computing to provide a theoretical support for enacting the criteria and evaluating the ability of privacy preservation.



Special thanks to Dr. Ye Wu from Baidu, for his support and effort on applying the privacy computing framework into practical use in Baidu, Inc. This work is supported by the National Key Research and Development Program of China (2017YFB0802203), the National Natural Science Foundation of China (61672515 and 61872441), and the Youth Innovation Promotion Association, Chinese Academy of Sciences (2018196).

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Fenghua Li, Hui Li, Ben Niu, and Jinjun Chen declare that they have no conflict of interest or financial conflicts to disclose.