1   《1. Introduction》

1. Introduction

In the past decade, digital images have become more and more popular in our daily life. Compared with traditional text content, images are more intuitive and can convey much more information. Despite these benefits, the easy accessibility of digital images has resulted in significant security problems—that is, how to examine the authenticity of digital images and how to detect malicious modification. With advanced image processing software, it is becoming easier and easier to manipulate images without leaving visually noticeable traces, making the aforementioned problems more challenging [1].

To authenticate image contents and detect image forgeries accurately and robustly, researchers have proposed various approaches in digital image forensics. Generally speaking, there are two major categories of digital image forensics approaches: active approaches and passive approaches [2]. Active forensics approaches usually involve designing various kinds of watermarks or fingerprints of the image content and embedding them into the digital image. In the authentication stage, the former embedded watermarks or fingerprints are extracted and examined to determine whether the original image has been tampered with and, if so, where the tampered location is [3]. These active approaches can detect digital image tampering accurately; however, they have not been widely used because it is not possible to require all the digital images on the Internet to be watermarked before distribution. Hence, passive forensics approaches have become a more popular choice. These approaches detect digital image forgeries by analyzing specific inherent clues or patterns that occur during the creation/modification stage of digital images [4]. Compared with active forensics approaches, passive approaches do not rely on any prior or preset information; thus, they can have a broader application in image forensics.

For passive digital image forensics, various kinds of traces can be exploited to differentiate tampered images from natural ones [5]. In this paper, we have categorized these traces into three groups: traces left in image acquisition, traces left in image storage, and traces left in image editing. For each kind of traces, we will briefly review the corresponding passive digital image forensics approaches by clarifying the following issues:

• What are these traces and how do they form?

• What are the relevant recent and state-of-the-art approaches in image forensics?

• Why can these approaches detect these particular traces?

The paper is organized as follows. In Sections 2–4, we introduce various passive digital image forensics approaches, respectively: detecting the traces in image acquisition, the traces in image storage, and the traces in image editing. Section 5 concludes with a discussion of the major limitations of current techniques, and presents some possible future directions.

《2. Traces left in image acquisition》

2. Traces left in image acquisition

When a digital image is captured, several processing steps are performed before storage (Fig. 1). Before entering the imaging device, the natural light usually goes through a series of lenses. The imaging device then conveys it to the color filter array (CFA), where a specific color mosaic is performed that only permits a certain component of the light to pass through a specific area. In most digital cameras, only one main color (red, green, or blue, or RGB) is allowed for each pixel. After CFA filtering, the light reaches the imaging sensor, which is the crucial part of the digital camera. At present, there are two widely used sensors: the charge-coupled device (CCD) and the complementary metal-oxide semiconductor (CMOS). A number of photo detectors are contained in the imaging sensor and each of their output is related to a pixel of the image. In each detector, the filtered light is transformed into a corresponding voltage; thus, the output of the sensor is a mosaic of RGB pixels with various intensity values. In order to obtain the integrated color information for every pixel in the image, a demosaicing procedure has been adopted. An interpolation process has been performed on all the color channels; thus, the missing color components can be estimated.

《Fig. 1》

Fig. 1. Flowchart of the digital image acquisition procedure.

Each of the above stages will introduce specific imperfections or patterns into the final image; these can be adopted as useful clues in image source identification and tampering detection. The following subsections present a brief introduction to the traces left by the lens, sensor, and CFA interpolation; typical image forensics approaches using these traces; and the state-of-the-art forensics approaches used in this case.

《2.1. Traces left by acquisition artifacts》

2.1. Traces left by acquisition artifacts

Owing to the design and manufacturing process, distortions or aberrations will be introduced by the lens into the captured images. There are two kinds of common distortions in digital cameras: chromatic aberration (CA) and spherical aberration. The former distortion will appear when the lenses cannot focus colors with different wavelengths to the same convergence position on the sensor. The latter aberration will appear when lights passing through the lens do not converge at the focal point.

Sensor noise is another important acquisition characteristic. Among the various kinds of sensor noises, noise caused by the photo response nonuniformity (PRNU) is of the greatest importance. Various kinds of digital image forensics approaches have been proposed based on the PRNU noise; these approaches cover the applications of source identification, processing history recovery, image forgery detection, and more. The PRNU is a distinguishing characteristic that is related to each imaging sensor. Hence, by analyzing the PRNU in each captured image, the traces of the capturing device can be obtained.

Another aspect that cannot be ignored is the demosaicing of artifacts in the CFA. In order to estimate the corresponding values of each pixel in the image, an interpolation procedure has been applied to the three color channels. The interpolation process inevitably introduces certain correlations among the pixels, and such correlations can be adopted as an intrinsic ‘‘fingerprint” of the capturing device. By analyzing the demosaicing patterns, traces of the capturing devices can be obtained.

《2.2. Image forensics using acquisition artifacts》

2.2. Image forensics using acquisition artifacts

2.2.1. Image forensics based on sensor noise

Using the aberrations produced by the camera lens, we can relate an image to a specific device or examine whether an image has been tampered. For example, radial distortion is usually observed in an image, as the straight lines of an object appear to be curved. In order to deal with radial distortion, digital camera manufacturers usually adopt various methods to compensate for the distortion; these methods produce different artifacts accordingly. Hence, by analyzing such artifacts, the camera manufacturer and even the camera model can be identified.

Sensor pattern noise is commonly used in image forensics. For the image, the pattern noise can be formulated as follows:

where is the overall residual, which can be obtained from the original image by deducting the counterpart after passing the image through a de-noising filter ; is the PRNU factor; and φ is the summation of all the other kinds of noises in the image.

Assuming that we have N images captured by the same camera—that is, images I1 to IN—the corresponding residual Rk can be calculated using Eq. (1). The PRNU factor P can be estimated following the maximum likelihood criterion and is formulated as follows:

In camera identification, supposing that there are M devices, the PRNU factor should be calculated M times with a specific value being recorded for each device . In the test stage, the residual term is first calculated using Eq. (1) as for the test image . Then the correlation between the PRNU factors and this residual, is computed as follows:

where denotes normalized correlation. The capturing device is identified as the one with maximum .

2.2.2. Image forensics based on CFA patterns

The underlying rationale of the forensics approaches that are based on CFA artifacts is that the original image will have specific CFA artifacts, whereas the tampered region will probably have a different kind of artifact. Hence, the CFA artifacts are computed for each block in the image that is being investigated, with different CFA patterns indicating the existence and location of forgery. In addition, since a different demosaicing algorithm introduces different correlations between neighboring pixels in a specific color channel, there are two major directions in image forensics based on CFA patterns. The first direction aims to predict the interpolation parameters and identify the type of capturing device, while the second aims to examine the demosaicing traces so as to locate the region that has potentially been tampered with.

《2.3. State-of-the-art acquisition artifacts detection approaches》

2.3. State-of-the-art acquisition artifacts detection approaches

2.3.1. Source identification based on lens aberrations

As discussed in Section 2.1, different cameras have different kinds of lens aberrations, so lens aberrations can be adopted as a unique ‘‘fingerprint” in source identification. Choi et al. [6] proposed a pioneering work in this area: Considering the characteristics of the radial distortion that forces straight lines to become curved, they proposed two kinds of features, based on the pixel intensities and the distortion measurements. Choi et al. [6] then performed source identification as a classification procedure and achieved a 4% detection accuracy gain compared with approaches that only use image intensities.

2.3.2. Tampering detection based on lens artifacts

The basic idea behind using lens artifacts to detect image-splicing forgery is that the original image and the inserted image patch are probably captured using different devices; thus, detecting and locating image regions with different lens artifacts may identify image forgery. Yerushalmy and Hel-Or [7] proposed a new type of lens artifact, which is referred to as the “purple fringing aberration” (PFA); they also presented a corresponding extraction method. The PFA directions are adopted as unique ‘‘fingerprints” to determine whether there is any inconsistency in the image being tested. The algorithm achieved good performance in both image forgery detection and tampering detection.

2.3.3. Source identification based on sensor pattern noise

Owing to deficiencies in the sensor manufacturing process, different pixels will have dissimilar light sensitivities; therefore, the sensor pattern noise, and especially the PRNU, can be used to differentiate various kinds of sensors and camera types. Lukas et al. [8] proposed a source identification approach based on the PRNU that can identify nine camera models. Kulkarni and Mane [9] noticed that an estimation of the sensor noise may not be very accurate in the edge region, and proposed a preprocessing step before feature extraction. Two edge detectors, the Canny and Laplace operators, were adopted to detect the edge regions; these regions are then removed for further processing. After calculating the sensor noise by thresholding, a number of statistical features were extracted from the discrete wavelet transform (DWT) domain in the form of the gray-level co-occurrence matrix (GLCM). Finally, the k-nearest neighbor (k-NN) is adopted as the classifier. Considering the applications for camera identification in mobile phones, Sandoval Orozco et al. [10] proposed a specific approach based on sensor imperfections. Their features were also extracted from the wavelet domain, and this approach achieved good results in digital camera identification in mobile phones.

2.3.4. Tampering detection based on sensor fingerprint inconsistencies

Similar to the lens artifacts, the sensor noise can be used to detect image forgery. The underlying idea is that there will be inconsistencies in the sensor noise/fingerprints in the tampered region, allowing a suspicious tampered region to be detected. Fridrich [11] proposed an image-tampering detection approach based on the PRNU information. A statistical model was built to describe the PRNU factor. The PRNU factor of the image being tested was extracted and used to identify the capturing device. The experimental results showed that this approach can achieve almost 100% accuracy for 100 different types of cameras.

2.3.5. Source identification based on CFA artifacts

As mentioned earlier, the CFA and the demosaicing process vary among various cameras; Gao et al. [12] proposed a source identification method based on this information. A 69 dimensional feature was designed to describe the abovementioned artifacts. The experiments were carried out on the Dresden Image Database, and a very high detection accuracy of 99.88% was achieved in differentiating seven camera models.

2.3.6. Tampering detection based on CFA artifacts

The basic idea of using CFA artifacts in tampering detection is straightforward: The tampered region will exhibit different CFA/ demosaicing artifacts than those of the original image. Prasad [13] proposed a feature to describe the demosaicing artifacts in images. If an abnormal region is present (i.e., one without the original artifacts or having different artifacts), it is regarded as a tampered region. Katre and Chandel [14] proposed an approach that can both detect image forgeries and locate the tampered region. The artifacts caused by demosaicing were modeled, which helped to reveal the image forgery. Their approach achieved good performance for uncompressed images; however, the question of how to deal with Joint Photographic Experts Group (JPEG) compression remained a challenging one.

《3. Traces left in image storage》

3. Traces left in image storage

JPEG is the most widely used format for image transmission and storage. Since it is a lossy compression standard, JPEG will inevitably introduce certain compression patterns during each image storage. By analyzing these patterns, it is possible to deduce important forensic cues, such as how many times the image has been compressed and whether all the regions in the image have been compressed the same number of times. The following subsections present a brief introduction to the patterns left by JPEG compression, typical scenarios in JPEG compressed image forensics, and state-of-the-art forensics approaches in this area.

《3.1. Patterns left by JPEG compression》

3.1. Patterns left by JPEG compression

A standard JPEG compression procedure for grayscale images runs as follows (note that this procedure can be extended to color images by performing a similar approach in each channel of the YCbCr color space): A nonoverlapping 8×8 block division is performed on an original image. For each block, a two-dimensional discrete cosine transform (2DDCT) is applied on the grayscale values to transform all the pixels to the frequency domain. Next, the amplitudes of the frequency components are quantized by a preset quantization table. Fig. 2 shows a typical quantization table with a quality factor of 50, where larger quality factors represent a higher image quality and lower compression ratio. Finally, an entropy coding technique (e.g., Huffman coding) is adopted to turn the quantized frequency amplitude into a binary sequence.

《Fig. 2》

Fig. 2. A quantization table with a quality factor of 50.

The JPEG compression will introduce three kinds of bias to the original image: the quantization error, truncation error, and rounding error. The quantization error is caused by the quantization process in the frequency domain. After quantization, the original value of a specific DCT component will be represented by the closest integer multiples of the corresponding quantization step. For example, given an original direct current (DC) value of 86 and a quantization step of 16 (as shown in Fig. 2), the DC value will be changed to 80 after quantization and the difference is denoted as the quantization error (i.e., 86 – 80 = 6). The truncation error and rounding error are introduced in the inverse discrete cosine transform (IDCT) transform. As we know that the grayscale value should be an integer ranging from 0 to 255, any value greater than 255 or less than 0 will be truncated to 255 or 0, correspondingly, which leads to the truncation error. On the other hand, most values after IDCT are not integers, so a rounding process must be performed, which leads to rounding errors. Generally speaking, the quantization error is much greater than the other two errors, especially when the quality factor is medium or low (< 75). By analyzing the quantization error, some clues about JPEG compression can be derived.

《3.2. Image-tampering detection using clues from JPEG compression》

3.2. Image-tampering detection using clues from JPEG compression

Consider a simple image-tampering scenario. We crop the small Patch I from Image B and insert it into Image A to generate a composite Image C (Fig. 3). If all the images-A, B, and C-are stored using JPEG compression, the following issues can be observed.

《Fig. 3》

Fig. 3. An illustration of copypaste image tampering.

3.2.1. Aligned double JPEG compression

In Image C, all the regions except Patch I are compressed twice (during the storing of Image A and the storing of Image C), whereas Patch I is compressed once. Note that although Image B has also been JPEG compressed, the 8×8 block division structure of Image B is very likely (with a probability of 63/64) to be different from that of Image A with regards to Patch I. Thus, Patch I in Image C is compressed only once according to the 8 8 block division structure of Image C (which is the same as that of Image A). Hence, the inserted Patch I can be located by examining the aligned double JPEG compression effects in all the regions. Here, the term “aligned” refers to the first and second JPEG compression using the same 8×8 block division structure.

3.2.2. Shifted (nonaligned) double JPEG (SDJPEG) compression

The image in Patch I is focused on in this part. The image in Patch I is first compressed when storing Image B using a specific block division structure; it is then compressed again during the storing of Image C, using another structure. Double JPEG compression with different block division structures is referred to as a "shifted” or "nonaligned” case. By examining all the regions in Image C, the inserted Patch I can be located based on the SDJPEG compression effects.

《3.3. Stateoftheart double JPEG compression detection approaches》

3.3. Stateoftheart double JPEG compression detection approaches

For most tampered images, at least two JPEG compressions have been applied to the original image. Hence, double JPEG compression detection is a crucial part in digital image forensics. The representative aligned double JPEG compression detection approaches can be found in Refs. [15–21]. One of the earliest works was proposed by Lukas and Fridrich [15], who noticed that double JPEG compression would generate two peaks in the DCT histogram; this phenomenon was adopted as a clue to detect double JPEG compressed regions. Fu et al. [16] proposed a double JPEG compression detection approach based on the generalized Benford’s law. They assumed that the first digits of the DCT coefficients after JPEG compression form a distribution that follows the generalized Benford’s law; thus, any image region violating this assumption is determined to be a double JPEG compressed region. Pevny and Fridrich [17] extracted a number of lowfrequency DCT coefficient histograms as distinguishing features and used the support vector machine (SVM) as the classifier. Farid [18] stated that if the image was double JPEG compressed, recompression with the same quality factor would cause the minimum reconstruction error. He referred to such a local minimum as a ‘‘JPEG ghost,” and tried to detect double compression by searching for JPEG ghosts. Lin et al. [19] proposed a tampered regionlocating algorithm based on double JPEG compression detection. The distribution of DCT coefficients after single JPEG compression was assumed to follow the Laplace distribution; the distribution after double JPEG compression can then be derived accordingly. By using an expectation maximization (EM) optimization algorithm, each region was assigned a probability of being doubly compressed. A graph cut algorithm was then adopted to avoid false alarms. This algorithm usually works well for double compression with different quality factors. However, when two compressions use the same quantization table, most of the existing approaches cannot achieve a high detection accuracy. In view of this, Huang et al. [20] designed an algorithm only for detecting double JPEG compression with the same quality factor. A random perturbation strategy was adopted based on the fact that the differences between the first and second compressions are much greater than those between the second and third compressions. In a recent work, Yang et al. [21] extended the idea in Ref. [20] and comprehensively analyzed the error blocks in JPEG compression. The rounding errors and truncation errors were analyzed, and a set of features describing the differences between single and double JPEG compressions was extracted. By using the SVM as the classifier, their algorithm [21] can accurately detect double compression with the same quality factor when the quality factor is relatively high.

When the original image (i.e., Image A in Fig. 3) is uncompressed, aligned double JPEG compression effects will not exist in the composite image (i.e., in Image C in Fig. 3). In order to detect image forgeries in such a case, the researchers try to investigate the SDJPEG compression effects [22–27]. Luo et al. [22] detected SDJPEG compression effects by introducing a specific feature called the blocking artifact characteristics matrix (BACM). For single JPEG compressed images, it is observed that the corresponding BACMs are symmetric, whereas the BACMs for SDJPEG compressed images are no longer symmetric. However, the BACM feature is related to the image content to some extent; thus, different image contents may lead to different kinds of BACM features, which may degrade the double JPEG compression detection performance. In order to solve the content-related problem, an extended BACM feature was proposed by Chen and Hsu [23] by considering the inter-block correlation. Qu et al. [24] proposed a convolutive mixing model for SDJPEG compression, and solved it using blind signal separation. They analyzed the independent value map (IVM) to examine whether the image had undergone SDJPEG compression, with the assumption that SDJPEG compression will break the symmetry of the IVM. Bianchi and Piva [25] tried to disclose SDJPEG compressed traces from the DCT coefficient histograms. A complete search was performed to examine each region with a specific size in order to determine whether it was SDJPEG compressed or not. When the searching region matches the block division of the first JPEG compression, an integer periodicity pattern can be observed in its DCT coefficient histogram. Bianchi and Piva [26] also proposed a statistical model for the DCT coefficient distributions caused by SDJPEG compression. They simulated the SDJPEG compression by adding a zero-mean Gaussian noise to each DCT coefficient; they also provided a variance estimation method to approximate the noise variance. Recently, Wang et al. [27] extended the idea in Ref. [26] and provided a complete theoretical proof on how the SDJPEG compression affects the DCT coefficient distributions.

The major limitations of the current double JPEG compression detection algorithm are threefold:

(1) Most techniques are based on statistical features or models. When the tampered region is small enough (i.e., smaller than 64 64), most techniques cannot provide accurate results, since the data for constructing the statistical features (models) are quite limited.

(2) Most techniques can achieve good results when the first quantization table is known or can be accurately estimated. However, such an assumption cannot be easily achieved. Due to the error propagation, the overall detection accuracy will decrease to some extent during the exhaustive search for the first quantization table.

(3) For SDJPEG compression detection, when the quality factor of the second quantization is much less than that of the first quantization, most state-of-the-art detection approaches show a performance that is similar to random guessing.

《4. Traces left in image editing》

4. Traces left in image editing

《4.1. Inconsistency in lighting》

4.1. Inconsistency in lighting

Copy-paste image tampering is the most common method of creating an image forgery. These forgeries are usually deceptive to the human eye; however, they may have some inconsistencies in lighting, shadows, perspective, and so forth, which can be detected by proper analysis. Fig. 4 [28] shows a well-known case of lighting inconsistencies. When a picture is captured, the objects in the scene are illuminated by a light source coming from a certain direction (Fig. 5 [28]). If two objects originally come from different images, it is unlikely that their light sources will be similar in direction and distance. The following subsections present a brief introduction to image forgery detection by analyzing lighting consistencies.

《Fig. 4》

Fig. 4. The light source directions of the two people in this image are different, which may be evidence of forgery [28].

《Fig. 5》

Fig. 5. In different images, the light source directions are different [28].

4.1.1. Lighting traces

When an image is captured, the objects captured in the scene are illuminated by a light source. This light source will leave traces on the objects, such as intensity differences on the surface, and shadows. Hence, the light direction can be inferred from these traces. If the objects in an image are illuminated by different light sources, these objects are very unlikely to belong to the same original scene—which indicates forgery.

4.1.2. Image-tampering detection using a lighting pattern

The lighting environment in the real world is complex. Light sources are three dimensional (3D), and sometimes there are multiple light sources. Hence, assumptions are made to simplify the problem: the Lambertian assumption of the surface of interest, the constant assumption of the object’s reflectance, and the assumption that the light source is located infinitely far away. With these assumptions, Johnson and Farid [28] described the image intensity, , as follows:

where is the object’s constant reflectance value, is the light source direction, is the surface normal direction at the coordinate(x,y), and C is the constant ambient light. Moreover, can be regarded as a unit value when only the light direction is required for estimation. Given knowledge of 3D surface normals from at least p distinct points (p ≥ 4) on a surface with the same reflectance, the least-squared estimation is adopted to calculate the light direction, which is formulated as follows:

where || || represents the vector norm, is the light source direction containing three components (Lx, Ly, and Lz), and

where Nx, Ny, and Nz are the three components of the surface normal direction. Setting the partial derivative at zero, the least-squared estimate can be calculated as follows:

However, with only a single image and without knowing the geometry of the objects in the scene, it is unlikely that the object surface normal N(x,y) can be obtained. A good solution is to simplify a 3D scene to a 2D one, which means that we only need to estimate two components, (Lx, Ly), of the light source direction from a digital image.

In some scenes, the simplifying assumptions are too strict, such that forgeries cannot be detected. In these cases, we can relax some simplifying assumptions. For example, we can relax the constant reflectance assumption; different light source directions on each patch of the surface must then be estimated. Moreover, when the light source is local, the assumption that the light source is infinitely far away is invalid. In that case, one more assumption must be added: that the light source direction is constant for a local patch.

4.1.3. State-of-the-art lighting inconsistency detection approaches

Lighting traces are quite complex in the real world. For indoor scenes, there may be multiple local light sources. For outdoor scenes, there is very likely a one-point light source from infinitely far away. In some scenes, we cannot calculate the 3D components of the light direction due to unknown geometry. Most of the detection approaches are based on particular scenes or have some defects. Johnson and Farid [28] proposed one of the earliest attempts at detection using lighting inconsistencies. Although this method had some difficulties in calculating the object surface normal in many cases, it was observed that for images containing human faces, the light source in the scene can be estimated from the highlights on human eyes. Johnson and Farid [29] then estimated the light direction from this highlight based on a 3D model of a human eye, and proposed a lower-dimensional model in Ref. [30]to deal with a complex lighting environment. Kee and Farid [31] further built a 3D head model to improve the estimation performance for a complex lighting environment. Nillius and Eklundh [32] presented an automatic light source direction estimation algorithm from a single image. The algorithm in Ref. [32] required at least one occluding object contour with isotropic surface reflectance in the image—a requirement that can be easily satisfied in many images with various contents. In the algorithm, the occluding contours were first extracted based on the color and edge information. A shading model was then adopted to estimate the light source direction for each contour, and a Bayesian network was designed to fuse all the estimation results and output the most likely estimation. It is worth noting that in Ref. [32], the shadows of the objects could also be used to detect forgery. Koenderink et al. [33] modeled the illumination process as a parallel light beam from random directions projecting on random Gaussian surfaces, and proposed an illumination direction estimation method accordingly. Zhang et al. [34] adopted the planar homology and shadow matte to describe the color distribution/characteristics and relationship of the shadows in the image. A framework based on photometry and geometry of the image shadow was proposed to detect image forgeries. Fan et al. [35] designed a straightforward counter-forensics strategy to outdo the 2D lighting consistency-based forensics approaches; this strategy disposed the shortcomings of the existing approaches in this area and provided new challenges for future research. Moreover, it was shown that lighting traces are more effective for outdoor scenes due to the simple lighting environment.

《4.2. Local filtering traces》

4.2. Local filtering traces

4.2.1. Median filtering detection

Median filtering (MF) is a common image postprocessing method to filter out image noise. With its nonlinear property, it also can be used in image forgeries to remove specific modification traces. The MF result is obtained by the median value of a small window surrounding a specific pixel. Generally speaking, the square window size is set to odd values, such as 3×3 or 5×5. Since the MF takes the median value as the filtered result, it usually generates many constant or nearly constant patches in the filtered image; these patches become intrinsic footprints and make the MF procedure traceable and detectable.

The general form of a 2D median filter is given as follows:

where is the output of the median filter on a pixel (i, j) with a squared window of the size [z, z].

The nonlinear nature of MF makes it difficult to build up an explicit expression describing the relationship between its input and output. However, Bovik et al. [36] demonstrated that MF has a good property of edge preservation. In addition, constant or nearly constant image patches are usually found in an image after MF; this is referred to as the streaking artifact [37], as shown in Fig. 6. Kirchner and Fridrich [38] first tried to exploit the streaking artifact of MF images and proposed a pair of MF detectors using this artifact. The histogram in the first-order difference image was analyzed to examine the streaking artifacts, and the approach showed good results for uncompressed images. In order to detect MF in JPEG compressed images, the subtractive pixel adjacency matrix (SPAM) features [38], which show good results in steganalysis, were adopted for MF detection. However, since the SPAM is a complex statistical model, the detection performance will degrade when the number of pixels in the image region is reduced to a relatively small value (e.g., 64 or 128). Similarly, Cao et al. [39] proposed an MF detection algorithm based on the streaking artifacts. After computing the difference images of the texture region in both the horizontal and vertical directions, the probabilities of zero values were recorded. Moreover, the algorithm developed by Cao et al. [39] can differentiate MF from other local image processes such as rescaling, Gaussian lowpass filtering, and averaging filtering. Yuan [40] observed that MF results have a local dependency artifact because the local filtering windows of adjacent pixels are overlapped with each other during MF, and a number of pixels are shared between adjacent local windows. Based on this observation, 44 features measuring the intrinsic relationships among the adjacent pixels were proposed for MF detection, as what is referred to as the median filtering feature (MFF). The MFF can detect MF very accurately for uncompressed images and can achieve similar results when compared with the SPAM feature in JPEG compressed images. Moreover, for JPEG compressed images with a low quality factor, the MFF usually outperformed the SPAM feature.

《Fig. 6》

Fig. 6. Pixel distribution (a) before and (b) after MF.

The major drawback of using streaking artifacts for MF detection is that these approaches are not robust against certain types of image postprocessing steps, such as JPEG compression. Hence, other kinds of MF traces should be exploited. Chen and Ni [41] observed that images after MF show different image characteristics in the edge region, compared with the original images. Such characteristics are represented in the specific correlation between neighboring pixels and the relationship between image noise and edges [41]. Taking the above characteristics into consideration, the edge-based prediction matrix (EBPM) feature was proposed for MF detection; this feature is extracted from various edge regions in the image, and the SVM is adopted for classification. Compared with previous methods, the EBPM showed better performance in distinguishing MF from other filtering processes such as Gaussian lowpass filtering and averaging filtering. Kang et al. [42] adopted the median filter residual (MFR) as evidence to detect MF. The residual was defined as the difference between the image and its median filtered output. The authors pointed out that when an image is median filtered once more, the MFR will be reduced; they then proposed an MF detection approach based on this observation. An autoregressive (AR) model was employed to construct an MF detection feature set based on the MFR. Compared with SPAM feature and MFF, the MFR feature has a lower dimension (it is 10 dimensional) and achieves comparable detection results, even when the quality factor of the JPEG compressed image is low (e.g., 30). Chen et al. [43] exploited both the global information and local information for MF detection. In the global sense, a number of cumulative distributions of various orders of the difference images were adopted as global features. In the local sense, the correlations between various neighboring pixel pairs were adopted as local features. The final feature was constructed by concatenating the global and local features, and had a dimensionality of 56. Their approach is successful at detecting MF in lower-solution and JPEG compressed images with low quality factors.

Some modern techniques, including the local texture descriptor and deep learning, have recently been adopted for MF detection. Zhang et al. [44] proposed a local texture descriptor, referred to as the second-order local ternary pattern (LTP), for MF detection. The proposed feature combined the merits of the LTP feature and those of the local derivative pattern feature; thus, it can better describe the local characteristics caused by MF. Moreover, kernel principal component analysis (KPCA) was employed for feature dimension reduction and discriminative feature extraction. Using the approaches in Ref. [44], median filtered images can be detected accurately and efficiently. Chen et al. [45] aimed to detect MF artifacts in a challenging small-sized image patch scenario. A convolutional neural network (CNN) was designed for both feature extraction and classification. The image patch is adopted as the input of the network, and the detection result is the network output. In Ref. [45], the features were learned automatically from training samples, and no manual feature extraction procedure was required. This approach showed significant performance improvements in MF detection, especially when the image patch is small.

4.2.2. Unsharp masking sharpening detection

Unsharp masking (USM) sharpening is a technique that is widely used in daily life: It enhances edge contrast in order to improve image quality. However, in many image forgeries, USM sharpening can also be employed to cover the traces of image manipulation to some extent. Hence, USM sharpening detection has become a hot research topic in digital image forensics. The USM sharpening process usually contains the following two steps:

Step 1: Perform Gaussian high-pass filtering.

where H is the high pass filter; Ii is the original image; (x, y) represents the horizontal and vertical coordinates; and stands for the Gaussian high-pass filter, and is the standard deviation (SD) of G, which controls the sharpening range.

Step 2: Add the unsharpened mask to the original image.

where O is the final image after sharpening and λ is the scale coefficient that can control the strength of the sharpening.

Fig. 7 [46] shows the change in grayscale values after USM sharpening. The original edge is a side-plain edge. It is observed in Fig. 7 that there are two jumps enlarging the edge effects in the edge region. This phenomenon is defined as the overshoot artifact, which is a significant clue for the USM sharpening procedure. It is caused by the superposition of high-frequency signals.

《Fig. 7》

Fig. 7. Overshoot artifacts of USM sharpening [46].

In recent years, many researchers have proposed several approaches to detect USM sharpening [46–49]. The edge-modeling-based [47,48] and local-texture-based [46,49] approaches are two widely used techniques in USM sharpening detection. Cao et al. [47] first proposed the idea of USM sharpening detection. They discovered the histogram aberration caused by sharpening, and built edge models to measure it. However, in their later work [48], the method in Ref. [47] was proved to be effective only when the image has a wide pixel value histogram. Cao et al. [48] proposed another method for USM sharpening detection. First, the edge position of the original image was detected. Next, the overshoot strength around the edge was measured, and the average overshoot strength over the whole image was calculated. Finally, a threshold was set for overshoot artifacts in order to determine whether an image has been USM sharpened. Although this approach in Ref. [48] solves some of the shortcomings of the method presented in Ref. [47], it is very sensitive to image noises.

Local-texture-based approaches apply different local textures in edge regions. Local binary pattern (LBP) and edge perpendicular binary coding (EPBC) [46] textures are two widely used local textures. Ding et al. [49] adopted LBP texture. They first located the edge region using a Canny operator. They then applied LBP texture around the edge pixel. The overall histogram of the whole image was calculated and an SVM classifier was utilized to differentiate the sharpened image from the original images. Experimental results showed that the LBP-based approach outperforms the edge modeling approaches. Later, Ding et al. [46] discovered that overshoot artifacts mainly appeared in the perpendicular direction of the edge. Therefore, they proposed a novel texture—EPBC texture—for USM sharpening detection. Compared with LBP texture, the EPBC texture feature uses a 1 N rectangle window, which is on the edge pixel and along the edge normal. A new binary coding strategy was applied for the rectangle window. Compared with other approaches, the EPBC-based approach has better detection accuracy with efficiency. This approach is also robust against JPEG compression and other noises, to some extent.

《4.3. Detection of copy-move attacks》

4.3. Detection of copy-move attacks

The aim of a copy-move attack is to deliberately forge or remove one or more objects in the source image. For object creation, the new object is generated by copying a source object and moving it to another position in the same image as shown in Fig. 8. For object removal, the object region being removed is replaced by some background patches in the same image. In order to deceive the human eye, fundamental affine transforms such as scaling, rotation, and so forth are usually applied before pasting.

《Fig. 8》

Fig. 8. A typical example of copy-move attack. (a) Original image; (b) tampered image.

Considering the mechanism of the copy-move attack, it can be concluded that in the tampered image, there is at least one pair of regions in the object or background region with extremely similar color, shape, and texture. By analyzing these similar image patch pairs, a copy-move attack can be detected. Accuracy and efficiency are two key issues in copy-move detection. Refs. [50,51] are the pioneering works in copy-move detection. Fridrich et al. [50] discussed several major requirements of the copy-move detection algorithm, including: allowing for an approximate match of small image segments; and having few false alarms and an acceptable processing time or complexity. They then proposed a detection algorithm based on block matching. The image investigated was first divided into a series of overlapping small blocks, and specific features were extracted from each block. By comparing the features, similar small blocks were detected; the test image was regarded as an image forgery when there were more than a preset number of block pairs gathering together in the spatial domain. Popescu and Farid [51] adopted principle component analysis (PCA) for feature extraction, and the eigenvalues of the small block were adopted as its corresponding feature. The major computation complexity of Refs. [50,51] is dominated by the lexicographic sorting, which is O(F× N log N) [51], where F is the feature dimension and N is the number of pixels in the image. However, the above approaches may not achieve good results when the duplicated region is scaled or rotated greatly.

In order to deal with the problems caused by scaling and rotation, Bayram et al. [52] exploited the properties of the Fourier-Mellin transform, which is robust against translation, rotation, and scaling. Furthermore, they adopted the practice of counting bloom filters in order to reduce the computational complexity caused by lexicographic sorting. The experimental results showed that their features can resist a rotation of 10% and a scaling of 10%. Li and Yu [53] extended the idea in Ref. [52]and proposed the vector erosion filter to solve the problem of vector counters. Their algorithm is shown to be able to detect region duplication with a large rotation angle. Recently, Zandi et al. [54] proposed an adaptive copy-move forgery detection (CMFD) approach. Different thresholds were adopted for various image contents. The adaptive threshold of a specific block was determined by its SD; thus, the corresponding CMFD can detect duplications in both smooth and textured regions. Christlein et al. [55] performed a comprehensive evaluation of various kinds of CMFD approaches. An image database containing 48 base images was adopted, and the copy-move forgeries were carefully produced without leaving visually noticeable traces. The results in Ref. [55] demonstrated that although keypoint-based approaches are very efficient, the detection results will be affected by low-contrast regions and repetitive objects [55]. On the other hand, block-based methods provide high detection accuracies at the cost of higher computational complexity. Of all the features, Zernike features are the recommended choice.

《4.4. Resampling detection》

4.4. Resampling detection

Generally speaking, in many imagesplicing scenarios, the splicing image region usually undergoes a resizing and/or a rotation procedure to cover the forgery traces and to make the final image forgery appear more realistic. Moreover, in the resizing and rotation procedure, the pixels in the modified image patch must be resampled to fit the new sample lattice. Hence, resampling detection can help to detect possible image forgeries and locate suspicious splicing regions.

In 2005, Popescu and Farid [56] proposed their pioneer work in resampling detection. They observed that for resampled signals, there are strong correlations between neighboring/adjacent samples. An EM algorithm was employed to estimate the underlying correlation parameters. With the above parameters, a probability map can be generated to describe the probability of a specific pixel to be correlated with its neighboring pixels. This approach can provide high detection accuracy for uncompressed images, although the detection performance will be degraded for JPEG compressed images because the block discrete cosine transform (BDCT) in JPEG compressed images leads to a specific kind of resampling that may confuse the detector.

Gallagher [57] proposed a simple algorithm to detect two widely used interpolation operators: the linear interpolation and the cubic interpolation. He observed that after linear/cubic interpolation, images would have a periodic pattern in the second derivative signal, and that the period of such a pattern can indicate the interpolation factor. The experiments showed that the approach in Ref. [57] can detect the zoom factor from 1.1 to 3.0 in increments of 0.1. Mahdian and Saic [58] extended this idea and showed that interpolation will introduce a specific periodicity when considering the covariance of the signal’s derivatives. However, the detection performance of the above approaches will be degraded with medium or strong JPEG compressions.

In order to deal with JPEG compressed images, Kirchner and Gloe [59] modified the algorithm in Ref. [56] by removing the artifacts caused by JPEG compression to some extent. Their results demonstrated that the detection performance degrades greatly when the post-compression is stronger than the pre-compression. Moreover, upscaling is usually more detectable than downscaling.

VázquezPadín et al. [60] recently proposed a simple approach to detect upsampled images or image patches. The singular value decomposition (SVD) approach was adopted to characterize the linear dependencies of a resampled image. A specific measure describing the degrees of saturated pixels was then proposed in order to differentiate upsampled images from genuine images. This approach showed good performance in detecting small resampled patches.

《4.5. Blind image-splicing detection》

4.5. Blind image-splicing detection

This kind of method attempts to detect various image-tampering and manipulation procedures based on the fact that such procedures will inevitably introduce certain artifacts that do not exist in the original images. Generally speaking, these approaches lead to a binary decision regarding whether a test image is original or not. However, even if the image is examined and determined to not be original, it is not possible to tell what kind of tampering has been performed, or what region has been tampered with. Nonetheless, blind image-tampering detection algorithms can be used as a preprocessing step for many image forensics systems in order to detect suspicious images without determining the editing/manipulation tools that were used.

One of the first blind image-tampering detection methods was proposed by Avcibas et al. [61], who aimed to detect image manipulations caused by affine transformations (scaling and rotation), brightness and contrast adjustments, and so forth. Various image quality metrics were adopted as image features, and linear regression was employed as the classifier. The forgery detection was formulated as a two-class classification problem, which allowed a feasible solution for most of the subsequent blind detection approaches. The detection accuracy of this method was reported as about 70%. Shi et al. [62] proposed an image-splicing detection method based on a DCT coefficient model. They observed that the distribution and correlation between neighboring BDCT coefficients are useful to differentiate natural images from spliced images. The Markov features extracted from the BDCT 2D arrays were adopted as the discriminative features, and the SVM was employed as the classifier. The experimental results showed that this approach can achieve over 90% detection accuracy on the Columbia Image Splicing Detection Evaluation Dataset [63]. Wang et al. [64] stated that the chromatic information is more discriminative than the grayscale information. They adopted the GLCM of the edge map in a chromatic channel as the discriminative feature, and employed SVM as the classifier. Similar to the Markov features in Ref. [62], the GLCM features represent the second-order statistics of the data and usually provide comparable results. He et al. [65] extended the idea from Ref. [62] and extracted Markov features in both the DCT and DWT domains. An SVM with recursive feature elimination (SVMRFE) was employed to extract highly discriminative features and provide higher detection accuracy. This approach also achieved high detection accuracy in the Columbia Image Splicing Detection Evaluation Dataset [63]. Recently, Zhao et al. [66] proposed a new blind image-splicing detection method. In contrast to the traditional causal Markov features [62,64,65], they employed a 2D noncausal Markov model to describe the underlying characteristics of the natural and spliced images in both the DCT and DWT domains. The proposed noncausal model provided more information and better modeled the 2D image. The model parameters were regarded as the features, and SVM was adopted as the classifier. The experimental results revealed their detection accuracies to be above 90% in the Columbia Image Splicing Detection Evaluation Dataset, which demonstrated the feasibility of the blind detection approaches in image-splicing detection.

《5. Conclusion and future work》

5. Conclusion and future work

Owing to various kinds of traces in image acquisition, image storage, and image editing stages, digital image forgeries can be detected without any prior information or knowledge. Among the various kinds of passive digital image forensics approaches, machine-learning-based techniques play an important role, in which forgery detection is formulated as a two-class (i.e., natural image or composite image) classification problem. Various features that can effectively disclose specific underlying traces have been proposed in digital image forensics.

The most critical issue and challenging problem in digital image forensics stem from the discriminative power and generalization ability of the extracted features. On the one hand, the extracted features should be sensitive to specific forgery operations and be robust against variations caused by different image contents. In most cases, these two requirements conflict with each other, leaving the question of how to design a highly discriminative, content-adaptive image forensics approach as an unsolved problem. On the other hand, most of the current features are handcrafted or hand-designed features that are usually based on certain assumptions or on a specific simplified model of the forgery procedure. However, due to the complexity in real images that are caused by variations in image content, image manipulation techniques, and so forth, handcrafted features have difficulty handling most forgery cases effectively and comprehensively.

With the development of sophisticated artificial intelligence techniques, deep learning suggests a promising solution for digital image forensics. In many deep learning structures, such as the CNN and the deep residual network, features are automatically learned from training samples rather than being manually designed. With sufficient training samples, deep-learning-based approaches can provide a more comprehensive description for specific forgery operations, compared with traditional approaches. Recent deep-learning-based approaches have shown great performance improvement in digital image forensics applications such as MF detection. However, thus far, deep-learning-based approaches have not shown as much performance gain in digital image forensics as they have in image recognition and understanding. This is because the current network structures are being learned from structures that are adopted in image recognition; these structures are more or less related to image content, which leads to performance degradation in many digital image forensics applications. As a result, although deep-learning-based approaches are promising, they are not yet mature in digital image forensics; a considerable amount of work remains to be done in this area [67–71].

《Acknowledgements》

Acknowledgements

The work described in this paper was supported by National Key Research and Development Program of China (2016QY01W0104) and National Natural Science Foundation of China (61771310).

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Xiang Lin, JianHua Li, ShiLin Wang, AlanWeeChung Liew, Feng Cheng, and XiaoSa Huang declare that they have no conflict of interest or financial conflicts to disclose.