《1. Introduction》

1. Introduction

Accurate kinematics of the knee joint is critical in many orthopedic applications for understanding aspects such as the normal function of the joint [1], development of knee osteoarthritis [2], mechanisms of knee injuries [3], optimization of prosthesis design [4], preoperative planning, and postoperative rehabilitation [5]. The measurement of knee kinematics is also essential for biomechanical studies on the musculoskeletal system. In the event of significant demand for kinematics in the clinical field, an efficient and reliable method to measure the dynamic motion of the joint is needed.

Various measurement tools are now available for researchers to quantify three-dimensional (3D) knee kinematics, but only a few of them can provide millimeter-scale accuracy and rapid tracking velocity. Skin-marker-based optical tracking systems are widely used in the analysis of human motion, but their accuracy is affected by marker-associated soft-tissue artifacts, which can cause displacements of up to 40 mm [6]. Although several researchers have attempted to reduce the effects of soft-tissue artifacts by building mathematical models [7–9], the issue remains unsolved when using any skin-marker-based motion-capture technique [10]. With the development of medical imaging, some techniques can measure dynamic joint kinematics directly, such as real-time magnetic resonance (MR) tomography and computed tomography (CT) [11,12]. However, clinical promotion of these techniques was limited by low temporal resolution, restricted range of motion (ROM), the need to control motion speed, low image quality, and nonnegligible amounts of radiation [13,14]. In the past decade, a dual-fluoroscopic imaging system (DFIS) has been widely used and well-accepted for accurate in-vivo joint motion analysis because of its high accuracy [15], accessibility, sufficient ROM [16], and low radiation levels compared with traditional radiography (Fig. 1).

《Fig. 1》

Fig. 1. Virtual DFIS for measuring the dynamic motion of knee joints.

To find the pose of the object (i.e., native knee joints) in DFIS, two-dimensional (2D) to 3D registration, which aligns the volume data (e.g., CT) with fluoroscopy (continuous X-ray images), is applied in the measurement procedure. The 3D position of CT is adjusted iteratively, and a large number of digitally reconstructed radiographs (DRRs) is generated simultaneously until the DRR is most similar to the X-ray image [17]. With the increasing use of DFIS in clinical applications, researchers have attempted various automatic registration methods to accelerate the matching procedure. Optimization-based registration, which is composed of an optimizer and similarity metrics between images, has been investigated extensively [18,19]. Although the accuracy of optimizationbased registration is high [20–22], several drawbacks, such as the strictly required registration initialization and the high computational cost of calculating DRRs and the iterations during optimization, limit the widespread use of DFIS [23].

With the rapid development of machine learning [24,25] in recent years, several learning-based methods have been developed to measure joint kinematics, with the advantages of computational efficiency and enhancement of capture range compared with optimization-based methods [21,26–28]. However, these methods are always trained with synthetic X-ray images (i.e., DRRs) because training such models with a large amount of authentic labeled data is impractical. Even so, considerable authentic images are still necessary to ensure the robustness of registration [22,27]. Another consideration is the discrepancy between DRRs and X-ray images. Compared with DRRs, fluoroscopic images showed blurred edges, geometric distortion, and nonuniform intensity [29,30]; therefore, networks that train on DRRs do not generalize to fluoroscopic images ideally [22]. Previous studies have established various physical models to generate more realistic DRRs through additional measurements of X-ray quality [31,32]. Recently, a survey conducted by Haskins et al. [24] has shown the ability to use transfer learning in such cross-modal registration, which may save the effort of building complicated DRR models or collecting authentic clinical images.

In our work, we developed a pseudo-Siamese multi-view pointbased registration framework to address the problem of limited number of real fluoroscopic images. The proposed method is a combination of a pseudo-Siamese point-tracking network and a feature-transfer network. The pose of the knee joints was estimated by tracking selected points on knee joints with the multiview point-based registration network, paired DRRs, and fluoroscopy. A feature extractor was trained by the featurelearning network with pairs of DRRs and fluoroscopic images. To overcome the limited number of authentic fluoroscopic images, we trained the multi-view point-based registration network with DRRs and pre-trained the feature-learning network on ImageNet.

The remainder of this paper is organized as follows. Section 2 reviews deep-learning-based 2D–3D registration and domain adaption. Section 3 presents the proposed learning-based 2D–3D registration problems. Section 4 presents the experiments and results, and Section 5 concludes the paper.

《2. Related work》

2. Related work

《2.1. Learning-based strategy》

2.1. Learning-based strategy

To avoid the large computational costs of optimization-based registration, researchers have recently developed learning-based registration [24]. Considering the success of convolutional neural networks (CNNs), feature extraction from both DRRs and fluoroscopic images has been proposed. The pose of the rigid object was then estimated by a hierarchical regressor [33]. The CNN model improves the robustness of registration, but it is limited to objects with strong features, such as medical implants, and cannot perform the registration of native anatomic structures. Miao et al. [28] proposed a reinforcement learning network to register X-ray and CT images of the spine with a Markov decision process. Although they improved the method with a multi-agent system, the proposed method may still fail because it cannot converge during searching. Recently, several attempts have been made to register rigid objects with point correspondence networks [27,34,35], which showed good results in both efficiency and accuracy on anatomic structures. Their method avoids the costly and unreliable iterative pose searching and corrects the out-of-plane errors with multiple views.

《2.2. Domain adaption》

2.2. Domain adaption

The discrepancy between synthetic data (i.e., DRRs) and authentic data (i.e., fluoroscopic images), also known as drift, is another challenge to learning-based registration strategies, in which training data and future data must be in the same feature space and have the same distribution [36]. Compared with building complicated models for DRR generation, domain adaption has emerged as a promising and relatively effortless strategy to account for the domain difference between different image sources [37], and it has been applied in many medical applications, such as X-ray segmentation [38] and multi-modal image registration [21,22,39]. For 2D–3D registration, Zheng et al. [21] proposed the integration of a pairwise domain adaptation module into a pretrained CNN that performs rigid registration using a limited amount of training data. The network was trained on DRRs, and it performed well on synthetic data; therefore, the authentic features were transferred close to the synthetic features with domain adaption. However, existing methods are still inappropriate for natural joints, such as knees and hips. Therefore, a designed registration approach for natural joints that do not require numerous clinical X-ray images for training is needed.

《3. Methods》

3. Methods

The aim of 2D–3D registration is to estimate the six degrees of freedom (6DOF) pose of 3D volume data from pairs of 2D multiview fluoroscopic images. In the following section, we begin with an overview of the tracking system and multi-view point-based 2D–3D registration (Section 3.1). Then, details of the two main components of our work are given in Section 3.2 and Section 3.3.

《3.1. Multi-view point-based registration》

3.1. Multi-view point-based registration

3.1.1. 2D–3D rigid registration with 6DOF

We consider the registration of each bone in the knee joint as a separate 2D–3D registration procedure. Pose reproduction of each bone is denoted as the 3D alignment of the CT volume data V through a transformation matrix T4×4, which is parameterized by six elements of translations and rotations using the Euler angle [40]. Transformation matrix T44 can be represented as a homogeneous 4 × 4 matrix, and pose P can be derived as follows:

where R3×3 is a rotation matrix about three axes R3×3  and  , and t is translation vector along the three axes t.

3.1.2. 3D projection geometry of X-ray imaging

In the virtual DFIS, the four corners of each imaging plane and the location of the X-ray sources were used to establish the optical pinhole model during DRR generation (Fig. 1). After a polynomialbased distortion correction and spatial calibration of two-view fluoroscopy, DRRs were generated by the ray-casting algorithm [41] with segmented CT volume data using Amira software (ThermoFisher Scientific, USA). Combing the transformation matrix T4×4, the final DRR IDRR can be computed as follows:

where is the ray s connecting the X-ray source and image plane in the X-ray imaging model, and p is a point of the ray. represents the attenuation coefficient at some point in the volume data.

3.1.3. Registration by tracking multi-view points

Previous literature has reported single-view 2D–3D registration to be an ill-posed problem; therefore, two-view fluoroscopic images were used for registration to avoid out-of-plane errors [42]. Considering the excellent performance of the point-based registration method on anatomic structures [27,34,35], we measured the dynamic motion of knee joints by tracking a set of selected points on the surface model in DFIS (Fig. 2), and we denoted the selected points as Pbone = [p1, p2, p3, ..., pN ]. 2D projection of the selected points was tracked with a pseudo-Siamese multi-view point-based registration network (Section 3.2). After tracking the selected points from all the provided views, we reproduced the 3D locations of the set of points PE = [p1estimated, p2estimated, p3estimated, ..., pNestimated ] using triangulation[43]. To determine the final transformation matrix T, a Procrustes analysis [44] was used as follows:

The final pose of each bone was reproduced with transformation matrix T.

《Fig. 2》

Fig. 2. The workflow of the multi-view point-based registration method. A set of points was selected on the bone surface, and their 2D projections were tracked from each view in the virtual DFIS to reconstruct their 3D positions. The final transformation matrix was determined by the reconstructed points using Procrustes analysis [44].

《3.2. Pseudo-Siamese point tracking network》

3.2. Pseudo-Siamese point tracking network

In the proposed method, we used a pseudo-Siamese network to track points from each view. The pseudo-Siamese network has two branches: One is a visual geometry group (VGG) network [45] for extracting features from DRRs, and the other is a feature-transfer network, which transfers authentic features to synthetic features (Section 3.3). The overall workflow is shown in Fig. 3. The input of the network was unpaired DRRs and fluoroscopic images, and the output was the tracked points of the fluoroscopic images. In the upper branch of the network (Fig. 3), the exported features FDRR around each selected point have the size of M × N × C when the width and height of the DRR are respectively M and N, and C is the number of feature channels. In the lower branch of the network, the features of fluoroscopic images Ffluoro, were exported by the feature-transfer network without weight sharing. With the output of the extracted features FDRR and Ffluoro, a convolutional layer was applied to quantify the similarity between the two feature maps [27]. The similarity is denoted as

where W is a learned weighting factor in finding better similarity for each selected point. The objective function to be minimized during the training is Euclidean loss (i.e., registration loss), defined as

where pfluoro is the tracked 2D points and pdrr is the projected 2D points in DRR with known locations. With the tracked 2D points from different views, the 3D points were reconstructed using triangulation [43].

《Fig. 3》

Fig. 3. The framework of the point-tracking network. Pairs of DRRs and fluoroscopic images were imported to the network, and their features were extracted by a VGG and a feature-transfer network, respectively. The selected points were tracked on fluoroscopic images by searching the most similar feature patch around the selected points in DRRs. Conv: convolution layers.

《3.3. Feature transfer using domain adaption》

3.3. Feature transfer using domain adaption

For feature extraction of fluoroscopic images, we proposed a transfer-learning-based method to reduce the domain difference between synthetic images (e.g., the DRRs) and authentic X-ray images (e.g., the fluoroscopic images) (Fig. 4).

《Fig. 4》

Fig. 4. Feature-transfer network with the paired synthetic image and authentic image. Synthetic images (i.e., DRRs) were generated at the pose after manual registration.

To close the gap between the two domains, we used a domainadaption method. That is, additional coupled VGG net with cosine similarity was set during feature extraction of the fluoroscopic images to close the gap (Fig. 5). Pairs of DRRs and fluoroscopic images, which share the same locations of volume data using a model-based manual registration method [9], were used for training. We used cosine similarity as the cost function to measure the gap between the two domains. For the tracking problem, the cosine similarity can be stated as

where denotes L2-norm and denotes dot product, and Fand Fare the feature maps. To improve the performance of feature transfer, we optimized the proposed method with weights pre-trained on ImageNet.

《Fig. 5》

Fig. 5. The architecture of synthetic X-ray image feature extraction.

《4. Experiments and results》

4. Experiments and results

《4.1. Dataset》

4.1. Dataset

In this institutional-review-board-approved study, we collected CT images of three subjects’ knees, and all subjects performed two or three motions that were captured by a bi-plane fluoroscopy system (BV Pulsera, Philips, the Netherlands) with a frame rate of 30 frames per second. CT scans (SOMATOM Definition AS; Siemens, Germany) of each knee, ranging from approximately 30 cm proximal and distal to the knee joint line (thickness, 0.6 mm; resolution 512 × 512 pixels), were obtained. The size of the fluoroscopic images was 1024 × 1024 pixels with a pixel spacing of 0.28 mm. Geometric parameters of the bi-plane fluoroscopy imaging model, such as polynomial distortion correction parameters [46] and the locations of the X-ray source and detector plane, were used to establish a virtual DFIS, in which poses of each bone were reproduced manually [47]. In this study, 143 pairs of matched fluoroscopic images were used (Fig. 6), of which 91 pairs of matched images were used for training the feature-transfer network of fluoroscopic images and the point tracking network, and the remaining images were used as the testing set. Additionally, a three-fold validation was performed in the study. To evaluate the 2D–3D registration algorithm, a widely used 3D error measurement (i.e., the target registration error (TRE)) was applied [48]. We computed the mean TRE (mTRE) to determine the 3D error. The average distance between the selected points defines the mTRE.

where Pbone denotes the selected points and PE denotes the estimated points. The success rate was defined as the percentage of all the test cases with an mTRE of less than 10 mm.

《Fig. 6》

Fig. 6. Paired raw fluoroscopic images and the corresponding images after manual matching. The raw fluoroscopic images are (a) and (b), in which additional noise (wearable electromyography sensors) can be found on the surface of the lower limb. As described in the previous study [6], manual registration was performed until the projections of the surface bone model matched the outlines of the fluoroscopic images, and the matched results are shown in (c) and (d). Reproduced from Ref. [6] with permission of Elsevier Ltd., © 2011.

《4.2. Loss selection in cross-domain feature extraction analysis》

4.2. Loss selection in cross-domain feature extraction analysis

We defined a cosine similarity as the loss function in the feature extraction on the authentic X-ray images. We also used the mean squared error as the loss function [22] to find a better loss function. The position of the loss function may also affect the final performance of the feature extraction layer. Thus, we first compared the effects of loss functions located at different convolution layers. To obtain the best performance of the cross-domain feature from the real fluoroscopic images, we put the defined loss function between the pairs of conv2 layers, conv3 layers, conv4 layers, and conv5 layers. In our data (Fig. 7), we preferred the cosine similarity as the loss function because it has better performance regarding the final registration result of the entire knee joint. Cosine similarity showed the best performance between conv5 layers (see details in Appendix A, Table S1).

《Fig. 7》

Fig. 7. The success rate using cosine similarity and mean squared error (MSE) at different convolutional layers.

《4.3. With or without transfer training network analysis》

4.3. With or without transfer training network analysis

To test the effects of the proposed feature-based transfer learning method, we compared this method with the Siamese registration network (i.e., POINT2 network) [27]. Moreover, as a widely used transfer learning tool, fine-tuning, was also compared in the current study to find a better way to reduce the differences between the fluoroscopic images and DRRs. The weights of the proposed method were pre-trained on the ImageNet database. The average performance of 10 tests for each method was used as the final performance. The mTRE results are reported in terms of the 10th, 25th, 50th, 75th, and 95th percentiles to demonstrate the robustness of the compared methods. The proposed featurebased transfer learning method had a significantly better performance than the Siamese registration network (Fig. 8), and it also performed better than fine-tuning, with a success accuracy rate of almost zero (Table S2 in Appendix A).

《Fig. 8》

Fig. 8. Mean target registration error with different registration networks.

《4.4. Three-fold cross-validation》

4.4. Three-fold cross-validation

We used three-fold cross-validation in this study and compared the proposed pseudo-Siamese registration network with and without transfer learning. Therefore, two of the three subjects were used for training the system, and the last subject was used to validate the system. This approach was iterated ten times by shifting the test subjects randomly. The performances (mTRE) were evaluated in each iteration. Finally, the performances recorded in all ten iterations were averaged to obtain a final mTRE. The mTRE results are reported in terms of the 10th, 25th, 50th, 75th, and 95th percentiles (Table 1). The final three-fold cross-validation showed that the proposed method also had a better performance with feature transfer.

《Table 1》

Table 1 Three-fold cross-validation with and without transfer learning.

All values are in millimeters.

a Joint means the final registration result of the whole joint.

《5. Conclusions》

5. Conclusions

To overcome limited numbers of real fluoroscopic images in learning-based 2D–3D rigid registration via DRRs, we proposed a pseudo-Siamese multi-view point-based registration framework. The proposed method can decrease the demand for real X-ray images. With the ability to transfer authentic features to synthetic features, the proposed method has better performance than the fine-tuning pseudo-Siamese network. This study also estimated the POINT2 network with and without transfer learning. The results showed that the proposed pseudo-Siamese network has a better success rate and accuracy than the Siamese point-tracking network. With a small amount of training data, the proposed method can work as an initialization step for the optimization-based registration method to improve accuracy. However, there are several limitations to the current work. First, because our method is designed for at least two fluoroscopic views, multi-view data were required to reconstruct the knee poses; otherwise, out-of-plane translation and rotation error would be large because of the physical imaging model. Second, the proposed method cannot reach a sub-millimeter accuracy compared with an optimization-based strategy. Like other learning-based strategies, our proposed method did not provide good accuracy but would be much faster than the optimization-based method, because no iterative step was needed during matching. In clinical orthopedic practice, accurate joint kinematics is essential for the determination of a rehabilitation scheme [5], surgical planning [1], and functional evaluation [47]. The proposed method alone is inappropriate for in-vivo joint kinematics measurement. Therefore, a combination of our method with an optimization-based strategy would be a viable solution.

《Acknowledgements》

Acknowledgements

This project was sponsored by the National Natural Science Foundation of China (31771017, 31972924, and 81873997), the Science and Technology Commission of Shanghai Municipality (16441908700), the Innovation Research Plan supported by Shanghai Municipal Education Commission (ZXWF082101), the National Key R&D Program of China (2017YFC0110700, 2018YFF0300504 and 2019YFC0120600), the Natural Science Foundation of Shanghai (18ZR1428600), and the Interdisciplinary Program of Shanghai Jiao Tong University (ZH2018QNA06 and YG2017MS09).

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Cong Wang, Shuaining Xie, Kang Li, Chongyang Wang, Xudong Liu, Liang Zhao, and Tsung-Yuan Tsai declare that they have no conflict of interest or financial conflicts to disclose.

《Appendix A. Supplementary data》

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.eng.2020.03.016.