Understanding Transfer Learning for Medical Imaging

  1. ImageNet images has a global subject while malignant tissues (diseases) manifest as local texture variations in medical images as shown in Fig. 1.
  2. The number of images in ImageNet is significantly larger than the number of images in any medical images dataset.
  3. ImageNet has a large number of classes (categories), while medical images datasets have significantly fewer classes. The number of possible diseases in medical images is small; E.g., 5 classes for diabetic retinopathy diagnosis.
Figure 1: Example images from the ImageNet, the retinal fundus photographs, and the CheXpert datasets, respectively. The fundus photographs and chest x-rays have much higher resolution (more pixels) than the ImageNet images, and are classified by looking for small local variations in tissue.
Table 1: Transfer learning and random initialization perform comparably across both large standard and lightweight CBR models for AUCs. Both sets of models have similar AUCs, despite significant differences in size and complexity. Model performance on DR diagnosis is also not closely correlated with ImageNet performance, with the small models performing poorly on ImageNet but very comparably on the medical task.
Table 2: Transfer learning provides mixed performance gains on chest x-rays. Again, transfer learning (trans) does not help significantly, and much smaller models perform comparably.
Table 3: The benefits of transfer learning in the small data regime are largely due to architecture size. AUCs when training on the Retina task with only 5000 data points. We see a bigger gap between random initialization and transfer learning for Resnet (a large model), but not for the smaller CBR models.
  1. Does transfer learning result in any representational differences compared to training from random initialization? Or are the effects of the initialization lost?
  2. Does feature reuse take place, and if so, where exactly?
Figure 2: Pretrained weights give rise to different hidden representations than training from random initialization for large models. We compute CCA similarity scores between representations learned using pretrained weights and those from random initialization. We do this for the top two layers (or stages for Resnet, Inception) and average the scores, plotting the results in orange. In blue is a baseline similarity score, for representations trained from different random initializations. We see that representations learned from random initialization are more similar to each other than those learned from pretrained weights for larger models, with less of a distinction for smaller models.
Figure 3: Larger models move less after training than smaller networks.
Figure 4: Per-layer CCA similarities before and after training on a medical task. For all models, we see that the lowest layers are most similar to their initializations, and this is especially evident for ResNet50 (a large model). Feature reuse is mostly restricted to the bottom two layers (stages for ResNet) — the only place where the similarity with initialization is significantly higher for pre-trained weights (grey dotted lines show the difference in similarity scores between pre-trained and random initialization).
Figure 5: Using only the scaling of the pretrained weights (Mean Var Init) helps with convergence speed. The figures compare the standard transfer learning and the Mean Var initialization methods to training from scratch. On both the Retina data (a-b) and the CheXpert data, we see convergence speedups.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha


I write reviews on computer vision papers. Writing tips are welcomed.