Semantic Input Sampling for Explanation (SISE) - A Technical Description


* We propose a state-of-the-art post-hoc CNN specific Visual XAI algorithm - SISE.
* Input      :  A test image; The trained model
* Output     :  A visual 2D heatmap 
* Properties :  Noise-free, High resolution, Class discriminative and Correlates 
                to model's prediction.

Please find an updated article about our XAI algorithm (SISE) written by LG AI Research here (English) or here (Korean).

Need for XAI

Deep Neural models based on Convolutional Neural Networks (CNNs) have rendered inspiring breakthroughs in a wide variety of computer vision tasks. However, the lack of interpretability hurdles the understanding of decisions made by these models. This diminishes the trust consumers have for CNNs and limits the interactions between users and systems established based on such models.

Fig. 1: The Need of Explainable AI (XAI). Source

Explainable AI (XAI) attempts to interpret such cumbersome models. The offered interpretation ability has put XAI in the center of interest in various fields, especially where any single false prediction can cause severe consequences (e.g. healthcare) or where regulations force automotive decision-making systems to provide users with explanations (e.g. criminal justice) [1].

Types of XAI

The two main classifications of XAI methods are,

  1. Ad-hoc: Explaining how the model learns in the Training phase.
  2. Post-hoc (Ante-hoc): Explaining how the model makes its decisions in the Evaluation phase.

Our work particularly addresses the problem of Visual Explanation for images, which is a branch of post-hoc XAI. This field aims to visualize the behavior of models trained for image recognition tasks [2]. The outcome of the methods in this sub-field is a heatmap in the same size as the input image named “explanation map” that represents the evidence leading the model to make a decision.


We introduce a novel XAI algorithm that we presented at the 35th AAAI Conference on Artificial Intelligence (AAAI-21), which offers both spatial resolution and explanation completeness in its output explanation map by

  1. using multiple layers from the “intermediate blocks” of the target CNN,
  2. selecting crucial feature maps from the outputs of the layers,
  3. employing an attribution-based technique for input sampling to visualize the perspective of each layer, and
  4. applying a feature aggregation step to reach refined explanation maps.

Our proposed algorithm is motivated by ‘perturbation-based’ XAI methods that attempt to interpret the model’s prediction by employing input sampling techniques. These methods have shown a great faithfulness in rationally inferring the predictions of models. However, they show instability as their output depends on random sampling (RISE) [3] or random initialization for optimizing a perturbation mask (Extremal perturbation) [4]. They also produce different explanation maps each time and have an excessive runtime when attempting to get generalized results.

To address such limitations while enhancing their strength, we propose a CNN-specific algorithm that improves their fidelity and plausibility (in the view of reasoning) with adaptive runtime for practical usage. We term our algorithm as Semantic Input Sampling for Explanation (SISE). To claim such a reform, we replace the randomized input sampling technique in RISE with a sampling technique that relies on the feature maps derived from various layers of the model. We call this procedure attribution-based input sampling and we show that it provides a high-resolution perspective of the model in multiple semantic levels, in turn with restricting the applicability of SISE to CNNs.


Fig. 2: Global overview of the Proposed framework

As sketched in Fig. 2, SISE consists of four phases. In the first phase, multiple layers of the model are selected and a set of corresponding output feature maps are extracted. In the second phase, for each set of feature maps, a subset containing the most important feature maps are sampled with a backward pass.

Fig. 3: Schematic of SISE's layer visualization framework (first three phases). The procedure in this framework is applied to multiple layers and is followed by the fusion framework (as in Fig. 6)

The selected feature maps are then post-processed to create sets of perturbation masks to be utilized in the third phase for attribution-based input sampling and are termed as attribution masks. These procedures in the first three phases are applied to the last layers of each convolutional block in the network, and their output is a 2-dimensional saliency map named visualization map. The details of this method are depicted in the schematic figure (Fig. 3) and our intuition for the layer selection policy is discussed more analytically below. Such obtained visualization maps are aggregated in the last phase to reach the final explanation map.

Attribution-Based Input Sampling

Assume $\Psi:\mathcal{I}\rightarrow\mathbb{R}$ be a trained model that outputs a confidence score for a given input image, where $\mathcal{I}$ is the space of RGB images $\mathcal{I}={I|I:\Lambda\rightarrow\mathbb{R}^{3}}$, and $\Lambda={1,…,H} \times {1,…,W}$ is the set of locations (pixels) in the image.

Given any model and image, the goal of an explanation algorithm is to reach a unified explanation map $S_{I,\Psi}(\lambda)$, that assigns an “importance value” to each location in the image $(\lambda \in \Lambda)$. The explanation maps are represented as $$S_{I,\Psi}(\lambda)=\mathbb{E}_{M} [\Psi(I\odot m)\cdot C_{m}(\lambda)]$$

where the term $C_{m}(\lambda)$ indicates the contribution amount of each pixel in the masked image and it is defined as $$C_{m}(\lambda)=\frac{m(\lambda)}{\sum_{\lambda\in\Lambda}m(\lambda)}$$

Here, it can be noted that we normalize them according to the size of perturbation masks to decrease the assigned reward to the background pixels when a high score is reached for a mask with too many activated pixels.

Feature Map Selection

Let $l$ be a selected layer containing $N$ feature maps that are 2-dimensional matrices represented as $A_{k}^{(l)} (k={1,…,N})$.

To identify and reject the class-indiscriminative feature maps, we partially backpropagate the signal to the selected layer to score the gradient of model’s confidence score to each of the feature maps. These gradient scores are represented as follows:

$$\alpha_k^{(l)} = \sum_{\lambda^{(l)} \in \Lambda^{(l)}} \frac{\partial \Psi(I)}{\partial A_{k}^{(l)} (\lambda^{(l)})}$$

$$\beta^{(l)} = \max_{k\in {1,…,N}} ( \alpha_k^{(l)})$$

The feature maps with corresponding non-positive gradient scores ($\alpha_k^{(l)}$), tend to contain features related to other classes rather than the class of interest. Terming such feature maps as negative-gradient, we define the set of attribution masks obtained from the positive-gradient feature maps, $M_d^{(l)}$, as:

$$M_d^{(l)}={ \Omega(A_{k}^{(l)})|k\in {1,…,N}, \alpha_k^{(l)}<\mu \times \beta^{(l)} }$$

where $\mu$ is a parameter that is 0 by default to discard negative-gradient feature maps while retaining only the positive-gradients.

A visual comparison of such created attribution masks in our approach with random masks in Fig. 4 emphasizes such advantages discussed.

Fig. 4: Qualitative comparison of (a) attribution masks derived from different blocks of a VGG16 network as in SISE, with (b) random masks employed in RISE.

Layer Selection

As SISE extracts the feature maps from multiple layers in its first phase, we here define the most crucial layers for explicating the model’s decisions. The intention is to reach a complete understanding of the model by visualizing the minimum number of layers.

We study a simulation experiment as in “Veit et al.” [5], where the corresponding test errors are reported for removing a layer individually from a residual network. It was observed as in Fig. 5 that removing convolutional layers individually does not affect the network, while a significant degradation in test performance is recorded only when the pooling layers are removed.

Fig. 5: Screenshot of simulation results from [5] claiming that the importance of downsampling layers.

Based on this hypothesis and result, most of the data in each model can be collected by probing the pooling layers. Thus, by visualizing these layers, it is possible to track the way features are propagated through convolutional blocks. Therefore, for all given CNNs, we select the inputs of the pooling layers to be visualized in the first three phases of SISE and pass their corresponding visualization maps to the fusion block to perform a block-wise feature aggregation.

Fusion Block

Fig. 6: The proposed fusion framework for a CNN with 5 convolutional blocks.

The proposed fusion module is designed with cascaded fusion blocks. In each fusion block, the feature information from the visualization maps representing explanations for two consecutive blocks is collected using an “addition” block. Then, the features that are absent in the latter visualization map are removed from the collective information by masking the output of the addition block with a binary mask indicating the activated regions in the latter visualization map. To reach the binary mask, we apply an adaptive threshold to the latter visualization map, determined by Otsu’s method [6]. By cascading fusion blocks as in Fig. 6, the features determining the model’s prediction are represented in a more fine-grained manner while the inexplicit features are discarded.


We verify the performance of our method on shallow and deep CNNs, including VGG16 and ResNet-50 architectures. To conduct the corresponding experiments, we employed PASCAL VOC 2007 [7] and MS COCO 2014 [8] datasets, and the experiments are evaluated on their test set with pre-trained models from the TorchRay library [9].

Quantitative Results

Quantitative analysis includes evaluation results categorized into ground truth-based and model truth-based metrics. The former is used to justify the model by assessing the extent to which the algorithm satisfies the users by providing visually superior explanations, while the latter is used to analyze the model behavior by assessing the faithfulness of the algorithm and its correctness in capturing the attributions in line with the model’s prediction procedure. Refer to our preprint paper to know more about the metrics employed.

The evaluation results are shown in Tables 1 and 2 below which indicates the superior ability of SISE in providing satisfying, high-resolution, and complete explanation maps that provide a precise visual analysis of the model’s predictions and perspective. For each metric, the best is shown in bold. Except for Drop%, the higher is better for all other metrics.

ModelMetricGrad-CAMGrad-CAM++Extremal PerturbationRISEScore-CAMIntegrated GradientFullGradSISE

Table 1: Results of ground truth-based and model truth-based metrics for state-of-the-art XAI methods along with SISE (proposed) on two networks (VGG16 and ResNet-50) trained on PASCAL VOC 2007 dataset.

ModelMetricGrad-CAMGrad-CAM++Extremal PerturbationRISEScore- CAMIntegrated GradientFullGradSISE

Table 2: Results of the state-of-the-art XAI methods compared with SISE on two networks (VGG16 and ResNet-50) trained on MS COCO 2014 dataset.

Qualitative Results

Based on explanation quality, we have compared SISE with other state-of-the-art methods on sample images from the Pascal dataset in Fig. 7 and MS COCO dataset in Fig. 8. Images with both normal-sized and small object instances are shown along with their corresponding confidence scores.

Fig. 7: Qualitative comparison of SISE with other state-of-the-art XAI methods with a ResNet-50 model on the Pascal VOC 2007 dataset.

Fig. 8: Explanations of SISE along with other conventional methods from a VGG16 model on the MS COCO 2014 dataset.


  • In this work we proposed SISE, a novel Visual XAI method well-performed on both shallow and deep CNNs.
  • By taking efficient perturbation techniques into account, we were able to improve the preciseness of the obtained explanation map.
  • Also, by employing backpropagation techniques, we were able to reduce the complexity of our method, while retaining the performance.
  • By applying the proposed simple block-wise feature aggregation, the visual resolution of our method has been enhanced, and low-level and mid-level features in the mode have been represented as well as high-level ones.
  • Ground truth-based metrics are utilized for proving the superior visual quality of our method.
  • Model truth-based metrics also used along with the sanity checks to verify the performance of SISE in investigating the behavior of the models accurately.

More resources like Poster, Slides, Video and PDF are available here.

All figures and content published here are owned by the authors at the Multimedia Laboratory at the University of Toronto and the LG AI Research.


Consider citing our work as below, if you find it useful in your research:

  title={Explaining Convolutional Neural Networks through Attribution-Based Input Sampling and Block-Wise Feature Aggregation}, 
  journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
  author={Sattarzadeh, Sam and Sudhakar, Mahesh and Lem, Anthony and Mehryar, Shervin and Plataniotis, Konstantinos N and Jang, Jongseong and Kim, Hyunwoo and Jeong, Yeonjeong and Lee, Sangmin and Bae, Kyunghoon}, 


[1]. Lipton, Z. C. 2016. The Mythos of Model Interpretability. CoRR abs/1606.03490. URL

[2]. Barredo Arrieta, A.; Diaz Rodriguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado Gonzalez, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, V. R.; Chatila, R.; and Herrera, F. 2019. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Information Fusion doi:10.1016/j.inffus.2019.12.012.

[3]. Petsiuk, V., Das, A. and Saenko, K., 2018. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421.

[4]. Fong, R., Patrick, M. and Vedaldi, A., 2019. Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2950-2958).

[5]. Veit, A., Wilber, M.J. and Belongie, S., 2016. Residual networks behave like ensembles of relatively shallow networks. In Advances in neural information processing systems (pp. 550-558).

[6]. N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” in IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62-66, Jan. 1979, doi: 10.1109/TSMC.1979.4310076.

[7]. Everingham, M.; Van Gool, L.; Williams, C. K. I.; Winn, J.; and Zisserman, A. 2007. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. URL

[8]. Lin, T.; Maire, M.; Belongie, S. J.; Bourdev, L. D.; Girshick, R. B.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; and Zit-nick, C. L. 2014. Microsoft COCO: Common objects in context. arXiv 2014. arXiv preprint arXiv:1405.0312 .

[9]. Fong, R.; Patrick, M.; and Vedaldi, A. 2019. Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE International Conference on Computer Vision, 2950–2958.

Mahesh Sudhakar
Mahesh Sudhakar
Computer Vision Research Engineer

Computer Vision | Robotics | Machine Learning