Deep neural networks achieve high performance in many signal processing tasks. However, they are often considered as black-box models. To address this issue, there is a growing interest in developing interpretable models, along with methods to explain the model decisions. These techniques were successful in text or image processing tasks, but their application to video is limited.
In this project, we propose a joined framework for interpretable and explainable deep learning for video. First, we consider the lowcomplexity data structure inherent to video, both in the image and temporal domain to learn video representations. Our approach is based on the deep unfolding framework, leading to interpretability by design. Considered applications include inverse problems, foreground separation and frame prediction. Secondly, we propose a novel recurrent model based on state-space model assumptions and deep Kalman filtering in order to learn interpretable video dynamics and semantics. Applications include inverse problems and object tracking. Thirdly, we propose trustworthy post-hoc analysis methods to explain model decisions. Input saliency techniques will be used to generate disentangled visual and temporal explanation. We leverage the structures and dynamics captured by our networks to bridge interpretability and explainability, by relating the output and input spaces to the interpretable latent space. For the proposed applications, we expect to reach state-of-the-art performance