Description

Datasets containing predictions, training and validation data for D2Deep cancer driver mutation predictor.

Abstract

Background: The mutations driving cancer are being increasingly exposed through tumor-specific genomic data. However, differentiating between cancer-causing driver mutations and random passenger mutations remains challenging. State-of-the-art predictors contain built-in biases and are often ill-suited to the intricacies of cancer biology. Most of them fail to offer result interpretation, creating a barrier to their effective utilization in the clinical setting.

Results: The AI-based D2Deep method we introduce here addresses these challenges by combining two powerful elements: i) a non-specialized protein language model that captures the makeup of all protein sequences and ii) protein-specific evolutionary information that encompasses functional requirements for a particular protein. D2Deep relies exclusively on sequence information, outperforms state-of-the-art predictors and captures intricate epistatic changes throughout the protein caused by mutations. These epistatic changes correlate with known mutations in the clinical setting and can be used for the interpretation of results. The model is trained on a balanced, somatic training set and so effectively mitigates biases related to hotspot mutations compared to state-of-the-art techniques. The versatility of D2Deep is illustrated by its performance on non-cancer mutation prediction, where most variants still lack known consequences. D2Deep predictions and confidence scores are available via https://tumorscope.be/d2deep to help with clinical interpretation and mutation prioritization.

Conclusions: A purely protein sequence-based predictor to distinguish driver from passenger mutations in cancer outperforms predictors that utilize a plethora of contextual features, highlighting the necessity for somatic-specific, unbiased predictive models in the field of cancer research.
Date made available4 Aug 2023
PublisherZenodo

Keywords

  • Protein Large Language Models
  • Evolutionary Information
  • Machine Learning

Format

  • Format
  • csv

Cite this