STRUCTURAL PROPERTIES OF PROTEIN SEGMENTS ENCODED BY SYNONYMOUS CONSTRAINT ELEMENTS AND ACCELERATED REGIONS OF THE HUMAN GENOME

Student thesis: Master's Thesis

Abstract

The catalogues of primate- and human-specific accelerated regions (PARs and HARs), and multi-functional genomic sites detected as synonymous constraint elements (SCEs) of the human genome have been considerably expanded recently by the publication of the results of the 29 Mammals Project. Our aim is to study the structural preferences of the protein segments encoded by these functionally or evolutionarily distinguished DNA regions. We believe that in case of multiple evolutionary constraints, the protein structure has to cope with the consequences stemming from the second function fulfilled by the coding DNA segment. These restrictions probably do not allow for the free exploration of the amino acid space in the affected segment of the protein, which could cause problems in terms of forming proper secondary structure elements and adopting a well-defined fold. Regarding evolutionarily accelerated regions that overlap protein-coding genes, the question is how protein structure and function can survive the increased rate of mutations in the given genomic region. A wide range of suitable bioinformatics methods were applied to analyze 577 PARs and 563 HARs, and 11882, 10757 and 8933 SCEs (9, 15 and 30 codons window sizes applied in their identification, respectively) from the protein structural aspect. Unfortunately, the majority of accelerated regions (ARs) were found to reside in non-protein-coding regions and consequently, we could only identify a small number of corresponding protein segments (3.81% and 4.62% of the total entries for PARs and HARs, respectively). Due to this, the AR-derived protein segment datasets showed low statistical power that, together with contradictory results between HARs and PARs, prevented us from drawing any reliable conclusion on their structural properties. In the case of the SCEs, the large number of entries in each SCE dataset confers them a higher statistical power and, when comparing them with their equivalent reference sets, they showed a highly significant enrichment in predicted structural disorder and a bias towards low sequence complexity, while at the same time, a highly significant depletion in regular secondary structure elements and a tendency to reside outside annotated domain regions. These tendencies got stronger with decreasing detection window size (increasing resolution), demonstrating the selective and localized effect of multi-functionality on the implicated protein segments. In all, our results show that multi-functional DNA sequences tend to overlap with intrinsically disordered protein regions rather than with globular domains; suggesting that there might exist a strategic distribution of the encoded complexity within the coding regions of genomes.
Date of Award13 Sept 2013
Original languageEnglish
SupervisorPeter Tompa (Promotor), Rita Pancsa (Co-promotor), Remy Loris (Jury) & Tom Lenaerts (Jury)

Keywords

  • SYNONYMOUS CONSTRAINT ELEMENTS
  • ACCELERATED REGIONS
  • PROTEIN STRUCTURAL DISORDER
  • HUMAN GENOME
  • INTRINSICALLY DISORDERED PROTEINS
  • BIOINFORMATICS

Cite this

'