In this research proposal, we are interested in distributed algorithms for query evaluation that run with formal optimality guarantees. With the emergence of big data and the increased popularity of using large cluster settings for big data analysis, there is a huge interest in understanding the precise complexity of query operators in these settings. While query optimization is a classical database problem with a long history in research and engineering, new scalable
systems use radically different architectures and require algorithms optimized for different cost-models than traditional database systems. For example, in these new systems data is partitioned across independently operating compute nodes that typically keep the entire partition in main memory. The dominating cost of algorithms in these systems is defined by the communication cost:
the amount of data exchange and the number of synchronization rounds between nodes. This is in stark contrast to the classical external memory architecture consisting of a single machine and with the number of interactions with the disk as dominating cost. A particularly challenging operator in distributed settings is the relational join, whose execution often requires shipping a large amount of data over the network. In this research proposal, we will study the precise complexity of join queries for cluster settings and
develop algorithms to compute joins with formal optimality guarantees on their communication cost.