The primary problem that we will investigate in this project pertains to advanced
document processing. In particular, we would like build human-like Artificial
Intelligence (AI) agents that can process documents which are presented in noninformative format (e.g. scans). In order to achieve this objective, advanced natural language processing (NLP), computer vision (CV) and joint language-visual processing skills have to be imparted into the AI agents. CV skills are required for AI agents to recognize content of documents (e.g. by processing their images). This remains a challenging problem especially when layout of documents are complex. Conventionally expert-system models were used, where the process of image processing is supervised using rule-based heuristics. Recently, thanks to the release of large-scale ground-truth document image datasets, deep learning models (e.g. vision transformer or ViT) have also been investigated. Once the content (and mainly texts) are recognized, NLP skills is required for the AI agents to be able to parse the meaning of the document and to perform high-level tasks (e.g. summarizing the content of documents). It is important to note that for advanced services (e.g. visual query answering or VQA), visual and language processing may have to be integrated in mutual complementary modelling. Such integration is essentially a cross-modality data processing problem which remains an open challenge for the research community and which we would also investigate in this project.
Period10 Jul 202331 Mar 2024