Topics in Database Systems

CMPS290H: Large-scale Data Integration

Instructor: Wang-Chiew Tan

Where: Jack Baskin Engineering 156

When: TTH 12noon -1:45pm

Office Hours:

  • Where: E2 343B 
  • When: TBD

Course Description:   Data integration refers to the process of combining data from multiple sources into a unified format. It is a fundamental step that is required in many applications and performed across data of diverse types and domains (e.g., enterprise data, bioinformatics repositories, health records, social media data). This course will provide an overview of the underlying research challenges and major research efforts on both the foundations and systems aspects of achieving end-to-end data integration at scale. Specifically, this course will cover working systems and research papers on topics including information extraction from unstructured data, integration of structured data, provenance of the integrated data, and high-level languages for performing integration at scale. This course will also feature a few presentations from leading researchers, including a mini tutorial on SystemT, which is a commercial information extraction system that is developed at IBM Research - Almaden. Students are expected to give critical analysis of papers, in-class presentations, and complete a project relevant to the topic of this course.

The official prerequisite for this course is CMPS180 (or equivalent), CMPS277 or CMPS278. You are encouraged to speak to the instructor if you are interested in taking this course but do not have the prerequisite.


Please log on to for further information about this course. 

Instructors and Assistants