Search Google Appliance


SYSC 431/531: Data Mining with Information Theory

Summary:

DMIT is a project-based course that offers you an opportunity to use information theoretic methods to analyze data. These models are implemented in a software package named OCCAM, developed at PSU, that will be the main analytical tool used in the course. The theory underlying these methods is taught in SySc 551/651Discrete Multivariate Modeling (DMM), but this course (DMIT) is stand-alone and does not have DMM as a prerequisite. Only the theory needed to understand the inputs and outputs of OCCAM will be presented, but OCCAM will be treated as a black box, so the algorithms that it implements will not be discussed. The point is to make it possible for you to do exploratory modeling on data of interest to you without having to master the underlying theory first. If you want to understand this theory, you can take DMM later, but this is not required.

Click here for a recent Course Flyer or Syllabus.

Instructor: Martin Zwick

Information about these methods and their use, including links to OCCAM and the OCCAM User's Manual, can be found on the instructor's Discrete Multivariate Modeling website.

Texts:

Required readings include:

  1. the OCCAM User Manual
  2. the tutorial ("Overview") paper
  3. one or more research papers selected by the instructor to be used as guidance for for student's research project

A recommended textbook is:
Krippendorff, Klaus. (1986). Information theory: Structural models for qualitative data. Series: Quantitative Applications in the Social Sciences. Paper #62, Sage Publications, Beverly Hills, California. (ISBN: 0-8039-2132-2).

Prerequisites:

It is recommended for all students to have basic probability and statistics or machine learning (e.g., Math 105, Stat 243, or equivalent) and access to data that they know something about and want to analyze. (The instructor will be able to provide data to students who do not have their own, but bringing your own data is preferable.)

Undergraduate students (in 431 section), must have upper division standing and completion of one of the SYSC3xxU cluster courses, or permission of the instructor.

Assignments:

Graduate students will submit a substantial research paper at the end of the course (80%) and will give a class presentation of their results (20%). 

Undergraduate students will submit a shorter paper (100%), focusing on data, methods, and results.

Course Outline:

  • Week 1: Introduction to DMIT
  • Week 2: Introduction to the OCCAM software
  • Week 3: Presentation of a prototype DMM research paper by the instructor
  • Week 4: Class members describe their projects
  • Week 5: Students work in computer lab on their projects with assistance from instructor
  • Week 6: Mid-quarter update (short research reports)
  • Week 7: Continued work on research projects
  • Week 8: Draft papers due for instructor comments and guidance
  • Week 9: Project work continues
  • Week 10: Class presentations, final reports due
  • Week 11: General discussion