News...

Dec 1, 2006:
Pamphlet for Dec 7 presentations now available.

Nov 17, 2006:
Project 4 presentations will be on Dec 7 from 3:30 till when we finish. Dinner (pizza) will be provided. Location: ESB G39 Visitors will be invited so please (i)come dressed in business casual; (ii)thoroughly rehearse your talk- make sure you finish in 15 minutes and leave time for 5 minutes questions; (iii)only present HTML- saves time switching between talks.

Nov 16, 2006:
Lecture 8 is now available.

Oct 3, 2006:
Lecture 7 is now available.

This subject is about finding the diamonds in the dust; that is the small gems within mountains of data.

Students in this subject will learn the core concepts of data mining and how to use those concepts to build theories from data.

For administrative details about this subject, see the Syllabus. For some of the motivations of this subject, read on.

The core ideas of this subject are:

Wholes and holes

  • Humans and data miners are a natural partnership. Humans are good at the whole story while data miners are good at filling in the holes (the details that humans don't have time to tell us, or just don't know).

Much of "mining" is really "data pre-processing."

  • So as well as exploring data mining methods and tools (e.g. the WEKA), students to need also learn the scripting skills required for the pre- and post-processing.

Bias makes us blind, bias lets us see

  • The output of a data miner is always biased by the data selected for the learning, the learning method applied, etc etc. They must be biased since, otherwise, there would be no way to decide what bits are most important and which bits can be ignored.
    Paradoxically, bias blinds us to some things while letting us see (predict) the future.
  • So all theories are biased (but only some admit it). But we should always be aware of the domain-specific nature of the conclusions drawn from a learner.

Algorithms need audiences

  • Data miners built theories that some{one|thing} will use. People like reading things and some things are easier to read than other things.
  • Hence, this subject does not spend to much time on mostly mathematical methods (eg. regression, neural nets, Bayes classifiers). Instead, we'll focus on methods that generate human-readable theories (e.g. decision trees, rule-based learners, treatment learners).

How do dumb apes get by?

  • Here's a puzzle. People aren't real bright (just look at how badly they write software). Yet, somehow, people have built the most amazing things, like the international domestic airline network and the Internet. How?
  • Maybe the real world is not as complex as our egos imagine. And seemingly naive probes tell us most of what can be found using supposedly more sophisticated methods.

You are responsible

  • Very successful data miners can be surprisingly simple. This begs the question "why aren't they used more often so we can control the world around us, better?".
  • The answer is that, sometimes the world is very very complicated and no single simple solution will suffice. But often, the world is a surprisingly simple place (otherwise, dumb apes would not get by) which means, in turn, that we should be able to predict and control and select the future that we want.
  • So the curse of data mining is that once you learn how to do it, you become responsible for the future of the human race. Are your ready for that?