Data Mining
Data mining is the extraction of useful knowledge from large bodies of
data.
It is fast becoming a key technology for thousands of companies. Our
research
directions in this field include:
- Developing knowledge discovery algorithms that run in linear or
near-linear time, and so scale up to large databases.
- Using subsampling techniques to scale up pre-existing approaches.
- Developing platforms for large-scale collaborative information processing.
- Understanding the probabilistic properties and foundations of data
mining algorithms.
- Developing techniques for mining multi-relational databases and
semi-structured data sources (e.g., text, the Web).
Projects
People
Software
- Alchemy:
Statistical relational AI.
- BVD:
Bias-variance decomposition for zero-one loss.
- NBE:
Bayesian learner with very fast inference.
- RISE:
Unified rule- and instance-based learner.
- VFML:
Toolkit for mining massive data sources.
Publications
Selected Book Chapters
-
What's Missing in AI: The Interface Layer. In P. Cohen (ed.), Artificial
Intelligence: The First Hundred Years, 2006. Menlo Park, CA: AAAI Press.
To appear.
-
Combining Link and Content Information in Web Search, with Matt
Richardson. In M. Levene and A. Poulovassilis (eds.), Web Dynamics
(pp. 179-193), 2004. New York: Springer.
-
Ontology Matching: A Machine Learning Approach, with AnHai Doan,
Jayant Madhavan and Alon Halevy. In S. Staab and R. Studer (eds.),
Handbook on Ontologies in Information Systems (pp. 385-403), 2004.
New York: Springer.
-
Machine Learning. In W. Klosgen and J. Zytkow (eds.), Handbook of
Data Mining and Knowledge Discovery (pp. 660-670), 2002. New York:
Oxford University Press.
-
Learning Repetitive Text-Editing Procedures with SMARTedit, with
Tessa Lau, Steve Wolfman and Dan Weld. In H. Lieberman (ed.), Your Wish
Is My Command: Giving Users the Power to Instruct their Software
(pp. 209-225), 2001. San Francisco, CA: Morgan Kaufmann.
Selected Journal Papers
-
Markov Logic Networks, with Matt Richardson. Machine Learning, 62,
107-136, 2006.
-
Mining Social Networks for Viral Marketing (short paper). IEEE Intelligent
Systems, 20(1), 80-82, 2005.
-
Learning to Match Ontologies on the Semantic Web, with AnHai Doan,
Jayant Madhavan, Robin Dhamankar and Alon Halevy. VLDB Journal 12(4),
303-319, 2003.
-
Programming by Demonstration Using Version Space Algebra, with Tessa Lau,
Steve Wolfman and Dan Weld. Machine Learning, 53, 111-156, 2003.
-
Tree Induction for Probability-Based Ranking, with Foster Provost.
Machine Learning, 52, 199-216, 2003.
-
Learning to Match the Schemas of Data Sources: A Multistrategy
Approach, with AnHai Doan and Alon Halevy. Machine Learning, 50,
279-301, 2003.
-
A General Framework for Mining Massive Data Streams, with Geoff Hulten
(short paper). Journal of Computational and Graphical Statistics, 12, 2003.
-
Prospects and Challenges for Multi-Relational Data Mining (position
paper). SIGKDD Explorations, 5, 80-83, 2003.
-
The Role of Occam's Razor in Knowledge Discovery.
Data Mining and Knowledge Discovery, 3, 409-425, 1999.
-
Knowledge Discovery Via Multiple Models.
Intelligent Data Analysis, 2, 187-202, 1998.
-
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,
with Michael Pazzani. Machine Learning, 29, 103-130, 1997.
-
Context-Sensitive Feature Selection for Lazy Learners.
Artificial Intelligence Review, 11, 227-253, 1997.
-
Unifying Instance-Based and Rule-Based Induction.
Machine Learning, 24, 141-168, 1996.
-
Two-Way Induction.
International Journal on Artificial Intelligence Tools, 5, 113-125, 1996.
Selected Conference Papers
-
Recursive Random Fields, with Daniel Lowd. Proceedings of the Twentieth
International Joint Conference on Artificial Intelligence, 2007.
Hyderabad, India: AAAI Press. To appear.
-
Unifying Logical and Statistical AI, with various coauthors. Proceedings
of the Twenty-First National Conference on Artificial Intelligence (pp. 2-7),
2006. Boston, MA: AAAI Press.
-
Sound and Efficient Inference with Probabilistic and Deterministic
Dependencies, with Hoifung Poon. Proceedings of the Twenty-First
National Conference on Artificial Intelligence (pp. 458-463), 2006.
Boston, MA: AAAI Press.
-
Memory-Efficient Inference in Relational Domains, with Parag Singla.
Proceedings of the Twenty-First National Conference on Artificial
Intelligence (pp. 488-493), 2006. Boston, MA: AAAI Press.
-
Entity Resolution with Markov Logic, with Parag Singla. Proceedings of
the Sixth IEEE International Conference on Data Mining, 2006. Hong Kong:
IEEE Computer Society Press. To appear.
-
Learning the Structure of Markov Logic Networks, with Stanley Kok.
Proceedings of the Twenty-Second International Conference on Machine
Learning (pp. 441-448), 2005. Bonn, Germany: ACM Press.
-
Naive Bayes Models for Probability Estimation, with Daniel Lowd.
Proceedings of the Twenty-Second International Conference on Machine
Learning (pp. 529-536), 2005. Bonn, Germany: ACM Press.
-
Discriminative Training of Markov Logic Networks, with Parag Singla.
Proceedings of the Twentieth National Conference on Artificial Intelligence
(pp. 868-873), 2005. Pittsburgh, PA: AAAI Press.
-
Object Identification with Attribute-Mediated Dependences, with Parag
Singla. Proceedings of the Ninth European Conference on Principles and
Practice of Knowledge Discovery in Databases (pp. 297-308), 2005. Porto,
Portugal: Springer. Winner of the Best Paper Award.
-
Markov Logic: A Unifying Framework for Statistical Relational Learning,
with Matt Richardson. Proceedings of the ICML-2004 Workshop on Statistical
Relational Learning and its Connections to Other Fields (pp. 49-54), 2004.
Banff, Canada: IMLS.
-
Multi-Relational Record Linkage, with Parag. Proceedings of the
KDD-2004 Workshop on Multi-Relational Data Mining (pp. 31-48), 2004.
Seattle, CA: ACM Press.
-
Adversarial Classification, with Nilesh Dalvi, Mausam, Sumit Sanghai
and Deepak Verma. Proceedings of the Tenth International Conference on
Knowledge Discovery and Data Mining (pp. 99-108), 2004. Seattle, WA: ACM Press.
-
Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood,
with Dan Grossman. Proceedings of the Twenty-First International Conference on
Machine Learning (pp. 361-368), 2004. Banff, Canada: ACM Press.
-
iMAP: Discovering Complex Semantic Matches between Database Schemas,
with Robin Dhamankar, Yoonkyong Lee, AnHai Doan and Alon Halevy. Proceedings
of the 2004 ACM SIGMOD International Conference on Management of Data
(pp. 383-394), 2004. Paris, France: ACM Press.
-
Building Large Knowledge Bases by Mass Collaboration, with Matt
Richardson. Proceedings of the Second International Conference on
Knowledge Capture (pp. 129-137), 2003. Sanibel Island, FL: ACM Press.
-
Learning Programs from Traces Using Version Space Algebra, with Tessa Lau
and Dan Weld. Proceedings of the Second International Conference on
Knowledge Capture (pp. 36-43), 2003. Sanibel Island, FL: ACM Press.
-
Trust Management for the Semantic Web, with Matt Richardson and Rakesh
Agrawal. Proceedings of the Second International Semantic Web Conference
(pp. 351-368), 2003. Sanibel Island, FL: Springer.
-
Learning with Knowledge from Multiple Experts, with Matt Richardson.
Proceedings of the Twentieth International Conference on Machine Learning
(pp. 624-631), 2003. Washington, DC: Morgan Kaufmann.
-
Mining Massive Relational Databases, with Geoff Hulten and Yeuhi Abe.
Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from
Relational Data (pp. 53-60), 2003. Acapulco, Mexico: IJCAII.
-
Research on Statistical Relational Learning at the University of
Washington, with various coauthors. Proceedings of the IJCAI-2003
Workshop on Learning Statistical Models from Relational Data (pp. 43-47),
2003. Acapulco, Mexico: IJCAII.
-
Relational Markov Models and their Application to Adaptive Web
Navigation, with Corin Anderson and Dan Weld. Proceedings of the
Eighth International Conference on Knowledge Discovery and Data Mining
(pp. 143-152), 2002. Edmonton, Canada: ACM Press.
-
Mining Knowledge-Sharing Sites for Viral Marketing, with Matt
Richardson. Proceedings of the Eighth International Conference on
Knowledge Discovery and Data Mining (pp. 61-70), 2002. Edmonton,
Canada: ACM Press.
-
Mining Complex Models from Arbitrarily Large Databases in Constant
Time, with Geoff Hulten. Proceedings of the Eighth International
Conference on Knowledge Discovery and Data Mining (pp. 525-531),
2002. Edmonton, Canada: ACM Press.
-
Representing and Reasoning about Mappings between Domain Models,
with Jayant Madhavan, Phil Bernstein and Alon Halevy. Proceedings of the
Eighteenth National Conference on Artificial Intelligence (pp. 80-86), 2002.
Edmonton, Canada: AAAI Press.
-
Learning to Map between Ontologies on the Semantic Web, with AnHai Doan,
Jayant Madhavan and Alon Halevy. Proceedings of the Eleventh International
World Wide Web Conference (pp. 662-673), 2002. Honolulu, HI: ACM Press.
-
Learning from Infinite Data in Finite Time, with Geoff Hulten. Advances
in Neural Information Processing Systems 14 (pp. 673-680), 2002. Cambridge,
MA: MIT Press.
-
The Intelligent Surfer: Probabilistic Combination of Link and Content
Information in PageRank, with Matt Richardson. Advances in Neural
Information Processing Systems 14 (pp. 1441-1448), 2002. Cambridge, MA:
MIT Press.
-
Mining the Network Value of Customers, with Matt Richardson. Proceedings
of the Seventh International Conference on Knowledge Discovery and Data
Mining (pp. 57-66), 2001. San Francisco, CA: ACM Press.
-
Mining Time-Changing Data Streams, with Geoff Hulten and Laurie Spencer.
Proceedings of the Seventh International Conference on Knowledge Discovery
and Data Mining (pp. 97-106), 2001. San Francisco, CA: ACM Press.
-
Adaptive Web Navigation for Wireless Devices, with Corin Anderson and Dan
Weld. Proceedings of the Seventeenth International Joint Conference on
Artificial Intelligence (pp. 879-884), 2001. Seattle, WA: Morgan Kaufmann.
-
A General Method for Scaling Up Machine Learning Algorithms and its
Application to Clustering, with Geoff Hulten. Proceedings of the
Eighteenth International Conference on Machine Learning (pp. 106-113), 2001.
Williamstown, MA: Morgan Kaufmann.
-
Reconciling Schemas of Disparate Data Sources: A Machine-Learning
Approach, with AnHai Doan and Alon Halevy. Proceedings of the 2001 ACM
SIGMOD International Conference on Management of Data (pp. 509-520), 2001.
Santa Barbara, CA: ACM Press.
-
Personalizing Web Sites for Mobile Users, with Corin Anderson and
Dan Weld. Proceedings of the Tenth International World Wide Web Conference
(pp. 565-575), 2001. Hong Kong: ACM Press.
-
Mixed Initiative Interfaces for Learning Tasks: SMARTedit Talks Back,
with Steve Wolfman, Tessa Lau and Dan Weld. Proceedings of the 2001
Conference on Intelligent User Interfaces (pp. 167-174), 2001. Santa Fe,
NM: ACM Press.
-
Mining High-Speed Data Streams, with Geoff Hulten. Proceedings of the
Sixth International Conference on Knowledge Discovery and Data Mining
(pp. 71-80), 2000. Boston, MA: ACM Press.
-
A Unified Bias-Variance Decomposition for Zero-One and Squared Loss.
Proceedings of the Seventeenth National Conference on Artificial Intelligence
(pp. 564-569), 2000. Austin, TX: AAAI Press.
-
Version Space Algebra and its Application to Programming by Demonstration,
with Tessa Lau and Dan Weld. Proceedings of the Seventeenth International
Conference on Machine Learning (pp. 527-534), 2000. Stanford, CA: Morgan
Kaufmann.
-
A Unified Bias-Variance Decomposition and its Applications.
Proceedings of the Seventeenth International Conference on
Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.
-
Bayesian Averaging of Classifiers and the Overfitting Problem.
Proceedings of the Seventeenth International Conference on
Machine Learning (pp. 223-230), 2000. Stanford, CA: Morgan Kaufmann.
-
Learning Source Descriptions for Data Integration, with AnHai Doan and
Alon Levy. Proceedings of the Third International Workshop on the Web and
Databases (pp. 81-86), 2000. Dallas, TX: ACM SIGMOD.
-
MetaCost: A General Method for Making Classifiers Cost-Sensitive.
Proceedings of the Fifth International Conference on Knowledge
Discovery and Data Mining (pp. 155-164), 1999. San Diego, CA: ACM
Press. Winner of the Best Paper Award for Fundamental Research.
-
Process-Oriented Estimation of Generalization Error.
Proceedings of the Sixteenth International Joint Conference on Artificial
Intelligence (pp. 714-719), 1999. Stockholm, Sweden: Morgan Kaufmann.
-
Occam's Two Razors: The Sharp and the Blunt. Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining
(pp. 37-43), 1998. New York, NY: AAAI Press. Winner of the Best Paper
Award for Fundamental Research.
-
A Process-Oriented Heuristic for Model Selection.
Proceedings of the Fifteenth International Conference on
Machine Learning (pp. 127-135), 1998. Madison, WI: Morgan Kaufmann.
-
How to Get a Free Lunch: A Simple Cost Model for Machine Learning
Applications. Proceedings of the AAAI-98/ICML-98 Workshop on the
Methodology of Applying Machine Learning (pp. 1-7), 1998. Madison,
WI: AAAI Press.
-
Knowledge Acquisition from Examples Via Multiple Models.
Proceedings of the Fourteenth International Conference on
Machine Learning (pp. 98-106), 1997. Nashville, TN: Morgan Kaufmann.
-
Why Does Bagging Work? A Bayesian Account and its Implications.
Proceedings of the Third International Conference on Knowledge Discovery
and Data Mining (pp. 155-158), 1997. Newport Beach, CA: AAAI Press.
-
Bayesian Model Averaging in Rule Induction. Preliminary Papers of
the Sixth International Workshop on Artificial Intelligence and
Statistics (pp. 157-164), 1997. Ft. Lauderdale, FL: Society for
Artificial Intelligence and Statistics.
-
Linear-Time Rule Induction. Proceedings of the Second
International Conference on Knowledge Discovery and Data Mining
(pp. 96-101), 1996. Portland, OR: AAAI Press.
-
Using Partitioning to Speed Up Specific-to-General Rule Induction.
Proceedings of the AAAI-96 Workshop on Integrating Multiple Learned
Models (pp. 29-34), 1996. Portland, OR: AAAI Press.
-
Beyond Independence: Conditions for the Optimality of the Simple
Bayesian Classifier, with Michael Pazzani. Proceedings of the
Thirteenth International Conference on Machine Learning (pp. 105-112),
1996. Bari, Italy: Morgan Kaufmann.
-
From Instances to Rules: A Comparison of Biases. Proceedings of
the Third International Workshop on Multistrategy Learning
(pp. 147-154), 1996. Harpers Ferry, WV: AAAI Press.
-
Two-Way Induction. Proceedings of the Seventh IEEE International
Conference on Tools with Artificial Intelligence (pp. 182-189), 1995.
Herndon, VA: IEEE Computer Society Press.
-
Rule Induction and Instance-Based Learning: A Unified Approach.
Proceedings of the Fourteenth International Joint Conference on
Artificial Intelligence (pp. 1226-1232), 1995. Montreal, Canada:
Morgan Kaufmann.
-
The RISE System: Conquering Without Separating. Proceedings of
the Sixth IEEE International Conference on Tools with Artificial
Intelligence (pp. 704-707), 1994. New Orleans, LA: IEEE Computer
Society Press.