## Data Mining

*11 Jun 2014 13:59*

I've taught a course on this, so I ought to be able to describe it, oughtn't
I? Data mining, more stuffily "knowledge discovery in databases", is the art
of finding and extracting useful patterns in very large collections of data.
It's not quite the same as machine
learning, because, while it certainly uses ML techniques, the aim is to
directly guide action (*praxis*!), rather than to develop a technology
and theory of induction. In some ways, in fact, it's closer to
what statistics calls "exploratory data
analysis", though with certain advantages and limitations that come from having
really big data to explore.

Kernel methods probably deserve their own notebook.

See also: Clinical and Actuarial Compared; Clustering; Statistics for Structured Data; Text Mining

- Recommended, big picture:
- Leo Breiman, "Statistical Modeling: The Two Cultures",
Statistical Science
**16**(2001): 199--231 [very much including the discussion by others and the reply by Breiman] - Pedro Domingos, "A Few Useful Things to Know about Machine Learning"
- David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining [The textbook I teach from; also a book I learned a lot from.]
- Bernard E. Harcourt, Against Prediction: Profiling, Policing, and Punishing in an Actuarial Age [Precis as a 43 pp. PDF working paper]
- Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction [Website, with full text free in PDF]
- Sholom M. Weiss and Nitin Indrukyha, Predictive Data
Mining: A Practical Guide [Pedestrian, but it
*is*practical, and adapted to the meanest, i.e. the managerial, understanding]

- Recommended, close-ups:
- Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján, "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection", Journal of Machine Learning Research
**13**(2012): 27--66 - Sharad Goel, Jake M. Hofman, Sébastien Lahaie, David M. Pennock, and Duncan J. Watts, "Predicting consumer behavior with Web search",
Proceedings of the National Academy of Sciences
(USA)
**107**(2010): 17486--17490 [A case study in using data mining, while recognizing limitations] - Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
- Jon Kleinberg, Christos Papadimitriou and Prabhakar Raghavan, "A
Microeconomic View of Data Mining", Data Mining and Knowledge
Discovery
**2**(1998) [PDF] - Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, Michael I. Jordan, "A Scalable Bootstrap for Massive Data", arxiv:1112.5016
- Kling, Scherson and Allen, "Parallel Computing and Information Capitalism," in Metropolis and Rota (eds.), A New Era in Computation (1992) [A batch of UC Irvine comp. sci. professors who write like sociologists. " `Information capitalism' refers to forms of organization in which data-intensive techniques and computerization are key strategic resources for corporate production."]
- Erik Larson, The Naked Consumer: How Our Private Lives Become Public Commodities
- R. Dean Malmgren, Jake M. Hofman, Luis A. N. Amaral, Duncan J. Watts, "Characterizing Individual Communication Patterns", arxiv:0905.0106
- Ryan J. Tibshirani, "Degrees of Freedom and Model Search", arxiv:1402.1920
- Yong Wang, Ilze Ziedins, Mark Holmes, Neal Challands, "Tree Models
for Difference and Change Detection in a Complex
Environment", Annals of
Applied Statistics
**6**(2012): 1162--1184, arxiv:1202.1561 [In an ordinary classification tree, we are interested in the distribution of the class labels \( Y \) given the predictors \( X \), i.e., \( \Pr(Y|X) \), and make splits on \( X \) so that (in essence) the conditional entropy \( H[Y|X] \) becomes small. This is of course equivalent to making splits so that the divergence of \( Pr(Y|X) \) from \( Pr(Y) \) is maximized. What they are interested in is not classification but*describing*how the different classes are distinct, so the relevant distribution is \( Pr(X|Y) \), and they want a big divergence between \( Pr(X) \) and \( Pr(X|Y) \).] - Jianming Ye, "On Measuring and Correcting the Effects of Data Mining and Model Selection", Journal of the American Statistical Association
**93**(1998): 120--131

- Modesty forbids me to recommend:
- My lecture notes for my data mining class [However, many of them are based on lecture notes originally written by Tom Minka, and modesty does not forbid me from recommending his work.]

- To read:
- Arvind Agarwal, Jeff M. Phillips, Suresh Venkatasubramanian, "A Unified Algorithmic Framework for Multi-Dimensional Scaling", arxiv:1003.0529
- Dhoha Almazro, Ghadeer Shahatah, Lamia Albdulkarim, Mona Kherees, Romy Martinez, William Nzoukou, "A Survey Paper on Recommender Systems", arxiv:1006.5278
- Ian Ayres, Super Crunchers: Why Thinking-by-Numbers Is the
New Way to Be Smart [Despite the
*painful*title, Ayres has done cool applied work in social statistics] - David L. Banks and Yasmin H. Said, "Data Mining in Electronic
Commerce", Statistical Science
**21**(2006): 234--246, math.ST/0609204 - Gérard Biau, Benoît Cadre, Laurent Rouvière, "Statistical analysis of $k$-nearest neighbor collaborative recommendation",
Annals of Statistics
**38**(2010): 1568--1592, arxiv:1010.0499 - Kerstin Bunte, Michael Biehl and Barbara Hammer,
"A General Framework for Dimensionality-Reducing Data Visualization Mapping",
Neural Computation
**24**(2012): 771--804 - Burnham, Rise of the Computer State
- Bertrand Clarke, "Desiderata for a Predictive Theory of Statistics",
Bayesian Analysis
**5**(2010): 1--36 - Bertrand Clarke, Ernest Fokoue and Hao Helen Zhang, Principles and Theory for Data Mining and Machine Learning
- Jesse Davis and Mark Goadrich, "The Relationship Between Precision-Recall and ROC Curves" [PDF preprint]
- Pavel Dmitriev and Carl Lagoze, "Mining Generalized Graph Patterns based on User Examples", cs.DS/0609153
- Usama Fayyad, Geroges G. Grinstein and Andreas Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery
- Peter Flach, Machine Learning: The Art and Science of Algorithms that Make Sense of Data
- Hillol Kargupta and Philip Chan (eds.), Advances in Distributed and Parallel Knolwedge Discovery
- Robert L. Grossman and Richard G. Larson, "State Space Realization Theorems for Data Mining", arxiv:0901.2735
- Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena Yesha, Data Mining: Next Generating Challenges and Future Directions
- Nicholas M. Kiefer and C. Erik Larson, "Specification and Informational Issues in Credit Scoring", SSRN/956628
- Ann B. Lee, Diana Luca and Kathryn Roeder, "A spectral graph approach to discovering genetic ancestry", Annals of Applied
Statistics
**4**(2010): 179--202 - Colleen McCue, Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis [To be shot after a fair trial]
- Michalski, Kubat, Bratko and Bratko (eds.), Machine Learning and Data Mining: Methods and Applications
- Rada Mihaclea and Dragomir Radev, Graph-Based Natural Language Processing and Information Retrieval
- Petra Kralj Novak, Nada Lavrac and Geoffrey I. Web,,
"Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast
Set, Emerging Pattern and Subgroup Mining", Journal
of Machine Learning Research
**10**(2009): 377--403 - Anand Rajaraman, Jure Leskovec and Jeffrey David Ullman, Mining of Massive Datasets
- Naren Ramakrishnan and Chris Bailey-Kellogg, "Sampling Strategies for Mining in Data-Scarce Domains," cs.CE/0204047
- Daniel J. Solove, "Data Mining and the Security-Liberty Debate", SSRN/990030
- Joseph Turow, Niche Envy: Marketing Discrimination in the Digital Age
- Christian H. Weiss, "Rule generation for categorical time
series with Markov assumptions", Statistics
and Computing
**21**(2011): 1--16 [Variable-length Markov models] - Johannes Wollbold, "Attribute Exploration of Discrete Temporal Transitions", q-bio/0701009
- Jun-Ming Xu, Aniruddha Bhargava, Robert Nowak, and Xiaojin Zhu, "Socioscope: Spatio-Temporal Signal Recovery from Social Media" [PDF]
- Mohammed Javeed Zaki
- Scalable Data Mining for Rules [Ph.D. thesis, U. of Rochester, 1998; on-line through NCSTRL]
- "SPADE: An Efficient Algorithm for Mining Frequent
Sequences," Machine Learning
**42**(2001): 31--60