Testing for the existence of clusters

Claudio Fuentes; George Casella

Ayuda

Testing for the existence of clusters

Autores: Claudio Fuentes, George Casella
Localización: Sort: Statistics and Operations Research Transactions, ISSN 1696-2281, Vol. 33, Nº. 2, 2009, págs. 115-146
Idioma: inglés
Enlaces
- Texto completo (pdf)
Resumen
- Detecting and determining clusters present in a certain sample has been an important concern, among researchers from different fields, for a long time. In particular, assessing whether the clusters are statistically significant, is a question that has been asked by a number of experimenters. Recently, this question arose again in a study in maize genetics, where determining the significance of clusters is crucial as a primary step in the identification of a genome-wide collection of mutants that may affect the kernel composition. Although several efforts have been made in this direction, not much has been done with the aim of developing an actual hypothesis test in order to assess the significance of clusters. In this paper, we propose a new methodology that allows the examination of the hypothesis test H0 : K = 1 vs. H1 : K = k, where K denotes the number of clusters present in a certain population. Our procedure, based on Bayesian tools, permits us to obtain closed form expressions for the posterior probabilities corresponding to the null hypothesis. From here, we calibrate our results by estimating the frequentist null distribution of the posterior probabilities in order to obtain the p-values associated with the observed posterior probabilities. In most cases, actual evaluation of the posterior probabilities is computationally intensive and several algorithms have been discussed in the literature. Here, we propose a simple estimation procedure, based on MCMC techniques, that permits an efficient and easily implementable evaluation of the test. Finally, we present simulation studies that support our conclusions, and we apply our method to the analysis of NIR spectroscopy data coming from the genetic study that motivated this work.
Referencias bibliográficas
- Andrews, G. (1976). The Theory of Partitions. Addison-Wesley, Reading MA.
- Auffermann, W. F., Ngan, S. C. and Hu, X. (2002). Cluster significance testing using the bootstrap. NeuroImage, 17, 583-591.
- Bayarri, M. J. and Berger, J. (1998). Quantifying surprise in the data and model verification (with discussion). Bayesian Statistics 6, J....
- Bolshakova, N., Azuaje, F. and Cunningham, P. (2005). An integrated tool for microarray data clustering and cluster validity assessment. Bioinformatics,...
- Bona, M. (2004). Combinatorics of Permutations. Chapman & Hall/CRC, London.
- Booth, J. G., Casella, G. and Hobert, J. P. (2008). Clustering using objective functions and stochastic search. Journal of Royal Statistical...
- Casella, G. and Robert, C. (1998). Post-processing accept-reject samples: recycling and rescaling. Journal of the Computational and Graphical...
- Easton, G. S. and Rochetti, R. (1986). General saddlepoint approximations with applications to L statistics. Journal of the American Statistical...
- Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical...
- Fuentes, C. (2008). Testing for the Existence of Clusters with Applications to NIR Spectroscopy Data. Master Thesis, University of Florida,...
- Ghosh J. K., Delampady, M. and Samanta, T. (2006). An Introduction to Bayesian Analysis: Theory and Methods. Springer, New York.
- GiroÌn, F. J., MartÄ±Ìnez, M. L., Moreno, E. and Torres, F. (2006). Objective testing procedures in linear models: calibration of the p-values....
- Gould, H. W. (1960). Stirling number representation problems. Proceedings of the AmericanMathematical Society, 11, 447-451.
- Glaser, R. E. (1980). A characterization of Bartlett’s statistic involving incomplete beta functions. Biometrika, 67, 53-58.
- Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
- Jeffreys, H. (1961). Theory of Probability. Third Edition. Oxford University Press, Oxford.
- Kendall, M., Stuart, A., Ord, J. K. and Arnold, S. (1999). Kendall’s Advanced Theory of Statistics, Volume 2A: Classical Inference and the...
- Kerr, M. K. and Churchill, G. A. (2001). Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments....
- Lavine, M. and Shervish, M. (1999). Bayes factors: what they are and what they are not. American Statistician, 53, 119-122.
- McCullaugh, P. and Yang, J. (2006). How many clusters?. Technical Report, Department of Statistics. University of Chicago, Chicago.
- Pitman, J. (1996). Some developments of the Blackwell-MacQueen urn scheme. Statistics, Probability and Game Theory. IMS Lecture Notes Monograph...
- Pritchard, J. K., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155, 945-959.
- Quintana, F. A. (2004). A predictive view of bayesian clustering. Journal of Statistical Planning and Inference, 136, 2407-2429.
- Robert, C. P. (2001). The Bayesian Choice. Second Edition. Springer-Verlag, New York.
- Sen, P. K. and Singer, J. M. (1993). Large Sample Methods in Statistics: An Introduction with Applications. Chapman & Hall, New York-London.
- Steele, R., Raftery, A. E. and Emond, M. J. (2003). Computing normalizing constants for finite mixture models via incremental mixture importance...
- Sugar, C. and James, G. (2003). Finding the number of clusters in a data set: an information theoretic approach. Journal of the American Statistical...
- Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal...
- Van Dijk, H. and Kloeck, T. (1984). Experiments with some alternatives for simple importance sampling in Monte Carlo integration. Bayesian...
- Ventura, V. (2002). Non-parametric bootstrap recycling. Statistics and Computing, 12, 261-273.A Generating a random partition