[Ml-stat-talks] Fwd: Wilks Statistics Seminar: Florentina Bunea, Tomorrow, April 15 @ 12:30pm, Sherrerd Hall 101

Barbara Engelhardt bee at princeton.edu
Thu Apr 14 15:37:30 EDT 2016


Talk of interest.

---------- Forwarded message ----------


***   Wilks Statistics Seminar   ***

DATE:  Tomorrow, April 15, 2016

TIME:   12:30pm

LOCATION:   Sherrerd Hall 101

SPEAKER: Florentina Bunea, Cornell University

TITLE:  Minimax Optimal Variable Clustering in G-Models

ABSTRACT:  The goal of variable clustering is to partition a random vector
X 2 Rp in sub-groups of similar probabilistic behavior. Popular methods
such as hierarchical clustering or K- means are algorithmic procedures
applied to observations on X, while no population level target is de ned
prior to estimation. We take a di erent view in this talk, where we discuss
model based variable clustering. We consider three models, of increasing
level of complexity, termed generically G-models, with G standing for the
partition to be estimated. Motivated by the potential lack of identi
ability of the G-latent models, which are currently used in problems
involving variable clustering, we introduce two new classes of models, the
G-exchangeable and the G-block covariance models. We show that both classes
are identi able, for any distribution of X, thereby providing well de ned
targets for estimation. Our focus is on clusters that are invariant with
respect to unknown monotone transformations of the data, and that can be
estimated in a computationally feasible manner. Both desiderata can be met
if the clusters correspond to blocks in the copula correlation matrix of X,
assumed to have a Gaussian copula distribution. This motivates the
introduction of a new similarity metric for cluster membership, CORD, and
of a homonymous method for cluster estimation. Central to our work is the
derivation of the minimax rate of the CORD cluster separation for exact
partition recovery. We obtain the surprising result that the CORD rate is
of order q log p n , irrespective of the number of clusters, or of the size
of the smallest cluster. Our new procedure, CORD, available on CRAN,
achieves this bound and has computational complexity that is polynomial in
p. The CORD distance between two clusters is larger than the classical
"within-between" correlation gap between clusters, and can be employed even
when the latter is negative. However, in the particular case of a positive
correlation GAP, the GAP minimax rate for exact recovery is q log p mn ,
where m is the minimum cluster size. We show that while methods such as
spectral clustering cannot, in general, recover the partition exactly at
the minimax GAP separation level, convex algorithms can be near minimax
optimal. Our results are further supported by extensive numerical studies
and data examples.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/ml-stat-talks/attachments/20160414/43870236/attachment.html>


More information about the Ml-stat-talks mailing list