Thanks Laszlo - that looks very interesting.
Cheers
Jo
-----Original Message-----
From: laszlo.seress(a)gmail.com [mailto:laszlo.seress@gmail.com] On Behalf
Of Laszlo Seress
Sent: Friday, 15 June, 2012 08:36
To: Johannes Hachmann; Alan Aspuru-Guzik; Roberto Olivares; Aidan Daly;
Sule Atahan; Suleyman Er; Shrestha, Supriya
Cc: A-G Group
Subject: Cheminformatics, data-mining, and finding correlation between
variables in large data sets
Hi Everyone,
My friend just sent me a paper regarding a tactic for data-mining large
dimensional data sets and finding correlations between variables (kind of
what we're trying to do with the cheminformatics). The link to the paper
is
here:
http://www.sciencemag.org/content/334/6062/1518
Abstract: Identifying interesting relationships between pairs of variables
in
large data sets is increasingly important. Here, we
present a measure of
dependence for two-variable relationships: the maximal information
coefficient (MIC). MIC captures a wide range of associations both
functional
and not, and for functional relationships provides a
score that roughly
equals
the coefficient of determination (R2) of the data
relative to the
regression
function. MIC belongs to a larger class of maximal
information-based
nonparametric exploration (MINE) statistics for identifying and
classifying
relationships. We apply MIC and MINE to data sets in
global health, gene
expression, major-league baseball, and the human gut microbiota and
identify known and novel relationships.
I'm not sure if it's exactly what we need, but I thought I'd pass it
along.
Best,
Laszlo
P.S. cc-d Aspuru-Group in case anyone else is interested.
--
Laszlo Ryan Seress