def get_fake_data(nsubjects=20, noise_level=0.2, nbogus_classes=0): orig_ds = mean_group_sample(['targets'])(testing_datasets['uni3large']) # and creating an additional target which is a composition of the other two, so # it should be closer to them than to the left out L2 classes_data = [ orig_ds.samples, orig_ds[0].samples + orig_ds[1].samples, orig_ds[1].samples + 4 * orig_ds[2].samples ] classes_targets = list(orig_ds.T) + ['L0+1', 'L1+4*2'] if nbogus_classes: classes_data.append( np.zeros((nbogus_classes, classes_data[0].shape[1]), dtype=float)) classes_targets += ['B%02d' % i for i in xrange(nbogus_classes)] proto_ds = dataset_wizard(np.vstack(classes_data), targets=classes_targets) ntargets = len(proto_ds.UT) dss = [] for i in xrange(nsubjects): R = get_random_rotation(proto_ds.nfeatures) ds = dataset_wizard(np.dot(proto_ds.samples, R), targets=proto_ds.T) #ds = dataset_wizard(proto_ds.samples, targets=proto_ds.T) ds.sa['subjects'] = [i] # And select a varying number of features ds = ds[:, :np.random.randint(10, ds.nfeatures)] # Add some noise ds.samples += np.random.normal(size=ds.shape) * noise_level dss.append(ds) return dss
def to_data_set(labels, samples): ds = dataset_wizard(samples, targets=labels) len_ = len(labels) a = len_/5 runtype = a*range(5) if len(runtype)<len_: runtype += [1]*(len_-len(runtype)) ds.sa['runtype']=runtype return ds
def get_dissimilarities(dss_subjects_rois, roi_labels=None): dss = [] for dss_rois in dss_subjects_rois: dissimilarities_rois = np.array( [dist.pdist(ds, 'correlation') for ds in dss_rois]) # and those would compose our 'dss' if roi_labels is None: roi_labels = ['ROI%d' % i for i in xrange(len(dissimilarities_rois))] dss.append(dataset_wizard(dissimilarities_rois, targets=roi_labels)) return dss
def to_mvpa_dataset(stimset, samples): ds_data = [] targets = [] for stim_key,samps in samples.iteritems(): sym = stimset.md5_to_symbol[stim_key] for samp in samps: targets.append(sym) ds_data.append(samp) ds_data = np.array(ds_data) train_len = int(0.75*ds_data.shape[0]) ds_indx = range(ds_data.shape[0]) np.random.shuffle(ds_indx) train_index = ds_indx[:train_len] valid_index = ds_indx[train_len:] ds_train = dataset_wizard(ds_data[train_index, :], targets=np.array(targets)[train_index]) ds_valid = dataset_wizard(ds_data[valid_index, :], targets=np.array(targets)[valid_index]) return (ds_train, ds_valid)
def mds_withprocrust(a, t, **kwargs): # data should already be in the needed scale -- we just care about # rotation, shift, reflection pm = ProcrusteanMapper(reflection=True, scaling=False, reduction=False, oblique=False) a_ = mdsf(a, **kwargs) ds = dataset_wizard(a_, targets=t) pm.train(ds) return pm.forward(a_)
def _call(self, ds_): """Extract weights from GPR .. note: Input dataset is not actually used. New dataset is constructed from what is known to the classifier """ clf = self.clf # normalize data: clf._train_labels = (clf._train_labels - clf._train_labels.mean()) \ / clf._train_labels.std() # clf._train_fv = (clf._train_fv-clf._train_fv.mean(0)) \ # /clf._train_fv.std(0) ds = dataset_wizard(samples=clf._train_fv, targets=clf._train_labels) clf.ca.enable("log_marginal_likelihood") ms = ModelSelector(clf, ds) # Note that some kernels does not have gradient yet! # XXX Make it initialize to clf's current hyperparameter values # or may be add ability to specify starting points in the constructor sigma_noise_initial = 1.0e-5 sigma_f_initial = 1.0 length_scale_initial = np.ones(ds.nfeatures) * 1.0e4 # length_scale_initial = np.random.rand(ds.nfeatures)*1.0e4 hyp_initial_guess = np.hstack( [sigma_noise_initial, sigma_f_initial, length_scale_initial]) fixedHypers = array([0] * hyp_initial_guess.size, dtype=bool) fixedHypers = None problem = ms.max_log_marginal_likelihood( hyp_initial_guess=hyp_initial_guess, optimization_algorithm="scipy_lbfgsb", ftol=1.0e-3, fixedHypers=fixedHypers, use_gradient=True, logscale=True) if __debug__ and 'GPR_WEIGHTS' in debug.active: problem.iprint = 1 lml = ms.solve() weights = 1.0 / ms.hyperparameters_best[ 2:] # weight = 1/length_scale if __debug__: debug( "GPR", "%s, train: shape %s, labels %s, min:max %g:%g, " "sigma_noise %g, sigma_f %g" % (clf, clf._train_fv.shape, np.unique(clf._train_labels), clf._train_fv.min(), clf._train_fv.max(), ms.hyperparameters_best[0], ms.hyperparameters_best[1])) return weights
def get_dissim_roi(subnr): ds = h5load(fns.betafn(subnr)) ds = ds[:, mask_] ds = ds[ds.sa.condition != 'self'] zscore(ds, chunks_attr='chunks') ds = mean_group_sample(['condition'])(ds) names = [] dissims = [] for roi, (center, ids) in rois.iteritems(): names.append(roi) sample_roi = ds.samples[:, ids] dissim_roi = pdist(sample_roi, 'correlation') dissims.append(dissim_roi) dss = dataset_wizard(dissims, targets=names) return dss
def _call(self, ds_): """Extract weights from GPR .. note: Input dataset is not actually used. New dataset is constructed from what is known to the classifier """ clf = self.clf # normalize data: clf._train_labels = (clf._train_labels - clf._train_labels.mean()) \ / clf._train_labels.std() # clf._train_fv = (clf._train_fv-clf._train_fv.mean(0)) \ # /clf._train_fv.std(0) ds = dataset_wizard(samples=clf._train_fv, targets=clf._train_labels) clf.ca.enable("log_marginal_likelihood") ms = ModelSelector(clf, ds) # Note that some kernels does not have gradient yet! # XXX Make it initialize to clf's current hyperparameter values # or may be add ability to specify starting points in the constructor sigma_noise_initial = 1.0e-5 sigma_f_initial = 1.0 length_scale_initial = np.ones(ds.nfeatures)*1.0e4 # length_scale_initial = np.random.rand(ds.nfeatures)*1.0e4 hyp_initial_guess = np.hstack([sigma_noise_initial, sigma_f_initial, length_scale_initial]) fixedHypers = array([0]*hyp_initial_guess.size, dtype=bool) fixedHypers = None problem = ms.max_log_marginal_likelihood( hyp_initial_guess=hyp_initial_guess, optimization_algorithm="scipy_lbfgsb", ftol=1.0e-3, fixedHypers=fixedHypers, use_gradient=True, logscale=True) if __debug__ and 'GPR_WEIGHTS' in debug.active: problem.iprint = 1 lml = ms.solve() weights = 1.0/ms.hyperparameters_best[2:] # weight = 1/length_scale if __debug__: debug("GPR", "%s, train: shape %s, labels %s, min:max %g:%g, " "sigma_noise %g, sigma_f %g" % (clf, clf._train_fv.shape, np.unique(clf._train_labels), clf._train_fv.min(), clf._train_fv.max(), ms.hyperparameters_best[0], ms.hyperparameters_best[1])) return weights
def test_polydetrend(): samples_forwhole = np.array( [[1.0, 2, 3, 4, 5, 6], [-2.0, -4, -6, -8, -10, -12]], ndmin=2 ).T samples_forchunks = np.array( [[1.0, 2, 3, 3, 2, 1], [-2.0, -4, -6, -6, -4, -2]], ndmin=2 ).T chunks = [0, 0, 0, 1, 1, 1] chunks_bad = [ 0, 0, 1, 1, 1, 0] target_whole = np.array( [[-3.0, -2, -1, 1, 2, 3], [-6, -4, -2, 2, 4, 6]], ndmin=2 ).T target_chunked = np.array( [[-1.0, 0, 1, 1, 0, -1], [2, 0, -2, -2, 0, 2]], ndmin=2 ).T ds = Dataset(samples_forwhole) # this one will auto-train the mapper on first use dm = PolyDetrendMapper(polyord=1, space='police') mds = dm.forward(ds) # features are linear trends, so detrending should remove all assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # we get the information where each sample is assumed to be in the # space spanned by the polynomials assert_array_equal(mds.sa.police, np.arange(len(ds))) # hackish way to get the previous regressors into a dataset ds.sa['opt_reg_const'] = dm._regs[:,0] ds.sa['opt_reg_lin'] = dm._regs[:,1] # using these precomputed regressors, we should get the same result as # before even if we do not generate a regressor for linear dm_optreg = PolyDetrendMapper(polyord=0, opt_regs=['opt_reg_const', 'opt_reg_lin']) mds_optreg = dm_optreg.forward(ds) assert_array_almost_equal(mds_optreg, np.zeros(mds.shape)) ds = Dataset(samples_forchunks) # 'constant' detrending removes the mean mds = PolyDetrendMapper(polyord=0).forward(ds) assert_array_almost_equal( mds.samples, samples_forchunks - np.mean(samples_forchunks, axis=0)) # if there is no GLOBAL linear trend it should be identical to mean removal # even if trying to remove linear mds2 = PolyDetrendMapper(polyord=1).forward(ds) assert_array_almost_equal(mds, mds2) # chunk-wise detrending ds = dataset_wizard(samples_forchunks, chunks=chunks) dm = PolyDetrendMapper(chunks_attr='chunks', polyord=1, space='police') mds = dm.forward(ds) # features are chunkswise linear trends, so detrending should remove all assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # we get the information where each sample is assumed to be in the # space spanned by the polynomials, which is the identical linspace in both # chunks assert_array_equal(mds.sa.police, range(3) * 2) # non-matching number of samples cannot be mapped assert_raises(ValueError, dm.forward, ds[:-1]) # however, if the dataset knows about the space it is possible ds.sa['police'] = mds.sa.police # XXX this should be #mds2 = dm(ds[1:-1]) #assert_array_equal(mds[1:-1], mds2) # XXX but right now is assert_raises(NotImplementedError, dm.forward, ds[1:-1]) # Detrend must preserve the size of dataset assert_equal(mds.shape, ds.shape) # small additional test for break points # although they are no longer there ds = dataset_wizard(np.array([[1.0, 2, 3, 1, 2, 3]], ndmin=2).T, targets=chunks, chunks=chunks) mds = PolyDetrendMapper(chunks_attr='chunks', polyord=1).forward(ds) assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # test of different polyord on each chunk target_mixed = np.array( [[-1.0, 0, 1, 0, 0, 0], [2.0, 0, -2, 0, 0, 0]], ndmin=2 ).T ds = dataset_wizard(samples_forchunks.copy(), targets=chunks, chunks=chunks) mds = PolyDetrendMapper(chunks_attr='chunks', polyord=[0,1]).forward(ds) assert_array_almost_equal(mds, target_mixed) # test irregluar spacing of samples, but with corrective time info samples_forwhole = np.array( [[1.0, 4, 6, 8, 2, 9], [-2.0, -8, -12, -16, -4, -18]], ndmin=2 ).T ds = Dataset(samples_forwhole, sa={'time': samples_forwhole[:,0]}) # linear detrending that makes use of temporal info from dataset dm = PolyDetrendMapper(polyord=1, space='time') mds = dm.forward(ds) assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # and now the same stuff, but with chunking and ordered by time samples_forchunks = np.array( [[1.0, 3, 3, 2, 2, 1], [-2.0, -6, -6, -4, -4, -2]], ndmin=2 ).T chunks = [0, 1, 0, 1, 0, 1] time = [4, 4, 12, 8, 8, 12] ds = Dataset(samples_forchunks.copy(), sa={'chunks': chunks, 'time': time}) mds = PolyDetrendMapper(chunks_attr='chunks', polyord=1, space='time').forward(ds) # the whole thing must not affect the source data assert_array_equal(ds, samples_forchunks) # but if done inplace that is no longer true poly_detrend(ds, chunks_attr='chunks', polyord=1, space='time') assert_array_equal(ds, mds)
rng = np.random.RandomState(1) X = np.sort(5 * rng.rand(80, 1), axis=0) y = np.sin(X).ravel() y[::5] += 3 * (0.5 - rng.rand(16)) """ So far the code has been identical. The first difference is the import of the adaptor class. We also use a convenient way to convert the data into a proper :class:`~mvpa2.datasets.base.Dataset`. """ # this first import is only required to run the example a part of the test suite from mvpa2 import cfg from mvpa2.clfs.skl.base import SKLLearnerAdapter from mvpa2.datasets import dataset_wizard ds_train=dataset_wizard(samples=X, targets=y) """ The following lines are an example of the only significant modification with respect to a pure scikit-learn implementation: the regression is wrapped into the adaptor. The result is a PyMVPA learner, hence can be called with a dataset that contains both samples and targets. """ clf_1 = SKLLearnerAdapter(DecisionTreeRegressor(max_depth=2)) clf_2 = SKLLearnerAdapter(DecisionTreeRegressor(max_depth=5)) clf_1.train(ds_train) clf_2.train(ds_train)
@sweepargs(ds=iter(tdatasets.values())) def test_BinaryFxFeatureMeasure(ds): if not isinstance(ds.samples, np.ndarray): return # some simple function f = lambda x, y: np.sum((x.T*y).T, axis=0) fx = BinaryFxFeaturewiseMeasure(f, uni=False, numeric=True) fx_uni = BinaryFxFeaturewiseMeasure(f, uni=True, numeric=True) out = fx(ds) out_uni = fx_uni(ds) assert(len(out) == 1) assert_array_almost_equal(out.samples, out_uni) assert_equal(out.fa, out_uni.fa) ok_(str(fx).startswith("<BinaryFxFeaturewiseMeasure: lambda x, y:")) _nonlin_tests = [(dataset_wizard([0, 1-0.01, 0, 1], targets=['a', 'b', 'a', 'b']), ([0.99], [1])), (dataset_wizard([0, 1-0.01, 2, 0, 1, 2], targets=['a', 'b', 'c', 'a', 'b', 'c']), ([0.99], [1])), # verify that order of 'labels' doesn't matter to get the same correspondence (dataset_wizard([1-0.01, 0, 1, 0], targets=['a', 'b', 'a', 'b']), ([0.99], [1])), # unfortunately with both normal kde based MI and dcorr # we are not getting "ideal" results in case of "non-linear" # but strict dependencies (dataset_wizard([0, 1-0.01, 2, 0, 1, 2], targets=['a', 'c', 'b', 'a', 'c', 'b']), ([0.8], [1])), # 2nd feature should have no information above the targets
def test_group_clusterthreshold_simple(n_proc): if n_proc > 1: skip_if_no_external('joblib') feature_thresh_prob = 0.005 nsubj = 10 # make a nice 1D blob and a speck blob = np.array([0, 0, .5, 3, 5, 3, 3, 0, 2, 0]) blob = Dataset([blob]) # and some nice random permutations nperms = 100 * nsubj perm_samples = np.random.randn(nperms, blob.nfeatures) perms = Dataset(perm_samples, sa=dict(chunks=np.repeat(range(nsubj), len(perm_samples) / nsubj)), fa=dict(fid=range(perm_samples.shape[1]))) # the algorithm instance # scale number of bootstraps to match desired probability # plus a safety margin to mimimize bad luck in sampling clthr = gct.GroupClusterThreshold(n_bootstrap=int(3. / feature_thresh_prob), feature_thresh_prob=feature_thresh_prob, fwe_rate=0.01, n_blocks=3, n_proc=n_proc) clthr.train(perms) # get the FE thresholds thr = clthr._thrmap # perms are normally distributed, hence the CDF should be close, std of the distribution # will scale 1/sqrt(nsubj) assert_true( np.abs(feature_thresh_prob - (1 - norm.cdf(thr.mean(), loc=0, scale=1. / np.sqrt(nsubj)))) < 0.01) clstr_sizes = clthr._null_cluster_sizes # getting anything but a lonely one feature cluster is very unlikely assert_true(max([c[0] for c in clstr_sizes.keys()]) <= 1) # threshold orig map res = clthr(blob) # # check output # # samples unchanged assert_array_equal(blob.samples, res.samples) # need to find the big cluster assert_true(len(res.a.clusterstats) > 0) assert_equal(len(res.a.clusterstats), res.fa.clusters_featurewise_thresh.max()) # probs need to decrease with size, clusters are sorted by size (decreasing) assert_true( res.a.clusterstats['prob_raw'][0] <= res.a.clusterstats['prob_raw'][1]) # corrected probs for every uncorrected cluster assert_true('prob_corrected' in res.a.clusterstats.dtype.names) # fwe correction always increases the p-values (if anything) assert_true( np.all(res.a.clusterstats['prob_raw'] <= res.a.clusterstats['prob_corrected'])) # check expected cluster sizes, ordered large -> small assert_array_equal(res.a.clusterstats['size'], [4, 1]) # check max position assert_array_equal(res.a.clusterlocations['max'], [[4], [8]]) # center of mass: eyeballed assert_array_almost_equal(res.a.clusterlocations['center_of_mass'], [[4.429], [8]], 3) # other simple stats #[0, 0, .5, 3, 5, 3, 3, 0, 2, 0] assert_array_equal(res.a.clusterstats['mean'], [3.5, 2]) assert_array_equal(res.a.clusterstats['min'], [3, 2]) assert_array_equal(res.a.clusterstats['max'], [5, 2]) assert_array_equal(res.a.clusterstats['median'], [3, 2]) assert_array_almost_equal(res.a.clusterstats['std'], [0.866, 0], 3) # fwe thresholding only ever removes clusters assert_true( np.all( np.abs(res.fa.clusters_featurewise_thresh - res.fa.clusters_fwe_thresh) >= 0)) # FWE should kill the small one assert_greater(res.fa.clusters_featurewise_thresh.max(), res.fa.clusters_fwe_thresh.max()) # check that the cluster results aren't depending in the actual location of # the clusters shifted_blob = Dataset([[.5, 3, 5, 3, 3, 0, 0, 0, 2, 0]]) shifted_res = clthr(shifted_blob) assert_array_equal(res.a.clusterstats, shifted_res.a.clusterstats) # check that it averages multi-sample datasets # also checks that scenarios work where all features are part of one big # cluster multisamp = Dataset(np.arange(30).reshape(3, 10) + 100) avgres = clthr(multisamp) assert_equal(len(avgres), 1) assert_array_equal(avgres.samples[0], np.mean(multisamp.samples, axis=0)) # retrain, this time with data from only a single subject perms = Dataset(perm_samples, sa=dict(chunks=np.repeat(1, len(perm_samples))), fa=dict(fid=range(perms.shape[1]))) clthr.train(perms) # same blob -- 1st this should work without issues sglres = clthr(blob) # NULL estimation does no averaging # -> more noise -> fewer clusters -> higher p assert_greater_equal(len(res.a.clusterstats), len(sglres.a.clusterstats)) assert_greater_equal(np.round(sglres.a.clusterstats[0]['prob_raw'], 4), np.round(res.a.clusterstats[0]['prob_raw'], 4)) # no again for real scientists: no FWE correction superclthr = gct.GroupClusterThreshold( n_bootstrap=int(3. / feature_thresh_prob), feature_thresh_prob=feature_thresh_prob, multicomp_correction=None, n_blocks=3, n_proc=n_proc) superclthr.train(perms) superres = superclthr(blob) assert_true('prob_corrected' in res.a.clusterstats.dtype.names) assert_true('clusters_fwe_thresh' in res.fa) assert_false('prob_corrected' in superres.a.clusterstats.dtype.names) assert_false('clusters_fwe_thresh' in superres.fa) # check validity test assert_raises(ValueError, gct.GroupClusterThreshold, n_bootstrap=10, feature_thresh_prob=.09, n_proc=n_proc) # check mapped datasets blob = np.array([[0, 0, .5, 3, 5, 3, 3, 0, 2, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) blob = dataset_wizard([blob]) # and some nice random permutations nperms = 100 * nsubj perm_samples = np.random.randn(*((nperms, ) + blob.shape)) perms = dataset_wizard(perm_samples, chunks=np.repeat(range(nsubj), len(perm_samples) / nsubj)) clthr.train(perms) twodres = clthr(blob) # finds two clusters of the same size assert_array_equal(twodres.a.clusterstats['size'], res.a.clusterstats['size'])
def give_data(): # 100x10, 10 chunks, 4 targets return dataset_wizard(np.random.normal(size=(100,10)), targets=[ i%4 for i in range(100) ], chunks=[ i//10 for i in range(100)])
def test_erdataset(): # 3 chunks, 5 targets, blocks of 5 samples each nchunks = 3 ntargets = 5 blocklength = 5 nfeatures = 10 targets = np.tile(np.repeat(range(ntargets), blocklength), nchunks) chunks = np.repeat(np.arange(nchunks), ntargets * blocklength) samples = np.repeat(np.arange(nchunks * ntargets * blocklength), nfeatures).reshape(-1, nfeatures) ds = dataset_wizard(samples, targets=targets, chunks=chunks) # check if events are determined properly evs = find_events(targets=ds.sa.targets, chunks=ds.sa.chunks) for ev in evs: assert_equal(ev['duration'], blocklength) assert_equal(ntargets * nchunks, len(evs)) for t in range(ntargets): assert_equal(len([ev for ev in evs if ev['targets'] == t]), nchunks) # now turn `ds` into an eventreleated dataset erds = eventrelated_dataset(ds, evs) # the only unprefixed sample attributes are assert_equal(sorted([a for a in ds.sa if not a.startswith('event')]), ['chunks', 'targets']) # samples as expected? assert_array_equal(erds.samples[0], np.repeat(np.arange(blocklength), nfeatures)) # that should also be the temporal feature offset assert_array_equal(erds.samples[0], erds.fa.event_offsetidx) assert_array_equal(erds.sa.event_onsetidx, np.arange(0, 71, 5)) # finally we should see two mappers assert_equal(len(erds.a.mapper), 2) assert_true(isinstance(erds.a.mapper[0], BoxcarMapper)) assert_true(isinstance(erds.a.mapper[1], FlattenMapper)) # check alternative event mapper # this one does temporal compression by averaging erds_compress = eventrelated_dataset(ds, evs, event_mapper=FxMapper( 'features', np.mean)) assert_equal(len(erds), len(erds_compress)) assert_array_equal(erds_compress.samples[:, 0], np.arange(2, 73, 5)) # # now check the same dataset with event descretization tr = 2.5 ds.sa['time'] = np.arange(nchunks * ntargets * blocklength) * tr evs = [{'onset': 4.9, 'duration': 6.2}] # doesn't work without conversion assert_raises(ValueError, eventrelated_dataset, ds, evs) erds = eventrelated_dataset(ds, evs, time_attr='time') assert_equal(len(erds), 1) assert_array_equal(erds.samples[0], np.repeat(np.arange(1, 5), nfeatures)) assert_array_equal(erds.sa.orig_onset, [evs[0]['onset']]) assert_array_equal(erds.sa.orig_duration, [evs[0]['duration']]) assert_array_almost_equal(erds.sa.orig_offset, [2.4]) assert_array_equal(erds.sa.time, [np.arange(2.5, 11, 2.5)]) # now with closest match erds = eventrelated_dataset(ds, evs, time_attr='time', match='closest') expected_nsamples = 3 assert_equal(len(erds), 1) assert_array_equal( erds.samples[0], np.repeat(np.arange(2, 2 + expected_nsamples), nfeatures)) assert_array_equal(erds.sa.orig_onset, [evs[0]['onset']]) assert_array_equal(erds.sa.orig_duration, [evs[0]['duration']]) assert_array_almost_equal(erds.sa.orig_offset, [-0.1]) assert_array_equal(erds.sa.time, [np.arange(5.0, 11, 2.5)]) # now test the way back results = np.arange(erds.nfeatures) assert_array_equal(erds.a.mapper.reverse1(results), results.reshape(expected_nsamples, nfeatures)) # what about multiple results? nresults = 5 results = dataset_wizard([results] * nresults) # and let's have an attribute to make it more difficult results.sa['myattr'] = np.arange(5) rds = erds.a.mapper.reverse(results) assert_array_equal( rds, results.samples.reshape(nresults * expected_nsamples, nfeatures)) assert_array_equal(rds.sa.myattr, np.repeat(results.sa.myattr, expected_nsamples))
def test_cluster_count(): skip_if_no_external('scipy', min_version='0.10') # we get a ZERO cluster count of one if there are no clusters at all # this is needed to keept track of the number of bootstrap samples that yield # no cluster at all (high treshold) in order to compute p-values when there is no # actual cluster size histogram assert_equal(gct._get_map_cluster_sizes([0, 0, 0, 0]), [0]) # if there is at least one cluster: no ZERO count assert_equal(gct._get_map_cluster_sizes([0, 0, 1, 0]), [1]) for i in range(2): # rerun tests for bool type of test_M test_M = np.array([[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0], [0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1], [0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0], [0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0], [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0], [0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0], [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0]]) expected_result = [5, 4, 3, 3, 2, 0, 2] # 5 clusters of size 1, # 4 clusters of size 2 ... test_ds = Dataset([test_M]) if i == 1: test_M = test_M.astype(bool) test_M_3d = np.hstack( (test_M.flatten(), test_M.flatten())).reshape(2, 9, 16) test_ds_3d = Dataset([test_M_3d]) # expected_result^2 expected_result_3d = np.array( [0, 5, 0, 4, 0, 3, 0, 3, 0, 2, 0, 0, 0, 2]) size = 10000 # how many times bigger than test_M_3d test_M_3d_big = np.hstack((test_M_3d.flatten(), np.zeros(144))) test_M_3d_big = np.hstack( (test_M_3d_big for i in range(size))).reshape(3 * size, 9, 16) test_ds_3d_big = Dataset([test_M_3d_big]) expected_result_3d_big = expected_result_3d * size # check basic cluster size determination for plain arrays and datasets # with a single sample for t, e in ((test_M, expected_result), (test_ds, expected_result), (test_M_3d, expected_result_3d), (test_ds_3d, expected_result_3d), (test_M_3d_big, expected_result_3d_big), (test_ds_3d_big, expected_result_3d_big)): assert_array_equal( np.bincount(gct._get_map_cluster_sizes(t))[1:], e) # old M = np.vstack([test_M_3d.flatten()] * 10) # new ds = dataset_wizard([test_M_3d] * 10) assert_array_equal(M, ds) expected_result = Counter( np.hstack([gct._get_map_cluster_sizes(test_M_3d)] * 10)) assert_array_equal(expected_result, gct.get_cluster_sizes(ds)) # test the same with some arbitrary per-feature threshold thr = 4 labels, num = measurements.label(test_M_3d) area = measurements.sum(test_M_3d, labels, index=np.arange(labels.max() + 1)) cluster_sizes_map = area[labels] # .astype(int) thresholded_cluster_sizes_map = cluster_sizes_map > thr # old M = np.vstack([cluster_sizes_map.flatten()] * 10) # new ds = dataset_wizard([cluster_sizes_map] * 10) assert_array_equal(M, ds) expected_result = Counter( np.hstack( [gct._get_map_cluster_sizes(thresholded_cluster_sizes_map)] * 10)) th_map = np.ones(cluster_sizes_map.flatten().shape) * thr # threshold dataset by hand ds.samples = ds.samples > th_map assert_array_equal(expected_result, gct.get_cluster_sizes(ds))
def test_group_clusterthreshold_simple(n_proc): if n_proc > 1 and not externals.exists('joblib'): raise SkipTest feature_thresh_prob = 0.005 nsubj = 10 # make a nice 1D blob and a speck blob = np.array([0, 0, .5, 3, 5, 3, 3, 0, 2, 0]) blob = Dataset([blob]) # and some nice random permutations nperms = 100 * nsubj perm_samples = np.random.randn(nperms, blob.nfeatures) perms = Dataset(perm_samples, sa=dict(chunks=np.repeat(range(nsubj), len(perm_samples) / nsubj)), fa=dict(fid=range(perm_samples.shape[1]))) # the algorithm instance # scale number of bootstraps to match desired probability # plus a safety margin to mimimize bad luck in sampling clthr = gct.GroupClusterThreshold(n_bootstrap=int(3. / feature_thresh_prob), feature_thresh_prob=feature_thresh_prob, fwe_rate=0.01, n_blocks=3, n_proc=n_proc) clthr.train(perms) # get the FE thresholds thr = clthr._thrmap # perms are normally distributed, hence the CDF should be close, std of the distribution # will scale 1/sqrt(nsubj) assert_true(np.abs( feature_thresh_prob - (1 - norm.cdf(thr.mean(), loc=0, scale=1. / np.sqrt(nsubj)))) < 0.01) clstr_sizes = clthr._null_cluster_sizes # getting anything but a lonely one feature cluster is very unlikely assert_true(max([c[0] for c in clstr_sizes.keys()]) <= 1) # threshold orig map res = clthr(blob) # # check output # # samples unchanged assert_array_equal(blob.samples, res.samples) # need to find the big cluster assert_true(len(res.a.clusterstats) > 0) assert_equal(len(res.a.clusterstats), res.fa.clusters_featurewise_thresh.max()) # probs need to decrease with size, clusters are sorted by size (decreasing) assert_true(res.a.clusterstats['prob_raw'][0] <= res.a.clusterstats['prob_raw'][1]) # corrected probs for every uncorrected cluster assert_true('prob_corrected' in res.a.clusterstats.dtype.names) # fwe correction always increases the p-values (if anything) assert_true(np.all(res.a.clusterstats['prob_raw'] <= res.a.clusterstats['prob_corrected'])) # fwe thresholding only ever removes clusters assert_true(np.all(np.abs(res.fa.clusters_featurewise_thresh - res.fa.clusters_fwe_thresh) >= 0)) # FWE should kill the small one assert_greater(res.fa.clusters_featurewise_thresh.max(), res.fa.clusters_fwe_thresh.max()) # check that the cluster results aren't depending in the actual location of # the clusters shifted_blob = Dataset([[.5, 3, 5, 3, 3, 0, 0, 0, 2, 0]]) shifted_res = clthr(shifted_blob) assert_array_equal(res.a.clusterstats, shifted_res.a.clusterstats) # check that it averages multi-sample datasets # also checks that scenarios work where all features are part of one big # cluster multisamp = Dataset(np.arange(30).reshape(3, 10) + 100) avgres = clthr(multisamp) assert_equal(len(avgres), 1) assert_array_equal(avgres.samples[0], np.mean(multisamp.samples, axis=0)) # retrain, this time with data from only a single subject perms = Dataset(perm_samples, sa=dict(chunks=np.repeat(1, len(perm_samples))), fa=dict(fid=range(perms.shape[1]))) clthr.train(perms) # same blob -- 1st this should work without issues sglres = clthr(blob) # NULL estimation does no averaging # -> more noise -> fewer clusters -> higher p assert_greater_equal(len(res.a.clusterstats), len(sglres.a.clusterstats)) assert_greater_equal(np.round(sglres.a.clusterstats[0]['prob_raw'], 4), np.round(res.a.clusterstats[0]['prob_raw'], 4)) # no again for real scientists: no FWE correction superclthr = gct.GroupClusterThreshold( n_bootstrap=int(3. / feature_thresh_prob), feature_thresh_prob=feature_thresh_prob, multicomp_correction=None, n_blocks=3, n_proc=n_proc) superclthr.train(perms) superres = superclthr(blob) assert_true('prob_corrected' in res.a.clusterstats.dtype.names) assert_true('clusters_fwe_thresh' in res.fa) assert_false('prob_corrected' in superres.a.clusterstats.dtype.names) assert_false('clusters_fwe_thresh' in superres.fa) # check validity test assert_raises(ValueError, gct.GroupClusterThreshold, n_bootstrap=10, feature_thresh_prob=.09, n_proc=n_proc) # check mapped datasets blob = np.array([[0, 0, .5, 3, 5, 3, 3, 0, 2, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) blob = dataset_wizard([blob]) # and some nice random permutations nperms = 100 * nsubj perm_samples = np.random.randn(*((nperms,) + blob.shape)) perms = dataset_wizard( perm_samples, chunks=np.repeat(range(nsubj), len(perm_samples) / nsubj)) clthr.train(perms) twodres = clthr(blob) # finds two clusters of the same size assert_array_equal(twodres.a.clusterstats['size'], res.a.clusterstats['size'])
def test_polydetrend(): samples_forwhole = np.array( [[1.0, 2, 3, 4, 5, 6], [-2.0, -4, -6, -8, -10, -12]], ndmin=2).T samples_forchunks = np.array( [[1.0, 2, 3, 3, 2, 1], [-2.0, -4, -6, -6, -4, -2]], ndmin=2).T chunks = [0, 0, 0, 1, 1, 1] chunks_bad = [0, 0, 1, 1, 1, 0] target_whole = np.array([[-3.0, -2, -1, 1, 2, 3], [-6, -4, -2, 2, 4, 6]], ndmin=2).T target_chunked = np.array([[-1.0, 0, 1, 1, 0, -1], [2, 0, -2, -2, 0, 2]], ndmin=2).T ds = Dataset(samples_forwhole) # this one will auto-train the mapper on first use dm = PolyDetrendMapper(polyord=1, space='police') mds = dm.forward(ds) # features are linear trends, so detrending should remove all assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # we get the information where each sample is assumed to be in the # space spanned by the polynomials assert_array_equal(mds.sa.police, np.arange(len(ds))) # hackish way to get the previous regressors into a dataset ds.sa['opt_reg_const'] = dm._regs[:, 0] ds.sa['opt_reg_lin'] = dm._regs[:, 1] # using these precomputed regressors, we should get the same result as # before even if we do not generate a regressor for linear dm_optreg = PolyDetrendMapper(polyord=0, opt_regs=['opt_reg_const', 'opt_reg_lin']) mds_optreg = dm_optreg.forward(ds) assert_array_almost_equal(mds_optreg, np.zeros(mds.shape)) ds = Dataset(samples_forchunks) # 'constant' detrending removes the mean mds = PolyDetrendMapper(polyord=0).forward(ds) assert_array_almost_equal( mds.samples, samples_forchunks - np.mean(samples_forchunks, axis=0)) # if there is no GLOBAL linear trend it should be identical to mean removal # even if trying to remove linear mds2 = PolyDetrendMapper(polyord=1).forward(ds) assert_array_almost_equal(mds, mds2) # chunk-wise detrending ds = dataset_wizard(samples_forchunks, chunks=chunks) dm = PolyDetrendMapper(chunks_attr='chunks', polyord=1, space='police') mds = dm.forward(ds) # features are chunkswise linear trends, so detrending should remove all assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # we get the information where each sample is assumed to be in the # space spanned by the polynomials, which is the identical linspace in both # chunks assert_array_equal(mds.sa.police, list(range(3)) * 2) # non-matching number of samples cannot be mapped assert_raises(ValueError, dm.forward, ds[:-1]) # however, if the dataset knows about the space it is possible ds.sa['police'] = mds.sa.police # XXX this should be #mds2 = dm(ds[1:-1]) #assert_array_equal(mds[1:-1], mds2) # XXX but right now is assert_raises(NotImplementedError, dm.forward, ds[1:-1]) # Detrend must preserve the size of dataset assert_equal(mds.shape, ds.shape) # small additional test for break points # although they are no longer there ds = dataset_wizard(np.array([[1.0, 2, 3, 1, 2, 3]], ndmin=2).T, targets=chunks, chunks=chunks) mds = PolyDetrendMapper(chunks_attr='chunks', polyord=1).forward(ds) assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # test of different polyord on each chunk target_mixed = np.array([[-1.0, 0, 1, 0, 0, 0], [2.0, 0, -2, 0, 0, 0]], ndmin=2).T ds = dataset_wizard(samples_forchunks.copy(), targets=chunks, chunks=chunks) mds = PolyDetrendMapper(chunks_attr='chunks', polyord=[0, 1]).forward(ds) assert_array_almost_equal(mds, target_mixed) # test irregluar spacing of samples, but with corrective time info samples_forwhole = np.array( [[1.0, 4, 6, 8, 2, 9], [-2.0, -8, -12, -16, -4, -18]], ndmin=2).T ds = Dataset(samples_forwhole, sa={'time': samples_forwhole[:, 0]}) # linear detrending that makes use of temporal info from dataset dm = PolyDetrendMapper(polyord=1, space='time') mds = dm.forward(ds) assert_array_almost_equal(mds.samples, np.zeros(mds.shape)) # and now the same stuff, but with chunking and ordered by time samples_forchunks = np.array( [[1.0, 3, 3, 2, 2, 1], [-2.0, -6, -6, -4, -4, -2]], ndmin=2).T chunks = [0, 1, 0, 1, 0, 1] time = [4, 4, 12, 8, 8, 12] ds = Dataset(samples_forchunks.copy(), sa={'chunks': chunks, 'time': time}) mds = PolyDetrendMapper(chunks_attr='chunks', polyord=1, space='time').forward(ds) # the whole thing must not affect the source data assert_array_equal(ds, samples_forchunks) # but if done inplace that is no longer true poly_detrend(ds, chunks_attr='chunks', polyord=1, space='time') assert_array_equal(ds, mds)
def test_cluster_count(): if externals.versions['scipy'] < '0.10': raise SkipTest # we get a ZERO cluster count of one if there are no clusters at all # this is needed to keept track of the number of bootstrap samples that yield # no cluster at all (high treshold) in order to compute p-values when there is no # actual cluster size histogram assert_equal(gct._get_map_cluster_sizes([0, 0, 0, 0]), [0]) # if there is at least one cluster: no ZERO count assert_equal(gct._get_map_cluster_sizes([0, 0, 1, 0]), [1]) for i in range(2): # rerun tests for bool type of test_M test_M = np.array([[1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0], [0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1], [0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0], [0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0], [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0], [0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0], [1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0]]) expected_result = [5, 4, 3, 3, 2, 0, 2] # 5 clusters of size 1, # 4 clusters of size 2 ... test_ds = Dataset([test_M]) if i == 1: test_M = test_M.astype(bool) test_M_3d = np.hstack((test_M.flatten(), test_M.flatten())).reshape(2, 9, 16) test_ds_3d = Dataset([test_M_3d]) # expected_result^2 expected_result_3d = np.array([0, 5, 0, 4, 0, 3, 0, 3, 0, 2, 0, 0, 0, 2]) size = 10000 # how many times bigger than test_M_3d test_M_3d_big = np.hstack((test_M_3d.flatten(), np.zeros(144))) test_M_3d_big = np.hstack((test_M_3d_big for i in range(size)) ).reshape(3 * size, 9, 16) test_ds_3d_big = Dataset([test_M_3d_big]) expected_result_3d_big = expected_result_3d * size # check basic cluster size determination for plain arrays and datasets # with a single sample for t, e in ((test_M, expected_result), (test_ds, expected_result), (test_M_3d, expected_result_3d), (test_ds_3d, expected_result_3d), (test_M_3d_big, expected_result_3d_big), (test_ds_3d_big, expected_result_3d_big)): assert_array_equal(np.bincount(gct._get_map_cluster_sizes(t))[1:], e) # old M = np.vstack([test_M_3d.flatten()] * 10) # new ds = dataset_wizard([test_M_3d] * 10) assert_array_equal(M, ds) expected_result = Counter(np.hstack([gct._get_map_cluster_sizes(test_M_3d)] * 10)) assert_array_equal(expected_result, gct.get_cluster_sizes(ds)) # test the same with some arbitrary per-feature threshold thr = 4 labels, num = measurements.label(test_M_3d) area = measurements.sum(test_M_3d, labels, index=np.arange(labels.max() + 1)) cluster_sizes_map = area[labels] # .astype(int) thresholded_cluster_sizes_map = cluster_sizes_map > thr # old M = np.vstack([cluster_sizes_map.flatten()] * 10) # new ds = dataset_wizard([cluster_sizes_map] * 10) assert_array_equal(M, ds) expected_result = Counter(np.hstack( [gct._get_map_cluster_sizes(thresholded_cluster_sizes_map)] * 10)) th_map = np.ones(cluster_sizes_map.flatten().shape) * thr # threshold dataset by hand ds.samples = ds.samples > th_map assert_array_equal(expected_result, gct.get_cluster_sizes(ds))
def get_dsm_roi_secondorder_xval2(ds, rois, zscore_ds=True, part=OddEvenPartitioner(), cond_chunk='condition'): """ Obtain second-order dissimilarities between ROIs. This version cross-validates at the second level, thus the resulting dsms are not symmetrical. Arguments -------- ds: dataset rois: dict each item in the dictionary must be a tuple where the 0th element is the center of the roi, and the 1st element is a list of ids zscore_ds: bool is the dset already zscored? part: partitioner cond_chunk: str across which sample attribute to perform mean group sample Returns ------- dataset containing second level dsm """ #ds = h5load(fns.betafn(subnr)) #ds = ds[:, mask_] #ds = ds[ds.sa.condition != 'self'] if zscore_ds: zscore(ds, chunks_attr='chunks') # set up oddeven partition #part = OddEvenPartitioner() rdms = [] mgs = mean_group_sample([cond_chunk]) dissims_folds = [] for ds_ in part.generate(ds): ds_1 = ds_[ds_.sa.partitions == 1] ds_2 = ds_[ds_.sa.partitions == 2] ds_1 = mgs(ds_1) ds_2 = mgs(ds_2) assert (ds_1.samples.shape == ds_2.samples.shape) # first generate first-order rdms for each fold names = [] centers = [] dissims_1 = [] dissims_2 = [] for roi, (center, ids) in rois.iteritems(): names.append(roi) centers.append(center) sample1_roi = ds_1.samples[:, ids] sample2_roi = ds_2.samples[:, ids] dissim1_roi = pdist(sample1_roi, 'correlation') dissim2_roi = pdist(sample2_roi, 'correlation') dissims_1.append(dissim1_roi) dissims_2.append(dissim2_roi) dss1 = np.array(dissims_1) dss2 = np.array(dissims_2) # now compute second-order rdm correlating across folds dissim_2ndorder = 1. - corrcoefxy(dss1.T, dss2.T) dissim_2ndorder = dataset_wizard(dissim_2ndorder, targets=names) dissim_2ndorder.sa['centers'] = centers # also add fa information about roi dissim_2ndorder.fa['roi'] = names dissims_folds.append(dissim_2ndorder) # average dissims = dissims_folds[0] for d in dissims_folds[1:]: dissims.samples += d.samples dissims.samples /= len(dissims_folds) return dissims
def __call__(self, datasets): """Estimate mappers for each dataset Parameters ---------- datasets : list or tuple of datasets Returns ------- A list of trained Mappers of the same length as datasets """ params = self.params # for quicker access ;) ca = self.ca ndatasets = len(datasets) nfeatures = [ds.nfeatures for ds in datasets] residuals = None if ca['residual_errors'].enabled: residuals = np.zeros((2 + params.level2_niter, ndatasets)) ca.residual_errors = Dataset( samples = residuals, sa = {'levels' : ['1'] + ['2:%i' % i for i in xrange(params.level2_niter)] + ['3']}) if __debug__: debug('HPAL', "Hyperalignment %s for %i datasets" % (self, ndatasets)) if params.ref_ds is None: ref_ds = np.argmax(nfeatures) else: ref_ds = params.ref_ds if ref_ds < 0 and ref_ds >= ndatasets: raise ValueError, "Requested reference dataset %i is out of " \ "bounds. We have only %i datasets provided" \ % (ref_ds, ndatasets) ca.choosen_ref_ds = ref_ds # might prefer some other way to initialize... later mappers = [deepcopy(params.alignment) for ds in datasets] # zscore all data sets # ds = [ zscore(ds, chunks_attr=None) for ds in datasets] # Level 1 (first) commonspace = np.asanyarray(datasets[ref_ds]) if params.zscore_common: zscore(commonspace, chunks_attr=None) data_mapped = [np.asanyarray(ds) for ds in datasets] for i, (m, data) in enumerate(zip(mappers, data_mapped)): if __debug__: debug('HPAL_', "Level 1: ds #%i" % i) if i == ref_ds: continue #ZSC zscore(data, chunks_attr=None) ds = dataset_wizard(samples=data, targets=commonspace) #ZSC zscore(ds, chunks_attr=None) m.train(ds) data_temp = m.forward(data) #ZSC zscore(data_temp, chunks_attr=None) data_mapped[i] = data_temp if residuals is not None: residuals[0, i] = np.linalg.norm(data_temp - commonspace) ## if ds_mapped == []: ## ds_mapped = [zscore(m.forward(d), chunks_attr=None)] ## else: ## ds_mapped += [zscore(m.forward(d), chunks_attr=None)] # zscore before adding # TODO: make just a function so we dont' waste space commonspace = params.combiner1(data_mapped[i], commonspace) if params.zscore_common: zscore(commonspace, chunks_attr=None) # update commonspace to mean of ds_mapped commonspace = params.combiner2(data_mapped) if params.zscore_common: zscore(commonspace, chunks_attr=None) # Level 2 -- might iterate multiple times for loop in xrange(params.level2_niter): for i, (m, ds) in enumerate(zip(mappers, datasets)): if __debug__: debug('HPAL_', "Level 2 (%i-th iteration): ds #%i" % (loop, i)) ## ds_temp = zscore( (commonspace*ndatasets - ds_mapped[i]) ## /(ndatasets-1), chunks_attr=None ) ds_new = ds.copy() #ZSC zscore(ds_new, chunks_attr=None) #PRJ ds_temp = (commonspace*ndatasets - ds_mapped[i])/(ndatasets-1) #ZSC zscore(ds_temp, chunks_attr=None) ds_new.targets = commonspace #PRJ ds_temp m.train(ds_new) # ds_temp) data_mapped[i] = m.forward(np.asanyarray(ds)) if residuals is not None: residuals[1+loop, i] = np.linalg.norm(data_mapped - commonspace) #ds_mapped[i] = zscore( m.forward(ds_temp), chunks_attr=None) commonspace = params.combiner2(data_mapped) if params.zscore_common: zscore(commonspace, chunks_attr=None) # Level 3 (last) to params.levels for i, (m, ds) in enumerate(zip(mappers, datasets)): if __debug__: debug('HPAL_', "Level 3: ds #%i" % i) ## ds_temp = zscore( (commonspace*ndatasets - ds_mapped[i]) ## /(ndatasets-1), chunks_attr=None ) ds_new = ds.copy() # shallow copy so we could assign new labels #ZSC zscore(ds_new, chunks_attr=None) #PRJ ds_temp = (commonspace*ndatasets - ds_mapped[i])/(ndatasets-1) #ZSC zscore(ds_temp, chunks_attr=None) ds_new.targets = commonspace #PRJ ds_temp# m.train(ds_new) #ds_temp) if residuals is not None: data_mapped = m.forward(ds_new) residuals[-1, i] = np.linalg.norm(data_mapped - commonspace) return mappers
def get_dsm_roi_xval1(ds, rois, zscore_ds=True, part=OddEvenPartitioner(), cond_chunk='condition'): """ Obtain second-order dissimilarities between ROIs. This version cross-validates at the first level, thus the resulting dsms are symmetrical. Arguments -------- ds: dataset rois: dict each item in the dictionary must be a tuple where the 0th element is the center of the roi, and the 1st element is a list of ids zscore_ds: bool is the dset already zscored? part: partitioner cond_chunk: str across which sample attribute to perform mean group sample Returns ------- dataset containing second level dsm """ #ds = h5load(fns.betafn(subnr)) #ds = ds[:, mask_] #ds = ds[ds.sa.condition != 'self'] if zscore_ds: zscore(ds, chunks_attr='chunks') # set up oddeven partition #part = OddEvenPartitioner() rdms = [] mgs = mean_group_sample([cond_chunk]) dissims_folds = [] for ds_ in part.generate(ds): ds_1 = ds_[ds_.sa.partitions == 1] ds_2 = ds_[ds_.sa.partitions == 2] ds_1 = mgs(ds_1) ds_2 = mgs(ds_2) assert (ds_1.samples.shape == ds_2.samples.shape) # first generate first-order rdms cross-validated across folds names = [] centers = [] dissims = [] for roi, (center, ids) in rois.iteritems(): names.append(roi) centers.append(center) sample1_roi = ds_1.samples[:, ids] sample2_roi = ds_2.samples[:, ids] dissim_roi = 1. - corrcoefxy(sample1_roi.T, sample2_roi.T) nsamples = ds_1.nsamples assert (dissim_roi.shape == (nsamples, nsamples)) dissims.append( dissim_roi.flatten()) # now the RDM is not symmetrical anymore dissims_folds.append(np.array(dissims)) # average across folds dissims_folds = np.array(dissims_folds).mean(axis=0) assert (dissims_folds.shape == (len(names), nsamples**2)) # now compute second level (distances) distance_roi = dist.pdist(dissims_folds, metric='correlation') dissims_folds = dataset_wizard(dist.squareform(distance_roi), targets=names) dissims_folds.fa['roi'] = names dissims_folds.sa['centers'] = centers return dissims_folds
def test_erdataset(): # 3 chunks, 5 targets, blocks of 5 samples each nchunks = 3 ntargets = 5 blocklength = 5 nfeatures = 10 targets = np.tile(np.repeat(range(ntargets), blocklength), nchunks) chunks = np.repeat(np.arange(nchunks), ntargets * blocklength) samples = np.repeat( np.arange(nchunks * ntargets * blocklength), nfeatures).reshape(-1, nfeatures) ds = dataset_wizard(samples, targets=targets, chunks=chunks) # check if events are determined properly evs = find_events(targets=ds.sa.targets, chunks=ds.sa.chunks) for ev in evs: assert_equal(ev['duration'], blocklength) assert_equal(ntargets * nchunks, len(evs)) for t in range(ntargets): assert_equal(len([ev for ev in evs if ev['targets'] == t]), nchunks) # now turn `ds` into an eventreleated dataset erds = eventrelated_dataset(ds, evs) # the only unprefixed sample attributes are assert_equal(sorted([a for a in ds.sa if not a.startswith('event')]), ['chunks', 'targets']) # samples as expected? assert_array_equal(erds.samples[0], np.repeat(np.arange(blocklength), nfeatures)) # that should also be the temporal feature offset assert_array_equal(erds.samples[0], erds.fa.event_offsetidx) assert_array_equal(erds.sa.event_onsetidx, np.arange(0,71,5)) # finally we should see two mappers assert_equal(len(erds.a.mapper), 2) assert_true(isinstance(erds.a.mapper[0], BoxcarMapper)) assert_true(isinstance(erds.a.mapper[1], FlattenMapper)) # check alternative event mapper # this one does temporal compression by averaging erds_compress = eventrelated_dataset( ds, evs, event_mapper=FxMapper('features', np.mean)) assert_equal(len(erds), len(erds_compress)) assert_array_equal(erds_compress.samples[:,0], np.arange(2,73,5)) # # now check the same dataset with event descretization tr = 2.5 ds.sa['time'] = np.arange(nchunks * ntargets * blocklength) * tr evs = [{'onset': 4.9, 'duration': 6.2}] # doesn't work without conversion assert_raises(ValueError, eventrelated_dataset, ds, evs) erds = eventrelated_dataset(ds, evs, time_attr='time') assert_equal(len(erds), 1) assert_array_equal(erds.samples[0], np.repeat(np.arange(1,5), nfeatures)) assert_array_equal(erds.sa.orig_onset, [evs[0]['onset']]) assert_array_equal(erds.sa.orig_duration, [evs[0]['duration']]) assert_array_almost_equal(erds.sa.orig_offset, [2.4]) assert_array_equal(erds.sa.time, [np.arange(2.5, 11, 2.5)]) # now with closest match erds = eventrelated_dataset(ds, evs, time_attr='time', match='closest') expected_nsamples = 3 assert_equal(len(erds), 1) assert_array_equal(erds.samples[0], np.repeat(np.arange(2,2+expected_nsamples), nfeatures)) assert_array_equal(erds.sa.orig_onset, [evs[0]['onset']]) assert_array_equal(erds.sa.orig_duration, [evs[0]['duration']]) assert_array_almost_equal(erds.sa.orig_offset, [-0.1]) assert_array_equal(erds.sa.time, [np.arange(5.0, 11, 2.5)]) # now test the way back results = np.arange(erds.nfeatures) assert_array_equal(erds.a.mapper.reverse1(results), results.reshape(expected_nsamples, nfeatures)) # what about multiple results? nresults = 5 results = dataset_wizard([results] * nresults) # and let's have an attribute to make it more difficult results.sa['myattr'] = np.arange(5) rds = erds.a.mapper.reverse(results) assert_array_equal(rds, results.samples.reshape(nresults * expected_nsamples, nfeatures)) assert_array_equal(rds.sa.myattr, np.repeat(results.sa.myattr, expected_nsamples))
def get_dsm_roi_xval1_firstlev(ds, rois, zscore_ds=True, part=OddEvenPartitioner(), cond_chunk='condition', fisher=False): """ Obtain second-order dissimilarities between ROIs. This version cross-validates at the first level and returns only the first level, without distances between ROIs Arguments -------- ds: dataset rois: dict each item in the dictionary must be a tuple where the 0th element is the center of the roi, and the 1st element is a list of ids zscore_ds: bool is the dset already zscored? part: partitioner cond_chunk: str across which sample attribute to perform mean group sample fisher: bool whether to fisher-transform the correlations before averaging across folds Returns ------- dataset containing first level dsm of shape (nrois, ncond**2) """ #ds = h5load(fns.betafn(subnr)) #ds = ds[:, mask_] #ds = ds[ds.sa.condition != 'self'] # set up oddeven partition #part = OddEvenPartitioner() mgs = mean_group_sample([cond_chunk]) dissims_folds = [] folds = 1 for ds_ in part.generate(ds): print("Running fold {0}".format(folds)) ds_1 = ds_[ds_.sa.partitions == 1] ds_2 = ds_[ds_.sa.partitions == 2] ds_1 = mgs(ds_1) ds_2 = mgs(ds_2) if ds_1.nsamples >= 4 and zscore_ds: zscore(ds_1, chunks_attr='chunks') zscore(ds_2, chunks_attr='chunks') assert (ds_1.samples.shape == ds_2.samples.shape) # first generate first-order rdms cross-validated across folds names = [] centers = [] dissims = [] for roi, (center, ids) in rois.iteritems(): names.append(roi) centers.append(center) sample1_roi = ds_1.samples[:, ids] sample2_roi = ds_2.samples[:, ids] dissim_roi = corrcoefxy(sample1_roi.T, sample2_roi.T, fisher=fisher) nsamples = ds_1.nsamples assert (dissim_roi.shape == (nsamples, nsamples)) dissims.append( dissim_roi.flatten()) # now the RDM is not symmetrical anymore dissims_folds.append(np.array(dissims)) folds += 1 # average across folds dissims_folds = np.array(dissims_folds).mean(axis=0) assert (dissims_folds.shape == (len(names), nsamples**2)) if fisher: dissims_folds = np.tanh(dissims_folds) dissims_folds = dataset_wizard(dissims_folds, targets=names) dissims_folds.sa['centers'] = centers return dissims_folds
def give_data(): # 100x10, 10 chunks, 4 targets return dataset_wizard(np.random.normal(size=(100, 10)), targets=[i % 4 for i in range(100)], chunks=[i // 10 for i in range(100)])
rng = np.random.RandomState(1) X = np.sort(5 * rng.rand(80, 1), axis=0) y = np.sin(X).ravel() y[::5] += 3 * (0.5 - rng.rand(16)) """ So far the code has been identical. The first difference is the import of the adaptor class. We also use a convenient way to convert the data into a proper :class:`~mvpa2.datasets.base.Dataset`. """ # this first import is only required to run the example a part of the test suite from mvpa2 import cfg from mvpa2.clfs.skl.base import SKLLearnerAdapter from mvpa2.datasets import dataset_wizard ds_train = dataset_wizard(samples=X, targets=y) """ The following lines are an example of the only significant modification with respect to a pure scikit-learn implementation: the regression is wrapped into the adaptor. The result is a PyMVPA learner, hence can be called with a dataset that contains both samples and targets. """ clf_1 = SKLLearnerAdapter(DecisionTreeRegressor(max_depth=2)) clf_2 = SKLLearnerAdapter(DecisionTreeRegressor(max_depth=5)) clf_1.train(ds_train) clf_2.train(ds_train) X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] y_1 = clf_1.predict(X_test)