Beispiel #1
0
def test_experiment(assess_dataframe):
    def fit_maxent(X, y):
        mod = LogisticRegression(solver='liblinear', multi_class='auto')
        mod.fit(X, y)
        return mod

    sst.experiment(sst.train_reader(sst_home, include_subtrees=False),
                   phi=lambda x: {"$UNK": 1},
                   train_func=fit_maxent,
                   assess_dataframes=assess_dataframe,
                   random_state=42)
def test_experiment(assess_reader):
    def fit_maxent(X, y):
        mod = LogisticRegression(solver='liblinear', multi_class='auto')
        mod.fit(X, y)
        return mod

    sst.experiment(sst_home,
                   train_reader=sst.train_reader,
                   phi=lambda x: {"$UNK": 1},
                   train_func=fit_maxent,
                   assess_reader=assess_reader,
                   random_state=42)
    return feats


# In[8]:


def glove_leaves_phi(tree, np_func=np.sum):
    return vsm_leaves_phi(tree, glove_lookup, np_func=np_func)


# In[9]:

_ = sst.experiment(
    SST_HOME,
    glove_leaves_phi,
    fit_maxent_classifier,
    class_func=sst.ternary_class_func,
    vectorize=False
)  # Tell `experiment` that we already have our feature vectors.

# ### IMDB representations
#
# Our IMDB VSMs seems pretty well-attuned to the Stanford Sentiment Treebank, so we might think that they can do even better than the general-purpose GloVe inputs. Here are two quick assessments of that idea:

# In[10]:

imdb20 = pd.read_csv(os.path.join(VSMDATA_HOME, 'imdb_window20-flat.csv.gz'),
                     index_col=0)

# In[11]:
            if i == j:
                X[i][j] = 100000  #discounted cosine similarity value
            else:
                X[i][j] = computeContentSimilarity(lines[i], lines[j])

    print(X)
    #the goal is to build an empty numpy d array based on

    mod = LogisticRegression(fit_intercept=True)
    mod.fit(X, y)
    return mod


_ = sst.experiment(unigrams_phi,
                   fit_maxent_classifier,
                   train_reader=sst.train_reader,
                   assess_reader=sst.dev_reader,
                   class_func=sst.binary_class_func)

#----------------------------------------------------------------------------#
#TfRNNClassifier wrapper


def rnn_phi(tree):
    return tree.leaves()


def fit_tf_rnn_classifier(X, y):
    vocab = sst.get_vocab(X, n_words=3000)
    mod = TfRNNClassifier(vocab,
                          eta=0.05,
# ## Experiments
#
# We now have all the pieces needed to run experiments. And __we're going to want to run a lot of experiments__, trying out different feature functions, taking different perspectives on the data and labels, and using different models.
#
# To make that process efficient and regimented, `sst` contains a function `experiment`. All it does is pull together these pieces and use them for training and assessment. It's complicated, but the flexibility will turn out to be an asset.

# ### Experiment with default values

# In[15]:

_ = sst.experiment(SST_HOME,
                   unigrams_phi,
                   fit_softmax_classifier,
                   train_reader=sst.train_reader,
                   assess_reader=None,
                   train_size=0.7,
                   class_func=sst.ternary_class_func,
                   score_func=utils.safe_macro_f1,
                   verbose=True)

# A few notes on this function call:
#
# * Since `assess_reader=None`, the function reports performance on a random train–test split. Give `sst.dev_reader` as the argument to assess against the `dev` set.
#
# * `unigrams_phi` is the function we defined above. By changing/expanding this function, you can start to improve on the above baseline, perhaps periodically seeing how you do on the dev set.
#
# * `fit_softmax_classifier` is the wrapper we defined above. To assess new models, simply define more functions like this one. Such functions just need to consume an `(X, y)` constituting a dataset and return a model.

# ### A dev set run