# y     &= \textbf{softmax}(h_{n}W_{hy} + b)
# \end{align*}$$
#
# where $1 \leqslant t \leqslant n$. As indicated in the above diagram, the sequence of hidden states is padded with an initial state $h_{0}$ In our implementations, this is always an all $0$ vector, but it can be initialized in more sophisticated ways (some of which we will explore in our unit on natural language inference).
#
# This is a potential gain over our sum-the-word-vectors baseline, in that it processes each word independently, and in the context of those that came before it. Thus, not only is this sensitive to word order, but the hidden representation give us the potential to encode how the preceding context for a word affects its interpretation.
#
# The downside of this, of course, is that this model is much more difficult to set up and optimize. Let's dive into those details.

# ### RNN dataset preparation
#
# SST contains trees, but the RNN processes just the sequence of leaf nodes. The function `sst.build_rnn_dataset` creates datasets in this format:

# In[16]:

X_rnn_train, y_rnn_train = sst.build_rnn_dataset(
    SST_HOME, sst.train_reader, class_func=sst.ternary_class_func)

# Each member of `X_rnn_train` is a list of lists of words. Here's a look at the start of the first:

# In[17]:

X_rnn_train[0][:6]

# Because this is a classifier, `y_rnn_train` is just a list of labels, one per example:

# In[18]:

y_rnn_train[0]

# For experiments, let's build a `dev` dataset as well:
예제 #2
0
def test_build_rnn_dataset():
    split_df = sst.dev_reader(sst_home)
    X, y = sst.build_rnn_dataset(split_df)
    assert len(X) == 1101
    assert len(y) == 1101
def test_build_rnn_dataset():
    X, y = sst.build_rnn_dataset(sst_home,
                                 sst.train_reader,
                                 class_func=sst.binary_class_func)
    assert len(X) == 6920
    assert len(y) == 6920
예제 #4
0
파일: train_bert.py 프로젝트: bpm72/cs224u
import utils

if __name__ == '__main__':

    logger = get_logger()

    set_seed(3)
    
    SST_HOME = os.path.join('', 'trees')

    train_df = pd.read_csv(TRAIN_FILE, encoding='utf-8', sep='\t')
    test_df = pd.read_csv(DEV_FILE, encoding='utf-8', sep='\t')
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    
    X_train_lst, y_train_txt = sst.build_rnn_dataset( SST_HOME, sst.train_reader, class_func=sst.binary_class_func)
    X_train = [' '.join(X_train_lst[index]) for index in range(len(X_train_lst))]

    y_train = []
    for label in y_train_txt:
        if label =='positive':
            y_train.append(1)
        else:
            y_train.append(0)

    train_df = pd.DataFrame({'sentence':X_train, 'label':y_train})

    X_test_lst, y_test_txt = sst.build_rnn_dataset( SST_HOME, sst.test_reader, class_func=sst.binary_class_func)
    X_test = [' '.join(X_test_lst[index]) for index in range(len(X_test_lst))]

    y_test = []