Try to guess the numerical rating that corresponds with the text review. This one doesn't do so well; haven't messed with configurations much and there are many that could likely be made to improve it from it's current <10% prediction rate. LSTM is maybe not so helpful with this categorization problem. GPU command: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python imdb_lstm.py ''' max_features=20000 maxlen = 100 # cut texts after this number of words (among top max_features most common words) batch_size = 16 print("Loading data...") (X_train, y_train), (X_test, y_test), w = load_imdb_data( binary=False, seed=113, maxlen=maxlen, max_features=max_features) # for categories, convert label lists to binary arrays nb_classes = np.max(y_train)+1 Y_train = np_utils.to_categorical(y_train, nb_classes) Y_test = np_utils.to_categorical(y_test, nb_classes) print(len(X_train), 'train sequences') print(len(X_test), 'test sequences') print('X_train shape:', X_train.shape) print('X_test shape:', X_test.shape) print('Build model...') model = Sequential() model.add(Embedding(max_features, 256))
Modified version of keras LSTM example: trains an LSTM network on the imdb sentiment analysis data set. In addition to predicting on test data, also stores the model's weights and intermediate activation values for training and test data. GPU command: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python imdb_lstm.py ''' max_features=20000 maxlen = 100 # cut texts after this number of words (among top max_features most common words) batch_size = 16 # had some luck with seed 111 print("Loading data...") (X_train, y_train), (X_test, y_test), w = load_imdb_data( binary=True, max_features=max_features, maxlen=maxlen, seed=37) print(len(X_train), 'train sequences') print(len(X_test), 'test sequences') print('X_train shape:', X_train.shape) print('X_test shape:', X_test.shape) print('Build model...') model = Sequential() model.add(Embedding(max_features, 256)) model.add(LSTM(256, 128)) # try using a GRU instead, for fun model.add(Dropout(0.5)) model.add(Dense(128, 1)) model.add(Activation('sigmoid'))
test data, also stores the model's weights and intermediate activation values for training and test data. GPU command: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python imdb_lstm.py ''' max_features = 20000 maxlen = 100 # cut texts after this number of words (among top max_features most common words) batch_size = 16 # had some luck with seed 111 print("Loading data...") (X_train, y_train), (X_test, y_test), w = load_imdb_data(binary=True, max_features=max_features, maxlen=maxlen, seed=37) print(len(X_train), 'train sequences') print(len(X_test), 'test sequences') print('X_train shape:', X_train.shape) print('X_test shape:', X_test.shape) print('Build model...') model = Sequential() model.add(Embedding(max_features, 256)) model.add(LSTM(256, 128)) # try using a GRU instead, for fun model.add(Dropout(0.5)) model.add(Dense(128, 1)) model.add(Activation('sigmoid'))
This one doesn't do so well; haven't messed with configurations much and there are many that could likely be made to improve it from it's current <10% prediction rate. LSTM is maybe not so helpful with this categorization problem. GPU command: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python imdb_lstm.py ''' max_features = 20000 maxlen = 100 # cut texts after this number of words (among top max_features most common words) batch_size = 16 print("Loading data...") (X_train, y_train), (X_test, y_test), w = load_imdb_data(binary=False, seed=113, maxlen=maxlen, max_features=max_features) # for categories, convert label lists to binary arrays nb_classes = np.max(y_train) + 1 Y_train = np_utils.to_categorical(y_train, nb_classes) Y_test = np_utils.to_categorical(y_test, nb_classes) print(len(X_train), 'train sequences') print(len(X_test), 'test sequences') print('X_train shape:', X_train.shape) print('X_test shape:', X_test.shape) print('Build model...') model = Sequential()