""" ### Applying the hashing trick to an integer categorical feature If you have a categorical feature that can take many different values (on the order of 10e3 or higher), where each value only appears a few times in the data, it becomes impractical and ineffective to index and one-hot encode the feature values. Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector of fixed size. This keeps the size of the feature space manageable, and removes the need for explicit indexing. """ # Sample data: 10,000 random integers with values between 0 and 100,000 data = np.random.randint(0, 100000, size=(10000, 1)) # Use the Hashing layer to hash the values to the range [0, 64] hasher = preprocessing.Hashing(num_bins=64, salt=1337) # Use the CategoryEncoding layer to one-hot encode the hashed values encoder = preprocessing.CategoryEncoding(max_tokens=64, output_mode="binary") encoded_data = encoder(hasher(data)) print(encoded_data.shape) """ ### Encoding text as a sequence of token indices This is how you should preprocess text to be passed to an `Embedding` layer. """ # Define some text data to adapt the layer data = tf.constant([ "The Brain is wider than the Sky", "For put them side by side",
def input_layer(self): return preprocessing.Hashing(**self.feature_params)(self.inputs)