txt2numpy

A series of scripts to convert a set of txt files into a set of numpy arrays stored as h5py files

About

If you have a set of txt files as for OpenWebText and PG19, the first thing you may want to do is to convert them into a set of numpy arrays tokenized in some way. Here, I will provide a set of scripts to perform this efficiently. It allows my laptop to process the entirety of OpenWebText within an hour. You can modify them accordingly to suit your needs. There are two types of resulting numpy arrays: (1) the one-dimensional array that tokenizes all the tokens, and it would look like <SEP>(DOC1)<SEP>(DOC2)<SEP>..., (2) the array that records the location of separator. This format is suitable for long-range language modeling such as Reformer.

If you want a 2D numpy array (e.g. for shorter-range language model as BERT), each of whose row is a document, padded and curtailed accordingly for varying size, then you can segment this first array into each document using the second array. While this process is simple, I didn't provide a script for this, as the main target of this repo is for long-range language modeling.

I also provided a script that converts the first array in a way suitable for adaptive input/softmax. For example, one can attempt GPT-2 BPE with adaptive input/softmax, which I observed to be fast and perform well.

Requirements

huggingface/tokenizers.
h5py

Concatenating txt files

For this, all you need is to bash a command as follows (this is an example for GP19):

echo ./train/*.txt | xargs awk 'FNR==1{print "<|endoftext|>"}1' > ./train.txt

Tokenization

For this, it suffices to call python3 tokenize.py. It produces a h5py file containing the aforementioned two types of numpy arrays. If you want to use the default GPT2 BPE, please store vocab.json and gpt2-merges.txt in the same directory. If you want to try a customized vocabulary, please refer to huggingface/tokenizers for their instruction. You can then modify my tokenize.py accordingly. In particular, you need to modify the dtype accordingly if your custom vocabulary size exceeds the limit of the default dtype. If by any chance it outputs MemoryError, you can decrease the default value for MAX_TOKEN_LENGTH and FREQ.

For adaptive input/softmax

For this, it suffices to call python3 adaptive.py. It retokenizes the array of the first type in a way compatible with adaptive input/softmax, while the array containing the document location information is kept intact. It also outputs how this mapping was performed, which can be used for converting the tokens back into the original vocabulary such as GPT2 BPE.

Caveats

Though I use PyTorch, I do not use its DataLoader class. So, I'm not sure how h5py file works with DataLoader. If by any chance it does not work well and if DataLoader is crucial to your work, you may want to convert it into other format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

adaptive.py

adaptive.py

gpt2-merges.txt

gpt2-merges.txt

tokenize.py

tokenize.py

vocab.json

vocab.json

Repository files navigation

txt2numpy

About

Requirements

Concatenating txt files

Tokenization

For adaptive input/softmax

Caveats

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
adaptive.py		adaptive.py
gpt2-merges.txt		gpt2-merges.txt
tokenize.py		tokenize.py
vocab.json		vocab.json

stjordanis/txt2numpy

Folders and files

Latest commit

History

Repository files navigation

txt2numpy

About

Requirements

Concatenating txt files

Tokenization

For adaptive input/softmax

Caveats

About

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Languages