github.com/goplus/llgo@v0.8.3/_demo/llama2-c/README.md (about) 1 llama2 - Inference Llama 2 in LLGo 2 ===== 3 4 <p align="center"> 5 <img src="assets/llama_cute.jpg" width="300" height="300" alt="Cute Llama"> 6 </p> 7 8 Have you ever wanted to inference a baby [Llama 2](https://ai.meta.com/llama/) model in Go? No? Well, now you can! 9 10 This is based on [llama2.c](https://github.com/karpathy/llama2.c), we didn't port anything! So it's very different from these Go implementations: 11 * https://github.com/nikolaydubina/llama2.go 12 * https://github.com/tmc/go-llama2 13 14 llgo plays a great role as a bridge, allowing the C ecosystem to be seamlessly connected to Go. 15 16 You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) paper). This repo is a "fullstack" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity. 17 18 As the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. 19 20 ## feel the magic 21 22 How to run this example? The simplest way is to run it without any arguments: 23 24 ```bash 25 llgo run . 26 ``` 27 28 This means it uses the default model checkpoint file (`stories15M.bin`), and the default prompt (`Once upon a time`). 29 30 You need download the model checkpoint file first. Download this 15M parameter model trained on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (~60MB download): 31 32 ```bash 33 wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin 34 ``` 35 36 If you want to specify a prompt (eg. `Long long ago`): 37 38 ```bash 39 llgo run . 'Long long ago' 40 ``` 41 42 We can also try a bit bigger 42M parameter model (ie. `stories42M.bin`): 43 44 ```bash 45 wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin 46 llgo run . -m stories42M.bin 'Long long ago' 47 ``` 48 49 There is also an even better 110M param model available, see [models](#models). 50 51 Quick note on sampling, the recommendation for ~best results is to sample with `-t 1.0 -p 0.9`, i.e. temperature 1.0 (default) but also top-p sampling at 0.9 (default). Intuitively, top-p ensures that tokens with tiny probabilities do not get sampled, so we can't get "unlucky" during sampling, and we are less likely to go "off the rails" afterwards. More generally, to control the diversity of samples use either the temperature (i.e. vary `-t` between 0 and 1 and keep top-p off with `-p 0`) or the top-p value (i.e. vary `-p` between 0 and 1 and keep `-t 1`), but not both. Nice explainers on LLM sampling strategies include [this](https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/), [this](https://docs.cohere.com/docs/controlling-generation-with-top-k-top-p) or [this](https://huggingface.co/blog/how-to-generate). 52 53 54 ## Meta's Llama 2 models 55 56 As the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https://github.com/facebookresearch/llama). Once we have those checkpoints, we have to convert them into the llama2.c format. 57 For this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export.py` file, e.g. for 7B model: 58 59 ```bash 60 python export.py llama2_7b.bin --meta-llama path/to/llama/model/7B 61 ``` 62 63 The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts. I would not attempt to run anything above 7B right now for two reasons: first, 13B+ currently doesn't work because of integer flow in pointer arithmetic, which is yet to be fixed, and second, even if it were fixed, this repo is doing float32 inference right now, so it would be fairly unusably slow. Once the export is done, we can run it: 64 65 ```bash 66 ./run llama2_7b.bin 67 ``` 68 69 This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output: 70 71 > The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&T ... R&D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage. 72 73 base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo! 74 75 You can also try Meta's Code Llama models even if support for them is incomplete. In particular, some hyperparameters changed (e.g. the constant in RoPE layer), so the inference is not exactly correct and a bit buggy right now. Looking into fixes. Make sure to build the tokenizer for the plain and instruct variants and pass it when doing inference. 76 77 ```bash 78 python export.py codellama2_7b.bin --meta-llama /path/to/CodeLlama-7b 79 python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b/tokenizer.model 80 ./run codellama2_7b.bin -z /path/to/CodeLlama-7b/tokenizer.bin 81 ``` 82 83 Chat with Code Llama Instruct: 84 85 ```bash 86 python export.py codellama2_7b_instruct.bin --meta-llama /path/to/CodeLlama-7b-Instruct 87 python tokenizer.py --tokenizer-model=/path/to/CodeLlama-7b-Instruct/tokenizer.model 88 ./run codellama2_7b_instruct.bin -m chat -z /path/to/CodeLlama-7b-Instruct/tokenizer.bin 89 ``` 90 91 92 ## huggingface models 93 94 We can load any huggingface models that use the Llama 2 architecture. See the script [export.py](export.py) and the `--hf` flag to export the model .bin file. 95 96 ## models 97 98 For the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https://huggingface.co/karpathy/tinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin: 99 100 | model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download 101 | --- | --- | --- | --- | --- | --- | --- | --- | --- | 102 | 260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | [stories260K](https://huggingface.co/karpathy/tinyllamas/tree/main/stories260K) 103 | OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | [stories15M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin) | 104 | 42M| 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | [stories42M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin) | 105 | 110M| 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | [stories110M.bin](https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin) | 106 107 You'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1/2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute/learned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery. 108 109 110 ## training 111 112 Let's see how we can train a baby Llama 2 from scratch using the code in this repo. First let's download and pretokenize some source dataset, e.g. I like [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) so this is the only example currently available in this repo. But it should be very easy to add datasets, see the code. 113 114 ```bash 115 python tinystories.py download 116 python tinystories.py pretokenize 117 ``` 118 119 Then train our model: 120 121 ```bash 122 python train.py 123 ``` 124 125 **brief training guide**. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the [Chinchilla paper](https://arxiv.org/abs/2203.15556) to get a sense of how the Transformer parameters (dim, n_layers, n_heads) grow or shrink together. Extrapolate/interpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the _total_ batch size per update (printed by the script as "tokens per iteration will be:") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs/LLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 / 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size * 1024 tokens in sequence * 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so. 126 127 Totally understand if you want to skip model training, for simple demo just download one of the pretrained models (see [models](#models) section), e.g.: 128 129 ```bash 130 wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin 131 ``` 132 133 Once we have the model.bin file, we can inference in C. Compile the C code first: 134 135 ```bash 136 llgo run . -m stories15M.bin 137 ```