0. About 1-2G less. 2. very interesting data and to me in-line with Apple silicon. fb. Now, I sadly do not know enough about the 7900 XTX to compare. 65 GiB total capacity; 22. 13B 16k model uses 18 GB of VRAM, so the 4080 will have issues if you need the context. This is the repository for the 13B pretrained model, converted for the Hugging Face Transformers format. RTX 4090 's Training throughput/Watt is close to RTX 3090, despite its high 450W power consumption. Jul 21, 2023 · In this tutorial, we will walk you through the process of fine-tuning LLaMA 2 models, providing step-by-step instructions. 00 (USD). com/llama/试用版:https://www. Training Data Jul 20, 2023 · 16K~32Kの長いコンテキスト用にファインチューンしたLlama-2派生モデルが複数リリースされ始めた。. 3x, 3. py 中 get_preprocessed_arithmetic 函数展示了如何读取自定义数据,并且转化为 llama2 模型的输入。. Specify the file path of the mount, eg. if your downloaded Llama2 model directory resides in your home path, enter /home/[user] Specify the Hugging Face username and API Key secrets. On A6000, TinyChat achieves up to 3. Access LLaMA 3 from Meta Llama 3 on Hugging Face or my Hugging Face repos: Xiongjie Dai . 6 days ago · The current approach to fine-tuning the Llama-2 chat model with two RTX 4090 GPUs involves experimenting with various techniques and configurations to achieve consistent results. 30-series and later NVIDIA GPUs should be well supported, but anything Pascal or older with poor FP16 support isn't going to perform well. Fine-tuning. Aug 31, 2023 · I was stoked to check out Code Llama but it was pretty intimidating to get everything up and running. The specs: 2x 4090 RTX Founders Edition. All the code related to this article is available in our dedicated GitHub repository. cpp folder using the cd command. A new exciting announcement from Answers. The llama-65b-4bit should run on a dual 3090/4090 rig. Llama 2. Dec 12, 2023 · For 13B Parameter Models. Furthermore it can run on multiple GPUS, so it is possible to train a model on a 2X 4090 instance! 4x 4090 will draw 1. This may include adjusting the learning rate, batch size, and other hyperparameters, as well as exploring different data augmentation and regularization methods. We evaluate Sequoia with LLMs of various sizes (including Llama2-70B-chat, Vicuna-33B , Llama2-22B, InternLM-20B and Llama2-13B-chat ), on 4090 and 2080Ti, prompted by MT-Bench with temperature=0. This will launch the respective model within a Docker container, allowing you to interact with it through a command-line interface. 7x speedup for LLaMA-2, Vicuna, MPT and Falcon In any situation where you compare them 1v1, a 4090 wins over a 3090. Train a 70b language model on a 2X RTX 4090 with QLoRA and FSDP Overview. Llama-2-7b-Chat-GPTQ can run on a single GPU with 6 GB of VRAM. This is the repository for the 70B pretrained model. The goal of this repository is to provide examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. 매개변수는 70억개부터 시작하지만, 상당히 고성능인 700억 개 짜리 모델까지 학계뿐만 아니라 기업 등 상용으로도 공개하여 큰 주목을 받고 있다. But you can run Llama 2 70B 4-bit GPTQ on 2 x 24GB and many people are doing this. Jun 24, 2024 · 2. While training, it can be up to 2x times faster. Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps. It's doable with blower style consumer cards, but still less than ideal - you will want to throttle the power usage. Training Data Apr 18, 2024 · The Llama 3 release introduces 4 new open LLM models by Meta based on the Llama 2 architecture. Nov 8, 2023 · This blog post explores methods for enhancing the inference speeds of the Llama 2 series of models with PyTorch’s built-in enhancements, including direct high-speed kernels, torch compile’s transformation capabilities, and tensor parallelization for distributed computation. If you're using the GPTQ version, you'll want a strong GPU with at least 10 gigs of VRAM. 5. cpp、LangChain、text-generation-webui等,同时建议到对应的项目中查找解决方案。 问题 llama-recipes 提供了一个接口,允许用户可以自己设计训练数据的输入格式,在 dataset. 1. You need 2 x 80GB GPU or 4 x 48GB GPU or 6 x 24GB GPU to run fp16. AMD 6900 XT, RTX 2060 12GB, RTX 3060 12GB, or RTX 3080 would do the trick. In tasks that can utilize 2 cards, dual 3090 wins. 54 tokens/s, 1504 tokens, context 33, seed 1719700952)`. ROCm is also theoretically supported (via HIP) though I currently have no AMD 探索知乎专栏,深入了解各领域专家的观点和见解。 Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Nov 23, 2023 · In this video we run Llama models using the new M3 max with 128GB and we compare it with a M1 pro and RTX 4090 to see the real world performance of this Chip Jul 28, 2023 · Running: torchrun --nproc_per_node 1 example_text_completion. Make sure you have downloaded the 4-bit model from Llama-2-7b-Chat-GPTQ and set the MODEL_PATH and arguments in . where the Llama 2 model will live on your host machine. 园瓦歹松壶钻馁. I am developing on an RTX 4090 and an RTX 3090-Ti. comdaro缠臂玄. Subreddit to discuss about Llama, the large language model created by Meta AI. Aug 24, 2023 · Code Llama launch post - https://about. This is the repository for the 13B pretrained model. Navigate to the main llama. env like example . 2x speedup for LLaMA-2-chat models, up to 3. In addition, we implement CUDA version, where the transformer is implemented as a number of CUDA kernels. 9-3. llama2. Meta-Llama-3-8b: Base 8B model. Speed wise, I dont think either can get 40 t/s. 2TB array) VROC Premium key. They come in two sizes: 8B and 70B parameters, each with base (pre-trained) and instruct-tuned versions. 哗匹呆正媳见掀海. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse quantization wherein more important weights (columns) are quantized with more bits. Speed was: `Output generated in 424. This repository focuses on the 70B 以 RTX-6000ADA, RTX-A6000, TESLA-A100-80G, Mac Studio 192G, RTX-4090-24G 為例。相關資料: https://tw. alpha_value 4. Also, Group Query Attention (GQA) now has been added to Llama 3 8B as well. There are different methods that you can follow: Method 1: Clone this repository and build locally, see how to build. All the variants can be run on various types of consumer hardware and have a context length of 8K tokens. AutoGPTQ or GPTQ-for-LLaMa are better options at the moment for older GPUs. 1B parameters. これらのモデルを使うと、オリジナルのLlama-2よりも長いコンテキストを正確に参照できると思われる。. Our benchmarks show the tokenizer offers improved token efficiency, yielding up to 15% fewer tokens compared to Llama 2. Nov 1, 2022 · Nov 1, 2022. Thanks to shawwn for LLaMA model weights (7B, 13B, 30B, 65B): llama-dl. Jul 20, 2023 · We've shown how easy it is to spin up a low cost ($0. 170K subscribers in the LocalLLaMA community. 2 kW maybe is more fair. Try out the -chat version, or any of the plethora of fine-tunes (guanaco, wizard, vicuna, etc). py. For instance, one can use an RTX 3090, an ExLlamaV2 model loader, and a 4-bit quantized LLaMA or Llama-2 30B model, achieving approximately 30 to 40 tokens per second, which is huge. 7b_gptq_example. 8x and 3. This was a major drawback, as the next level graphics card, the RTX 4080 and 4090 with 16GB and 24GB, costs around $1. In the paper presenting the model, Llama 2 demonstrates impressive capabilities on public benchmarks for various natural language generation and coding tasks. Sequoia can speed up LLM inference for a variety of model sizes and types of hardware. Spinning up the machine and setting up the environment takes only a few minutes, and the downloading model weights takes ~2 minutes at the beginning of training. Just noticed this old post. On many tasks, fine-tuned Llama can outperform GPT-3. The 'llama-recipes' repository is a companion to the Llama 2 model. eg. 특히 마이크로소프트 와 우선 계약을 체결하여 큰 화재를 모았는데 Feb 13, 2024 · Now, these groundbreaking tools are coming to Windows PCs powered by NVIDIA RTX for local, fast, custom generative AI. Besides, TinyLlama is compact with only 1. First, navigate to the Llama 2 directory using the Firstly, you need to get the binary. 4x speedup for MPT and Falcon models. In general, it can achieve the best performance but it is also the most resource-intensive and time consuming: it requires most GPU resources and takes the longest. It loads entirely! Remember to pull the latest ExLlama version for compatibility :D. Explore the specialized columns on Zhihu, a platform where questions meet their answers. Method 3: Use a Docker image, see documentation for Docker. The hardware platforms have different GPUs, CPU 知乎专栏提供各领域专家的深度文章,分享专业知识和见解。 Aug 4, 2023 · Llama 2 is a state-of-the-art large language model (LLM) released by Meta. I get memory error: RuntimeError: CUDA error: out of memory. Sep 13, 2023 · Why this is interesting: To my knowledge, this is the first time you are able to run the largest Llama, at competitive speed, on a consumer GPU (or something like A40). but why i ran the example. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. cpp via brew, flox or nix. This means you start fine tuning within 5 minutes using really simple 4080 is not a good choice. 60 per hour) GPU machine to fine tune the Llama 2 7b models. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. 14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid Once the model download is complete, you can start running the Llama 3 models locally using ollama. ”. (Data drive) 4x VROC Raid 0 Micron 9300 Max (12. int8 () work of Tim Dettmers. Navigate to the code/llama-2-[XX]b directory of the project. Oct 3, 2023 · We adopted exactly the same architecture and tokenizer as Llama 2. 7900 XTX I am not sure, as that uses ROCM. 几赐屑抒匠 三0扁1勉屋币簸螟虫其(Stanford Alpaca 7B) ,Stanford Alpaca 潘愉 LLaMA 秸蘑粟丽傻奠缩,斑初慈宽颜怜融必引答勘例始回枯塔丽烤(full fine-tuning Dec 6, 2023 · Download the specific Llama-2 model ( Llama-2-7B-Chat-GGML) you want to use and place it inside the “models” folder. leaderg. max_seq_len 16384. If you add a GPU FP32 TFLOPS column (pure GPUs is not comparable cross architecture), the PP F16 scales with TFLOPS (FP16 with FP32 accumulate = 165. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. Between a 4090 and two 3060s, you would be at 48gb VRAM, without shattering the piggy bank and grinding it into dust. Chat with RTX, now free to download, is a tech demo that lets users personalize a chatbot with their own content, accelerated by a local NVIDIA GeForce RTX 30 Series GPU or higher with at least 8GB of video random access memory We would like to show you a description here but the site won’t allow us. Benchmark. Jul 23, 2023 · After setting up the environment and downloading the Llama 2 model, you are ready to use the model for inference. Meta LLama 2 should be next in the pipe Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Meta also released Chat versions of Llama 2. Oct 31, 2022 · 24 GB memory, priced at $1599. The cheapest Studio with 64GB of RAM is 2,399. 2 TFLOPS for the 4090), the TG F16 scales with memory-bandwidth (1008 GB/s for 4090). Then, open your fine-tuning notebook of RTX 3090 is a little (1-3%) faster than the RTX A6000, assuming what you're doing fits on 24GB VRAM. Method 2: If you are using MacOS or Linux, you can install llama. OutOfMemoryError: CUDA out of memory. 8 for compute rev 8. The difference is pretty big. Full parameter fine-tuning is a method that fine-tunes all the parameters of all the layers of the pre-trained model. Prompt eval rate comes in at 192 tokens/s. 6. And then, enabled it and gathered other results. I can get up to 1500 tokens returned before it OOMs on 2 x 4090. 在准备好数据处理函数之后,用户可以通过 --dataset 和 --custom_dataset. Similar on the 4090 vs A6000 Ada case. 5 TB ram. 9) finishes 3 epochs in only a minute: ===== Skip to content Jul 18, 2023 · Aug 27, 2023. Llama 2 Uncensored is a 7B parameter model that is about 3. These chat models can be used as chatbots. ai/replicate:https://replicate. LLama 2伺耗赠膊亭富+捅着慰组. Supported Hardware Platform(s): RTX 4090 Supported Operating System(s): Windows. This example demonstrates how to achieve faster inference with the Llama 2 models by using the open source project vLLM. Training & Finetuning: Dataset: Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. I think htop shows ~56gb of system ram used as well as about ~18-20gb vram for offloaded layers. With max_batch_size set to 3, the model starts with 19GB, and exapands to 22GB during the usage. Llama 2 is an open source LLM family from Meta. Llama 2 13B is the Nov 22, 2023 · Thanks a lot. Pretty much, if you don't think you'll be able to get nvidia p2p working, and your tasks can't be parallelized between GPUs, go with a Mar 19, 2023 · As an example, the 4090 (and other 24GB cards) can all run the LLaMa-30b 4-bit model, whereas the 10–12 GB cards are at their limit with the 13b model. com/a16z-infra/llama13b-v2-chat本视频不构成任何投资 Dec 19, 2023 · In fact, a minimum of 16GB is required to run a 7B model, which is a basic LLaMa 2 model provided by Meta. Also, just a fyi the Llama-2-13b-hf model is a base model, so you won't really get a chat or instruct experience out of it. However, to run the larger 65B model, a dual GPU setup is necessary. I run llama2-70b-guanaco-qlora-ggml at q6_K on my setup (r9 7950x, 4090 24gb, 96gb ram) and get about ~1 t/s with some variance, usually a touch slower. The project currently is intended for research use. 2x 8280L (56/112 cores), Asus c621 Sage Dual socket motherboard. #1. For Llama 3 70B: ollama run llama3-70b. meta. com/news/2023/08/code-llama-ai-for-coding/Code llama Technical Paper - https://ai. 30. ADMIN MOD. This is the repository for the 70B pretrained model, converted for the Hugging Face Transformers format. For ease of use, the examples use Hugging Face converted versions of the models. 31 seconds (3. Also, image generation currently cannot use more than one GPU. This is a pure Java implementation of standalone LLama 2 inference, without any dependencies. PEFT, or Parameter Efficient Fine Tuning, allows Mar 6, 2023 · Tested on 7B version, OK. For Llama 3 8B: ollama run llama3-8b. Nov 23, 2023 · In this video we run Llama models using the new M3 max with 128GB and we compare it with a M1 pro and RTX 4090 to see the real world performance of this Chip for AI. env. Model Details. The size of Llama 2 70B fp16 is around 130GB so no you can't run Llama 2 70B fp16 with 2 x 24GB. Fine-tuning the LLaMA 2 model on RTX 4090 #llama2 #RTX4090 #LLM In this video I’ll share how you can use large language models like llama-2 on your local machine without the GPU acceleration which means you can run the Ll Feb 2, 2024 · This GPU, with its 24 GB of memory, suffices for running a Llama model. Model Details Note: Use of this model is governed by the Meta license. Either in settings or "--load-in-8bit" in the command line when you start the server. Running it locally via Ollama running the command: % ollama run llama2-uncensored Llama 2 Uncensored M3 Max Performance. For the CPU infgerence (GGML / GGUF) format, having We benchmarked the Llama 2 7B and 13B with 4-bit quantization on NVIDIA GeForce RTX 4090 using profile_generation. We benchmarked the Llama 2 7B and 13B with 4-bit quantization on NVIDIA GeForce RTX 4090 using profile_generation. Oct 19, 2023 · Saved searches Use saved searches to filter your results more quickly 本文介绍了如何在单GPU上使用Meta AI开源的LLama 2大语言模型进行微调,分享了实验过程和结果,以及遇到的一些问题和解决方案。 Apr 18, 2024 · Llama 3 will soon be available on all major platforms including cloud providers, model API providers, and much more. Tried to allocate 68. RTX 4090 's Training throughput and Training throughput/$ are significantly higher than RTX 3090 across the deep learning models we tested, including use cases in vision, language, speech, and recommendation system. The topmost GPU will overheat and throttle massively. 31 MiB free; 23. LLaMA-2 [편집] 2023년 7월 18일에 공개되었다. Worst case, use a PCI-E riser (be careful for it to be a reputable Gen4 one). Links to other models can be found in the index at the bottom. I happened to do this yesterday, testing the Dromedary 65B 4bit GPTQ I'd just uploaded to HF. . You can get 90% the perf at 300W, so 1. cuda. 13B, CUDA out of memory. Now, RTX 4090 when doing inference, is 50-70% faster than the RTX 3090. If you can effectively make use of 2x3090 with NVlink, they will beat out the single 4090. Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. 8TB each / 51. •. 目录. 4X lead with 768x768 images. ai demonstrated a way to train a larger model, such as Llama 2 70B on 48GB of GPU RAM. Llama 3 will be everywhere. DDR4 ECC LRDIMMs. file 两个参数 Mar 22, 2023 · Training LLaMA-13B-4bit on a single RTX 4090 with finetune. Sep 14, 2023 · 提交前必须检查以下项目 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。 第三方插件问题:例如llama. 8 kW if left unchecked. That cards is actually power limited, because of the shitty 8 nm process, so that thing might power spike at over 3kw with 4 cards as some people recommended. 1600W digital power supply. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. It might also theoretically allow us to run LLaMA-65B on an 80GB A100, but I haven't tried this. With max_batch_size set to 1, the model starts with 15GB. That said, here is a tutorial on what worked for me on Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Gives you better processing, less heat, and you can augment with 3060 12gb to expand the memory pool. 47. Open the Windows Command Prompt by pressing the Windows Key + R, typing “cmd,” and pressing “Enter. Aug 19, 2023 · If you have a bigger GPU, for instance with 24 GB of VRAM (RTX 3080/3090 or 4080/4090), it may also work with the 13B version of Llama 2. py returns out of memory on a 24G VRAM cards? any help will be appreciated! Thanks! Apr 8, 2023 · 中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs) - 3090或4090 单卡可以在本地推理 llama13B + 中文lora 的 4-bit Jun 6, 2023 · Saved searches Use saved searches to filter your results more quickly Jul 19, 2023 · 中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models) - ymcui/Chinese-LLaMA-Alpaca-2 We would like to show you a description here but the site won’t allow us. Definitions. 5-turbo or even GPT-4, but with a naive approach to serving (like HuggingFace + FastAPI), you will have hard time beating Depends on what you want for speed, I suppose. Dec 15, 2023 · The 4090 for example was 4. Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). 5G VRAM. Access LLaMA 2 from Meta AI . 3x speedup for Vicuna models and 2. Jul 28, 2023 · I have 4090 with 24GB and it barely works with llama-2-7b-chat. We would like to show you a description here but the site won’t allow us. 6x, 2. This guide will run the chat version on the models, and Supported Hardware Platform(s): RTX 4090 Supported Operating System(s): Windows. A 4 bit 70B model should take about 36GB-40GB of RAM so a 64GB MacStudio might still be price competitive with a dual 4090 or 4090 / 3090 split setup. The larger models like llama-13b and llama-30b run quite well at 4-bit on a 24GB GPU. But to do anything useful, you're going to want a powerful GPU (RTX 3090, RTX 4090 or A6000) with as much VRAM as possible. All the results are measured for single batch inference. せっかくなのでLlama 2を触ってみようと思っていた Jul 18, 2023 · meta llama:https://ai. Reply reply. 165b models also exist, which would Explore the impact of Llama3 model in various fields and its influence on the development of demonstration applications. Since I had it disabled, I made some tests first. Unlike the diffusion models, LLM's are very memory-intensive, even at 4-bit GPTQ. The RTX 4090 also has several other advantages over the RTX 3090, such as higher core count, higher memory bandwidth, higher NVLink bandwidth, and higher power limit. This guide shows how to accelerate Llama 2 inference using the vLLM library for the 7B, 13B and multi GPU vLLM with 70B. 匿优衫印,蜕产豹仰捻袄球寨梨详鬓亿,裆戒投偎邀LLama鸯鳖,LLama2好翻异疲挣晤鼻茎爸售,决摩凤譬漏蛾 We would like to show you a description here but the site won’t allow us. Hey, I am seeing very slow inference utilizing the same exact code I run on 2 other exact same computers but they are Ryzen 7900x versus this is 13900k intel) I run it through llamacpp and llama-cpp-python, BLAS is correctly set up so it is indeed utilizing Just plug it in the second PCI-E slot, if you have a 13900K there is no way you dont have a second GPU slot. 潮粗,Meta泉不LLama侄奴锡捺饱停吗屯,歇猩肛掘险霸妥徒卧昆徊,脸竣式捶曾菩绍兜乱蛉备。. My 4090 gets 50, a 4090 is 60% bigger than a 4080. 68 GiB already allocated; 41. This is a fork of the LLaMA code that runs LLaMA-13B comfortably within 24 GiB of RAM. env file. 6K and $2K only for the card, which is a significant jump in price and a higher investment. edited Aug 27, 2023. And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. It relies almost entirely on the bitsandbytes and LLM. torch. These factors make the RTX 4090 a superior GPU that can run the LLaMa v-2 70B model for inference using Exllama with more context length and faster speed than the RTX 3090. Discussion. You can reproduce all the experiments with OVHcloud AI Notebooks. All speedup numbers are benchmarked against the FP16 baseline. com/research/publications/co But enabling this settings makes a huge improvement when more than 1 GPU at the same time is working, and sometimes on a single GPU as well. The format allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight. Mar 6, 2023 · i have seen someone in this issues Message area said that 7B model just needs 8. 00 MiB (GPU 0; 23. If you want to run 4 bit Llama-2 model like Llama-2-7b-Chat-GPTQ, you can set up your BACKEND_TYPE as gptq in . The eval rate of the response comes in at 64 tokens/s. 7 times faster training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory. Here is how you can proceed: 1. You really don't want these push pull style coolers stacked right against each other. Settings used are: split 14,20. py (using PyTorch 2 beta, to support the requisite CUDA 11. I might be off on that). Note: This article was originally published in The Kaitchup . I've tested it on an RTX 4090, and it reportedly works on the 3090. We’ve achieved a latency of 29 milliseconds per token for I think a 4090 would be better. Running Llama 2 13B on M3 Max. 9X faster than the Arc A770 16GB at 512x512 images, and that increased to a 6. The 4090 does not have bad power spiked tough, the 3090 does. 8 GB on disk. model --max_seq_len 128 --max_batch_size 4. com/article/index?sn=11937講師:李明達老師 Need help finding cause for slow inference (RTX 4090), only utilizing some 40% of GPU. Llama2 70B GPTQ full context on 2 3090s. Compared to ChatGLM's P-Tuning, LLaMA Factory's LoRA tuning offers up to 3. 糟塌蝶捣,欧已Alpaca-Lora行鸭LLaMA (7B)蕾符启捡趋柬艾袜,泌封囱破裂纷雪轻憨. For beefier models like the Llama-2-13B-German-Assistant-v4-GPTQ, you'll need more powerful hardware. On the consumer-level 4090, we achieve similar 3. py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer. ys mw xx iw rv fi pe zk ms oe