Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Resolved issues raised with formatting or content clarity.

...

  1. Create an account / log in at HuggingFace and accept the terms and conditions for using the Aurora model.

  2. Create a Weights and Biases account.

  3. Create an account / log in at RunPod and add an SSH key to allow terminal access to cloud instances:

    Panel
    borderWidth0
    borderStylenone

    Generate a key locally, if needed:

    Code Block
    languagetext
    $ ssh-keygen -t ed25519 -C "your_email@example.com"
    $ cat ~/.ssh/id_ed25519.pub

    Paste this into RunPod at Account -> Settings -> Public Keys


...

  1. Start from the axolotl-runpod template and set the organization Account, as appropriate, from the top-right profile drop-down list.

  2. Scroll down to the "Previous Generation" section and click Deploy on the 1xA100 80GB GPU. Leave all settings at defaults and click the Continue and Deploy buttons on subsequent screens. (The quantity of GPUs can be adjusted to make training faster, if desired, by clicking "1x A100 80GB" on the first screen and selecting a greater quantity).

  3. Once the axolotl-runpod instance is ready, click expand it on the Pods list and find the Connect button to get the command to use for "Basic SSH Terminal" access. Run this command in your local terminal to establish the connection.

...

Info

The SSH session uses TMUX. This allows for resuming the session if it drops due to a connection issue, but scrolling in TMUX does not function like a standard terminal. See the linked documentation and the TMUX Cheat Sheet for details.

Prepare the Training Data and Settings

...

Code Block
languagetext
$ huggingface-cli login
[ provide read/write token from https://huggingface.co/settings/tokens ]
$ wandb login
[ provide copyAPI outputkey from https://wandb.ai/authorize ]

...

Alternatively, scp can be used to upload data directly from a local file. Get the IP address and port from the "SSH over exposed TCP" details for the axolotl-runpod instance, which can be found by clicking the Connect button for the instance. Then, use those values in the scp command, along with the same SSH key used to access the terminal of the instance already:

Code Block
languagetext
localhost:~$ scp -i ~/.ssh/[ssh-key] -P [port] [local-file] root@[axolotl-runpod-ip-addr]:/workspace/axolotl-mdel

...

Lines in the .yml file that have comments can be edited for subsequent training runs to improve effectiveness and performance. For the first run, be sure to set the "path" under datasets to point to the training data that has been copied to the instance and also set a name for wandb_project, which is used to identify the collection of training runs in your Weights and Biases account.

Launch the Training

Training is launched on the compute instance with a single command:

...

Both the terminal output and WandB dashboard will indicate when the training run is complete, at which point inference can be run.

Note

If the following error appears when launching axolotl.cli.train: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')

Run the following additional commands within the RunPod instance:
$ pip3 install -e '.[deepspeed]'
$ pip uninstall flash_attn

This is a workaround for a temporary problem that was encountered and should be resolved shortly, if not already.


Run Inference Testing

Simple testing can be performed on the new expert model by launching an inference server with the .yml configuration file of the training run:

...