The GridRepublic LLM Service is powered by a distributed network of LLM servers. 

The core server, available as a Docker application, provides a local interactive inference service, via simple command line interface. A range of models are supported.

System requirements

  • Docker Engine must be installed and running on the target system
  • Network connectivity is needed so that the Docker image and any specified models can be downloaded automatically
  • Disk space and RAM:
    • The application container itself requires 600 MB of disk space and will use 400 MB of system memory to run.
    • In addition to the above, the model selected will have its own resource requirements. See the instructions on Choosing a model for further details.

 Installation and basic usage

For testing purposes, the server application can be launched with a single command:

$ docker run -it --rm gridrepublic/llm-server

By default, the server will download the Zephyr 7B model upon launch. Once the model has been pulled and initialized, a prompt will appear and wait for input:

>>>

Choosing a model

The Zephyr 7B model that is used by default requires a bit over 4 GB of disk space and at least 4.5 GB of RAM to run (in addition to the application container requirements, as outlined above). Depending on the resources available, smaller, lighter weight models and larger, more performant models can also be used.

A variety of supported models can be found at: https://ollama.com/library 

As a general rule, systems should have more disk space and memory available in bytes than the number of parameters of a given model; e.g. to run a model of 7 billion parameters, use a system with 8 GB of disk space and RAM.

The model that the llm-server container will use is determined by the MODEL environment variable that is set when the container is first launched via "docker run". To set this to another model that is found in the library, add the following parameter to the docker command:

-e MODEL="[name]"

For example, to use the lightweight gemma 2B model:

$ docker run -it --rm -e MODEL="gemma:2b" gridrepublic/llm-server

Caching models locally

To avoid pulling the same models over and over on subsequent runs of the LLM server on a local system, it is valuable to establish a cache directory on the local system that will be used to store models. To do this, create an empty directory on the system, e.g. ~/llm-server/cache, and mount it as a volume at /srv/gr when running the docker image:

$ docker run -it --rm -v ~/llm-server/cache:/srv/gr gridrepublic/llm-server

The LLM server will then store all pulled models in that cache directory, which future instances launched via docker run will likewise find (as long as that same volume mount option is provided).

Examples

Using the default Zephyr model

To launch a the application container for inference using the Zephyr model, with local caching of the model:

$ docker run -it --rm -v ~/llm-server/cache:/srv/gr gridrepublic/llm-server

>>> What is the wavelength of blue light in nanometers?
The wavelength of blue light can vary, but it typically falls 
within the range of 450 to 495 nanometers (nm) in vacuum. In a 
medium like air or water, the wavelength is slightly longer due 
to refraction. This range of wavelengths corresponds to the color 
that we perceive as blue in visible light.

>>> /show info
Model details:
Family              llama
Parameter Size      7B
Quantization Level  Q4_0

>>> /bye

Using the lightweight Gemma 2B model

To use the Gemma 2B model:

$ docker run -it --rm -v ~/llm-server/cache:/srv/gr -e MODEL="gemma:2b" gridrepublic/llm-server
pulling manifest 
pulling c1864a5eb193...  26% |||||            | 432 MB/1.7 GB   10 MB/s   1m57s
...
success
>>> What is the wavelength of blue light in nanometers?
The wavelength of blue light is approximately 400-500 nanometers.

>>> /show info
Model details:
Family              gemma
Parameter Size      3B
Quantization Level  Q4_0

>>> /bye
  • No labels