Serving Gemma 2B for free using Google Colab

The free tier of Google Colab runtime, powered only by a CPU, is enough to successfully run Google’s Gemma 2B parameters model and prompt it using the Colab UI.
In addition, it’s possible to set up the Colab to serve the model, so it can be consumed from anywhere via a normal REST call.

Colab with all the instructions is here.

Install Ollama in Colab notebook

!curl -fsSL https://ollama.com/install.sh | sh

This command installs Ollama on the notebook.

Expose Ollama via a Cloudflare tunnel

In order to “expose” the Ollama instance installed in the Colab notebook to the external world, a Cloudflare Tunnel is created, using the official client. The following lines install the required packages:

!wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!dpkg -i cloudflared-linux-amd64.deb

Create the tunnel and capture the TryCloudflare subdomain

Instead of adding a subdomain to a registered Cloudflare’s account, a random subdomain is generated by TryCloudflare. No registration required.

The following code exists for two purposes: start the Cloudflare tunnel as soon as Ollama is ready to serve, and return the random subdomain created by TryCloudflare.

import os
# Set OLLAMA_HOST to specify bind address
# https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux
os.environ.update({'OLLAMA_HOST': '0.0.0.0'})

import subprocess
import threading
import time
import socket

def iframe_thread(port):
    while True:
        time.sleep(0.5)
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        result = sock.connect_ex(('127.0.0.1', port))
        if result == 0:
            break
        sock.close()

    p = subprocess.Popen(["cloudflared", "tunnel", "--url", f"http://127.0.0.1:{port}"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    for line in p.stderr:
        l = line.decode()
        if "trycloudflare.com " in l:
            print("\n\n\n\n\n")
            print("running ollama server\n\n", l[l.find("http"):], end='')
            print("\n\n\n\n\n")

threading.Thread(target=iframe_thread, daemon=True, args=(11434,)).start()

After setting some enviromental variables, a iframe_thread function is defined. In the function, a while True loop waits till the Ollama server is up and running. Once this happen, the subprocess.Popen creates the Cloudflare tunnel pointing to the local Ollama installation, using the command cloudflared, and prints the xxxx.trycloudflare.com randomly generated subdomain.

The last line of code lauches the iframe_thread as a background Thread. The wait for being connected with the Ollama server starts.

Launch the Ollama server

At this point, everything is ready to launch the Ollama server

!ollama serve

Colab will start the Ollama server, and the previously created thread, which was waiting for this to happen, quits from the while loop, creates the tunnel and print the subdomain. Looking at the output of this Colab block something similar will appear:

Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 blablablabla

2024/08/15 21:29:34 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
[...]


running ollama server

 https://sand-commerce-fields-danger.trycloudflare.com                                     

Bingo! The address https://sand-commerce-fields-danger.trycloudflare.com is the url to use to reach our Ollama-on-Colab instance, the <your_cloudflare_subdomain> in the following snippets.

Set up Ollama using API calls

Last, but not the least, Ollama need to download the Gemma 2B model. From this point ongoing, while the Colab notebook is busy keeps running the Ollama server, the rest of the interaction happen via Ollama API, available via the newly created Cloudflare tunnel.
From any command line shell available (a local computer, a mobile device, etc) these two commands need to be launched (only the first one is really mandatory, the second one is useful to improve performances):

curl <your_cloudflare_subdomain>/api/pull -d '{ "name": "gemma:2b" }'

This call instructs Ollama to download the Gemma 2B model via the pull API endpoint.

curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "keep_alive": -1}'

This call instructs Ollama to keep the gemma:2b model loaded in memory, instead of discarding it after 5 minutes of non usage (the default behaviour).

Ask question to Gemma 2B

It’s time to ask the first question to Gemma:

curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "stream":false, "prompt": "Create a 10 line poem about love, with rhyming couplets"}'

generate API endpoint generates a response for a given prompt with a provided model.
"stream":false waits for the model to elaborate the answer and returns it all at once, instead of a stream of tokens.

To generate the reply, Gemma took approx 60 to 90 seconds. Not the quickest in the world, but all CPU powered! ;)

The model’s reply is in the json message payload. To focus on it, and filter all the rest out, pipe the previous command to jq:

curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "stream":false, "prompt": "Create a 10 line poem about love, with rhyming couplets"}' | jq ".response"

Does it works forever?

Well, no. Google Colab FAQ says the free notebook can run for at most 12 hours.
But no fear, once the notebook has been shut down, it’s just a matter to launch another “Run all“, wait to the new Cloudflare random subdomain, and restart the fun.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.