The free tier of Google Colab runtime, powered only by a CPU, is enough to successfully run Google’s Gemma 2B parameters model and prompt it using the Colab UI.
In addition, it’s possible to set up the Colab to serve the model, so it can be consumed from anywhere via a normal REST call.
Colab with all the instructions is here.
Install Ollama in Colab notebook
!curl -fsSL https://ollama.com/install.sh | sh
This command installs Ollama on the notebook.
Expose Ollama via a Cloudflare tunnel
In order to “expose” the Ollama instance installed in the Colab notebook to the external world, a Cloudflare Tunnel is created, using the official client. The following lines install the required packages:
!wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!dpkg -i cloudflared-linux-amd64.deb
Create the tunnel and capture the TryCloudflare subdomain
Instead of adding a subdomain to a registered Cloudflare’s account, a random subdomain is generated by TryCloudflare. No registration required.
The following code exists for two purposes: start the Cloudflare tunnel as soon as Ollama is ready to serve, and return the random subdomain created by TryCloudflare.
import os
# Set OLLAMA_HOST to specify bind address
# https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux
os.environ.update({'OLLAMA_HOST': '0.0.0.0'})
import subprocess
import threading
import time
import socket
def iframe_thread(port):
while True:
time.sleep(0.5)
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('127.0.0.1', port))
if result == 0:
break
sock.close()
p = subprocess.Popen(["cloudflared", "tunnel", "--url", f"http://127.0.0.1:{port}"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for line in p.stderr:
l = line.decode()
if "trycloudflare.com " in l:
print("\n\n\n\n\n")
print("running ollama server\n\n", l[l.find("http"):], end='')
print("\n\n\n\n\n")
threading.Thread(target=iframe_thread, daemon=True, args=(11434,)).start()
After setting some enviromental variables, a iframe_thread
function is defined. In the function, a while True
loop waits till the Ollama server is up and running. Once this happen, the subprocess.Popen
creates the Cloudflare tunnel pointing to the local Ollama installation, using the command cloudflared
, and prints the xxxx.trycloudflare.com randomly generated subdomain.
The last line of code lauches the iframe_thread
as a background Thread
. The wait for being connected with the Ollama server starts.
Launch the Ollama server
At this point, everything is ready to launch the Ollama server
!ollama serve
Colab will start the Ollama server, and the previously created thread, which was waiting for this to happen, quits from the while
loop, creates the tunnel and print the subdomain. Looking at the output of this Colab block something similar will appear:
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:
ssh-ed25519 blablablabla
2024/08/15 21:29:34 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
[...]
running ollama server
https://sand-commerce-fields-danger.trycloudflare.com
Bingo! The address https://sand-commerce-fields-danger.trycloudflare.com is the url to use to reach our Ollama-on-Colab instance, the <your_cloudflare_subdomain> in the following snippets.
Set up Ollama using API calls
Last, but not the least, Ollama need to download the Gemma 2B model. From this point ongoing, while the Colab notebook is busy keeps running the Ollama server, the rest of the interaction happen via Ollama API, available via the newly created Cloudflare tunnel.
From any command line shell available (a local computer, a mobile device, etc) these two commands need to be launched (only the first one is really mandatory, the second one is useful to improve performances):
curl <your_cloudflare_subdomain>/api/pull -d '{ "name": "gemma:2b" }'
This call instructs Ollama to download the Gemma 2B model via the pull
API endpoint.
curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "keep_alive": -1}'
This call instructs Ollama to keep the gemma:2b model loaded in memory, instead of discarding it after 5 minutes of non usage (the default behaviour).
Ask question to Gemma 2B
It’s time to ask the first question to Gemma:
curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "stream":false, "prompt": "Create a 10 line poem about love, with rhyming couplets"}'
generate
API endpoint generates a response for a given prompt with a provided model."stream":false
waits for the model to elaborate the answer and returns it all at once, instead of a stream of tokens.
To generate the reply, Gemma took approx 60 to 90 seconds. Not the quickest in the world, but all CPU powered! ;)
The model’s reply is in the json message payload. To focus on it, and filter all the rest out, pipe the previous command to jq
:
curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "stream":false, "prompt": "Create a 10 line poem about love, with rhyming couplets"}' | jq ".response"
Does it works forever?
Well, no. Google Colab FAQ says the free notebook can run for at most 12 hours.
But no fear, once the notebook has been shut down, it’s just a matter to launch another “Run all“, wait to the new Cloudflare random subdomain, and restart the fun.