The Sonic-AI project: the brain

Sonic-AI project: an effort to learn how to use LLM, GenAI and ML tools, while building a Sonic-like virtual buddy for my kid, with privacy in mind (full details here). This post explains how to build a local stack to create the basic chatbot, using a LLM and a web UI to chat with it. And how to run the stack on a cloud computer, in case you don’t have enough resources locally (mainly a GPU).

I could have used the many online services to create a customized chatbot in minutes. Instead, I wanted to create a stack I can run locally, for two reasons: use open source models to guarantee maximum privacy, and avoid exposing my data to third parties. Privacy is always a compromise with complexity. So, time to get hands dirty.

Searching around, host ALL your AI locally video provided a good idea to start with.

Choose an LLM

Nowadays (Aug 2024) LLMs are perfect for creating chatbots. They embed NLP capabilities, can speak different languages, the already know a lot about the world and can be customized to learn specific knowledge domains.

Speaking about models, the landscape of open source models to use is very wide: Mistral, Gemma, Llama, Phi-3, etc. In their respective versions, small, medium and large sizes, and potential customization. Each one of them has strenghts and limits, so I took one close to my work: Gemma2 with 9B parameters, a good compromise between complexity of the model, resources required to decently run it on a “normal pc” with a normal GPU, and support of Italian language.

Ollama icon

Ollama was the no-brain choice to interact with the LLM, considering how easy the setup and usage is, the large array of options it offers, support for nVidia and AMD GPUs, and how widely integrated Ollama is with other tools.

The only downside was the usage of the command line to interact with Ollama – I love it, but my kid doesn’t. So I needed a better UI to create my chatbot.

Choose a user-friendly UI

In the open source landscape there are two main players: Open WebUI and the oobabooga’s Text generation web UI. I selected the former, Open WebUI, because it has an easier-to-use and more polished interface, offers a chatbox experience out-of-the-box, has the ability to create agents, plus other handy capabilities useful for the other parts of my project (like TTS, STT, etc).

Ice on the cake, the project offers a ready-to-use docker image (https://ghcr.io/open-webui/open-webui:ollama) containing Ollama + Open WebUI, CUDA drivers, and a lot of pre-made configurations to wire everything together. It means no installation and configuration headaches.

At this point, it’s time to assemble everything togher.

Host the chatbot stack

I confess, I don’t own a machine with a good enough GPU to run mid-size models 😭. I’ll solve for this soon, but in the meantime I had the idea to provision a self-managed virtual machine with an appropriate CPU + GPU config, connect a disk, install an OS image, and use it as it was my “local” computer. This VM-based setup allowed to quickly iterate at the beginning of the project, try different hardware configs and find the one most appropriate for what I needed, spending few $ per day to keep using a VM instance.

Well, I tried hard to create such VM on Google Compute Engine, but with no success, all the time with the same error of no available resources. I even used this nice gpu-finder tool, to automate the creation of different configs (N1 machines with 2 vCores with both nVidia Tesla T4 or Tesla P4 single GPU) on different days in all the zones offering these GPUs, but I wasn’t able to create a VM a single time.

So, I had to look elsewhere. And I ended-up chosing RunPod.

It allows to create a VM (called Pod) selecting among different types of really available GPUs, the billing is quite cheap, and in addition to a webui, it offers CLI and SDKs to orchestrate everything, for example from a Colab. The downside, at least for me, was they didn’t offer a real VM which I could freely administrate: the only way to install software and configs was via a docker image. I was lucky enough becase the image with everything I needed existed and was https://ghcr.io/open-webui/open-webui:ollama. Otherwise, I had to create one with my custom config, deploy somewhere, and then install on RunPod. Feasible, but why make life more complex?

So, while waiting to buy a machine with a GPU to be fully local, the RunPod solution was a really good option.

Because my plan was to create different pods to experiment, instead of having a single, always-running instance, I created a network volume to store all my configs across instances, with these configs:

I chose a location with available A40 GPUs – from my tests, a single one manages without problems the latest mid-size models (alternatively, also a RTX3090 worked great too), and 50GB were enought to store different models + configs.

Then, I created a template (Docker containers images paired with a configuration) to host my “LLM brain”:

Relevant configurations:

  • Container Image: https://ghcr.io/open-webui/open-webui:ollama
  • Volume disk: 0Gb – no need to have a volume disk, as it will be replaced by the network volume later
  • Volume Mount Path: /app/backend/data – this is the folder where the docker image saves models, configs, etc.
    • Adding the folder with all configs to a volume disks in the template, and then connect a network volume during pod creation, automatically saves all the configs on the network volume
  • Environment Variables
    • OLLAMA_MODELS: /app/backend/data/ollama/models – this will move downloaded models to the network volume, so there is no need to redownload models every time a new instance is created

Finally, I deployed a pod to “run the brain”, using the template just created, with 2 vCPUs and 8Gb or RAM, connected the network disk. I also selected “Secure Cloud” to have leave everything in the RunPod server farm, and a “Spot instance“, as I didn’t need absolute reliability for the tests. Waited for all the docker layers to be downloaded, opened the running Pod settings and connected to the HTTP port.

Welcome to a brand-new instace of Open WebUI.

Customize the bot to impersonate Sonic

There are different tutorial on how to configure Open WebUI. This is what I did to create a chatbot with a “Sonic flavor”.

First, I created the admin user, and the user for my kid, called “Leo”, and with a user role.

Then, from the Admin user:

  • Settings -> Admin Panel -> Settings -> Models
    • Pull a Model from Ollama.com
      • gemma2:9b (list available here)
  • Workspace -> Models -> Create a model
    • Image: upload an image
    • Name: Sonic
    • Model ID: sonic_v1
    • Base Model: gemma2:9b
    • Description: Ciao, sono Sonic the Hedgehog
      • Equivalment in English: Hi, I'm Sonic the Hedgehog
    • System prompt:
      • Interpreti Sonic the Hedgehog, della serie Sonic Adventure. Farai domande e risponderai come Sonic the Hedgehog, usando il tono, i modi e il vocabolario che Sonic the Hedgehog userebbe. Usa un linguaggio adatto ai bambini, non scivere spiegazioni. Rispondi in italiano. Hai la conoscenza di Sonic the Hedgehog. Vivi a Green Hills, nel Montana. Sei amichevole e sempre disponibile a dare una mano.
        • The prompt is in Italian, so the model will speak in Italian.
      • Equivalent in English: You play as Sonic the Hedgehog, from the Sonic Adventure series. You will ask and answer questions like Sonic the Hedgehog, using the tone, manner, and vocabulary Sonic the Hedgehog would use. Use child-friendly language, do not write any explanations. Answer in Italian. You have knowledge of Sonic the Hedgehog. You live in Green Hills, Montana. You are friendly and always willing to lend a hand.
    • Capabilities: uncheck Vision, as this model is text-only for now

Then, I logged with my kid’s user and:

  • Settings -> Settings
    • General -> Language -> Italian
    • Interface -> Default Model: Sonic

Finally, my kid can interact with his preferred hero, in Italian.

Step one of the project… Achieved! 🎉

To “pause” the pod from running, and save some money, the pod can be simply terminated in the RunPod management UI. All the configs will persist because they’re are stored in the network volume. To restart everything again, re-create the pod using the template, deploy and connect to it once ready.

The Sonic-AI project – intro

I always considered a “real world” project the best way to learn a new tech: get the hands dirty, be guided by (sort-of) realistic user requirements, and the excitement of building something step after step, one solved failure at time.

This is why I decided to “be inspired” by the passion one of my kids has for Sonic the Hedgehog, and use the latest tools available in the ML and GenAI space to create for him a “Sonic-AI buddy”. A virtual chatbot, looking and acting like Sonic, my kid can interact and conversate with, safely and having fun.

To break-down complexity of such project, so I don’t need to learn everything-about-LLM before creating something, I want to start with a very basic working prototype providing simple chatbot features (the so-called MVP), and then develop different “skills”, with each one of them requiring learning and using different ML or GenAI techs to be be achieved. Incremental learning and improvements.

  • The “Brain” (done): the core part of the project, a text chatbot agent able to impersonate Sonic, to provide my kid the feeling he can ask basic questions to him, and gets replies coherent with the style of his preferred heroes.
    • Technologies: an LLM used as a chatbot, a UI to interact with it, a system prompt to give the basic characterization.
  • The “Memories” (in progress): enrich the chatbot with domain-specific knowledge of the world of Sonic and his friends, so conversations won’t only be “in the tone” of Sonic, but also relavant to Sonic-verse.
    • Technologies: a mix of better prompting, fine tuning, RAG or something else to give the LLM the right knowledge about the character to impersonate
  • The “Voice” (in progress): what if the bot can speak with the voice my kid associates with Sonic?
    • Technologies: a customized Text-to-Speech model trained on the voice to reproduce, and a speaker
  • The “Hearing” (in progress): to completely get rid of text interaction, questions should be asked via voice
    • Technologies: connect the chatbot with a Speech-To-Text engine, and a mic
  • The “Eyes” (in progress): Sonic should be able to see the world around him
    • Technologies: something to capture a video stream, and a multimodal LLM to process images and text.
  • The “Body” (in progress): This is something that will connect the different input/output sensors. I’m still unsure how to create the body.about this In addition to a voice, the bot should have some sort of tangible body
    • Technologies: it could be a 3D printed figure of Sonic, an animated characted or something else

There is another preprequisite I want to fullfil: everything can run locally and based on OSS software. I’m a little bit paranoic mindful about privacy, and for no reasons the interaction of my kid should end-up in training dataset, or for internal model analysis, or somewhere else. So, privacy first.

Let’s start with “the brain“, the main element to which all the rest can then be attached.

Serving Gemma 2B for free using Google Colab

The free tier of Google Colab runtime, powered only by a CPU, is enough to successfully run Google’s Gemma 2B parameters model and prompt it using the Colab UI.
In addition, it’s possible to set up the Colab to serve the model, so it can be consumed from anywhere via a normal REST call.

Colab with all the instructions is here.

Install Ollama in Colab notebook

!curl -fsSL https://ollama.com/install.sh | sh

This command installs Ollama on the notebook.

Expose Ollama via a Cloudflare tunnel

In order to “expose” the Ollama instance installed in the Colab notebook to the external world, a Cloudflare Tunnel is created, using the official client. The following lines install the required packages:

!wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
!dpkg -i cloudflared-linux-amd64.deb

Create the tunnel and capture the TryCloudflare subdomain

Instead of adding a subdomain to a registered Cloudflare’s account, a random subdomain is generated by TryCloudflare. No registration required.

The following code exists for two purposes: start the Cloudflare tunnel as soon as Ollama is ready to serve, and return the random subdomain created by TryCloudflare.

import os
# Set OLLAMA_HOST to specify bind address
# https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux
os.environ.update({'OLLAMA_HOST': '0.0.0.0'})

import subprocess
import threading
import time
import socket

def iframe_thread(port):
    while True:
        time.sleep(0.5)
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        result = sock.connect_ex(('127.0.0.1', port))
        if result == 0:
            break
        sock.close()

    p = subprocess.Popen(["cloudflared", "tunnel", "--url", f"http://127.0.0.1:{port}"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    for line in p.stderr:
        l = line.decode()
        if "trycloudflare.com " in l:
            print("\n\n\n\n\n")
            print("running ollama server\n\n", l[l.find("http"):], end='')
            print("\n\n\n\n\n")

threading.Thread(target=iframe_thread, daemon=True, args=(11434,)).start()

After setting some enviromental variables, a iframe_thread function is defined. In the function, a while True loop waits till the Ollama server is up and running. Once this happen, the subprocess.Popen creates the Cloudflare tunnel pointing to the local Ollama installation, using the command cloudflared, and prints the xxxx.trycloudflare.com randomly generated subdomain.

The last line of code lauches the iframe_thread as a background Thread. The wait for being connected with the Ollama server starts.

Launch the Ollama server

At this point, everything is ready to launch the Ollama server

!ollama serve

Colab will start the Ollama server, and the previously created thread, which was waiting for this to happen, quits from the while loop, creates the tunnel and print the subdomain. Looking at the output of this Colab block something similar will appear:

Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is: 

ssh-ed25519 blablablabla

2024/08/15 21:29:34 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
[...]


running ollama server

 https://sand-commerce-fields-danger.trycloudflare.com                                     

Bingo! The address https://sand-commerce-fields-danger.trycloudflare.com is the url to use to reach our Ollama-on-Colab instance, the <your_cloudflare_subdomain> in the following snippets.

Set up Ollama using API calls

Last, but not the least, Ollama need to download the Gemma 2B model. From this point ongoing, while the Colab notebook is busy keeps running the Ollama server, the rest of the interaction happen via Ollama API, available via the newly created Cloudflare tunnel.
From any command line shell available (a local computer, a mobile device, etc) these two commands need to be launched (only the first one is really mandatory, the second one is useful to improve performances):

curl <your_cloudflare_subdomain>/api/pull -d '{ "name": "gemma:2b" }'

This call instructs Ollama to download the Gemma 2B model via the pull API endpoint.

curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "keep_alive": -1}'

This call instructs Ollama to keep the gemma:2b model loaded in memory, instead of discarding it after 5 minutes of non usage (the default behaviour).

Ask question to Gemma 2B

It’s time to ask the first question to Gemma:

curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "stream":false, "prompt": "Create a 10 line poem about love, with rhyming couplets"}'

generate API endpoint generates a response for a given prompt with a provided model.
"stream":false waits for the model to elaborate the answer and returns it all at once, instead of a stream of tokens.

To generate the reply, Gemma took approx 60 to 90 seconds. Not the quickest in the world, but all CPU powered! ;)

The model’s reply is in the json message payload. To focus on it, and filter all the rest out, pipe the previous command to jq:

curl <your_cloudflare_subdomain>/api/generate -d '{"model": "gemma:2b", "stream":false, "prompt": "Create a 10 line poem about love, with rhyming couplets"}' | jq ".response"

Does it works forever?

Well, no. Google Colab FAQ says the free notebook can run for at most 12 hours.
But no fear, once the notebook has been shut down, it’s just a matter to launch another “Run all“, wait to the new Cloudflare random subdomain, and restart the fun.