The Sonic-AI project: the brain

Sonic-AI project: an effort to learn how to use LLM, GenAI and ML tools, while building a Sonic-like virtual buddy for my kid, with privacy in mind (full details here). This post explains how to build a local stack to create the basic chatbot, using a LLM and a web UI to chat with it. And how to run the stack on a cloud computer, in case you don’t have enough resources locally (mainly a GPU).

I could have used the many online services to create a customized chatbot in minutes. Instead, I wanted to create a stack I can run locally, for two reasons: use open source models to guarantee maximum privacy, and avoid exposing my data to third parties. Privacy is always a compromise with complexity. So, time to get hands dirty.

Searching around, host ALL your AI locally video provided a good idea to start with.

Choose an LLM

Nowadays (Aug 2024) LLMs are perfect for creating chatbots. They embed NLP capabilities, can speak different languages, the already know a lot about the world and can be customized to learn specific knowledge domains.

Speaking about models, the landscape of open source models to use is very wide: Mistral, Gemma, Llama, Phi-3, etc. In their respective versions, small, medium and large sizes, and potential customization. Each one of them has strenghts and limits, so I took one close to my work: Gemma2 with 9B parameters, a good compromise between complexity of the model, resources required to decently run it on a “normal pc” with a normal GPU, and support of Italian language.

Ollama icon

Ollama was the no-brain choice to interact with the LLM, considering how easy the setup and usage is, the large array of options it offers, support for nVidia and AMD GPUs, and how widely integrated Ollama is with other tools.

The only downside was the usage of the command line to interact with Ollama – I love it, but my kid doesn’t. So I needed a better UI to create my chatbot.

Choose a user-friendly UI

In the open source landscape there are two main players: Open WebUI and the oobabooga’s Text generation web UI. I selected the former, Open WebUI, because it has an easier-to-use and more polished interface, offers a chatbox experience out-of-the-box, has the ability to create agents, plus other handy capabilities useful for the other parts of my project (like TTS, STT, etc).

Ice on the cake, the project offers a ready-to-use docker image (https://ghcr.io/open-webui/open-webui:ollama) containing Ollama + Open WebUI, CUDA drivers, and a lot of pre-made configurations to wire everything together. It means no installation and configuration headaches.

At this point, it’s time to assemble everything togher.

Host the chatbot stack

I confess, I don’t own a machine with a good enough GPU to run mid-size models 😭. I’ll solve for this soon, but in the meantime I had the idea to provision a self-managed virtual machine with an appropriate CPU + GPU config, connect a disk, install an OS image, and use it as it was my “local” computer. This VM-based setup allowed to quickly iterate at the beginning of the project, try different hardware configs and find the one most appropriate for what I needed, spending few $ per day to keep using a VM instance.

Well, I tried hard to create such VM on Google Compute Engine, but with no success, all the time with the same error of no available resources. I even used this nice gpu-finder tool, to automate the creation of different configs (N1 machines with 2 vCores with both nVidia Tesla T4 or Tesla P4 single GPU) on different days in all the zones offering these GPUs, but I wasn’t able to create a VM a single time.

So, I had to look elsewhere. And I ended-up chosing RunPod.

It allows to create a VM (called Pod) selecting among different types of really available GPUs, the billing is quite cheap, and in addition to a webui, it offers CLI and SDKs to orchestrate everything, for example from a Colab. The downside, at least for me, was they didn’t offer a real VM which I could freely administrate: the only way to install software and configs was via a docker image. I was lucky enough becase the image with everything I needed existed and was https://ghcr.io/open-webui/open-webui:ollama. Otherwise, I had to create one with my custom config, deploy somewhere, and then install on RunPod. Feasible, but why make life more complex?

So, while waiting to buy a machine with a GPU to be fully local, the RunPod solution was a really good option.

Because my plan was to create different pods to experiment, instead of having a single, always-running instance, I created a network volume to store all my configs across instances, with these configs:

I chose a location with available A40 GPUs – from my tests, a single one manages without problems the latest mid-size models (alternatively, also a RTX3090 worked great too), and 50GB were enought to store different models + configs.

Then, I created a template (Docker containers images paired with a configuration) to host my “LLM brain”:

Relevant configurations:

  • Container Image: https://ghcr.io/open-webui/open-webui:ollama
  • Volume disk: 0Gb – no need to have a volume disk, as it will be replaced by the network volume later
  • Volume Mount Path: /app/backend/data – this is the folder where the docker image saves models, configs, etc.
    • Adding the folder with all configs to a volume disks in the template, and then connect a network volume during pod creation, automatically saves all the configs on the network volume
  • Environment Variables
    • OLLAMA_MODELS: /app/backend/data/ollama/models – this will move downloaded models to the network volume, so there is no need to redownload models every time a new instance is created

Finally, I deployed a pod to “run the brain”, using the template just created, with 2 vCPUs and 8Gb or RAM, connected the network disk. I also selected “Secure Cloud” to have leave everything in the RunPod server farm, and a “Spot instance“, as I didn’t need absolute reliability for the tests. Waited for all the docker layers to be downloaded, opened the running Pod settings and connected to the HTTP port.

Welcome to a brand-new instace of Open WebUI.

Customize the bot to impersonate Sonic

There are different tutorial on how to configure Open WebUI. This is what I did to create a chatbot with a “Sonic flavor”.

First, I created the admin user, and the user for my kid, called “Leo”, and with a user role.

Then, from the Admin user:

  • Settings -> Admin Panel -> Settings -> Models
    • Pull a Model from Ollama.com
      • gemma2:9b (list available here)
  • Workspace -> Models -> Create a model
    • Image: upload an image
    • Name: Sonic
    • Model ID: sonic_v1
    • Base Model: gemma2:9b
    • Description: Ciao, sono Sonic the Hedgehog
      • Equivalment in English: Hi, I'm Sonic the Hedgehog
    • System prompt:
      • Interpreti Sonic the Hedgehog, della serie Sonic Adventure. Farai domande e risponderai come Sonic the Hedgehog, usando il tono, i modi e il vocabolario che Sonic the Hedgehog userebbe. Usa un linguaggio adatto ai bambini, non scivere spiegazioni. Rispondi in italiano. Hai la conoscenza di Sonic the Hedgehog. Vivi a Green Hills, nel Montana. Sei amichevole e sempre disponibile a dare una mano.
        • The prompt is in Italian, so the model will speak in Italian.
      • Equivalent in English: You play as Sonic the Hedgehog, from the Sonic Adventure series. You will ask and answer questions like Sonic the Hedgehog, using the tone, manner, and vocabulary Sonic the Hedgehog would use. Use child-friendly language, do not write any explanations. Answer in Italian. You have knowledge of Sonic the Hedgehog. You live in Green Hills, Montana. You are friendly and always willing to lend a hand.
    • Capabilities: uncheck Vision, as this model is text-only for now

Then, I logged with my kid’s user and:

  • Settings -> Settings
    • General -> Language -> Italian
    • Interface -> Default Model: Sonic

Finally, my kid can interact with his preferred hero, in Italian.

Step one of the project… Achieved! 🎉

To “pause” the pod from running, and save some money, the pod can be simply terminated in the RunPod management UI. All the configs will persist because they’re are stored in the network volume. To restart everything again, re-create the pod using the template, deploy and connect to it once ready.

The Sonic-AI project – intro

I always considered a “real world” project the best way to learn a new tech: get the hands dirty, be guided by (sort-of) realistic user requirements, and the excitement of building something step after step, one solved failure at time.

This is why I decided to “be inspired” by the passion one of my kids has for Sonic the Hedgehog, and use the latest tools available in the ML and GenAI space to create for him a “Sonic-AI buddy”. A virtual chatbot, looking and acting like Sonic, my kid can interact and conversate with, safely and having fun.

To break-down complexity of such project, so I don’t need to learn everything-about-LLM before creating something, I want to start with a very basic working prototype providing simple chatbot features (the so-called MVP), and then develop different “skills”, with each one of them requiring learning and using different ML or GenAI techs to be be achieved. Incremental learning and improvements.

  • The “Brain” (done): the core part of the project, a text chatbot agent able to impersonate Sonic, to provide my kid the feeling he can ask basic questions to him, and gets replies coherent with the style of his preferred heroes.
    • Technologies: an LLM used as a chatbot, a UI to interact with it, a system prompt to give the basic characterization.
  • The “Memories” (in progress): enrich the chatbot with domain-specific knowledge of the world of Sonic and his friends, so conversations won’t only be “in the tone” of Sonic, but also relavant to Sonic-verse.
    • Technologies: a mix of better prompting, fine tuning, RAG or something else to give the LLM the right knowledge about the character to impersonate
  • The “Voice” (in progress): what if the bot can speak with the voice my kid associates with Sonic?
    • Technologies: a customized Text-to-Speech model trained on the voice to reproduce, and a speaker
  • The “Hearing” (in progress): to completely get rid of text interaction, questions should be asked via voice
    • Technologies: connect the chatbot with a Speech-To-Text engine, and a mic
  • The “Eyes” (in progress): Sonic should be able to see the world around him
    • Technologies: something to capture a video stream, and a multimodal LLM to process images and text.
  • The “Body” (in progress): This is something that will connect the different input/output sensors. I’m still unsure how to create the body.about this In addition to a voice, the bot should have some sort of tangible body
    • Technologies: it could be a 3D printed figure of Sonic, an animated characted or something else

There is another preprequisite I want to fullfil: everything can run locally and based on OSS software. I’m a little bit paranoic mindful about privacy, and for no reasons the interaction of my kid should end-up in training dataset, or for internal model analysis, or somewhere else. So, privacy first.

Let’s start with “the brain“, the main element to which all the rest can then be attached.

Android notifications in GNOME Linux with GSConnect

MacOS’s Continuity show the possible level of integration between an iPhone and a Mac computer when a single company controls the OS stack of both devices. Sync notifications, make and pick-up phone calls from the desktop, file sharing, etc. Less powerful, but on the same line, the Android – ChromeOS integration offered by Google.

I’ve always missed something similar to show my Android phone notifications on my Linux desktop, without using questionable cloud services (in terms of privacy). But today I almost completely filled the gap, thanks to KDE Connect.

KDE Connect allows devices to securely share content like notifications or files and other features like SMS messaging, clipboard sync and remote control. The complete list of features supported by KDE Connect is impressive:

  • Receive phone notifications on the desktop computer and reply to messages
  • Control music playing on the desktop from the phone
  • Use the phone as a remote control for the desktop
  • Run predefined commands on the desktop PC from connected devices
  • Check the phone’s battery level from the desktop
  • Ring the phone to help find it
  • Share files and links between devices
  • Browse the phone from the desktop
  • Control the desktop’s volume using the phone
  • Send SMS from the desktop

And there are two additional good news: everything runs locally on the Linux desktop computer (no cloud needed), and there is a porting for GNOME, called GSConnect. It uses GNOME Shell with Nautilus, Chrome and Firefox integration.

KDE Connect wiki has detailed instructions on how to install and configure the software. Below, I’ll report what I did to install and configure GSConnect on my Linux box with Ubuntu 20.04.

GSConnect installation

Despite being available as gnome-shell-extension-gsconnect package since Focal (20.04), it seems the recommended way to install GSConnect is via a GNOME Shell extension. So, I opened the relative GNOME Shell Extensions website on my Linux desktop computer, and agreed in installing the extension.

Alternatively, there is a manual installation procedure, consisting in download the latest release of the Extension from GitHub, and then installing it using the following command:

gnome-extensions install --force gsconnect@andyholmes.github.io.zip

A restart of GNOME Shell is required, and there are different ways of doing it, based on Wayland or X11/Xorg (how to know the server in use).

Android installation

In parallel, the app KDE Connect needs to be installed from Google Play or F-Droid.

Then, the setup can continue following the standard instructions, both on Android phone, and on the desktop computer. In a couple of minutes, the mobile device should be paired with the desktop.

The GSConnect extension menu, after paired with the mobile device

Several app features will work only after granting the respective Android App Permissions to KDE Connect. The app shows a list of those that are disabled, and tapping on one of them opens the corresponding system setting, where the permission can be given.

For example, to share notifications with the desktop computer, open the app, tap on the “Notification sync”, and then tap again to open the corresponding system setting for granting the permission.

To enable integration with Nautilus, in order to send files from the desktop computer to the mobile device, the python-nautilus package has to be installed:

sudo apt install python-nautilus
Easily send files from desktop to the configured destination folder on the mobile device

In case of problems, the Help section of the wiki contains troubleshooting instructions to follow.

Conclusions

GSConnect desktop app, and the KDE Connect mobile app, offer a long list of feature that will require some time to be explored and mastered. But they almost allow to forget to have a mobile phone while using a desktop computer.

As a final note, there are early releases of KDE Connect for Windows and MacOS.

Smart speaker, ancora più smart, con Home Assistant

Non sarebbe bello se il Google Home in salotto ci avvertisse che c’è una nuova live su Twitch e, contemporaneamente, la facesse partire sulla TV? Oppure avere un Nest Mini che ci invita a spegnere subito qualche elettrodomestico, prima che l’ENEL ci stacchi la corrente visto che ne stiamo consumando troppa?
Tutto possibile, grazie ad Home Assistant, qualche file yaml, device IoT piazzati qua e là, e una manciata di logica per connetterli tra loro.
In questa sessione vedremo come configurare un hub domotico domestico, Home Assistant, magari su di un RasPi, connetterlo ad uno smart speaker Google Nest e creare delle routine per svolgere i compiti più disparati.

(DevFest Italia 2020)

Raspberry PI, Logitech c920 and WebRTC videocall: #fail

I had the idea to create a no-brain and enjoyable setup for my parents to videocall with me. I know I can use Skype, Hangouts or lot of similar apps on any PC / tablet, but for me no-brain and enjoyable means sit down on the living room couch, press a key somewhere, look at the big TV screen and start talking with me.

There are solutions like Chromebox for Meetings (hardware here and here) or the GT Mini 3330, but the price is too high for a whim like mine. After different iterations, I’ve figured out a solution, although final result is still not working for some reason. This post describes what I’ve done so far.

General architecture

TL;DR: A RasPI 2B connected to the TV, Logitech c920 webcam, WebRTC video call in a fullscreen Chromium tab.

An easy way to do bidirectional video call, nowadays, is to use dedicated apps like the before mentioned Skype, Hangouts, Ekiga or many others. But you often need a PC connected to the TV, you cannot automate the connection process, plus most solutions are using proprietary protocols. The alternative is to rely on the promises of WebRTC project and create a multi-party video chat. Nothing more than a browser and a camera is needed. Sounds good!

A Raspberry PI 2B model is powerful and cheap enough for running such setup and, connected to a TV screen via HDMI port, can use the TV for video and audio output.

As for the camera, I went for a Logitech c920. Why? Because it’s supported out of the box on Raspbian, works with fswebcam and it doesn’t need a powered USB hub, has a wide-angle lens very good for video calls, has a good directional microphone and can produce an hardware encoded FullHD H.264 video stream (it can be checked with v4l2-ctl –list-formats), with no additional computational overload on the RPi side. And it costs around 80 euros. Video quality is excellent and audio recored is loud and clean even if the speaker is 3 meters far from the camera: huge enough to cover the average distance between a couch and a TV :)
It may happen to see a rainbow square top right on the display while using the camera: it’s the under-voltage warning of the RPi, so connect it to a more powerful USB port, to an external USB power source or, as last resource, connect the camera to a powered USB hub.

Regarding the software, nothing more than a WebRTC-compatible browser, running on RPi, is required: IceWeasel and Chromium have it. To host the WebRTC call, there are plenty of solutions: AppRTC, Jitsi, OpenTokRTC, just to mention few, are online services that can be used to start a WebRTC video call. Source code is even available for OpenTok and AppRTC, to create a personal server.
Once selected the service, a script can open a fullscreen browser window pointing to the desired chat room url, once the RPi graphic environment starts, so the only action required is to power-up the Raspberry and wait to be connected. Bingo :)

Step-by-step config

Basic system
Downloaded NOOBS offline, formatted the SD card as one FAT32 partition using GParted,  and unzipped the NOOBS archive on SD card. Inserted the SD on the RPi and booted. Selected Raspbian Jessie, waited for the installation to finish. Alternative instruction for Linux here.

Run the usual sudo apt-get update, sudo apt dist-upgrade, sudo rpi-update commands triad

Browser
Regarding the browser to use… Well, it’s complicate.
As first try, I followed this guide that suggested to use Iceweasel. It was as simple as a sudo apt-get install iceweasel, but unfortunately vLine is not active anymore, with appRTC IceWeasel crashes and Jitsi.org doesn’t support Iceweasel < 40 (Jessie has v38 on the repo).
So I installed Chromium following this guide, using the precompiled armhf packages from Ubuntu Ports. But also this guide works, using packages in the Launchpad Librarian.

Why Chromium v45? Because v47 has a annoying bug that prevent a correct render of YouTube, appRTC and Jitsi, among the other.

Time to start my first WebRTC video call: I opened a Jitsi link to launch a test call on RasPI and my PC, authorised the website to use my mic and camera and I had both sides connected, audio and video working as expected. Wow moment, I’m an happy man!!!!

Unfortunately, after 30 to 60 secs, audio from the RasPI side stops working, and only the video is transmitted. I can see video from the RasPI side of the call, but I cannot hear the audio. Instead, on the TV screen, it’s possible to see and hear my stream. After tons of tests, I still don’t know why this is happening and how to solve it. Also a post on Raspberry PI forum isn’t helping. This is the biggest blocker I have right now :(

Converge Hackathon: developers + designers + diversity. Is it even possible?

One of the cool aspect of my current job is the freedom I have to experiment with what I think it’s valuable and important for the developer ecosystem. This time I tried to tackle two aspects, both under the diversity umbrella: expertise mix and gender gap.

In collaboration with frog design (thanks Laura and Alex for the help), we envisioned a platform to experiment and iterate around these topics, so we create the “Converge Hackathon” format. Let’s analyse the main idea and the first implementation, held at Google HQ in Milan, March 7th.

http://youtu.be/BSQb2oGXJDM

First, why an hackathon?

We all know what an hackathon is: a fixed amount of time for experimenting with new things, get in touch with smart people and have fun with passions. In addition, “Converge Hackathon” aims to improve the collaboration between designers and developers during the whole process of thinking, refining and realizing an idea. Hence the name. And because I viscerally love the hackathon format ;)

20150307 - Converge Hackathon 03
Don’t be shy and… present!

How the collaboration between developers and designers has gone?

Pretty much well, I would say.  This collaboration was one of the more acknowledged strength of the event. Here some of the attendees’ comments:
“Was challenging to work with stranger but at the same time interesting and funny. The best part was the division of the work”
“The collaboration was really good. It was my first time working with developers and I enjoyed a lot. Otherwise, I think it was needed a bit more of integration regarding with how the design and the coding could be merge”
“I’ve meet a lot of interesting people and different points of view on even the simplest thing”
“Good organization, very nice the initiative of mixing designers with developers and give an opportunity to work together”
Although it was challenging:
“I’m a designer. Speaking with Developer is very difficult because they only think in their square area.”
“At the beginning was difficult to know new people and get in touch with the developers”
To summarise: no pain, no gain when you start this kind of collaboration :) But the feedback showed that audience gained a lot, despite some small pain.
We balanced the attendees considering 2/3 of developers and 1/3 of designers, and frog carefully selected the latter viewing their portfolio, their profile, their activities. They wanted to be sure that the right profiles were part of the crowd. For developers, I let them in without any particular control. I trust in natural selection ;)
Another learning point was about the teams creation: such different crowd requires a focused pre-work for mixing the people in a proper way, something that goes beyond the quick ice-breakers we did in the morning, that work generally well in a standard hackathon. Dedicate the right attention to this aspect is crucial.
One final consideration is about the timing: one day only event makes hard to create something meaningful, and the ideation phase, that generally is very short during a normal hackathon because the attendees are eager to “get their hands dirty with code”, this time was fostered, and mostly led, by designers. The result was that final hacks were more elaborated that the average I’ve generally seen, but with the drawback of having prototypes less “working” than the usual. As note for us, organisers, next time we need to keep the ideation process inside a given timeframe, otherwise the risk is that, once the first half of the event has gone, teams are still thinking about what they can realise.

20150307 - Converge Hackathon 01
Diversity? Really not an issue for this team

Continue reading “Converge Hackathon: developers + designers + diversity. Is it even possible?”