Fably and the State of Edge AI
In which we discuss my recent hobby of building AI-powered assistants at the edge and try to imagine a future in which we can run a smart home automation system even without an internet connection.
Several weeks ago, I resigned from my role as backend architect at Singularity 6. I have a new job lined up, but I managed to carve out several weeks for recharging and staycate.
Weird thing happened though: all of the computers I was using for work were either the property of my former employer or accessible thru cloud account they paid for.
Overnight, I found myself “compute poor” for the first time in literally decades.
Fat clients or Thin Clients?
A staycation without a computer felt absurd, so I had a choice to make:
go “thin client” and embrace “local compute poverty” with a minimum viable local option and rented compute in the cloud (say, a local Raspberry Pi 5 for $80 flat and rent compute in the cloud on something like vast.ai which today prices an RTX 4090 at $300/month).
go “fat client” and increase my local compute wealth with something I personally owned (and could construct at my own desire).
There are additional factors I quickly came to realize I had to consider:
I want to be able to play video games together with my kids. Many of these only run well in Windows
Some of test require me to use a VR headset.
I find ROI evaluations of exploratory endeavors extremely difficult and cognitively taxing.
The first point alone effectively killed the “thin client” path: while we do have all three major gaming consoles in the house, since I built a gaming PC for my son, we have entered the Steam path and that is a very sticky path indeed (both in terms of ergonomics and availability of options).
But the second part turned out to be even more important for me: building and tinkering for me are two sides of the same coin. I feel that a lot of what I do simply cannot work well with a “taxi meter” consistently reminding me that I’m paying by the hour. “Was this a good use of time?” somehow feels different to me than “Was this a good use of money/resources”. The latter feels significantly harder to answer.
Fortunately, there is a way to maximize both points with the same exact purchase: top-of-the-line graphics cards are both a very powerful compute platform for gaming (so much so that any game runs on it with far more performance than necessary, and it will be so for many years to come) and very good AI edge accelerators and reasonably priced for that too.
“Wait”, I hear you say, “how is paying thousands of dollars for a graphics card reasonably priced?!”
In absolute dollars and on the surface, it does indeed feel insane to spend a similar amount of money on the graphic card than on the rest of the entire hardware components that make up a high end gaming PC. Ultimately, I ended up spending around $3500 on my gaming PC, $1750 of which exactly half were the graphics card. That is extremely different from what I did decades ago when I would allocate 10% of the budget to the graphics card, not 50%.
The options I looked into were: the NVidia RTX 4090 (24Gb at $1750), the NVidia RTX 4080 Super (16Gb at $1000) and the AMD Radeon RX 7900 XTX (24Gb at $900). What’s interesting is to look at it in terms of cost of compute:
4090 → $1750/82.58 Tflops = 21 $/Tflop
4080 Super → $1000/48.74 Tflops = 20.5 $/Tflop
7900 XTX → $900/61.39 Tflops = 14.6 $/Tflop
The 4080 has the same $/op value but has less memory which disqualifies it because not enough dedicated memory is something I ran into before when training models. I could buy two and get 32Gb but that also means more money, a bigger motherboard, bigger case, power supply and hoping the two cards can pass data efficiently between each other (and these cards are NOT designed for that). The cheapest option was, on paper and in theory, the AMD 7900 XTX but then we have to factor in two other things:
the state of the ML software ecosystem on AMD (using ROCm instead of CUDA) is nowhere near as functional and supported.
It’s very much unclear if those theoretical flops actually translate in real life and I couldn’t find anywhere a measure of int8 TOPs performance for this card. Real life inference benchmarks show a very different picture (for example, see this stable diffusion benchmark) in which a 4090 is many times faster than a 7900. This might very well related to the previous point: the hardware is there, but the software stack is not and the software is failing to utilize the hardware as well as it could. Or it could be that the 7900 is optimized for floats and int8 operation is not as good, I honestly don’t know and the fact that I couldn’t find any resource on the internet to explain this is a concern in and on itself.
So is saving $850 worth the hassle of fighting against the grain of the entire momentum of the ML industry? As an opportunity for compute arbitrage, maybe.
But my goal is different: enjoy my staycation and gain knowledge in the edge AI space. I needed a tool, not a hobby, so Nvidia got my money.
Dollars per tera-operations
The one thing that this exercise left me is the itch to understand the state the cost of compute at the edge (for example for things like home automation). For example, what would it take for me to run Home Assistant with a local LLM fine tuned for my own needs and with text-to-speech (TTS) that didn’t suck or cost an arm end a leg in Cloud fees (ElevenLabs, I’m looking at you).
So my high-end gaming PC cost me $3500 to get me a stable flat-feed edge access to 80 Tflops. That’s 44 $/Tflop, but that’s also a top of the line CPU with 16 cores and tons of L3 cache. It’s great for gaming or compiling or running lots of apps smoothly in parallel, but it’s not really needed for our home assistance. Could we go cheaper? How does this compare with other things out there?
Let’s start with the cheapest computers I know that can ran a familiar Linux stack, the Raspberry PIs:
Raspberry Pi 4 (4Gb) can run 12 Gflops for $45 or 3,750 $/Tflop
Raspberry Pi 5 (8Gb) can run 30 Gflops at $80 or 2,666 $/Tflop
Yeah, no, that doesn’t work. Raspberry Pi’s were not designed for computation or machine learning so it’s not surprising that their $/OP is way off. But that’s also true for high end CPUs: they are optimized for very different types of workloads. What if we add an accelerator card?
Google sells edge TPUs as accelerators (named Coral). These are very hard to find these days1 but if you can find one the math would look like this:
Raspberry Pi 5 (8Gb) + M.2 Hat ($10) + Coral M.2 accelerator at $80 + $10 + $25 = $105 with 4 TOPs (int8) which gives us $26/TOP
NOTE: Google sells a Coral board with two TPUs in it which provides even better $/TOP, but that requires 2 PCIe lanes and the RPI5 only exposes one to the board (even if the SOC has 4). I wouldn’t be surprised if the RPI6 fixed that but that’s not the present.
What about boards that already ship with NPUs (neural processing units)? The Rockchip RK3588S SOC seems like an interesting option here since it comes with a 6 TOP AI accelerator built in and we can buy boards powered by it directly at Amazon:
Orange Pi 5 (16Gb) at $127 with 6 TOP or $21.2/TOP
Now, it’s important to realize that TOP and Tflop are NOT the same thing. TOPs are generally int8 operations while Tflops are f32 operations. Operating on integers in silicon is much easier than operating on floats, which means we can use a lot less transistors and power to perform one operation.
For training, floating point precision matter a lot, but for inferencing it don’t seem to be the case2: things like Ollama and llama.cpp routinely run very large models quantized at 4bits ints per parameter with surprisingly effective results and able to squeeze even more speed out of GPUs because there is half the memory to transfer around (and large models tend to be I/O bound, not compute bound). There is even talk about 2bit quantization being feasible, which would increase speeds even more.
Note that if we are talking about int8 ops, then we need to re-evaluate the 4090 as well because we were talking about f32 flops before. For int8 ops, the 4090 packs a mighty punch with a theoretical 1321 TOPs (at least according to Nvidia official documentation). If that figure is real (I still have to benchmark it on my machine), that gives us a cost per teraop of 1.32$/TOP which is nearly 4x better value than even the 5$/TOP of the most powerful Coral accelerator.
But what about software support? Nvidia actually sells CUDA products in this space: the Jetson developer kits. They call it “robotics edge” but it’s exactly what we would need to run compute-intensive workloads at the edge.
The Jetson dev kits come in two flavors: expensive and eye-watering expensive.
Jetson Orin Nano at $500 with 40 TOPs or 12.5$/TOP
Jetson AGX Orin at $2000 with 275 TOPs or 7.2$/TOP
Note how the Jetson AGX Orin is by far the cheapest option available for edge compute and it’s still 7x more expensive than a 4090 in terms of $/TOP.
But hey, how does this stack up with ultra-high-end AI accelerators? Here we have several competitors although the actual prices are unknown and extremely handwavy, but this is what I have bene able to piece together from the internet:
NVidia H100 - 3,026 TOPs at ~$30,000 gives us ~$10/TOP
NVidia H200 - 3,958 TOPs at ~$40,000 gives us also ~$10/TOP
AMD MI300X - 2,600 TOPs at ~$15,000 gives us ~$5.7/TOP
This shows how the 4090 is actually a very reasonably priced accelerator for edge AI inference3.
But how many TOPs do we actually need?!
Ok, so, let’s say we want to create a Jasper-like assistant for our house that actually contains the smarts of the current breed of LLMs… how much compute do we need for that? Do we really need to go all the way to a 4090? Could we do it on a Jetson Orin Nano?
The simplest solution is: none, just use cloud APIs, let others pay the cost of hardware, cooling, power, sourcing the GPUs, integrate the stack, juice every little bit of performance out of their stack.
The are several problem with that approach:
those APIs are cents per operation, but over months that can add up to be expensive (TTS, surprisingly, it’s the most expensive part of the stack these days). Also, I really don’t like that I have to think “would this request to my house genie be worth 2 cents?!”, again I hate making these ROI evaluations.
the house doesn’t work when the internet is down (this is the case today withs stuff like Google or Alexa but I really don’t like it)
your system is locked into these services
I’ve been working with this a lot lately for Fably, the storytelling companion that I built for my daughter. I was able to make a fully functional STT-LLM-TTS pipeline fit into a $50 device to tell stories to kids. It works extremely well, judge for yourself.
The above is using OpenAI APIs and it’s actually amazing that the current AI tech stack has matured enough that I could achieve something like this in just a couple of weeks. Truly standing on the shoulders of giants here, both for ML and for embedding low-cost hardware and open source software stacks.
The problem is that each story ends up costing about $0.15.
I’m totally fine paying this money for my daughter to entertain herself when she wants (and get a kick out of a device that her dad built specifically for her), but I’m not really ok with the notion that each “hey Jasper, set a timer for 10 minutes” will cost me cents. We use “hey Google” a lot in my house for all sorts of things that manage the home and I want to better understand what it would take to run this all locally (and also, as I fully expect Google to start charging me for this service at some point, to understand what is a reasonable price point for it).
After a lot of experimentation, I was able to build an STT/LLM/TTS pipeline that worked worked reasonably well to emulate OpenAI’s but could run efficiently on my local 4090:
for STT, I’m using Whisper. This is the same exact model that OpenAI is using so the quality should be equivalent. I also realized that at least for quiet operation Whisper-tiny works reasonably well even if it’s just a 39M parameter model.
for LLM, I’m using Ollama and generally I’ve had good experience with either Llama3 8b (8000M parameters) and Mistral 7b (7200M parameters) both at 4bit quantization.
for TTS, I’m using WhisperSpeech which is still very much work-in-progress but it has an amazingly good prosody even as a tiny model (at 80M parameters).
In fact, the 4090 turns out to be overkill for this:
STT is nearly instantaneous for short voice queries with a handful of words.
the LLM achieves something like 130 tokens/second which is MUCH faster than even OpenAI cloud APIs (granted, the quality of the storytelling for my local stack is still TBD)
TTS is able to generate 20 seconds of speech in about 1.5 seconds, so it’s around 15x realtime. This means that even something 15 times slower would be good enough for my needs.
So what is the minimal amount of edge compute power that we need to run a reasonable STT/LLM/TTS pipeline? That’s what I’m going to try to find out next.4
The software elephant in the room
ML models are an even softer kind of software. Squishware, if you wish. The current state of the open source ML software stack is abysmal compared to anything else that I have experienced as a software engineer.
Let me show you what I mean with an example: text-to-speech.
I have text and I want the computer to speak the text to me. Easy enough, let’s use espeak:
sudo apt install espeak-ng-espeak espeak “Hi, this is a computer talking to you”
which sounds like this
This might sound endearing if you’re a fan of Stephen Hawking or Dr. Sbaitso but I know nobody in my family would trust a system that speaks like that to be intelligent or useful.
HuggingFace’s TTSArena shows that StyleTTS2 and XTTS are the best open source TTS models out there, so let’s try those. StyleTTS2 doesn’t come with an official python package, while XTTS does, so let’s start with that one first:
conda create -n xtts python
conda activate xtts
pip install tts
which… fails with this fantastically uninformative error:
ERROR: Ignored the following versions that require a different python version: 0.0.10.2 Requires-Python >=3.6.0, <3.9; 0.0.10.3 Requires-Python >=3.6.0, <3.9; 0.0.11 Requires-Python >=3.6.0, <3.9; 0.0.12 Requires-Python >=3.6.0, <3.9; 0.0.13.1 Requires-Python >=3.6.0, <3.9; 0.0.13.2 Requires-Python >=3.6.0, <3.9; 0.0.14.1 Requires-Python >=3.6.0, <3.9; 0.0.15 Requires-Python >=3.6.0, <3.9; 0.0.15.1 Requires-Python >=3.6.0, <3.9; 0.0.9 Requires-Python >=3.6.0, <3.9; 0.0.9.1 Requires-Python >=3.6.0, <3.9; 0.0.9.2 Requires-Python >=3.6.0, <3.9; 0.0.9a10 Requires-Python >=3.6.0, <3.9; 0.0.9a9 Requires-Python >=3.6.0, <3.9; 0.1.0 Requires-Python >=3.6.0, <3.10; 0.1.1 Requires-Python >=3.6.0, <3.10; 0.1.2 Requires-Python >=3.6.0, <3.10; 0.1.3 Requires-Python >=3.6.0, <3.10; 0.10.0 Requires-Python >=3.7.0, <3.11; 0.10.1 Requires-Python >=3.7.0, <3.11; 0.10.2 Requires-Python >=3.7.0, <3.11; 0.11.0 Requires-Python >=3.7.0, <3.11; 0.11.1 Requires-Python >=3.7.0, <3.11; 0.12.0 Requires-Python >=3.7.0, <3.11; 0.13.0 Requires-Python >=3.7.0, <3.11; 0.13.1 Requires-Python >=3.7.0, <3.11; 0.13.2 Requires-Python >=3.7.0, <3.11; 0.13.3 Requires-Python >=3.7.0, <3.11; 0.14.0 Requires-Python >=3.7.0, <3.11; 0.14.2 Requires-Python >=3.7.0, <3.11; 0.14.3 Requires-Python >=3.7.0, <3.11; 0.15.0 Requires-Python >=3.9.0, <3.12; 0.15.1 Requires-Python >=3.9.0, <3.12; 0.15.2 Requires-Python >=3.9.0, <3.12; 0.15.4 Requires-Python >=3.9.0, <3.12; 0.15.5 Requires-Python >=3.9.0, <3.12; 0.15.6 Requires-Python >=3.9.0, <3.12; 0.16.0 Requires-Python >=3.9.0, <3.12; 0.16.1 Requires-Python >=3.9.0, <3.12; 0.16.3 Requires-Python >=3.9.0, <3.12; 0.16.4 Requires-Python >=3.9.0, <3.12; 0.16.5 Requires-Python >=3.9.0, <3.12; 0.16.6 Requires-Python >=3.9.0, <3.12; 0.17.0 Requires-Python >=3.9.0, <3.12; 0.17.1 Requires-Python >=3.9.0, <3.12; 0.17.2 Requires-Python >=3.9.0, <3.12; 0.17.4 Requires-Python >=3.9.0, <3.12; 0.17.5 Requires-Python >=3.9.0, <3.12; 0.17.6 Requires-Python >=3.9.0, <3.12; 0.17.7 Requires-Python >=3.9.0, <3.12; 0.17.8 Requires-Python >=3.9.0, <3.12; 0.17.9 Requires-Python >=3.9.0, <3.12; 0.18.0 Requires-Python >=3.9.0, <3.12; 0.18.1 Requires-Python >=3.9.0, <3.12; 0.18.2 Requires-Python >=3.9.0, <3.12; 0.19.0 Requires-Python >=3.9.0, <3.12; 0.19.1 Requires-Python >=3.9.0, <3.12; 0.2.0 Requires-Python >=3.6.0, <3.10; 0.2.1 Requires-Python >=3.6.0, <3.10; 0.2.2 Requires-Python >=3.6.0, <3.10; 0.20.0 Requires-Python >=3.9.0, <3.12; 0.20.1 Requires-Python >=3.9.0, <3.12; 0.20.2 Requires-Python >=3.9.0, <3.12; 0.20.3 Requires-Python >=3.9.0, <3.12; 0.20.4 Requires-Python >=3.9.0, <3.12; 0.20.5 Requires-Python >=3.9.0, <3.12; 0.20.6 Requires-Python >=3.9.0, <3.12; 0.21.0 Requires-Python >=3.9.0, <3.12; 0.21.1 Requires-Python >=3.9.0, <3.12; 0.21.2 Requires-Python >=3.9.0, <3.12; 0.21.3 Requires-Python >=3.9.0, <3.12; 0.22.0 Requires-Python >=3.9.0, <3.12; 0.3.0 Requires-Python >=3.6.0, <3.10; 0.3.1 Requires-Python >=3.6.0, <3.10; 0.4.0 Requires-Python >=3.6.0, <3.10; 0.4.1 Requires-Python >=3.6.0, <3.10; 0.4.2 Requires-Python >=3.6.0, <3.10; 0.5.0 Requires-Python >=3.6.0, <3.10; 0.6.0 Requires-Python >=3.6.0, <3.10; 0.6.1 Requires-Python >=3.6.0, <3.10; 0.6.2 Requires-Python >=3.6.0, <3.10; 0.7.0 Requires-Python >=3.7.0, <3.11; 0.7.1 Requires-Python >=3.7.0, <3.11; 0.8.0 Requires-Python >=3.7.0, <3.11; 0.9.0 Requires-Python >=3.7.0, <3.11
ERROR: Could not find a version that satisfies the requirement tts (from versions: none)
ERROR: No matching distribution found for tts
It turns out (and yes, it took me a long time to find out why) that Dynamo (the pytorch compiler) is not compatible with Python 3.12 and it only works with 3.11 or below.
So what we need to do instead is:
conda create -n xtts python=3.11
conda activate xtts
pip install tts
which results in a successful installation. And note how this is why we use conda
and not just venvs
: Python venvs can’t change the version of Python they’re running!
After which (several GB of software downloaded and installed later!) we can write a very simple python program:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="Hi, this is a computer talking to you.",
file_path="output.wav",
speaker_wav="speaker.wav",
language="en")
and ends up with something like this
which… I mean, yeah, it’s a lot better than espeak
but it’s not really that good either. And yes, in theory this can do all languages and even voice cloning but in practice I don’t really care (and the voice cloning quality is bad). I just want something that speaks like it’s understanding what it’s saying.
Ok, so let’s use StyleTTS2 instead. This one doesn’t have an official python package so we need to look at the code ourselves. A few hours later, I have something that works
import phonemizer as pho
from styletts2 import tts
from nltk import tokenize
def phonemize(phonemizer, text):
text = text.strip()
text = text.replace('"', '')
phonemes = phonemizer.phonemize([text])
tokens = tokenize.word_tokenize(phonemes[0])
return ' '.join(tokens)
text = "Hi, this is a computer talking to you."
my_tts = tts.StyleTTS2()
phonemizer = pho.backend.EspeakBackend(
language="en-us", preserve_punctuation=True, with_stress=True
)
text = phonemize(phonemizer, text)
my_tts.inference(text, phonemize=False, output_wav_file="output.wav")
which ends up sounding like this
which is better, admittedly. Turns out tho, there is a python package for this but it’s not official. This is important because I just want to add the package as a pip dependency and move on.
But StyleTTS2 needs a phonemizer: a program that can translate text into phonemes and prosodic description. StyleTTS2 only generates the audio out of that.
The “phonemizer” python package uses espeak under the hood but that’s not MIT licensed, so the only phonemizer compatible with the MIT license is gruut which is far worse in terms of prosodic quality. So if we use that, or we didn’t know we needed to worry about that, we end up with something that sounds like this:
which, ok, seems similar enough but how would it work with longer sentences? This is the text I asked the model to speak
Once upon a time in a quaint village nestled between rolling hills and lush forests, lived a little girl named Anna. Anna was an imaginative and curious child with sparkling blue eyes and a heart full of wonder. She spent her days exploring the woods, reading books, and dreaming of magical adventures.
the model said this
“nestled beat… ween”? “anna anna”? <facepalm> This does not sound like a machine understanding what it is saying.
Ok, so I spent more days going down the list of open source TTS and despairing because they were getting worse and worse as we went down the list… until I found a relatively new project that was not even listed there: WhisperSpeech.
The crucial part of the WhisperSpeech project is that it is trying to learn prosody directly from the speech in an auto-encoding way. This seemed obvious to me but I was surprised that nobody else was doing it (in open source models). The result is surprisingly excellent:
And there is a python package!
Ok, so let’s go install it
conda create -n whisperspeech python
conda activate whisperspeech
pip install whisperspeech
and then write a minimal program to test it
import time
from whisperspeech.pipeline import Pipeline
tts_pipe = Pipeline(s2a_ref='whisperspeech/whisperspeech:s2a-q4-tiny-en+pl.model')
start = time.time()
tts_pipe.generate_to_file("output.wav", "Hi, this is a computer talking to you")
print(f'Took: {time.time() - start:,2)}s')
which generates this
and took 2.8sec to generate 2 seconds of audio. This is way too slow for us. But that’s weird, why is that so slow? This is a 80M parameter model quantized at 4bits per parameters and my graphic card can do more than 1 petaops of int8 operations.
Can we make it go faster? Turns out that by default PyTorch runs the model interpreted but if we pass torch_compile=True
, the model will get compiled and optimized into CUDA kernels and executed optimized on our graphic card. So let’s try that but now that fails with a separate error:
RuntimeError: Dynamo is not supported on Python 3.12+
which yeah, it’s the same problem we had before with XTTS. So, let’s trash this cuda environment, create another one with the right python and now it works fine and it took 0.53s which is 5x faster! Not bad for a single flag change!
But let’s get this straight: this is just for TTS and for models that were already trained, finetuned and quantized for edge use… and using the most used language for ML (python), the most used package manager (pip), the most common ML framework (pytorch) and the most common computation environment (CUDA).
On one hand, it’s amazing that I was able to come up with a reasonable clone of OpenAI TTS API in a few days on my own and make it run efficiently and effectively on my machine.
On the other, this is by far the path of least resistance. Could I make this work on cheaper hardware given this is just a 80M parameter model? At this point, it feels very very hard to me.
What would it take for this to run on Coral? I would probably have to rewrite the model in TensorFlow and hope all of the operations used by model are supported in Tensorflow Lite.
Could I make it work on an OrangePI? PyTorch has ~2000 ops, the probability of somebody rewriting them all for the RK3588S feels exactly zero. Is there even a single ML framework that supports the RK3588S? I couldn’t find any.
There are several projects out there that try to fix this problem (Mojo and Tinygrad come to mind) but it’s unclear whether either of those approaches will make progress fast enough against Nvidia’s dominance of the ML stack.
So, there we go: edge ML is being turbocharged by open source ML models and it holds a lot of promise even just gauging by how much I could do on my own in just a few weeks but the state of the stack remains turbulent and uncertain which gives me some pause.
Google does not seem to understand how important is to create a smooth onboarding ramp into a cloud offerings, unlike Nvidia. It is baffling to me that Google can exhibit simultaneously such engineering excellence in creating things like cloud TPUs (which reach 4,300 int8 TOPs per chip and you can have thousands of them in a high-speed fabric!), especially for very large scale ML training and completely miss the boat completely on “onboarding” and “onramping” which is why Nvidia is the center of everyone’s attention these days (both financially and operationally) despite being objectively years behind in terms of large scale deployments. Google should be selling 4090-class PCIe NPUs at cost and incorporate support for them in TensorFlow AND PyTorch just as a gateway drug for their on cloud TPU services. Why don’t they? Despite the fact that I used to work in the group that came out with Coral and did all of the edge ML research at Google, I still have no idea why.
Yes, I know about QLoRa and it’s entirely possible that one can use edge accelerators for finetuning without floating points but I don’t know enough about this to judge.
All right, so why don’t people just buy a ton of 4090s instead of H100s for cloud hyperscaling? It’s because $/TOP is not the only metric that matters, especially with I/O and memory and bandwidth and ability to compose larger compute fabrics together. A collection of 4090s with 24Gb each is significantly less performant than a collection of H100s with 80Gb each, especially when connected via NVLink instead of PCIe so the whole $/TOP equation changes rather dramatically when we need Petaflops or Exaflops.
Another, much bigger, open question is how a end2end multi-modal model like GPT-4o can change this equation. AFAIK, there aren’t any open source multi-modal models that incorporate STT and TSS although I would be very surprised if they don’t show up within the next 12 months. Still, considering that the STT and TSS parts are just 120M parameters compared to the 8B of the LLM, it is pretty straightforward to suspect that just doing the calculations with the LLM model would be enough to deal with the entire pipeline with audio end2end.