One of the nice properties of delivering software that users install on their own machines is that once the software has been built and shipped, the cost of running it is handed over to the user. The cost per installation and per user is minimal in terms of compute load on the developing company. Of course there are costs for things like support, but that is different. However, having the customer provide the compute resources is not necessarily that easy when it comes to AI-based setups.
We have been playing around with Retrieval-Augmented Generation, RAGs, which led to me wondering how easy it would be to ship such a beast as a product feature. In short, not very easy at all.
RAG Time
The scenario is that we have a product with extensive documentation that is shipping to users. A classic help system consists of the documents plus a reader and search system. The help system can be part of the product user interface or provided as HTML pages viewed in a browser. Regardless, the product is delivered with a set of files and the user runs the software and reads the documents on their own machines at their own compute cost.
Radially simplified, this is how a RAG works. It uses an LLM to format and collate and summarize results from searches in a set of documents. The documents are first ingested into the system by passing them through an embedding system into a database. A user query is also turned into an embedding, and the most similar documents are pulled out of the database (or rather chunks of documents). The LLM is then given the results and produces the final results that a user sees.
As illustrated above, the RAG system consists of some software components, some data, and then the LLM. It is easy to see how the documentation and software components can be shipped as software to a user. The database could be built locally by the user, or shipped pre-built. All of that works as classic deliverable software. It is a bit larger and more complex than your average help reader, but it is still just software. If we ignore the LLM, the RAG system is essentially a search engine.
With an LLM and some good prompting, it can be something much more user-friendly, as it can stitch together a set of results into a single reply. It can provide answers made out of code to text queries, or combine distributes sets of facts into a whole.
The key is how users can be provided with access to that magic engine.
Structurally, LLMs are similar to other engines that are used as backends in software. Like a database or web server or Python interpreter. The problem is that they are materially different in their requirements.
The Problem with the LLM
LLMs are large. Very large. The above illustration does not really show the scale of the difference. Today, you cannot run a generally capable LLM locally on a laptop, and definitely not on your average developer VM running on a server. Yes, we have “AI PCs” and such, but the kinds of models that do a good job with RAG systems is more on the side of ChatGPT-4 than a locally running small model.
This makes deploying LLMs as part of a bigger solution a lot more complicated than regular old software.
If you want to run the LLM locally as a user, you will need a dedicated server full of AI acceleration engines or GPUs. This means standing up a server inside your organization and keeping that updated as new software releases are delivered and put into production. Compared to just running software from your own account.
Delivering a cutting-edge LLM is also interesting. The huge amount of data and compute needed to train a model means that it not something you create yourself unless you are a dedicated AI company. Instead, a likely delivery model is to ask users to download a model from a repository like HuggingFace and connect it to the application on their own.
Alternatively, the LLM can be accessed as a service over an API. This appears to be a common model for company-internal LLM-based services today. Currently, this is the only way to access the most capable LLM models. Using LLM-as-a-service might be easier than setting up your own server, but it does mean that the RAG user has to enter into a commercial agreement with the LLM provider.
Philosophically, it also notable that a large LLM is way more complex in terms of its information content than both the software we are shipping and its documentation. The LLM contains billions of weights plus structure and has ingested insane amounts of data during its training. Using that engine to make access to some documents a bit easier feels a bit wasteful.
Fun with Stacks
Compared to classic processor-based compute, AI is a pain to deploy. Right now, you have to install libraries and toolchains that depend both on the frameworks you are using (PyTorch, Tensorflow, OpenVINO, …) and the particular hardware. It is like the early days of graphics accelerators for PCs, where each and every card came with its own custom drivers, and every game had to be adapted for every vendor API.
Today, games have none of these issues. Games for Windows simply use DirectX, and things work (or equivalent for other platforms). I guess AI will get there in a few years time, especially for client-side smaller-scale AI deployments.
Provide the RAG as a Service?
A common solution to documentation today is to put it all on the web, and direct users to the servers of the software providers in order to access the documentation. Serving up a set of web pages plus documentation search is really cheap today, and nobody would consider that a problem.
Adding a RAG to help access online documentation is a very attractive user proposition – and it is technically easier than shipping a RAG system for local installation. However, the drawback is that you get saddled with the cost of running the system and in particular the cost of either hosting an LLM on your local servers or renting LLM inference from a provider. This means that the cost per query goes up many times compared to a plain searchable website.
If you want to provide some software for free – say to universities – that means each user acquired incurs some amount of cost. That has to be factored into any product plans; while in the past just giving a user a whole software package meant any runtime costs were on the user.
Adding user accounts to control access and possibly bill for usage adds a whole slew of complications compared to providing a generally accessible web site.
Another issue with docs on the web is that assumes that all users can and are allowed to access the web (not a given if selling to certain security-conscious industries). Docs on the web also miss anything that is custom to a certain user or developed internally in a customer company. There are many cases where doing things locally just makes more sense.
Future?
It feels like LLMs are moving towards becoming a new type of fundamental infrastructure. It is early chaotic days still, but if we draw analogies to history there are some possibilities that leap out to me:
- Downloading and running LLMs should become as the underlying APIs that they rely on get computation done mature and become standardized. Just like happened with graphics.
- LLMs might become a standard service available from your operating system or some middleware stack. I.e., no matter what computer you are using, there is an LLM service that is similar to the file system or graphical user interface – something that just has to be there. Different for each vendor, but doing “the same thing” in all contexts.
- Or maybe an LLM is more like “the Internet” – i.e., it is pointless to talk about it as something local. Instead, the local computer provides standard ways to reach out. Just like you always have a web browser or a library like curl around that lets you do Internet accesses.
In any case, I would argue that LLMs are a new type of software beast. They are a fundamental building block, but building block that is more like a city block than a Lego brick in size. Not easy to just integrate and take along.
One thought on “Delivering AI-Based Solutions is not Always Easy”