The Swedish research institute RI.SE hosted an “Artificial Intelligence and Computer Science day” (AI and CS day) last week. RI.SE has a long tradition of hosting interesting open houses, both as RI.SE and in their previous guide as SiCS. The day was a mix of organized talks in the morning, and an open house where RI.SE researchers showed off their work in the afternoon. Most of the AI discussions were related to large language models (LLMs), but not all. I got some new insights about LLMs in general and using LLMs for coding in particular.
Not-LLM for Coding
Sahar Asadi, Director of the AI labs at King presented their work on using AI agents to increase programmer and designer productivity. King is the company behind casual games like Candy Crush Saga. They have been looking at AI agents since as far back as 2015, and have built up some interesting game-playing agents over the years. Which are not based on LLMs, thankfully.
The problem she described was the creation of new levels for games like Candy Crush. This happens in three steps, as shown below (sketched from my notes following a slide she showed):
The create step is fun. The balance step is key to getting a good level out that will keep users engaged and find the game fun. However, the balance step is a slog for humans as it means playing the same level over and over again, with tweaks, and trying to find any bugs or issues. Levels are continuously added to their games. For example, Candy Crush Saga currently has something like 14000 levels available! And there are players who are at the end of that sequence waiting for more.
What King AI has done is to create an AI agent that can help with the balance part, by playing the game in a manner similar to a human. This is a brilliant application of AI, in that it helps humans and makes them more productive. Creating such an agent has not been easy. Sahar went through several aspects of the process and how their architecture has changed over time.
At core, they use a deep learning approach with convolutional neural networks (CNN). The job of the CNN is to take a board state (how to encode the board state is a huge topic in itself) and provide a prediction of the move a “typical” human player would make. The goal is not to create a perfect player, but one that has a behavior like a human (a perfect player would be conceptually easier to do, just a search problem of a large but bounded search space).
They started out with supervised learning but have moved on to reinforcement learning to scale up the training process. Creating a proper rewards function is another engineering/tweaking problem. Right now, they are looking at building multiple-agent models to let them capture tuning parameters like how skilled a certain human is. She talked about a “skill library” model where each agent would have a certain type of skill/cover a different aspect of the game play. Combining the moves proposed by different “skills” provides a way to model differently skilled human players.
For the future, King was indeed looking at generative AI technology to support the level creation process.
The key take-away is that AI agents can be applied to speed up routine work like playtesting. But also that building good agents takes years, iterations, persistence, and the willingness to constantly try new techniques and change the existing system. I found this example quite interesting in that it did not directly address “writing code”, but rather “building value”.
Google Assistant and Bard
It was quite interesting to see the different views on LLMs and generative AI presented by two speakers: Edward Chi from Google, and Henry Tirri from SiloGen. Edward Chi is part of team building Google’s chatbot search, Bard.
Edward Chi was very gung-ho about the prospects and speed of development of LLMs. He spent a lot of time on the use of generative AI to create personal assistants and better search engines. For him the revolution is that with LLMs we get a single tool that can do many different tasks. Another important aspect is LLMs can use context to produce better results. Classic assistants (Apple Siri, Google, Amazon Alexa, …) are all transactional. With an LLM, you can imagine an assistant that uses a long history of interactions and other knowledge to produce more relevant results with less work on the part of the user to inform the agent.
In his view, the web search revolution of Google and similar services made us smarter. LLMs will take that technology enhancement of our abilities as humans to the next level.
He likes to separate out LLMs from Machine Learning – Machine Learning is mechanical and limited. LLMs are AI, where you teach the model like you teach a child. He really is very close to declaring that LLMs are intelligent. Logically, Edward made the claim that LLMs have common sense and can reason. It is not just statistical text completion. Something more is going on in the models. He brought up his own recent experiments in chain-of-thought prompting where you provide an LLM with (input, reasoning, output) and then ask it to produce results in the same vein with a different input. This to him proves the system can reason.
Edward also pointed out some issues with AI in the guise of LLMs:
- Responsibility and safety
- Struggling with factuality, grounding in truth, and attribution
- The feedback loop where AI is trained on AI-generated content – indeed this is going to be a good way to generate meaningless noise that looks like real text.
- Personalization and user memory, how to maintain a user-focused context (for the assistant use case).
Philosophy and Theory of LLMs
Henry Tirri had some more philosophical comments about LLMs. He used to do AI research back in the 1990s if I understood correctly, and has worked on a number of commercial applications for various types of AI over the years.
For him, LLMs are a scaling exercise. Statistical language models are an old idea (going back at least to the early 1990s). But adding the sheer volume of data that we have in place thanks to the Internet makes for a real breakthrough.
The real surprise here is just how much the models have achieved, which probably tells us something about the nature of language. Humans use languages in a more regular way than maybe we thought. We also use languages to explicitly reason about the external world. And by doing enough language volume, somehow the models can at least capture a semblance of reasoning. You can also look at LLMs and especially current chatbots as a user interface breakthrough.
A very profound point was that learning (in general) is compression. And probably compression can be thought of as learning. LLMs can be thought of as lossy compression with a random seed for the reconstruction of data – if the compression was lossless, the model would just perfectly reproduce their input with no variation or novelty. The training data must be bigger than the representational power of the model, and a model with more parameters will require bigger input data to avoid over-training and getting stuck.
As I have noted before, there is no actual knowledge in an LLM. For an LLM, there is no difference between a correct and an incorrect answer. The model inherently has no way to tell the difference. Hallucinations are a necessary and unavoidable part of the system. The only way to get rid of bad answers is to prune them in a post-processing stage. This is a fundamental limit on what you can do with LLMs.
Any use of an LLM without a fact-checker at the end should be considered fundamentally invalid.
One application of LLMs that he looked forward to is real-time translation between different human languages. Not quite there yet, but coming. This is a good application for LLMs since small mistakes likely will not matter and humans are inherently in the loop (do not use these LLMs to negotiate war and peace, please).
He also noted that standard big-data machine learning (not generative AI/LLMs) is going into next-generation telecom standards. The 6G protocols being proposed somehow rely on ML to function (details not clear from the presentation).
When it comes to risks and regulation of AI, Henry is not a huge fan of regulation in computer science. So far, it has all gone horribly wrong. Ethics are not technology, and what makes more sense is to regulate the use of technology rather than the technology itself. I am not entirely convinced by this argument – in other areas of regulation, it makes sense to outlaw things like whole classes of chemicals. Outlawing or restricting the use of LLM technology in certain fields seems like a sane idea.
Finally, Henry talked about the business of the company he is working for:
This is an instantiation of a general pattern that quite a few companies are applying today. Starting with a general model (like GPT-4), it is trained with data specific to a domain to create a domain-specific but still general model. Finally, and crucially, the model is fine-tuned using the data private to a certain business. Speaking of the base model, apparently SiloGen has been involved in creating a European foundational model using what must be the Lumi supercomputer in Finland.
LLMs for Coding
The topic of using LLMs to improve programmer productivity came up in conversations during the day. Many people report significant success when using LLMs to generate code. Things like 40-50% more code being produced is a common estimate. However, this is typically achieved in “simple coding” – where the alternative is to learn a new API or more like Google around for answers and reading StackOverflow. LLMs are supposedly an excellent tool for getting started with some new framework or API.
However, this only works for cases where the APIs are part of the training data. This means that for company-internal APIs or APIs where there is little code out in the world, the results will not be good. The take-away is really that the SiloGen business shown above is totally key to enterprise use of LLMs: the models must be fine-tuned using your own code, code examples, and documentation so that the tools learn what you do in your specific environment.
I would claim that LLMs are fairly useless for precision coding or novel algorithm development. They are there for bulk coding that has been done many times before, but not for the interesting (real) programming, i.e., solving new problems in a new way. Or solving problems with precision and thought.
Another problem with coding-training LLMs is that it seems they are mostly trained on code. This is not necessarily a good idea, as it amounts to learning by example when it comes to how system operate. I would like to rather train with a heavy emphasis on documentation, application notes, and best-practice examples. That is a what a good human programmer would do – not just google for some random code, but also read the docs and check what the API or language is REALLY supposed to do.
Having an LLM help find authoritative best practices or even mandated practices for how to solve common tasks within a framework or API would be fantastic. But that is not what is being offered today it seems. It is not even clear to me that an LLM would get the connection between API descriptions in text and the generation of corresponding code. Sounds like good examples are going to be even more important to train machines.
Another potential risk is that LLMs harm innovation and evolution of systems. Given that they are trained on existing code, they essentially amount to the automation of copy-paste coding from old code. Which is known to fossilize and reproduce bad practice. In theory, this can probably be solved by weighing the training data based on age of code or known versions. For example, it is better to learn from examples in C++20 than in C++11, and Python 3 over Python 2.
Even better, a coding LLM should be set to only generate code in a certain version of a language like “give me code in Python 3.11”. However, I fear that is never going to be theoretically possible since as noted above, LLMs have no idea what they are doing.