In previous three blog posts (1,2,3) about ChatGPT in particular and large language models in general, I touched on what they can do, what they cannot do, what they seem not to do, how they fall down in funny ways, and why I think they are fundamentally flawed for many applications. There is one more aspect left to consider – the legal and licensing side. I am not a lawyer, I am not an expert, but it seems obvious that there is a huge problem. There are also clear questions about business morals and what the right thing to do would be. I also doubt the business viability of LLMs in the way they are currently trained.
Copyrights
A key problem with LLMs and other generative models is the very murky situation vs copyright. It seems standard practice to train models on contents scraped off of the Internet (note that at the launch of GPT-4, OpenAI notably did not document how it was trained). It is clear that models like these would not exist without the huge amount of raw training material “freely” available on the Internet.
This creates a potentially tricky situation with respect to copyright.
I have a personal take on this. I have put a fair amount of text out on to the Internet. There is this blog, I have edited and added to Wikipedia pages, I have written product descriptions and technical marketing materials, and published many articles in various outlets over the year. Given that I could ask ChatGPT about Simics, I am certain that a small part of the “training base” of ChatGPT comes from text that I have produced. Basically, ChatGPT builds on my work.
My work was produced for human consumption. To inform, entertain, educate, and sometimes to sell. There is an implicit acceptance that other people can read my texts and use that to grow their knowledge and skills. I put stuff out there for the joy of writing and as part of my job. Not to directly monetize it or expecting any kind of substantive return.
However, the situation becomes totally different when an LLM is trained on the information. In the end, a commercial entity must make money off of their product – i.e., the model. This is a different situation compared to humans using the same materials. I do not think it is fair or right for LLMs to be trained like this without explicit agreement with the authors. The LLM trainers are using things that I have written for their own commercial purposes. That does not feel right and seems legally questionable.
Under copyright law (as I understand it) copyright is attached to all my text. That copyright might personal, it might belong to my employer, it might have been conferred to a publication or conference, or someone else. But there is always a copyright holder. Copyright law lays out the rules for how such materials can be reused. You need to have an agreement with and compensate the copyright holder if you reproduce their work beyond limited fair use. Training an LLM appears to be similar to including my texts in a book or reusing slides in a presentation, where I would expect at least to be asked to agree to it. [I am not a lawyer, this is a simplification, there are certainly nuances here I do not touch on.]
Just like has been seen with AI image generators, ChatGPT can be prompted to produce verbatim output matching known existing works. For example, one article made it produce the beginning of the book “Catch 22”. This tweet shows it reproducing specific source code. And I had it produce a part of the script from the movie “Monthy Python and the Holy Grail”. Clearly, the text that it has been trained on exists inside the model in some form and can be reproduced by it. This does not sound like something that can fall back on fair use. Even more egregious examples of how generative models reproduce their input information has been provided in the Getty images lawsuit against Stable Diffusion. Here, the model can be made to generate images that contain recognizable Getty watermarks. Not a good look.
Imagine a generative AI trained on the music library available in a service like Spotify. Which you could then prompt to build songs, say about Yellow Brick Road. If that happened to produce large chunks of the famous Elton John song, you think the music industry would be OK with that? I don’t, and the same logic has to be applied to the training text. There is already at least one lawsuit about this aspect for image generators. Update: ArsTechnica and the Financial Times report that Universal Music Group has asked Spotify and Apple to block AI services from scraping music – in order to stop AI-generated music that imitates popular artists (and eats into their income).
Regardless of what I think, this will be debated and litigated and hopefully legislated in the next few years… but I think that morally the situation is clear. Any time an AI model is trained for commercial use, all inputs should be properly licensed and agreed-to by the copyright holders. For non-profit research work, you could have a non-profit agreement or some kind of explicit open-source license for materials that you willingly provide as training materials. Another analogy is how the book business have fought the digital lending of e-books. Why don’t publishers sue the big AI players?
As someone said, it feels like a Napster moment. Who will be the Apple iTunes or Spotify to straighten this up and come out with properly licensed AI models?
Code Licenses
When it comes to software code we have the additional issue of software licenses; in particular open-source licenses with their requirements on how derivative code can be used and distributed.
First of all, if an LLM is trained on open-source code (such as being scraped off of github or just reading downloaded code), and it produces code snippets – what are the licenses of the generated code? Does the LLM give you all the license information for the code it used? The GitHub CoPilot tool has already produced a major lawsuit from the open-source world.
It is even more crazy that ChatGPT claims that you can use whatever it generates:
· Can I use output from ChatGPT for commercial uses?
Subject to the Content Policy and Terms, you own the output you create with ChatGPT, including the right to reprint, sell, and merchandise – regardless of whether output was generated through a free or paid plan.
https://help.openai.com/en/articles/6783457-chatgpt-general-faq
In a way, the position of ChatGPT is akin to money laundering – by passing the code through a system like this you clean off all the licenses and get something you can use by agreement with ChatGPT. Once again, I am not a lawyer, but this seems very iffy if you have been trained to be sensitive to following the letter and spirit of open-source licenses.
But can you really trust that? If you use an LLM to generate some source code that goes into a shipping product, and your legal compliance department asks you whether your code includes any open-source code or other licensed source code, what can you say? Do you know? Can you know? Any professional software development organization with a good system to track software bill-of-materials and licenses will have a very hard time classifying the output of an LLM.
Until this is cleaned up, I really hope that all professional developers and all serious enterprises refrain from using these tools. It is simply much safer not to use the tools. The resulting risk of legal battles down the line is just too high. There is also the moral aspect of not properly providing credit and following open-source licenses. There is every reason to argue for caution.
Information Leakage
Yet another problem with ChatGPT is that uses the user input to improve the overall model. This means that information can leak between users, if one users uploads information that fits the prompting from another user. This is very problematic. Entering something into ChatGPT is basically the information-management equivalent of posting it to a public forum on the Internet and making sure your favorite search engines index it for others to find.
I don’t quite understand how that works, since ChatGPT also claims their training cut-off time, but this has been proven to happen.
EETimes reports that Samsung has found several leaks as a result of engineers using ChatGPT:
The Digitimes report mentions three specific cases of leaks caused by engineers sharing information with ChatGPT. In one case an engineer uploaded faulty code and asked ChatGPT to find the fault and optimize the software. But as a result the source code became part of ChatGPT’s database and learning materials.
Another case was where ChatGPT was asked to take the minutes of meeting. By default the discussion and exactly who attended the meeting – both confidential – were stored on the ChatGPT database and thus ChatGPT was able to divulge the material to anyone who asked.
The risks for leaks like this is why Amazon forbade their employees to use ChatGPT at work. Many other companies have followed suit.
Using ChatGPT to help at work – be it reviewing a piece of software code, writing the starting point for an internal report, or other creative use of it to transform information to text – is tantamount to breaking confidentiality and failing to protect company intellectual property.
Making it Legal, Licensed, and Safe
All the issues cited in this post can be solved. They are not really a problem with the technology per se, just with how current solutions are being deployed and sold.
For an enterprise use case in particular, you must run your own model entirely separated from that of other companies. Run the model on-site (which is likely a lot cheaper than renting the compute given the scale needed), or at the very least using a closed-off model that is not cross-fertilized with the models used by other customers of a cloud provider.
Licensing and legal issues are solved by training your own enterprise model from scratch using known-provenance data. Indeed, such customized training is likely to produce better and more on-brand results than the current wild-west model where you have no idea what the training set contains.
Consumer products have an interesting challenge in not violating privacy laws. If a user uses a service like ChatGPT to summarize or rewrite highly private information, the same concerns applies as to leaking company secrets. Every user’s own information must be kept private and separate. If I upload an image to train an image generator on my own face, that training data must not be used to produce images for other people.
The potential for community training sets is clear – you help make a model better by providing content you have uniquely created yourself and can license to the pool, and in return you get a better product. But that has to start with every single piece of training data being opted in and explicitly acknowledged as such.
There are people voicing the opinion that this means that only large companies can play in the LLM space, and maybe that is the case. But is that really a problem? Many real-world products are not something a small company can hope to produce with a few people and some computers. Airplanes, pharmaceuticals, nuclear reactors, space launch rockets, and similar still have enough alternatives in place for decent competition.