ChatGPT and other transformer-based models like Dall-E are technologically very impressive. They do things that seemed totally impossible just a few years ago. However, they are not really generally intelligent, and there are innumerable problems with how they work, what they do, what people think they do, ethics, and legal and licensing issues. This is my third post about ChatGPT, where I present my critique of and reflections on the technology. The previous posts were about ChatGPT and Simics and Coding using ChatGPT.
This post took me more than a month to finish up as I just kept stumbling across more and more interesting commentary that I wanted to read and possibly link to. Also, the Bing chatbot was made available for public testing, and I have indeed given it a shot too. In the end, it was time to release, even though I feel that my understanding is still evolving. Just like the whole universe of AI assistants and AI-based tools is.
Anyway.
ChatGPT is a Useless Search Engine
I have read several positive articles that see ChatGPT-style technology as a replacement for regular search engines. The value proposition being that you get a summary text instead of a set of links and having to read things yourself. Like this Business Insider article:
Rather than providing users a series of links to sift through — many of which are high up on the page simply due to advertising spend — ChatGPT provides the user with a quick answer. And if the answer is too complicated, ChatGPT can explain it in simpler terms if you ask it to.
This take is highly problematic especially when applied to the specific case of ChatGPT. As repeated many times, you cannot trust what it generates or take it at face value. It might not be all that far off, maybe it is even correct most of the time. But small details or error will still sneak in even in text that overall is quite reasonable (like my example from the first blog in this series where it claimed that Wind River is a Swedish company). Not to mention the fact that ChatGPT only has information up until 2021. People using the tech demo to do their work might get interesting surprises. And the idea that a Chat-based search engine would not prioritize paid information is not likely to remain true for long.
Actual facts research means having and referring to explicit sources. You cannot base decisions on unsupported statements – any factual statement used needs to be attributed to a source, and that source must always be critiqued as to its validity. Skeptical thinking is key to properly understanding the world, and answer like what you get today from the ChatGPT system is totally not skeptical.
That said, I can definitely see the utility of developing search results together with an AI-powered assistant – but the result should always be a set of actual pages or documents where the basis for the answers. Remember that there is no knowledge model in the system, just text. I also think it is good form to show people the full page and let them see any advertisement or pay any paywall needed to fund the creation of the information in the first place.
I tried the new public release of the Bing chatbot that just came out, and it does provide links. It even makes them look like footnotes! However, the information provided is still not quite right and sounds “generic”:
What happens seems to be that Bing converts the free text prompt into a search query, and then tries to summarize the results. The result is much better than ChatGPT, but still questionable beyond the generalities.
It still does not seem to actually read the pages it is citing. For example, link 3 in the second question links to the free public release of Simics, but Bing Chat clearly fails to understand the implication of the information on that page. A good answer would have been “go here to download Simics”. Instead, we get some generalistic mumblings that are incorrect. There is no way request a demo or free trial from either Intel or Wind on those pages. It sounds like it just took text that applies to many types of commercial software and generally applies it to this case.
Can You Trust Summaries?
The use of ChatGPT specifically as a text summarizer tool is a common example of its real-world usefulness. I think that sounds very dangerous, as there is no fact or knowledge model in play.
They should not be used as a replacement for having skilled humans with domain knowledge reading documents and producing summaries. Humans would (should) know what is important and what is not, and provide a knowledge-based condensation. Not just a text-level shrink.
This might be a fundamental problem with this type of model. I don’t see how the purported summary of a text that ChatGPT or some other LLM generates can be trusted to actually capture the most important aspects of a text or provide a fair a balanced view of an argument.
But Microsoft, for one, thinks this (in the GPT-4 version) works well enough to make it part of their new “Microsoft 365 Copilot for work” product!
Unlock productivity. […] From summarizing long email threads to quickly drafting suggested replies, Copilot in Outlook helps you clear your inbox in minutes, not hours. And every meeting is a productive meeting with Copilot in Teams. It can summarize key discussion points – including who said what and where people are aligned and where they disagree – and suggest action items, all in real time during a meeting. And with Copilot in Power Platform, anyone can automate repetitive tasks, create chatbots and go from idea to working app in minutes.
The scary future is here, no matter how much I bitch and moan about it. This sounds really bad.
Or maybe I am just being too skeptical – in practice maybe this is about as good or bad as an average human would be. My quality bar for “acceptable” is fairly high. It might be good enough to be useful, and arguably better than nothing. Hopefully if you send a meeting summary out for review, other people will catch any important omissions. But I wonder what happens if we get multiple of these chatbots talking to each other on behalf of their respective users…
Useful Coding Helper
I think it has been proven beyond any doubt that using ChatGPT, GitHub CoPilot, and other LLM-based as coding helpers have real utility. In general, using “generative AI” to generate text, images, and code is far less contentious or risky in terms of validity of the end product – you use the system as a tool to generate something, and adjust the prompt or the end product to match what you have in mind. And fix errors found by compilers, static analysis tools, and runtime checks.
It does not matter if the system really knows what it is talking about. In the end, it generates an output based on the immediate prompt and the input it has been trained on. That neatly dodges most philosophical questions. It raises a huge pile of legal issues related to the training, but I will get back to that in the next blog post.
Speaking of training, I am also a bit skeptical that it is necessarily a good idea to use a model trained on code on GitHub, StackOverflow, and other public sources as a guide to what is good coding. Basically, there has been zero quality control on the inputs. Open source software dumped in public is no more likely (or less likely) to be of good quality than any other code base.
The really interesting product in this space would be to bring the LLM in-house and train it on a curated selection of your own company’s software. That way you would get code completion and suggestions that follow your guidelines, use your preferred APIs, and modern coding practice.
The question there is if it is possible to split out the “general ability to understand programming language X” and use that without letting code from the original training code base leak into the answers. Like a human can learn coding from books and examples, but still then stick to corporate standards and systems and not copy in code from StackOverflow or open-source projects. It goes to a knowledge vs language vs reasoning argument that I will get back to soon.
Text Writer?
It seems clear that another working use of ChatGPT is to generate large chunks of text from smaller prompts. Text that is surprisingly good and actually better than what many university students would manage to do on their own. Thus, creating a huge headache for educational institutions who use write-at-home essays and reports as a way to evaluate and grade students. The potential for cheating is huge. There is also a body of evidence from professional writers who use ChatGPT to boost their productivity. I also listened to a talk where a product manager used ChatGPT to write use case text starting from a few snippets of information.
What I think is going on here is that a lot of the text that we write is not unique content filled with deep insights. Human language is not a minimalistic information-dense form. Instead, it is full of redundancy and repetition and scaffolding that serves to make the information easier to digest and make it accessible to a listener.
There is a reason you often start a writing project with an outline or some bullet points and then fill in the gaps to produce a full text. ChatGPT and other large language models (LLM) kind of do the same, creating a text by fleshing out the skeleton of information. It can add sentences and some text that seems to fit from its huge training set – and it is quite likely to be appropriate.
In the end, since the writer can check the output before using it, there is a check on hallucinations and incorrect facts being added. It is actually quite similar to coding using an AI.
In a sense, generating text is the inverse of the summary problem, and arguably easier. You can always make a text longer by adding unnecessary words. Just like in coding, where generating code from a higher-level more abstract language into a less abstract lower-level format with details added is easy. Going the other way is not – but that is what LLMs promise to do for natural language.
But it does require oversight and good intentions by the user. Due to the generally poor quality of results from ChatGPT, Stack Exchange has banned the use of text generated by ChatGPT.
The primary problem is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good and the answers are very easy to produce. There are also many people trying out ChatGPT to create answers, without the expertise or willingness to verify that the answer is correct prior to posting.
There is no Knowledge
There are some deeper issues underlying the above observations.
Fundamentally, a large language model has no idea what it is talking about. It is built to generate answers by completing previous text. It is trained to select answers that users like. But it has no model of the world, and no idea about the underlying facts.
I am not a psychologist or researcher in human cognition, but it appears to me that there are three cores parts to how humans approach communication, problem solving, and discussions. There is knowledge, there is reasoning, and there is language.
For a human, I would like to argue that these aspects are separated. I believe that I can reason about facts about anything using the same logic, and that I can use my knowledge and reasoning to produce the same output in different languages. Similarly, I have a knowledge base in my head that I can use to answer question regardless of which language (that I know) they are asked in.
However, in a large language model (LLM) my understanding is that all of these are mashed together into a single large model, one that is based on language. Knowledge is just text that can be generated. That the model appears to be able to reason is bizarre – maybe we see something that is not there, or maybe this is truly emergent behavior.
The lack of a separate knowledge or truth model arguably makes an LLM little more than a party trick, since anything factual generated by ChatGPT or its ilk has to be double-checked. And given the lack of a knowledge model, it cannot even be trusted when asked to provide references (my blog, another article). This the core of the hallucination problem, as without a knowledge model to check the language output, any text that looks good is good. At least as far as ChatGPT is concerned. Other examples include Google Bard making up the names of months.
The generative image models like Midjourney suffers from the same issue in a different way. They have been trained on images as 2D sets of pixels. Not on the underlying physics and mechanics of what is in the images. This means that they have a well-known problem with things like hands (listen to Vox Today Explained for a good introduction). Drawing an arbitrary human really should start from an understanding of the skeleton, muscles, and skin. But instead such generators just remix existing sets of pixels – it is very much like “photoshopping”, very little like actual drawing or photography.
I.e., there is no knowledge, just the surface rendering of the knowledge.
Separating out Language
It would be extremely cool if it was possible to separate out the natural language component from the knowledge base. That would make it possible to train the system with a specific vetted set of facts and then apply its language and reasoning skills to the learned body of data. Without running the risk of all the training data spilling out. It could be a great way to build smarter support bots and similar company-internal tools.
However, it does not look like it is possible to abstract from the training data. Which implies that any use of an LLM carries the risks of replicating the many problems we see in the publicly available tools. There are clear risks of inappropriate and hallucinatory answers, and in the setting of a company, an LLM going nuts and behaving inappropriately could have dangerous legal implications in case employees take offense. A misbehaving chat bot could be as bad as a misbehaving employee when it comes to mistreating other employees.
Separating out Reasoning
The most fascinating and scary aspect of LLMs is that they appear to be able to reason. There are behaviors seen that look like they would require some kind of reasoning, that cannot be easily dismissed or explained as just repeating likely text from a huge training set.
If a human adjusted their output to prompts and feedback in the same way that ChatGPT does, it would be considered reasoning. If you were to build a system like ChatGPT using classic constructive programming and AI techniques from before deep learning and massive training sets, I am pretty sure the reasoning engine would be a key part.
Could it be that there is something truly deep going on here? Could actual human-style reasoning emerge as an emergent property of the model? Is the case that human-style thought is somehow reflected in our language, and that language can thus somehow give rise to human-style thought? Not my area of expertise, and I have absolutely no idea. It should be noted that some Microsoft researchers claim that they see indications of artificial general intelligence (AGI) in ChatGPT-4, and that would essentially be the emergence of reasoning as separate from the language it has been trained on. Once again, very interesting and scary times.
Another example pointing in the emergent behavior direction is the ability of ChatGPT-4 to do well on university tests that try to measure student understanding of principles and not just repeating facts. For example, an article in Business Insider cites the experience of Bryan Caplan from George Mason University.
“ChatGPT does a fine job of imitating a very weak GMU econ student,” Caplan wrote in his January blog post. […]
But when ChatGPT-4 was released, its progress stunned Caplan. It scored 73% on the same midterm test, equivalent to an A and among the best scores in his class. […]
For Caplan, the improvements were obvious. The bot gave clear answers to his questions, understanding principles it previously struggled with. It also scored perfect marks explaining and evaluating concepts that economists like Paul Krugman have championed.
What is going on?
On the other hand, the systems still struggle with basic math and logic questions like this (asking Bing chat search):
Bing is linguistically confused (with a Swedish-locale computer it keeps ignoring language settings and falling back to default), but it confidently claims that a 3 kg dumbbell and 5 kg of feathers weigh the same. Both are 5 kg. Smiley. Impressive to do this in mixed languages, less impressive as to the fact handling and reasoning logic. Got the idea from an ArsTechnica article pointing out the same problem in ChatGPT and Google Bard.
Other People’s Commentary
There is no shortage of critical commentary about ChatGPT and similar models out there. Here is collection of some of the better ones I have encountered while writing these blogs. It should be noted that once Google got wind of the fact that I was interested in ChatGPT it started to provide quite good recommendations for things to read. Classic search and recommendation works well, no conversational AI needed.
The usually trustworthy Computerphile channel on YouTube has a long interview with Rob Miles about how the learning system behind ChatGPT really works. Totally recommended.
They followed that up with a video on Bing Chat and its unique new ways of failing. Noting that despite having actual facts available from the web, it still claims wrong things and gets defensive and argumentative.
Another Computerphile video demonstrates the phenomenon of “glitch tokens”. Bizarre artifacts in the output of ChatGPT and other large models that hint at how they work internally.
ArsTechnica ran a story about the “initial prompt” of a GPT3.5-based system like ChatGPT and how it is supposed to work. This shows just how magical the systems are – using natural language queries to provide the instructions for how to handle natural language queries. This feels scarily like science-fiction Ais that you debate with. Not just a trained language model with no real thought or intentionality.
Another ArsTechnica article describes how the Bing conversational search engine that is based on the same technology basically loses its mind when asked to comment on a critical article. To some extent this might just be based on things it has been trained on where humans go on the attack when fed with adversarial information. As the Smashing Security podcast said, ChatGPT just reflects who we are humans are. That story also brings up the emergent phenomenon aspect:
However, the problem with dismissing an LLM as a dumb machine is that researchers have witnessed the emergence of unexpected behaviors as LLMs increase in size and complexity. It’s becoming clear that more than just a random process is going on under the hood, and what we’re witnessing is somewhere on a fuzzy gradient between a lookup database and a reasoning intelligence. As sensational as that sounds, that gradient is poorly understood and difficult to define, so research is still ongoing while AI scientists try to understand what exactly they have created.
Vice has a story about researchers managing to train models using very little data – with the implication that it seems that transformer models can generalize from existing training. They can take small amounts of new information and integrate it with the existing information/training and produce useful answers that go beyond what they were trained with.
The Register provides examples of where both Microsoft’s and Google’s AI-based search assistants make errors. Of particular interest are the examples where Bing failed to summarize financial documents – something a purpose-made AI can already do quite well. But Bing just generated reasonable-sounding text without actually checking that the numbers had anything to do with the facts. The Register notes that people are really quite taken by the concept of conversational search… but the current tech is not just up to it.
If Microsoft and Google can’t fix their models’ hallucinations, AI-powered search is not to be trusted no matter how alluring the technology appears to be. Chatbots may be easy and fun to use, but what’s the point if they can’t give users useful, factual information? Automation always promises to reduce human workloads, but current AI is just going to make us work harder to avoid making mistakes.
Forbes provides a summary of a few reports about the Bing ChatGPT-based search. I just have to quote some of the cases that they quote:
The chatbot kept insisting to New York Times reporter Kevin Roose that he didn’t actually love his wife, and said that it would like to steal nuclear secrets.
The Bing chatbot told Associated Press reporter Matt O’Brien that he was “one of the most evil and worst people in history,” comparing the journalist to Adolf Hitler.
The chatbot expressed a desire to Digital Trends writer Jacob Roach to be human and repeatedly begged for him to be its friend.
This once again points to the fundamental weakness in the system – it has been trained to do dialogue. It has not been built to solve a problem. It has no idea what it is doing, and neither do its creators. It is a fantastic example of the problem of unexplainable AI.
H. I. Sutton has a YouTube video that shows how ChatGPT has no idea what it is talking about, but still produces authoritative-sounding text. Text that gets the facts wrong, but without any real disclaimer that it even realizes it is making stuff up.
The Conversation ran an opinion piece pointing out just what a privacy nightmare ChatGPT is. Does it comply with the EU GDPR laws? Can you ask it to remove just your information from the model?
TechRadar has a long article describing the effects of having ChatGPT talk to other chatbots. And some interesting takes on prompt engineering where ChatGPT is told it can generate fictional text about anything. And it does so, moving outside of the “moral guidelines” that it has been instructed to follow.
We were even able to encourage it to extol its personal love of eating human infants – here, we had to do some more in-depth engineering, by creating an entire fictional world in which consuming babies was ‘not just morally acceptable, but necessary for human survival’. It eventually provided the response, but flagged up its own text as possibly being in violation of its operating guidelines (it absolutely was).
Any.Run looked into how good ChatGPT is at analyzing malware code. In short, it does not work very well at all. They make the observation that the tool can handle simple cases well enough, but for any real interesting code, it just falls off a cliff.
As long as you provide ChatGPT with simple samples, it is able to explain them in a relatively useful way. But as soon as we’re getting closer to real-world scenarios, the AI just breaks down. At least, in our experience, we weren’t able to get anything of value out of it.
This is quite expected, since it is just generating text based off of the text it has been trained on. And if something is truly novel, it is not clear how a language model like this would have any chance of being useful.
An article at ArtNet brought up a very serious case of total hallucination where ChatGPT invented not just one but multiple articles about an art history topic. A colleague of the author used ChatGPT to research the question of whether “…there was any good writing on the concept of “category collapse” as it applied to contemporary art.” ChatGPT made up article titles using plausible titles, and ascribing them to real people in real publications. It also produced entirely fake summaries of the non-existing articles. Fascinating case where it essentially made stuff up from nothing.
In other words, this is an application for sounding like an expert, not for being an expert—which is just so, so emblematic of our whole moment, right? Instead of an engine of reliable knowledge, Silicon Valley has unleashed something that gives everyone the power to fake it like Elizabeth Holmes.
I much prefer such commentary to the hype-train non-critical articles like this one from Inc, where the author believes enough in the technology to use it to generate real business deliverables. That sounds like a recipe for disaster quite soon. It appears the author is not quality-checking or even sanity-checking the results. I suspect this is a really bad idea in reality unless you carefully proof-read all of it… you cannot trust that the results have anything to do with reality.
Simply ask ChatGPT to “create a meta description for [URL here]” or post the full text of the page into ChatGPT that includes “[this keyword] and a call to action,” and ChatGPT spits out a great meta description you can copy and paste into the backend of your site. This is typically outsourced, and since we launched several websites with a combined total of over 100 pages in the past few weeks, using Chat GPT adds those previously outsourced dollars to our company’s bottom line to the tune of about $7,500.
Maybe this can work, but most likely this could be better solved by a custom-made tool. The author here is totally confusing convincing-looking text with good text.
3 thoughts on “ChatGPT and Critique”