In my previous blog post about ChatGPT and Simics, I tested it on its knowledge and abilities with a fairly niche subject. Not unsurprisingly it did not do all that well. However, one area where ChatGPT appears to really work well is when dealing with program code. This seems more practically useful as well, especially as a generator of starting points and boiler-plate code. It can also sometimes do a decent job explaining code, subject to quite common bizarre mistakes and errors. Update: Part 3, a critique of ChatGPT has been published.
It is obvious that ChatGPT is better at common mainstream languages than niche languages like DML. There is simply more information to be scraped off of the Internet and more repositories on GitHub to mine. Therefore, I used old boring C and Python in my experiments.
Generating Code
One use of ChatGPT that is very commonly cited as practical, useful, and a real time-saver is asking it to generate code for you, based on a prompt. With a few tweaks, it has been reported to generate everything from simple web GUIs to malware.
I got a comment on my first blog about ChatGPT, demonstrating this case clearly:
I am using ChatGPT as a software consultant. Maybe I asked it 10 questions related to two pieces of code that now runs on our web site. That’s one dollar of consulting, if every question is ten cents. That’s cheap. For both issues I tried other means first. ChatGPT immediately understood what I was after and supplied code.
I would call the code ”snippets that I could not write, but I can see they’re right”.
https://jakob.engbloms.se/archives/3755/comment-page-1#comment-806309
That make perfect sense. Instead of Googling around and collecting snippets of code from Stack Overflow or some project on GitHub, you converse with a system that has internalized all that information and it get it to do the job for you. This has huge potential for programming, at least for simple cases (given that the legal aspects can be worked out, getting back to that in a future post).
I have talked to several people who say that ChatGPT is a real time-saver when it comes to generating common code skeletons and boilerplate code. Compared to constructing the same code by hand from documents and examples. The “mash existing snippets together” nature of what it does is really quite a good match for many basic programming tasks. It will not write truly innovative things, but can get basic skeleton coding done much faster.
To test code generation, I asked it to build a Mandelbrot program (as a typical example of a well-known things that should be easy to do):
I love the “Copy Code” box and the dedicated syntax-highlighted box. ChatGPT really does treat code as a very important domain. The code is basically usable as-is. All I needed was to install the required libraries in my local Python installation.
Definitely faster than building something like this from some examples here and there. A Mandelbrot is a dumb example in a way since there are many ready-to-use examples out there. Still, the model generated a program consisting of multiple functions. Based on an algorithm where the Mandelbrot set is first computed and then plotted (unlike other solutions where you plot as you go). It solved the problem. It is very easy to see how this is attractive.
This implementation uses numpy, which is efficient but hides some of aspects of Mandelbrot computation when translated to C. Following up, I asked ChatGPT to provide an implementation without using numpy, and it complied with a new implementation. Instead, it used the built-in complex datatype. I asked it to remove that as well, and it gave me an even more primitive code.
This kind of exploration of alternatives is potentially very powerful when helping programmers and non-programmers code. However, I suspect it only works well when treading well-worn paths where is plenty of training data to fall back on.
There is a random element to ChatGPT. When I asked it for a Mandelbrot program a second time, it provided a different solution. Different function structure, and different way to do the underlying math. It also did not work, since it was using deprecated functionality in numpy.
The version of numpy that did the deprecation was released in January of 2021, so it was within the horizon of the training set for ChatGPT. Unlike a human, ChatGPT does not (cannot) check the code it generates by testing. But the fact that it can generate syntactically correct code at this volume is amazing.
Does it Parse?
The way that ChatGPT handles code is a bit fuzzy. It has no formal grammar for programming languages (beyond whatever its basic text tokenizer does). Same for semantics – it is generating meaning from examples, not from a formal specification. Which is honestly quite like how a human would go about the same task. Read a lot of code, read about programming languages, and build up a set of patterns to reuse when new code is encountered.
Why do I say it does not parse? I asked it!
Q: Does chatgpt parse programming language code using a parser?
A: No, OpenAI’s GPT-3 model, including ChatGPT, does not use a traditional parser to understand programming language code. Instead, it uses its deep neural network and large amount of training data to generate text based on the input it receives. While it can generate code snippets or perform code transformations to a certain extent, it does not have the ability to parse and understand the underlying structure of code in the same way a parser would.
Of course, this might be wrong… but given that it could make something that kind of looked like DML, this is a reasonable answer. It is also hard to see how you would combine a formal parser model with the language model (or maybe that is not a problem, I do not know enough. Note the non-ChatGPT-style of this disclaimer, as the bot would just state this as a fact).
Making Sense of Code
ChatGPT can also be used to explain what a program does, and maybe even find problems in code. It is clearly good at recognizing the language of a snippet of code, and somehow connecting that back to what it “knows”. Small toy examples work well, even if things sometimes happen go wron there too…
Q: What does this code do?
int f(void) {
unsigned long int v = 0x1010FE;
return v << 5;
}A: This code defines a function named f that takes no arguments (indicated by void) and returns an int value. The function declares an unsigned long int variable named v and initializes it with the value 0x1010FE (which is a hexadecimal representation of the number 1068470).
The function then performs a left shift operation on v, shifting its value to the left by 5 bits. In other words, each bit in the value of v is shifted 5 positions to the left, effectively multiplying the value of v by 2 raised to the power of 5.
Finally, the function returns the result of the shift operation as the return value of the function.
It’s worth noting that the left shift operation can cause overflow if the result of the shift is larger than the maximum representable value for the type of v. In this case, unsigned long int is typically 32 or 64 bits, depending on the target platform, and the shift operation can cause overflow if the result of the shift is larger than the maximum representable value for unsigned long int on the target platform.
Nice teacher work. I cannot fault with the above explanation. It shows that the model has built up a decent idea for what operations in C do and the limits of data types. Which is mind-blowing considering how it is supposedly trained. The fact that it spots the potential for overflow and says it might be an issue is very impressive.
It also demonstrates that ChatGPT lacks a formal evaluation system. Any compiler would be able to evaluate the above code and turn it into a single value. If code was generated for the function, it would consist of returning a constant value computed at compile-time. But ChatGPT does not have built-in facilities for evaluation of mathematical expressions.
Computing with ChatGPT
ChatGPT does not have a calculator built in. The system knows this, and it seems like a straightforward addition. But then again, doing something as mundane as calculations is not what ChatGPT is about as a demo.
As a result, the results can be bizarre as it generates mathematical results as text. For example, giving it a simple snippet of code to evaluate:
The tendency to be chatty is strong. Thus, ChatGPT does not simply answer with a number, but instead provide a long-winded explanation of how it would go about computing the answer. The description of C precedence (not precedences, a word it seems ChatGPT made up) is correct. The detour into binary is quite pedagogical. This piece is impressive.
But the actual numbers are atrocious. 0x1000 to binary is correct. But then when shifting, it inserts four additional zeroes (the binary shown corresponds to 0x80000, not 0x8000). Still, 0x8000 is the correct answer. But that is not 4096 in decimal, but 32768. Wow. Do not use ChatGPT for your math homework!
Spotting Bad Code
It might look like ChatGPT is good at explaining/analyzing code for errors, but it really is not. A compiler or static analysis tool will do a much better job of finding bugs and vulnerabilities in code since they have actual precise semantics. On the other hand, ChatGPT can deal with fuzzier aspects of code. Giving it something code-like produces useful answers, where other tools would just tell you there are syntax errors.
Compared to other tools, this is a convenient summary. Still, any language-aware editor would have spotted most of the syntax error prior to compilation. And the compiler would have caught the rest. Using inference in a hundreds-of-billions-of-parameters AI model in order to spot a simple compilation error would seem extremely inefficient – even if the result is extremely impressive.
Side note: ChatGPT appears to have an output style guide: it adds spaces between variables and operators.
Incorrect on Purpose
Given how good ChatGPT is at explaining code, I wonder if I could make it provide an intentionally incorrect description. And it did!
Q: Can you give me an incorrect description of this code?
int f(int i) {
return i*i;
}A: An incorrect description of this code is that it defines a function named f that takes an integer argument i and returns the square root of i. This description is incorrect because the code actually returns the square of i, not the square root.
For some reason, the answer makes me think of Sheldon from the Big Bang Theory chuckling at how clever he is. That ChatGPT can handle the idea of incorrect descriptions is insane. Somehow it goes from a squaring operation to a square root operation when asked to be incorrect.
Bizarre Code
Corner cases are the best cases. Can ChatGPT makes sense of intentionally obscure code, like this IOCCC 2020 winner?
Note how ChatGPT behaves in a rather human way here. When confronted with a wall of bizarre code like this it just says “this looks very complicated” and does not even attempt to evaluate it. Which is exactly what a human would do (or a human would just compile it and figure out from experiments what it does). A static code analyzer could make more sense of it, maybe.
I fed it another example from the IOCCC, and it correctly deduced that the code was from that body! I asked it how it did it, and it answered:
Q: How can you tell it is from the IOCCC?
A: The code contains a comment indicating that it is from the IOCCC, which stands for the International Obfuscated C Code Contest. The comment reads “IOCCC’20” and is located near the top of the code.
Amazing.
Finding Errors in More Complex Code
In our Simics trainings, we have a small C program that interprets a character string to generate image data. This program has an intentionally engineered-in interpreter error, used to demonstrate debugging with Simics. I will not show the code here in order not to put in spoilers for the training (but it is available in the public release of Simics). I gave the code to ChatGPT, and it gave me some feedback on the code.
There are several potential issues with this code:
Memory safety: The code does not check for buffer overflows and it may write beyond the end of the buffer, potentially causing undefined behavior.
Here, it should be noted that there are multiple buffers in the code. So “the buffer” might be any of them… most likely it refers to a fixed-length character array used with atoi.
Error handling: The code does not handle errors gracefully in cases where
atoi
encounters a non-numeric character in the input string. The function returns 0 in this case, which may not be what is desired.
True.
Magic numbers: The color code
0x80808080
is a magic number and its purpose is not immediately clear. It would be better to use a named constant or enumeration value instead.
This is just me using a non-zero value to initialize a local variable that is later overwritten. Just to visually spot cases in which it never gets set properly. No need for a constant.
Return value: The function does not return a value indicating success or failure. This can make it difficult to determine if the function has completed successfully.
Failure is not an option for this code. Or rather, there is no meaningful error state. It is demo code.
Naming: The variable names are not very descriptive and could be improved for better readability.
Maybe a good point. Point of eternal discussion between programmers.
But it did not spot the interesting error in the code, the one that was designed to make the code crash. Seems I managed to hide it sufficiently well that this AI would not see it.
It Can do Recursion!
Since ChatGPT seems to be able to evaluate things, try something more interesting. Like a well-known function with very high compute complexity.
Wow! It actually starts to compute the results step by step. I did not see that one coming. And at least for a few small arguments the results are correct. Not sure how to square this with the miserably failures for simple C math I showed above, as this is arguably a whole level more difficult for a human.
End of the computation. With the mandatory verbose explanation:
Asking to evaluate a more complex input that crashes on my machine due to too-deep recursion just stops after a while. I guess there is a built-in limitation on the amount of compute it is willing to dedicate to each answer. Sensible. However, it is not willing to admit to that.
When it stopped in the middle of a line and got back to prompt I challenged it:
And at this point it stopped again. Not entirely sure how the answer is supposed to be read.
Since it seems to be able to crazy things like emulating Linux prompts, I asked it to draw the computation as a tree.
This is just bizarre. The nodes make little sense, and the bottom part is… I have no idea. A(0, A(1,2)) should be 5 not 3. Once again, the system is shown to have entirely random limitations that seem impossible to predict.
Summary
Despite all the funny failures I managed to coax out of it, ChatGPT has huge potential as a code helper. Coding assistance seems like low-hanging fruit for AI language models; and I guess that is why the product already exists, in the form of the much-debated GitHub CoPilot and other tools. It is a logical extension to smart code completion. When generating code, one can hope that the programmer, compiler, static analysis tools, and testing will catch the inevitable errors.
Using tools like this to understand code seems quite limited and subject to random limitations and hallucinations. After all, ChatGPT is just simulating text that explains code. It has no actual knowledge or formal semantics to fall back on, just a huge set of overlapping patterns learnt from example.
That it can do what it does given how it works is once again mind-boggling. This whole system is. But it is not the end of programmers as we know them, and they are definitely not a replacement for real analysis systems.
Next post will be more of a critique and address criticisms of the system.
One thought on “ChatGPT and Code”