No, current law holds that I learned something. However, if I write a patent application about my new glue, and in there I literally quote half a page from Mr. Bayer's textbook of organic glues, then I have violated copyright.
See organic chemistry example above. The model is lossy and compressed, yet complete and well connected, just like my memory of organic chemistry. As long as the model doesn't literally regurgitate whole sections, it has ingested knowledge.
But what license can you write that says: You can use this code, but you can't read it, and if you read it you must forget it, and retain nothing more than the general concepts? Because that is sort of what a student (or the AI) does.
Nobody's equating LLMs to human brains. LLMs are the equivalent of textbooks that students have to read to learn anything. CPUs are the brains. It's kind of like first using an English Language textbook to learn the language, then using that knowledge to read classified documents (which are written in English if you are in a country where English is the official language).BTW, why does everybody and their grandma equate LLMs to human brains?
I doubt that the kind of LLM that GitHub's AI autocomplete uses has long enough sequences of tokens in the model to be able to reproduce any more than few lines at a time. And remember that copyright allows for fair use of small snippets. How small exactly those are is an interesting question; for music and literature that has been litigated; for source code not so much as far as I know.Models of this depth are prone to overfitting and will reproduce some parts of the training data literally on the right input, fully deterministic. Which means it stores an encoded copy of those parts.
Chances that these LLMs store at least some parts of GPL'd code verbatim are 100%.
The model does not contain a modified copy of the GPLed source code. It's not like the model contains "The Linux Kernel with the SCSI driver replaced by a SATA driver" (dumb example, but an example). It contains very short snippets of the source code, turned into token sequences and their frequencies (highly simplifying).That's what licenses are there for: Terms of usage. In GPL this is not just about copying, but also translation and modification, which both happen when training LLMs.
Nobody's equating LLMs to human brains. LLMs are the equivalent of textbooks that students have to read to learn anything. CPUs are the brains. It's kind of like first using an English Language textbook to learn the language, then using that knowledge to read classified documents (which are written in English if you are in a country where English is the official language).
I doubt that the kind of LLM that GitHub's AI autocomplete uses has long enough sequences of tokens in the model to be able to reproduce any more than few lines at a time. And remember that copyright allows for fair use of small snippets. How small exactly those are is an interesting question; for music and literature that has been litigated; for source code not so much as far as I know.
The model does not contain a modified copy of the GPLed source code. It's not like the model contains "The Linux Kernel with the SCSI driver replaced by a SATA driver" (dumb example, but an example). It contains very short snippets of the source code, turned into token sequences and their frequencies (highly simplifying).
The GPL talks a lot about turning source code into object code, but a model doesn't do that either: the LLM is not linkable or executable. I think today's licenses just don't get any traction on what model training does: It reads the source code (that is explicitly legal), it tokenizes it (again, there is no prohibition against that), and then it records the frequencies of short token sequences.
I think to prevent AI training, a new license would have to be created. And creating that one is not easy, since it has to prohibit "reading the source code with intent to store the fine-grain structure", which is exactly what a human reader does, for example when trying to learn coding style.
why not? I'd think it's pretty common knowledge that a CPU is the "brain" of the computer...CPU has no resemblance to what a brain does
Re-read the very snippet that ralphbsz wrote (and you are responding to). The exact size of those tokens of context does matter, because they just might contain code that somebody doesn't want to share.Why would you think that? Today's models boast 1000s of tokens context, and produce complete structure of multiple source code files on request. That context has to come from somewhere. To feed them only small portions of training data would completely defy the purpose and make them far more limited than they are today.
Ever hear of the Turing Test? do you know what it's about? That test doesn't test the computer, it tests humans, to see if they can be fooled into thinking that there's a human on the other end of the wire. And it's becoming easier and easier to fool the human.And I really don't get what should be difficult about legally differentiating between a human reader and a computer.
why not? I'd think it's pretty common knowledge that a CPU is the "brain" of the computer...
Re-read the very snippet that ralphbsz wrote (and you are responding to). The exact size of those tokens of context does matter, because they just might contain code that somebody doesn't want to share.
Ever hear of the Turing Test? do you know what it's about? That test doesn't test the computer, it tests humans, to see if they can be fooled into thinking that there's a human on the other end of the wire. And it's becoming easier and easier to fool the human.
Nope, that's being pretty ignorant about what a brain does. If your brain was a CPU
Bad analogy, in both directions. IIRC, parts of the frontal cortex could be comparable in function, in case you want to go really far-stretched.
- it wouldn't remember things permanently
- it would do exact logic and calculations very fast but inefficient on fuzzy tasks
- it wouldn't work all in parallel at the same time
- it wouldn't reorganize its logic and memory while using it.
What if a whole text file were treated as a token? or a .jpg image? For an example of image-based AI training, you may recall the hoopla over the 'Black Hitler' AI-generated imagery. Yes, exact content of training materials for AIs does matter, for many reasons. Not just copyright violations, not just historical/factual inaccuraciesThere's no "size of those tokens", tokens roughly correspond to words.
Y'know, it does take a ton of processing power to train an LLM. So, it's a bit of a tradeoff between investing in a machine powerful enough to process the terabytes of contextual data (and learn adequately from it) and the retun on such an investment.Given this I have no reason to believe what ralphbsz assumes, namely that LLMs are trained with so few tokens at a time that the represented text would escape copyright claims, for being only short text parts.
Yeah, if ChatGPT can generate a patch to work around a Wayland-related bug in Freebsd, can you still claim that it's easy to define the difference between humans and computers? An AI can easily run a marriage scam out of Nigeria at this point, y'know. Try getting a cop to take a look at the criminal AI and say with a straight face that they can tell the AI from a real human.How is that relevant here? It's easy to define the difference between humans and computers in legislation, and law enforcement can just look at the physical appearance if needed.
That's not at all what I assume. On the contrary, I think good coding / autocomplete LLMs are trained on a very large corpus, for example all of Github, or in large software companies all of the internal code base (which is typically 10^9 LOC). And they are trained on large token chains, as they have to be to give reasonably accurate predictions.Given this I have no reason to believe what ralphbsz assumes, namely that LLMs are trained with so few tokens at a time that the represented text would escape copyright claims, for being only short text parts.
If you have enough monkeys with enough typewriters, you'll eventually get a reproduction of the 'War and Peace' novel.My (uneducated) guess remains that an LLM that helps with creating source code will not reproduce enough of a copyrighted work (such as the Linux of FreeBSD kernel) to violate copyright law. And when I ask it to auto-complete whatever project I'm working on right now (which happens to be: how to download lots of files from Microsoft OneDrive using Python), it will probably not steal enough content from any existing copyrighted code base to trigger a license. But I admit that this is an open question, which ultimately legislatures and courts will have to decide.
Absolutely. You will also get Hamlet and Ulysses. And if you publish everything the monkeys have produced, you will violate the copyright on War and Peace, Hamlet, and Ulysses. In the old days, that was completely impractical, since there wasn't enough paper in the universe to publish everything the monkeys wrote, so it was also impossible to locate War and Peace in their output.If you have enough monkeys with enough typewriters, you'll eventually get a reproduction of the 'War and Peace' novel.
But just reading all that code, and storing it (even in some modified version) does not violate anything. Just like I am allowed to memorize Hamlet and perform it in my living room with only the Christmas tree listening, I can internalize all of the Linux kernel in my brain, or in the storage of my AI computer. It's when I publish results based on it that the trouble starts.By the same token, if you ask an AI to create enough code (just assuming it's creating .cpp files), you'll eventually see something that violates somebody's copyright in some form or shape.
You are making a good argument, which is not based on normal copyright, but on the "no modification" clause of the license. Taking that clause literally means that I can not modify the code at all. Even if I never show it to anyone! Technically, if I stand in my living room and tell the Christmas tree that I have changed the line of code that prints the OS version (in the uname() call) from "Linux" to "Ralphux", I have already violated the license. This is clearly ridiculous, but it's what GPL V3 says.ralphbsz: I understand your train of thought. But long token chains of GPL'ed code are definitely a derivative work, don't you agree? And blending these with other token chains is clearly a modification, which makes the whole LLM a GPL'ed work, technically. Moreover, the weights around some tokens heavily used in GPL code will be almost exclusively shaped by GPL code.
And this is where a balance with "fair use" comes in. I can write a new play (for example a romantic comedy involving Marilyn Monroe playing saxophone in an all-girl band), and use the line "to be or not to be" in there as a joke, and I have not violated Shakespeare's copyright, because the quote is so short. Similarly, I can look at the Linux kernel how it handles a particularly bizarre USB device, and put that into a driver I'm writing for Ralphux, again that's fair use and allowed, since I'm not using the work itself, only a small bit of knowledge gleaned from reading the kernel, not copying or modifying or using parts of it.The other argument is about reproducing copyrighted material, and my point is that it already happened and will happen again. More by accident than "criminal intent", sure, but some weights in these models will represent some copyrighted material more closely than desired. That's just a consequence of the training procedure, AI companies would certainly avoid that if they could.
I agree with you that this is an uncomfortable stretching of existing copyright law. But copyright law has been obsolete and insane since the invention of radio, TV, photocopies, and computer networks. It is intellectually designed for the time of books and magazines, and theater productions. If the AI wave causes all of copyright law to be thrown on the heap of bad ideas, and replaced with something sane, that would be wonderful. Alas, that's unlikely to happen.I'm not a "copyrights first" guy, by no means. But people conveniently gloss over these issues with some fuzzy reasoning, like nobody knows what's exactly encoded in those models. These bad excuses rub me the wrong way.