Github Copilot now available for free

No, current law holds that I learned something. However, if I write a patent application about my new glue, and in there I literally quote half a page from Mr. Bayer's textbook of organic glues, then I have violated copyright.
See organic chemistry example above. The model is lossy and compressed, yet complete and well connected, just like my memory of organic chemistry. As long as the model doesn't literally regurgitate whole sections, it has ingested knowledge.

The student learning example is irrelevant here. Law differentiates between memorizing something in a brain, and making physical (including digital) copies. An LLM model is not a brain, it's data stored in a deterministic and durable way. Models of this depth are prone to overfitting and will reproduce some parts of the training data literally on the right input, fully deterministic. Which means it stores an encoded copy of those parts.

Chances that these LLMs store at least some parts of GPL'd code verbatim are 100%.

But what license can you write that says: You can use this code, but you can't read it, and if you read it you must forget it, and retain nothing more than the general concepts? Because that is sort of what a student (or the AI) does.

That's what licenses are there for: Terms of usage. In GPL this is not just about copying, but also translation and modification, which both happen when training LLMs.

BTW, why does everybody and their grandma equate LLMs to human brains? Each are stupid in their own, completely different way.
 
BTW, why does everybody and their grandma equate LLMs to human brains?
Nobody's equating LLMs to human brains. LLMs are the equivalent of textbooks that students have to read to learn anything. CPUs are the brains. It's kind of like first using an English Language textbook to learn the language, then using that knowledge to read classified documents (which are written in English if you are in a country where English is the official language).
 
Models of this depth are prone to overfitting and will reproduce some parts of the training data literally on the right input, fully deterministic. Which means it stores an encoded copy of those parts.

Chances that these LLMs store at least some parts of GPL'd code verbatim are 100%.
I doubt that the kind of LLM that GitHub's AI autocomplete uses has long enough sequences of tokens in the model to be able to reproduce any more than few lines at a time. And remember that copyright allows for fair use of small snippets. How small exactly those are is an interesting question; for music and literature that has been litigated; for source code not so much as far as I know.

That's what licenses are there for: Terms of usage. In GPL this is not just about copying, but also translation and modification, which both happen when training LLMs.
The model does not contain a modified copy of the GPLed source code. It's not like the model contains "The Linux Kernel with the SCSI driver replaced by a SATA driver" (dumb example, but an example). It contains very short snippets of the source code, turned into token sequences and their frequencies (highly simplifying).

The GPL talks a lot about turning source code into object code, but a model doesn't do that either: the LLM is not linkable or executable. I think today's licenses just don't get any traction on what model training does: It reads the source code (that is explicitly legal), it tokenizes it (again, there is no prohibition against that), and then it records the frequencies of short token sequences. And it is usually impossible to reconstruct the model input from the frequencies, except for short sequences, or really improbably ones. I think to prevent AI training, a new license would have to be created. And creating that one is not easy, since it has to prohibit "reading the source code with intent to store the fine-grain structure", which is exactly what a human reader does, for example when trying to learn coding style.

It's not an easy or obvious issue.
 
The GPL talks about derived works, and that may be a new source file. The point is also not really if the LLM is allowed to do anything. The LLM gives you a code fragment that may or may not be based on GPL code. The problem starts when you copy that into your source files.

What they did here is not breaking the law, but more like pulling the trigger and handing you the gun while that hammer is falling. It's you who has to point it, and you have no idea if it's empty, blanks or lead comming out. Or a pool noodle.
 
Nobody's equating LLMs to human brains. LLMs are the equivalent of textbooks that students have to read to learn anything. CPUs are the brains. It's kind of like first using an English Language textbook to learn the language, then using that knowledge to read classified documents (which are written in English if you are in a country where English is the official language).

Sorry, that makes no sense to me. An LLM is not a corpus, a CPU has no resemblance to what a brain does, and ralphbsz is not nobody...

I doubt that the kind of LLM that GitHub's AI autocomplete uses has long enough sequences of tokens in the model to be able to reproduce any more than few lines at a time. And remember that copyright allows for fair use of small snippets. How small exactly those are is an interesting question; for music and literature that has been litigated; for source code not so much as far as I know.

Why would you think that? Today's models boast 1000s of tokens context, and produce complete structure of multiple source code files on request. That context has to come from somewhere. To feed them only small portions of training data would completely defy the purpose and make them far more limited than they are today.

The model does not contain a modified copy of the GPLed source code. It's not like the model contains "The Linux Kernel with the SCSI driver replaced by a SATA driver" (dumb example, but an example). It contains very short snippets of the source code, turned into token sequences and their frequencies (highly simplifying).

The model contains weights that will reproduce some parts of GPL'd code literally, that's a known problem of these models. They have been caught reproducing whole paragraphs of text verbatim, even with the obfuscation that is built into the interfaces. And that's only the copyright issue, you're still neglecting the terms of usage.

The GPL talks a lot about turning source code into object code, but a model doesn't do that either: the LLM is not linkable or executable. I think today's licenses just don't get any traction on what model training does: It reads the source code (that is explicitly legal), it tokenizes it (again, there is no prohibition against that), and then it records the frequencies of short token sequences.

That's far too narrow, anything that translates and modifies the code creates a derivative work, to which the GPL applies. Tokenizing is a 1:1 translation of the code, while the goal of the whole procedure is to extract linguistic structure (translation again) and mixing it with other linguistic structure (modification). Technically that makes an LLM a derivative work, even if it wouldn't reproduce large enough portions to trigger copyright infringement.

I think to prevent AI training, a new license would have to be created. And creating that one is not easy, since it has to prohibit "reading the source code with intent to store the fine-grain structure", which is exactly what a human reader does, for example when trying to learn coding style.

GPL does not prevent LLM training, it's just that it's terms of usage apply to the LLM created. And I really don't get what should be difficult about legally differentiating between a human reader and a computer. We have that distinction in a lot of copyright law already.
 
CPU has no resemblance to what a brain does
why not? I'd think it's pretty common knowledge that a CPU is the "brain" of the computer...
Why would you think that? Today's models boast 1000s of tokens context, and produce complete structure of multiple source code files on request. That context has to come from somewhere. To feed them only small portions of training data would completely defy the purpose and make them far more limited than they are today.
Re-read the very snippet that ralphbsz wrote (and you are responding to). The exact size of those tokens of context does matter, because they just might contain code that somebody doesn't want to share.

And I really don't get what should be difficult about legally differentiating between a human reader and a computer.
Ever hear of the Turing Test? do you know what it's about? That test doesn't test the computer, it tests humans, to see if they can be fooled into thinking that there's a human on the other end of the wire. And it's becoming easier and easier to fool the human.
 
As i do 32GB gpu , i run local llm`s ant not using copilot. But i do see issue if you put your code as f2u under all these licenses it can be a dangerous but there is another catch: how can you prove that my llm/copilot gave me this code or i stole it from somewhere. Whats happens if llm/copilot gives me code which is under license and no one knows, i re-write this code by myself using my brain but it turns out someone else wrote this code and put a license on it ?
Im new to all of this so im kinda interested in all these outcomes with "what if...".
P.s. does M$ snithces with its "spiders" trough web and looking for codes on other websites and takes it ?
 
why not? I'd think it's pretty common knowledge that a CPU is the "brain" of the computer...

Nope, that's being pretty ignorant about what a brain does. If your brain was a CPU
  • it wouldn't remember things permanently
  • it would do exact logic and calculations very fast but inefficient on fuzzy tasks
  • it wouldn't work all in parallel at the same time
  • it wouldn't reorganize its logic and memory while using it.
Bad analogy, in both directions. IIRC, parts of the frontal cortex could be comparable in function, in case you want to go really far-stretched.

Re-read the very snippet that ralphbsz wrote (and you are responding to). The exact size of those tokens of context does matter, because they just might contain code that somebody doesn't want to share.

There's no "size of those tokens", tokens roughly correspond to words. If an LLM is only fed single, non-overlapping sentences for training, it won't be able to produce whole paragraphs or documents around a context. At least not in a way that is linguistically sound (factual correctness is a different story). And yet they do.

Given this I have no reason to believe what ralphbsz assumes, namely that LLMs are trained with so few tokens at a time that the represented text would escape copyright claims, for being only short text parts.

Ever hear of the Turing Test? do you know what it's about? That test doesn't test the computer, it tests humans, to see if they can be fooled into thinking that there's a human on the other end of the wire. And it's becoming easier and easier to fool the human.

How is that relevant here? It's easy to define the difference between humans and computers in legislation, and law enforcement can just look at the physical appearance if needed.
 
Nope, that's being pretty ignorant about what a brain does. If your brain was a CPU
  • it wouldn't remember things permanently
  • it would do exact logic and calculations very fast but inefficient on fuzzy tasks
  • it wouldn't work all in parallel at the same time
  • it wouldn't reorganize its logic and memory while using it.
Bad analogy, in both directions. IIRC, parts of the frontal cortex could be comparable in function, in case you want to go really far-stretched.
  • It's amazing what humans can remember permanently and what they don't. That varies from one person to another. As for remembering things permanently - that's what external devices are for - be it Wikipedia or a local library of physical books.
  • Humans rely on computers to do exact logic and calculations. And yes, they are surprisingly inefficient at fuzzy tasks. Just ask any politician about their progress towards reducing impact of economic activity on climate change. Nobody has yet reached the baselines established by the Kyoto Protocol... and the 2030 deadline is looming, y'know 😏
  • Ever hear of multitasking? like drinking coffee while trying to pet your cat while trying to write C++ code? ;) Ever hear of the MULTICS operating system? any modern CPU is capable of serving up a few TTYs to a few users at the same time, y'know.
  • How do you think newer CPUs are designed? with help of existing CPUs. And, it's perfectly possible to reorganize the contents of RAM while using it. Brains are like that too - just try following a recipe and cooking in the kitchen.
There's no "size of those tokens", tokens roughly correspond to words.
What if a whole text file were treated as a token? or a .jpg image? For an example of image-based AI training, you may recall the hoopla over the 'Black Hitler' AI-generated imagery. Yes, exact content of training materials for AIs does matter, for many reasons. Not just copyright violations, not just historical/factual inaccuracies
Given this I have no reason to believe what ralphbsz assumes, namely that LLMs are trained with so few tokens at a time that the represented text would escape copyright claims, for being only short text parts.
Y'know, it does take a ton of processing power to train an LLM. So, it's a bit of a tradeoff between investing in a machine powerful enough to process the terabytes of contextual data (and learn adequately from it) and the retun on such an investment.

How is that relevant here? It's easy to define the difference between humans and computers in legislation, and law enforcement can just look at the physical appearance if needed.
Yeah, if ChatGPT can generate a patch to work around a Wayland-related bug in Freebsd, can you still claim that it's easy to define the difference between humans and computers? An AI can easily run a marriage scam out of Nigeria at this point, y'know. Try getting a cop to take a look at the criminal AI and say with a straight face that they can tell the AI from a real human.
 
Given this I have no reason to believe what ralphbsz assumes, namely that LLMs are trained with so few tokens at a time that the represented text would escape copyright claims, for being only short text parts.
That's not at all what I assume. On the contrary, I think good coding / autocomplete LLMs are trained on a very large corpus, for example all of Github, or in large software companies all of the internal code base (which is typically 10^9 LOC). And they are trained on large token chains, as they have to be to give reasonably accurate predictions.

But they don't store the complete token chains, they only store states and state transition probabilities. But because there are so many transition probabilities, it is hardly possible for the model to regurgitate exactly one completion which is textually accurate. Let me go back to the example of literature and novels: If I ask an LLM to auto-complete the novel that starts with "Call me Ishmael", it will probably not get it right. Now, a good modern conversational AI (like ChatGPT) will immediately detect what the novel is, and tell me that it is Moby Dick, and point me at a Project Gutenberg page where the whole thing is online and readable, but that's not regurgitating the whole novel, that is a bypass around the LLM.

My (uneducated) guess remains that an LLM that helps with creating source code will not reproduce enough of a copyrighted work (such as the Linux of FreeBSD kernel) to violate copyright law. And when I ask it to auto-complete whatever project I'm working on right now (which happens to be: how to download lots of files from Microsoft OneDrive using Python), it will probably not steal enough content from any existing copyrighted code base to trigger a license. But I admit that this is an open question, which ultimately legislatures and courts will have to decide.
 
My (uneducated) guess remains that an LLM that helps with creating source code will not reproduce enough of a copyrighted work (such as the Linux of FreeBSD kernel) to violate copyright law. And when I ask it to auto-complete whatever project I'm working on right now (which happens to be: how to download lots of files from Microsoft OneDrive using Python), it will probably not steal enough content from any existing copyrighted code base to trigger a license. But I admit that this is an open question, which ultimately legislatures and courts will have to decide.
If you have enough monkeys with enough typewriters, you'll eventually get a reproduction of the 'War and Peace' novel.

By the same token, if you ask an AI to create enough code (just assuming it's creating .cpp files), you'll eventually see something that violates somebody's copyright in some form or shape.

Can an AI feel any shame about that? How can anyone tell THAT, especially with a straight face?
 
astyle: Very creative reasoning, unfortunately that makes it quite hard to stay on point and follow through with an argument. Maybe we can compromise by saying that in your case, your brain acts a little bit more like a CPU than the brains of other people? ;)

ralphbsz: I understand your train of thought. But long token chains of GPL'ed code are definitely a derivative work, don't you agree? And blending these with other token chains is clearly a modification, which makes the whole LLM a GPL'ed work, technically. Moreover, the weights around some tokens heavily used in GPL code will be almost exclusively shaped by GPL code.

The other argument is about reproducing copyrighted material, and my point is that it already happened and will happen again. More by accident than "criminal intent", sure, but some weights in these models will represent some copyrighted material more closely than desired. That's just a consequence of the training procedure, AI companies would certainly avoid that if they could.

I'm not a "copyrights first" guy, by no means. But people conveniently gloss over these issues with some fuzzy reasoning, like nobody knows what's exactly encoded in those models. These bad excuses rub me the wrong way.
 
If you have enough monkeys with enough typewriters, you'll eventually get a reproduction of the 'War and Peace' novel.
Absolutely. You will also get Hamlet and Ulysses. And if you publish everything the monkeys have produced, you will violate the copyright on War and Peace, Hamlet, and Ulysses. In the old days, that was completely impractical, since there wasn't enough paper in the universe to publish everything the monkeys wrote, so it was also impossible to locate War and Peace in their output.

Note 1: If you have an intelligent human who reads all the monkeys' works, and then intentionally select War and Peace from it and publishes it because it seems interesting, that intelligent human might be the thing that violates copyright, because they have knowledge (probably based on reading the great classics) of what constitutes a good novel.

Note 2: If the monkeys just type, and you throw their output away instead of publishing it, then no copyright is violated, even if they have typed War and Peace (and Hamlet and Ulysses). Copyright is about the right to make money by publishing copies, not about the internal product. So I can learn Hamlet by heart, and sit in my living room and say "to be or not to be" when nobody is listening, and that does not violate copyright, nor the implied license that comes from books. What I can not do is to stand on a stage and say that line, nor sell copies of the book in the lobby of the theater.

By the same token, if you ask an AI to create enough code (just assuming it's creating .cpp files), you'll eventually see something that violates somebody's copyright in some form or shape.
But just reading all that code, and storing it (even in some modified version) does not violate anything. Just like I am allowed to memorize Hamlet and perform it in my living room with only the Christmas tree listening, I can internalize all of the Linux kernel in my brain, or in the storage of my AI computer. It's when I publish results based on it that the trouble starts.

ralphbsz: I understand your train of thought. But long token chains of GPL'ed code are definitely a derivative work, don't you agree? And blending these with other token chains is clearly a modification, which makes the whole LLM a GPL'ed work, technically. Moreover, the weights around some tokens heavily used in GPL code will be almost exclusively shaped by GPL code.
You are making a good argument, which is not based on normal copyright, but on the "no modification" clause of the license. Taking that clause literally means that I can not modify the code at all. Even if I never show it to anyone! Technically, if I stand in my living room and tell the Christmas tree that I have changed the line of code that prints the OS version (in the uname() call) from "Linux" to "Ralphux", I have already violated the license. This is clearly ridiculous, but it's what GPL V3 says.

The other argument is about reproducing copyrighted material, and my point is that it already happened and will happen again. More by accident than "criminal intent", sure, but some weights in these models will represent some copyrighted material more closely than desired. That's just a consequence of the training procedure, AI companies would certainly avoid that if they could.
And this is where a balance with "fair use" comes in. I can write a new play (for example a romantic comedy involving Marilyn Monroe playing saxophone in an all-girl band), and use the line "to be or not to be" in there as a joke, and I have not violated Shakespeare's copyright, because the quote is so short. Similarly, I can look at the Linux kernel how it handles a particularly bizarre USB device, and put that into a driver I'm writing for Ralphux, again that's fair use and allowed, since I'm not using the work itself, only a small bit of knowledge gleaned from reading the kernel, not copying or modifying or using parts of it.

Obviously fair use and knowledge is a gray zone. For example, the Linux people could claim that my code is written exclusively using ASCII characters 0x20 through 0x7E, and they use the same ones, ergo my code copies their work at the level of 8-bit entities. No judge or jury would ever let them get away with that, but crazier things have been tried in lawsuits.

So when the AI model uses lots of snippets from different places, is that fair use? I don't know.

I'm not a "copyrights first" guy, by no means. But people conveniently gloss over these issues with some fuzzy reasoning, like nobody knows what's exactly encoded in those models. These bad excuses rub me the wrong way.
I agree with you that this is an uncomfortable stretching of existing copyright law. But copyright law has been obsolete and insane since the invention of radio, TV, photocopies, and computer networks. It is intellectually designed for the time of books and magazines, and theater productions. If the AI wave causes all of copyright law to be thrown on the heap of bad ideas, and replaced with something sane, that would be wonderful. Alas, that's unlikely to happen.

And now I have to end this long post, since I have to go deal with Napoleon's invading army while drinking lots of vodka. See you later, Yorick.
 
Back
Top