Who owns the code Claude wrote?

A program is an extension of a person. Clearly, I can not sue a program in court. But I can sue the person who wrote the program, or who sold it, or who uses it. But in all cases, the basic causal relationship of "what made the crime happen" is the program.


Nonsense. Example: I write a program that analyzes the pressure in my home water system, and how often the pumps have to run, with the outside temperature, the water level in the well, and the phase of Aquarius. I will find certain results (irrigation uses a lot of water, it takes longer to pump water further uphill). I control the input data, which I have carefully collected over the last few years. To calculate the positions of Jupiter and Aquarius, I use well-known public domain formulas from an astronomy textbook.

Where exactly is the theft here? So your assertion that all programs that process data commit theft is balderdash.

Next: The works of Shakespeare are in the public domain (he's been dead for a while). I use an LLM to write the perfect love sonnet, and use it in an attempt to seduce my wife into cooking a romantic dinner (we've been married for over 30 years). Am I stealing from Shakespeare? Note that my wife would probably slap me, and if I tried it three times in a row, she'd even use a heavy and large object to hit me. Anyway, other than the very serious crimes of literary stupidity and domestic violence, where is the theft here?

People publish things all the time. Shakespeare published his works. Einstein published E=mc2. Linus T publishes the Linux kernels. They all have some copyright. But I'm allowed to read all of these published works, and make myself a smarter human, and then use that smartness for my own purposes (like getting a nice dinner). We have copyright laws to make sure the balance of equity between the people writing and publishing things, and the people reading and using things remains fair and equitable. Those laws work (pretty badly), but they are the best we have, and they keep things somewhat in order.

LLMs don't change the concept behind that. They read all these works that are published and sometimes copyrighted, and they "make themselves smarter". The difference is that the economics of the system is completely changed: LLMs learn faster and broader, and create results more cheaply (alas also usually unreliably). For that reason, the existing "social contract" between author and user may not be appropriate any longer. But right now, we don't have a new set of laws to govern this, and using the existing ones in the new situation is difficult and may give counter-intuitive results.
Yeah I get it. it's a calculator but it's running on other people's data. I don't see any ethical dilemma with public domain and I don't think that a public domain licensed work is in question regarding it's use. I'm talking in plain words when I say it's theft. Using a lot of words is unnecessary on this topic.
 
Like I said, lots of people make the mistake of mixing up the law on theft and the law on intellectual property infringement. It's very common, and some people even conflate them deliberately to make the latter sound as scary or clearcut as the former. It's a useful rhetorical device, as the "piracy is theft" campaign showed - and more recently with content creators complaining that AI firms have "stolen" their property.

I only said "absolutely everybody with any relevant expertise agrees". Say, somebody who was familiar with Dowling v. United States or their jurisdiction's equivalent. The area of law that experts are worried about here is not the law of theft or conversion, and potentially whether AI firms and/or their end users could be prosecuted or sued for stealing, but the law surrounding intellectual property. To anybody who does have expertise, people who insist that the legal situation is "theft" immediately look unserious. If it's important to you that decision-makers pay attention to your views, then you'd be better to read up on the law and stop writing nonsense, as that would make it much harder for them to dismiss what you're saying. There are very powerful arguments that can be made against AI which don't completely mangle the legal situation. If you just want to vent your views and don't mind whether people take it seriously then feel free - but don't expect to go uncorrected if you write legal misinformation.

I don't think the point of Dowling v. United States is all that hard to understand and it is an interesting area of law that impinges on a lot of people's professional lives, so personally I think it's worth a little reading up. But YMMV.
I hear ya. But in the most simple of terms it's still theft. Clearly, easily demonstrated theft. You can choose any word or series of words to convey the idea of theft how you like.
 
I propose a reasonable compromise here. For every output, a citation, list including licenses, from sources data sets used in each calculation of the "AI".

From the terabytes of output in the citation and license information a conclusion of "ownership" can be extrapolated.
 
I can instead read all the standard textbooks in this field (I have several meters of them in my bookshelf), understand them, and write the same ideas using my own words.
Small nitpick here - LLMs do not "understand" and do not have "own words". At least I saw no indication to that, but there may be somewhere.

What we need is a new word, I think. How would someone describe the process of what LLMs do? Stealing it isn't, the thing is still there. This is going to take some time.
 
Small nitpick here - LLMs do not "understand" and do not have "own words". At least I saw no indication to that, but there may be somewhere.

What we need is a new word, I think. How would someone describe the process of what LLMs do? Stealing it isn't, the thing is still there. This is going to take some time.
I think creativecommons.org has decided to call it "mass scraping".

"In the age of AI and mass scraping of openly licensed content, we haven’t seen the kind of reciprocity that’s necessary to sustain the infrastructure of the commons beyond our licenses. We’re currently exploring frameworks to provide licensors with more granular control over machine usage of data."
 
It might be a really dumb idea to try, since it might destroy their business model. If the output of an LLM is copyrighted by the AI company, most people can not use that output.
Indeed, generally AI firms want their end users to be able to utilize their model output with as few restrictions as possible, since that makes their product more valuable. AI firms have tried unilaterally to release their model output from such restrictions, but it's not totally clear how effective that is and it's conceivable this will require some kind of legislation to standardize. But there are plenty of other tensions between groups that could produce court cases, in fact there have been already: https://en.wikipedia.org/wiki/Artificial_intelligence_and_copyright

End users want reassurance that they have their own intellectual property stake in the output (see the OP for how that might be achieved, e.g. through their own application of creativity into the AI prompting process) or at least that nobody else does (note that Theo de Raadt explained this would be insufficient for the OpenBSD Project, for them it's vital that the dev has enough rights over the AI-assisted code to be able to license it and it isn't just public domain), while AI firms want their end users to feel confident they have this protection (for the same reason they generally don't want their own IP rights over the model output). Those with IP rights over the training data can have a host of objections to both the AI firms and their end users. They may object to their data being used for training at all, or at least without (possibly statutory) compensation, or for their data being used for certain purposes (e.g. AI models that can be licensed for military purposes), or for models that have the ability to mimic their work (e.g. they may want to prevent "write/draw in the style of ..." prompts), and certainly want their IP rights to be enforceable against those end users whose model output closely resembles their original work.

Aside from court cases, this also means intense lobbying is underway for legislative changes. AI firms and some of their biggest end users in the USA argue that winning the "AI race" is an economic and national security priority, and aim to minimize the rights of the training data providers. The likes of entertainment firms, artists, publishers, etc press in the opposite direction - the right to opt out of being used for training, recognition of their own IP in the end model output when there's an identifiable claim like regurgitation, a legal framework for training royalties and/or licensing fees. Outside the USA and China, local cultural industries are particularly critical of the damage done to them by American (and to a lesser extent, Chinese) AI firms.

Where this all ends up would be speculative, but the basic dynamics are already visible. It took a long time to sort out an international convention on copyright and we might need another one, since there's already some divergence in case law between countries. While there have been technological pressures to change the rules in the past (e.g. anti-piracy), that was mostly driven by increased ease of transmission and reproduction. AI models raise complex and more fundamental issues about creativity that go right to the heart of IP law. I agree with Crivens that we need a new language here, or at least new understandings of old terminology. The structure of LLMs is not quite the same as human brains and they do not "think" in the same way, but they are close enough that it's unhelpful to view them in the same way as simpler processing algorithms - even ones that a few years ago were seen as cutting-edge, like using machine learning to colorize black-and-white movies. That kind of case was straightforward to analyse with traditional IP law.

What LLMs can do with their training data seems far more transformative. Some cases appear more akin to how a human can write a novel, paint a picture, or direct a film that is obviously in the style of someone else, but even such clear influence and mimicry does not necessarily give the original source any IP rights over the new work. Plenty of pitfalls can though, and not just directly copying sections of the original work - reuse of a copyright character or fantasy setting is an infringement, as is copying the "total concept and feel" (see https://en.wikipedia.org/wiki/Roth_Greeting_Cards_v._United_Card_Co.). Note that LLMs are prone to such violations, e.g. they have no problem producing fan-fiction stories or artwork of heavily IP-protected media franchises.

Unfortunately the technology moves faster than the legal and legislative processes. There is strong expert consensus on a few things, not just that this isn't "stealing" like I mentioned before, but for example also that the AI model itself is not eligible to hold any IP rights on its output (you might remember that weird legal case where PETA claimed monkeys should be assigned the copyright on selfie photos they'd taken - https://en.wikipedia.org/wiki/Naruto_v._Slater - which silly as it sounds is relevant here as it reiterated a human authorship requirement). Outside those areas it gets murky fast. I think it's impossible to have real certainty even if you have a team of expensive lawyers at your disposal. Which means incorporating AI-generated code or other content into a project carries risks at the moment.

I think creativecommons.org has decided to call it "mass scraping".
This is not the answer to the question posed by Crivens. Mass scraping is simply the (dodgy and disruptive) process by which AI firms have acquired their training data. But the multi-billion dollar question isn't how the data is acquired, but the legal status of model outputs for an AI model which has been trained on the data. That depends on what you think the model is doing with that training data, which is what Crivens asked. Note that it's possible that publishers who hold rights over the training data may be able to sue AI firms for damages due to the scraping (some of these challenges have been settled out of court - see link in the first paragraph of this post) without them necessarily being able to enforce IP rights against the end users for the way they use their model output.

What you're saying isn't totally irrelevant - perhaps it will turn out to make a difference to the status of the model output whether the training data was acquired non-consensually or via some form of licensing agreement. But even licensing agreements are not a legal cure-all. IP owners won't want model outputs to be free of all restrictions if it falls into one of the pitfalls that also affect human-created work I mentioned above, so they're not going to sign away all rights over the model output (in fact even if they did, it's not totally clear what legal effect that would have - some IP rights are inalienable). But provided output doesn't fall into one of those pitfalls, end users wouldn't want the model output to be burdened with any additional licensing conditions from the owners of the training data. Clarifying this relationship between IP owners, AI firms and end users is one of the reasons I feel like a new legal framework will be needed.
 
Some cases appear more akin to how a human can write a novel, paint a picture, or direct a film that is obviously in the style of someone else, but even such clear influence and mimicry does not necessarily give the original source any IP rights over the new work.
My favorite example is from about 4 years ago, when Google had an AI image generator that allowed drawing pictures from text descriptions. Our group (which was inside Google) used it to make a picture of Snoopy, lying on his doghouse, in a field of sunflowers, in the style of Van Gogh. It was wonderful: clearly the painting technique of Van Gogh, but then also clearly the white and black beagle in his usual pose on his doghouse. My ethical judgement is that the copyright on that image should be held by BOTH Mr. Vincent Van Gogh and Mr. Charles Schulz. And by the people in our group, who spent half hour messing with the tool to make that charming drawing. And by the engineering group who had created the AI tool. But clearly copyright law doesn't allow four separate people/groups to hold a copyright, even though all of them contributed (some long after their death).

This whole thing is a dilemma, just as you describe. And my fear is, just as you describe, that the legislation that resolves the dilemma will not be created by level-headed ethical concerns, nor sensible economic reasoning, but by who has the largest lobbying group with politicians. And that the answer from Brussels, Washington and Beijing will be in conflict.
 
My favorite example is from about 4 years ago, when Google had an AI image generator that allowed drawing pictures from text descriptions. Our group (which was inside Google) used it to make a picture of Snoopy, lying on his doghouse, in a field of sunflowers, in the style of Van Gogh. It was wonderful: clearly the painting technique of Van Gogh, but then also clearly the white and black beagle in his usual pose on his doghouse. My ethical judgement is that the copyright on that image should be held by BOTH Mr. Vincent Van Gogh and Mr. Charles Schulz. And by the people in our group, who spent half hour messing with the tool to make that charming drawing. And by the engineering group who had created the AI tool. But clearly copyright law doesn't allow four separate people/groups to hold a copyright, even though all of them contributed (some long after their death).
This is an interesting example - there's no problem with multiple groups holding the copyright actually, in principle even a large number of them. But a legal situation people are keen to avoid with AI is something suggested earlier in the thread, in which AI-produced content is regarded as simply remixing the training data into a derivative work, in such a way that all the holders of IP in the training data end up with a claim on the output. This would be unwieldy and impractical, and substantially reduce the value of the technology.

Aside from the fact Van Gogh has been dead for over a century, beyond even Mexico's unusually lengthy life-plus-100 years copyright term, closely copying his style or technique wouldn't in itself be sufficient for him (or his estate) to have any claim over the work. That's because copyright protects specific expressions of an idea, but not the underlying idea, concept, process, method, technique etc. See e.g. https://en.wikipedia.org/wiki/Idea–expression_distinction, https://www.wipo.int/en/web/copyright/protection, https://www.copyright.gov/circs/circ33.pdf

However, if Van Gogh hadn't been dead for so many years, then a blatant rip-off of one of his "Sunflowers" series could be problematic. This is where tests of "substantial similarity" come into play, like the "total concept and feel" test I linked earlier: see https://en.wikipedia.org/wiki/Substantial_similarity

Since this thread is about copyright in code, I'll mention https://en.wikipedia.org/wiki/Abstraction-Filtration-Comparison_test which is why avoiding copying code verbatim is not enough to avoid licensing contamination but some similarity can actually be okay:

The AFC test is a three-step process for determining substantial similarity of the non-literal elements of a computer program. The process requires the court to first identify the increasing levels of abstraction of the program. Then, at each level of abstraction, material that is not protectable by copyright is identified and filtered out from further examination. The final step is to compare the defendant's program to the plaintiff's, looking only at the copyright-protected material as identified in the previous two steps, and determine whether the plaintiff's work was copied. In addition, the court will assess the relative significance of any copied material with respect to the entire program

There's no doubt that the IP rights to Snoopy belong to Peanuts Holdings LLC (majority owned by Sony with a smaller stake held by the Schulz family) so if the beagle in your picture was obviously Snoopy then their rights apply.

There's a legally interesting point about the role of your team versus the team who worked on the tool at a higher level, that was somewhat explored in the article in the OP. Your team's work had more authorial intent, since it was trying to get the picture just right. Chinese case law has recognized copyright in an AI-produced image on the basis of the prompting process requiring creativity. Lawyers encourage the logging of prompts for this kind of reason. The team who worked on the tool at a more abstract level are quite far removed from this end use. Their work would have produced IP rights (belonging to them and/or their employer) over the code for the tool itself, but probably not over the output of the tool. When you use a digital camera, the manufacturer doesn't get rights over your work just because they invented the kit and wrote the fancy image processing algorithms that produced your finished photos.
 
Did the movie 2001:A Space Odyssey or the book have a name for what HAL did?
Yes, IIRC, it came up in 2010 and HAL stands for Heuristic ALgorithmics, and definitely not to be one letter before IBM. So, presumably, what it was doing was more of an identifying what sort of a problem it was dealing with and then using a set of algorithms that were more tuned to the problem rather than using one "global" intelligence for it.

But, then again, HAL went nuts because they were stupid enough to tell it what the mission was and then order it to lie to the crew about the purpose even though it had been specifically programmed to never lie, rather than just putting that information on a disk that required HAL to decode and then ordering it not to do so until they got to a certain point. So, clearly not the brightest CS people ever.
 
AI doesn't steal code. It can't steal anything because it's not a legal entity (yet). Just like a dog that steals sausage from your shopping bag isn't a thief in eyes of the law. AI is a tool that helps you get the job done. It's you who is responsible for using that tool correctly. If you take a chainsaw and kill someone, you're the killer, not the saw. Nor is AI if you ever use an AI-powered chainsaw.
If you don't like someone learning your code, writing their own - with or without AI - just don't share your source code.
 
homeadm Your argument is rather poor. AI (Claude) is owned by Anthropic which is a legal entity and might, one day, be sued over its AI stealing code.
Of course, when one publishes their code online, they leave themselves open to theft but supposedly they license it in some way that makes it legal or not.
Whether AI ever abides by that is another story.
 
There are three distinct things here :

1) the contract between employer and employee regarding IP
2) the legitimacy of the source code that was used to train the model
3) the legitimacy of any source code dynamically fetched in by means of LLM agent-system in a workflow

Point 1 is pretty clear. There is absolutely no need to discern the tooling, if the contract says all code produced for the tasks of the company during company time is owned by the company.

Point 2 is less clear, since models do not persist the references for the training source.

Point 3 is also clear. If LLM happens to scrape some online source code whose licence doesn't align well with the company licencing - it is an illicit act by the company.

Practically, what happens if you agentic code for the company under company policy - the code used to train the model is definitely going to be ill-licensed for this purpose, but proving that is almost impossible right now. Also the company may switch blame to the LLM vendor, because vendor has never mentioned any cross-licensing agreements or whatnot. When agent goes on the internet and resolves something from a code example of a GPL project, generates the source, it is not a rewrite.

A rewrite vs refactor is a legally proven ground, there are many prior rulings that can be refereed to. Just take a look at genesis of FreeBSD.
LLM code generation is automatic refactoring. It cannot be said, with any scientific rigor, that the LLM understood what the prohibited source base does, and then implemented the same mechanisms. Everyone knows, the LLM just syntactically merged the two codebases. The whole point of these litigation was not to allow automatic tools to change source code to a point where it isn't that source code in ownership terms. Point was, OK, we cannot make you forget what you saw, as a human, you see, you understand, you cannot drop tables from brain or delete caches. Everything you experience influences and changes you for good, irreversible. This is not the case with the machines.

So, in my view it is an issue that most of developers today never ever worked in a nominal software engineering environment. You don't own the code, regardless of the company not owning the means to generate the code. You have no rights and no obligations. If you act on company policy you cannot be held accountable for anything. If there is IP infrigment that was consequently triggered by your act of engaging in some LLM activity that was allowed by the company, it's either the company or the LLM vendor that will reap the consequence.
 
Your argument is rather poor. AI (Claude) is owned by Anthropic which is a legal entity and might, one day, be sued over its AI stealing code.

You could therefore argue that Antropic is stealing code, not the AI itself. But Ralph explained why relying on general body of available knowledge isn't theft. I believe the same is true for code. And again, anything created by AI can be reproduced without it. AI is merely a tool that facilitates and shortens path to the goal. Just my 2 cents.
 
Rereading this thread and then thinking about the thread about vscode and git commit messages is even more interesting.
Then toss is what if you are being paid for SW development, all the source code has "copy right my company name", but some code is generated by AI (not all of it but some portions), how does that affect the copyright notice in the code?
Now you are using vscode at work to generate code and commit to git with the co authored by message.
 
JohnK thanks for the link. Good stuff. Lifted this from that page, I think it describes things clearly, especially the bolded part.

Given this, in order for code produced with AI to be accepted, it must either be trivial enough to be not copyrightable (basic refactoring, one line bugfixes), or there must be a public statement available from the AI publishershowing they do not assert copyright over the work.
 
Btw. very good article, worth spreading around.
In essence it makes all AI affairs derivative work and the AI user is now in position to prove his human input outweighs the machine output in order to be assigned a property status.

This should kill off all vibe coding instantly. In example : "add rate limiting module", followed by "module works but rate is wrong", etc. should also be a no-go.
 
JohnK thanks for the link. Good stuff. Lifted this from that page, I think it describes things clearly, especially the bolded part.

Given this, in order for code produced with AI to be accepted, it must either be trivial enough to be not copyrightable (basic refactoring, one line bugfixes), or there must be a public statement available from the AI publishershowing they do not assert copyright over the work.
:]
I found it very interesting also because `tmux` being part of OpenBSD in all.

Eventually, I want to get around to adding that to my readme's as well. I'm neither a professional dev or all that used to GitHub (for example) but I've noticed people like to push these monster PRs (which makes me suspicious of them being "generated"). About a month ago I accepted one and it broke my entire project and I found out later (in that case it was Ai produced); after about a month, I finally unscrewed the code (I revered the code base, of course, but I had to manually refactor the code to implement the features). What a mess (especially, because: I be stupid)!
 
  • Like
Reactions: mer
JohnK git reset --hard <commit-id>
branching is your friend and savior.
Someone sends a pull request/patch in a big PR, apply it on a branch.
If it does not build, reject it. Regardless of how useful it may be. You can say "applying breaks build, fix it and resubmit".
 
JohnK git reset --hard <commit-id>
branching is your friend and savior.
Someone sends a pull request/patch in a big PR, apply it on a branch.
If it does not build, reject it. Regardless of how useful it may be. You can say "applying breaks build, fix it and resubmit".
Wait! Saying that wouldn't be considered "rude"? ...I think I've been doing things all wrong.
We may need to take this off-thread but, I can say things like that and not take an hour to type a response like: "I think doing this may not work in all cases because..."?! In other words, you're telling me professionals have super thick skin. I need to learn from you guys!
 
Wait! Saying that wouldn't be considered "rude"? ...I think I've been doing things all wrong.
We may need to take this off-thread but, I can say things like that and not take an hour to type a response like: "I think doing this may not work in all cases because..."?! In other words, you're telling me professionals have super thick skin. I need to learn from you guys!
Ha.

No I'm saying "feelings don't override broken code. Fix your code, buy me tequila to fix my feelings and I'll think about the merge request"
Besides, AI doesn't have feelings, does it? (Yeah, I'm one of those "equal opportunity" types. I'm an a**hole to everyone, equally)
 
Check this very important part

What is unsettled is whether AI-generated output that reproduces training data patterns counts as verbatim copying. The working assumption among lawyers advising companies through M&A is that it probably does, and that assumption is now showing up as a standard condition in acquisition due diligence.

Use a part of permissively licensed code - you have to insert their license verbatim, which includes authors name and such.
Check out the About/Licence parts on the Android or iPhone, what a huge list it is.
LLMs cannot tell you the sources of generated output from training data.

What this means, if they train their model on a 1000 BSD/LGPL projects every word of code spat out by that model will have to come with 1000 BSD/LGPL licenses verbatim
 
  • Like
Reactions: mer
Back
Top