It might be a really dumb idea to try, since it might destroy their business model. If the output of an LLM is copyrighted by the AI company, most people can not use that output.
Indeed, generally AI firms want their end users to be able to utilize their model output with as few restrictions as possible, since that makes their product more valuable. AI firms have tried unilaterally to release their model output from such restrictions, but it's not totally clear how effective that is and it's conceivable this will require some kind of legislation to standardize. But there are plenty of other tensions between groups that could produce court cases, in fact there have been already:
https://en.wikipedia.org/wiki/Artificial_intelligence_and_copyright
End users want reassurance that they have their own intellectual property stake in the output (see the OP for how that might be achieved, e.g. through their own application of creativity into the AI prompting process) or at least that nobody else does (note that Theo de Raadt explained this would be insufficient for the OpenBSD Project, for them it's vital that the dev has enough rights over the AI-assisted code to be able to license it and it isn't just public domain), while AI firms want their end users to feel confident they have this protection (for the same reason they generally don't want their own IP rights over the model output). Those with IP rights over the training data can have a host of objections to both the AI firms and their end users. They may object to their data being used for training at all, or at least without (possibly statutory) compensation, or for their data being used for certain purposes (e.g. AI models that can be licensed for military purposes), or for models that have the ability to mimic their work (e.g. they may want to prevent "write/draw in the style of ..." prompts), and certainly want their IP rights to be enforceable against those end users whose model output closely resembles their original work.
Aside from court cases, this also means intense lobbying is underway for legislative changes. AI firms and some of their biggest end users in the USA argue that winning the "AI race" is an economic and national security priority, and aim to minimize the rights of the training data providers. The likes of entertainment firms, artists, publishers, etc press in the opposite direction - the right to opt out of being used for training, recognition of their own IP in the end model output when there's an identifiable claim like regurgitation, a legal framework for training royalties and/or licensing fees. Outside the USA and China, local cultural industries are particularly critical of the damage done to them by
American (and to a lesser extent, Chinese) AI firms.
Where this all ends up would be speculative, but the basic dynamics are already visible. It took a long time to sort out an international convention on copyright and we might need another one, since there's already some divergence in case law between countries. While there have been technological pressures to change the rules in the past (e.g. anti-piracy), that was mostly driven by increased ease of transmission and reproduction. AI models raise complex and more fundamental issues about creativity that go right to the heart of IP law. I agree with
Crivens that we need a new language here, or at least new understandings of old terminology. The structure of LLMs is not quite the same as human brains and they do not "think" in the same way, but they are close enough that it's unhelpful to view them in the same way as simpler processing algorithms - even ones that a few years ago were seen as cutting-edge, like using machine learning to colorize black-and-white movies. That kind of case was straightforward to analyse with traditional IP law.
What LLMs can do with their training data seems far more transformative. Some cases appear more akin to how a human can write a novel, paint a picture, or direct a film that is
obviously in the style of someone else, but even such clear influence and mimicry does not necessarily give the original source any IP rights over the new work. Plenty of pitfalls can though, and not just directly copying sections of the original work - reuse of a copyright character or fantasy setting is an infringement, as is copying the "total concept and feel" (see
https://en.wikipedia.org/wiki/Roth_Greeting_Cards_v._United_Card_Co.). Note that LLMs are prone to such violations, e.g. they have no problem producing fan-fiction stories or artwork of heavily IP-protected media franchises.
Unfortunately the technology moves faster than the legal and legislative processes. There is strong expert consensus on a few things, not just that this isn't "stealing" like I mentioned before, but for example also that the AI model itself is not eligible to hold any IP rights on its output (you might remember that weird legal case where PETA claimed monkeys should be assigned the copyright on selfie photos they'd taken -
https://en.wikipedia.org/wiki/Naruto_v._Slater - which silly as it sounds is relevant here as it reiterated a human authorship requirement). Outside those areas it gets murky fast. I think it's impossible to have real certainty even if you have a team of expensive lawyers at your disposal. Which means incorporating AI-generated code or other content into a project carries risks at the moment.
I think creativecommons.org has decided to call it "mass scraping".
This is not the answer to the question posed by
Crivens. Mass scraping is simply the (dodgy and disruptive) process by which AI firms have acquired their training data. But the multi-billion dollar question isn't how the data is acquired, but the
legal status of model outputs for an AI model which has been trained on the data. That depends on what you think the model is
doing with that training data, which is what
Crivens asked. Note that it's possible that publishers who hold rights over the training data may be able to sue AI firms for damages due to the scraping (some of these challenges have been settled out of court - see link in the first paragraph of this post) without them necessarily being able to enforce IP rights against the end users for the way they use their model output.
What you're saying isn't totally irrelevant - perhaps it will turn out to make a difference to the status of the model output whether the training data was acquired non-consensually or via some form of licensing agreement. But even licensing agreements are not a legal cure-all. IP owners won't want model outputs to be free of all restrictions if it falls into one of the pitfalls that also affect human-created work I mentioned above, so they're not going to sign away all rights over the model output (in fact even if they did, it's not totally clear what legal effect that would have - some IP rights are inalienable). But provided output doesn't fall into one of those pitfalls, end users wouldn't want the model output to be burdened with any additional licensing conditions from the owners of the training data. Clarifying this relationship between IP owners, AI firms and end users is one of the reasons I feel like a new legal framework will be needed.