Community Topics in Chatbot

I hope I don't break any/too many rules here.

I've been playing with augmented LLM for a bit - the idea is to make this community's knowledge more accessible. What I've done:

  1. preprocessed about 3 years of topics into structured MD files
  2. generated a vector database
  3. integrated with Anthropic LLM
  4. connected to a chainlit chatbot.

It's on a relatively small EC2 so each query takes a couple of seconds.

I wonder if it could be useful, i.e., if someone can try it. https://axelspire.com/chatbot/chat/

Speaking only for myself here:

Per the Let's Encrypt Community Forum Terms of Service, all user-created content on this forum is licensed under CC BY-NC-SA 3.0. That means that it can be used and remixed by projects like this if and only if those projects provide attribution and are licensed under the same terms.

I don't see a CC BY-NC-SA license on your chatbot, but even more importantly, LLMs are structurally incapable of providing attribution to the posts and users from which they are drawing their generated text.

So while this project is interesting, I believe it is in violation of this community's terms of service and I'd like to ask you to take it down and delete the model.

8 Likes

Hi, that's a good shout! Actually, each answer provided lists topics used for it. It uses a RAG control flow and seems to be very accurate in that respect. Not sure if you gave it a try, @aarongable .

I will review and either improve, if possible, or bin it.

They all get it a bit wrong in general and in particular when dealing with ACME clients you need to ingest their docs pages as well as community threads are rarely specific or accurate enough.

You should add a footer with the generic license attribution and as it's non-commercial you could not safely use it for commercial gain.

Personally I don't care about AI using the info, as the general models like ChatGPT have already scraped this info many thousands of times. I use AI every day, all the time, some others probably do too. Reading the docs is now a luxury activity for the time-rich, I accept that.

However models that try to be the documentation need to try extra hard to get it right and should favor authoritative answers.

3 Likes

This is a fatal flaw for this licensing model.

I am not a lawyer, but I can read court decisions, and it was adjudicated in Bartz v Anthropic that retaining data for training a LLM model is a violation of copyright. (Ref: Susman Godfrey Secures $1.5 Billion Settlement in Landmark AI Piracy Case | Susman Godfrey L.L.P. ) (A good background from Lawful Masses https://www.youtube.com/watch?v=XWY8QmLD5H4 - Leonard French, Esq is a practicing copyright attorney)

The CC licensing requires that you give attribution, do not use the work commercially, and must license the derivative material under the came license. Attribution is difficult, but it is a solveable problem if the portions of each that are used are attributed to their author. Non-commercial use has been held to exclude these materials being fed to commercial AI that then may ingest them to use for resold AI portions, but you could still use something like a local open source instance (such as a local Stable Diffusion instance for images). However, being able to license it same as the CC is probably fatal. According to the US Congress, generative AI content is not copyrightable and cannot be licensed: https://www.congress.gov/crs-product/LSB10922

I welcome someone with actual legal experience to contradict me, but that's my reading.

1 Like

interesting - based on a chat with a lawyer (off the top of his head). The safe way, in this particular case, is to 'be inspired', 'don't keep texts' and 'don't attribute'. Basically, strip out core technical facts, find public sources and build own knowledge base (with AI). What seems to be strange to me is that under CC BY-NC-SA you can't protect technical facts, only the way they are described.

Feels like something's not quite right.

Here in Canada, under the Copyright Act R.S.C., 1985, c. C-42, all works including technical works such as architectural plans or, in this case, software documentation and the discussions thereof, are copyrighted, the operant terms:

5 (1) Subject to this Act, copyright shall subsist in Canada, for the term hereinafter mentioned, in every original literary, dramatic, musical and artistic work if any one of the following conditions is met:

(a) in the case of any work, whether published or unpublished, including a cinematographic work, the author was, at the date of the making of the work, a citizen or subject of, or a person ordinarily resident in, a treaty country;

(b) in the case of a cinematographic work, whether published or unpublished, the maker, at the date of the making of the cinematographic work,

(i) if a corporation, had its headquarters in a treaty country, or

(ii) if a natural person, was a citizen or subject of, or a person ordinarily resident in, a treaty country; or

(c) in the case of a published work, including a cinematographic work,

(i) in relation to subparagraph 2.2(1)(a)(i), the first publication in such a quantity as to satisfy the reasonable demands of the public, having regard to the nature of the work, occurred in a treaty country, or

(ii) in relation to subparagraph 2.2(1)(a)(ii) or (iii), the first publication occurred in a treaty country.

The USA congress position on this was published above, which is reflective of their own legislation.

An update on the chatbot RAG model. Based on the feedback here (and elsewhere) - I have re-built it from scratch. It doesn't use any expressions of ideas or quotations from this web site - i.e., anything subject of CC BY-NC-SA. The model generation used the content from here only to build a set of technical facts and I added other sources as well. Once cleaned up, I then generated full-text topics and guides from authoritative sources - using facts purely as "headings". These new AI/synthetic datasets are used in the chatbot model.

@aarongable

For your review.

1 Like