[Hidden-tech] Question about ChatGPT and machine learning

Fri Mar 10 15:03:54 UTC 2023

From what I understand (admittedly from only a *basic* understanding of
machine learning), it is not so much that ChatGPT is "making errors",
but rather that it is "making stuff up", and does not admit that it is
making stuff up.

I'm going to brain dump what I think here, but I'm not an expert in this
by any stretch, so don't take me as an authority.  Perhaps this can help
you reason about ChartGPT until you find a better expert to consult ;)

One thing to understand is that this is a *trained* model.  That means
that it was given a set of questions and answers and told "these are
good, these are bad", probably with a rating of *how* good or bad.  Then
it was given a lot of other data (and how exactly this gets turned into
questions and answers is *way* beyond my knowledge level).  Then a team
of model trainers started asking questions.  The trainers would look at
the answers it came up with and rate them, thus adding to the "trained"
data set.  When you tell ChatGPT that its answer was good or bad, you
are also potentially adding to that training data, by the way.

I'm guessing that the way the system works there is actually no way for
it to "know" that it has made something up.  The output that it produces
is generated based on what you can think of as a very advanced version
of statistical language modelling:  given a certain input, what are the
most likely kinds of things that would follow as a response?  And like
any statistical model, when you get enough standard deviations out,
things get weird.  At no point in the model output are things tagged as
"made up" or "not made up":  it is *ALL* made up.

In the middle of the bell curve the made up things are *much* more
likely to be "correct" than out at the edges of the bell curve.  But
oh those edges...

It is of course more sophisticated than a statistical model, but the
same principle applies:  if there are few examples of *exactly* the kind
of data your input contains, then it is going to draw from stuff that is
a lot less closely related to your input for its response.  But, and
here is the important part, it is going to make up *something* to answer
with.  If a source is mentioned multiple times in the context of your
input, it will use it.  If there are no sources mentioned in the context
of your input, it will generate an output that looks like the *kind of
thing* that would be a response to that *kind of input*.  In this case
that included a list of articles.  It generated at least one of them
from an author whose name was probably mentioned in the context of your
input, but never with an actual article name attached.  Or maybe that
author was mentioned in the context of conversations containing a
subset of the *words* in your input (rather than logically formed
sentences), depending on just how fuzzy the match was.  Then it
effectively made up a plausible sounding article name to go with the
author name, because that's what responses to other similar questions in
its training data looked like (not similar in content, but similar in
*form*).

So while I agree that making up all the sources seems like an extreme
example of this, ChatGPT is what Science Fiction calls an "Artificial
Stupid" (something that can't actually *reason*), and thus I think my
explanation is plausible.  It just depends on how fuzzy the match was
that it made on the input.  If the match was very fuzzy, then it would
have come back with material from its data that generally followed at
least some of your input, and then since responses the trainers
considered "good" to questions like that usually included some sources,
it made some up based on how the answers to other, less related,
questions looked.

Anyone want to bet that four sources was the average number that was
accepted as "a good answer" by the people who did the training?  I know
I've seen "four things" in a couple of ChatGPT answers, and I haven't
asked it very many questions :)

Given all this, there are only two things you can do, one of which is
exactly what you did: ask it for the sources.  Given *that* input, it
should be able to come up with the most likely response being the actual
source.  If it can't, then it has probably made up the source (note: I
have not tested this technique myself, but it follows logically from how
I think the system works).

The second thing you can do (which you probably also already did) is to
rephrase your input, giving it different amounts and kinds of context,
and see how the output changes.  If your altered input results in a less
fuzzy match, you will get better answers.

The big takeaway, which you clearly already know, is to never trust
anything ChatGPT produces.  Use it as a rough draft, but verify all the
facts.

My fear is that there are going to be a lot of people who aren't as
diligent, and we'll end up with a lot of made up information out on the
web adding to all of the maliciously bad information that is already out
there.  I have read that the ChatGPT researchers are worried about how
to avoid using ChatGPT's output as input to a later ChatGPT model, and I
have no idea how they are going to achieve that!

And keep in mind that that maliciously bad information *is part of
ChatGPT's data set*.  Some of it the people who did the training will have
caught, but I'm willing to bet they missed a lot of it because *they*
didn't know it was bad, or it never came up during training.

--David

On Fri, 10 Mar 2023 03:14:21 +0000, Marcia Yudkin via Hidden-discuss <hidden-discuss at lists.hidden-tech.net> wrote:
> Yes, I know that people have been pointing out "ridiculous factual errors" from ChatGPT.   However, to make up sources that sound completely plausible but are fake seems like it belongs in a whole other category.
> 
> 
> 
> 
> 
> 
> On Thursday, March 9, 2023 at 04:10:43 PM HST, Alan Frank <alan at 8wheels.org> wrote: 
> 
> 
> 
> 
> 
> ChatGPT is a conversation engine, not a search engine.  It is designed 
> to provide plausible responses based on similarity of questions and 
> answers to existing material on the internet, without attempting to 
> correlate its responses with actual facts.  Pretty much every social 
> media space I follow has had multiple posts from people pointing out 
> ridiculous factual errors from ChatGPT.
> 
> --Alan
> 
> 
> -------- Original Message --------
> Subject: [Hidden-tech] Question about ChatGPT and machine learning
> Date: 2023-03-09 15:29
> From: Marcia Yudkin via Hidden-discuss 
> <hidden-discuss at lists.hidden-tech.net>
> To: "Hidden-discuss at lists.hidden-tech.net" 
> <Hidden-discuss at lists.hidden-tech.net>
> 
> This question is for anyone who understands how the machine learning in 
> ChatGPT works.
> 
> I've been finding ChatGPT useful for summarizing information that is 
> widely dispersed around the web, such as questions like "what are the 
> most popular objections to X?"  However, the other day for a blog post I 
> was writing I asked it "What are some sources on the relationship of X 
> to Y?"  It gave me four sources of information, including the article 
> title, where it was published and who wrote it.  
> 
> This looked great, especially since I recognized two of the author names 
> as authorities on X.  However, when I then did a Google search, I could 
> not track down any of the four articles, either by title, author or 
> place of publication.  I tried both in Google and in Bing.  Zilch!
> 
> Could ChatGPT have totally made up these sources?  If so, how does that 
> work?
> 
> I am baffled about the explanation of this.  One of the publications 
> involved was Psychology Today, so we are not talking about obscure 
> corners of the Internet or sites that would have disappeared recently.
> 
> Thanks for any insights.
> 
> Marcia Yudkin
> Introvert UpThink
> Introvert UpThink | Marcia Yudkin | Substack
> 
> 
> 
> 
> 
> Introvert UpThink | Marcia Yudkin | Substack
>   Marcia Yudkin
>   Exploring how introverts are misunderstood, maligned and 
> underappreciated in our culture - yet still thrive. Cli...
> 
> 
> _______________________________________________
> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
> Hidden-discuss at lists.hidden-tech.net
> 
> You are receiving this because you are on the Hidden-Tech Discussion 
> list.
> If you would like to change your list preferences, Go to the Members
> page on the Hidden Tech Web site.
> http://www.hidden-tech.net/members
> _______________________________________________
> Hidden-discuss mailing list - home page: http://www.hidden-tech.net
> Hidden-discuss at lists.hidden-tech.net
> 
> You are receiving this because you are on the Hidden-Tech Discussion list.
> If you would like to change your list preferences, Go to the Members
> page on the Hidden Tech Web site.
> http://www.hidden-tech.net/members