ChatGPT generates fake data set to support scientific hypothesis

Lee Duna · 1 year ago

ChatGPT generates fake data set to support scientific hypothesis

@NevermindNoMind · 1 year ago

For those who haven’t read the article, this is not about hallucinations, this is about how AI can be used maliciously. Researchers used GPT-4 to create a fake data set from a fake human trial, and the result was convincing. Only advanced techniques were able to show that the data was faked, like too many patient ages ending with 7 or 8 than would be likely in a real sample. The article points out that most peer review does not go that deep into the data to try to spot fakes. The issue here is that a malicious researcher could use AI to generate fake data supporting whatever theory they want and theoretically get published in peer reviewed journal.

I don’t have the expertise to assess how much of a problem this is. If someone was that determined, couldn’t they already fake data by hand? Does this just make it easier to do, or is AI better at it thereby increasing the risk? I don’t know, but it’s an interesting data point as we as a society think about what AI is capable of and how it could be used maliciously.

AnonStoleMyPants · 1 year ago

I don’t think this is a problem at all. Are they saying fake data is hard to do? I don’t get it. Why would it be hard to fake data? Get real results and shift the values by some number. That’s it. I mean, obviously if you shift too much then you will have problems, but enough to be credible? Easy.

Sure it is slightly harder ro make the data from scratch, but let’s be real here, a TON of the data is just a random csv file a machine pops out. Why on earth would it be hard to fake?

Now, human trials are a bit different than some measurement data but I fail to see why this would be hard, assuming you are an expert in rhe field.

Much more prominent problem in science is cherry picking data. It is very common to have someone make 50 new devices and measure them all, and conveniently leave out half of the measurements. Happens alllll the time

appel · 1 year ago

There are some statistical tests and methods you can do to quite easily spot fake data from what I remember. (The name has escaped me, sorry). Ie. To check if it has come from an RNG, or if it is too positive given the sample, etc. but you are right in that it is often enough to fool the review board and get something published. Often the data is only scrutinized with these methods thoroughly after it has been published.

@AbouBenAdhem · 1 year ago

There was an infamous case a few years ago, which was caught because the researcher forgot to delete the fake-data-generating formula from the Excel file.