Redefining plagiarism in the age of AI

Adapting the ethics of scientific publication to address large language models

10 December 2022 6 minute read

ethics llm ai publishing science

The Association for Computing Machinery (ACM) has a good, functional, definition of plagiarism:

ACM defines plagiarism as the misrepresentation of another's writings or other creative work (including unpublished and published documents, data, research proposals, computer code, or other forms of creative expression, including electronic versions) as one's own. Plagiarism is a clear violation of ACM Publications Policy and a potential violation of the ACM Code of Ethics. Plagiarism may also represent copyright infringement. Plagiarism manifests itself in a variety of forms, including:

• verbatim copying, near-verbatim copying, or intentionally paraphrasing portions of another's work;
• copying elements of another's work, such as equations, tables, charts, illustrations, presentation, or photographs that are not common knowledge, or copying or intentionally paraphrasing sentences without proper or complete source citation;
• verbatim copying of portions of another's work with incorrect source citation

Note that whether a prior Work has been formally published is not a factor in determining plagiarism; a Work not formally published may be plagiarized. This includes content provided online in preprints, tutorials, manuals, and essays, as well as offline content in any form. The representation of any other person's material as one's own Work is plagiarism.

https://www.acm.org/publications/policies/plagiarism-overview

This policy was, of course, written before an AI could write something approximating human text. Consequently, the policy refers to copying work of other humans (words in bold above). Now that machines produce human-quality text, we need to rethink this definition.

Why do we prohibit plagiarism in scientific publications? One might posit that it is because it does harm to the person who is plagiarized. That is, the creator of the material fails to get credit for their work or ideas. If this were the primary reason to ban plagiarism, one could argue the machine is not harmed unless the machine has copyrighted its work.

While harm to the creator is important, there are other harms that are critically important and which do not involve the creator. These are harms to science itself.

Why do we not let people copy the work of others without attribution? It is because science depends on attribution in multiple ways.

1. As scientists, we need to understand the thread of ideas over time to place results in context. This is why we carefully write sections describing previous work.

2. We need to know the foundations upon which scientific ideas rest. If a paper assumes something is true based on the literature, we need to be able to verify the source. This is why we cite prior work.

3. We use attribution for evaluation and promotion of scientists. Careers in science depend on this. We reward people who are creative and productive. As long as we are giving tenure to humans, we have to evaluate the abilities of those humans. We need to know their unique contributions relative to prior work.

Plagiarism is a problem because it breaks attribution in each of these cases. Note that the ACM wisely says that copied work does not need to be published to be plagiarized. It also wisely says that paraphrasing without citation is plagiarism. So, I argue for a simple change to the definition of plagiarism to extend it to cover the writings or creative works of another human or machine. It’s that simple. All our current policies and processes remain the same.

AIs are here to stay and will only get better. So how do you work with an AI in science ethically? Again, I think it’s quite simple. Think of it as a human and cite it accordingly. Here are several common cases that apply to human contributions today that are easily extended to machine contributions:

1. Acknowledgment. If someone helps you with your paper you acknowledge this help. This often includes “helpful discussions,” “proofreading”, “help with figures”, “code”, “data”, etc. If you have a chat with ChatGPT that helps you with your paper, this is a “helpful discussion”. If it produces code for you, then this is no different from a human doing so.

2. Personal communication. This is a bit out of fashion but, in older papers, one often sees citations to “personal communication”. If you had a chat with Einstein and he gave you some insight, you’d credit the idea with “Einstein, personal communication”. You can do the same with an AI. If you got an idea from ChatGPT, then I would cite this as “ChatGPT, <date>.” The date is important because AIs will change over time and knowing the version of the AI establishes which one created the idea.

3. Web content. Not all references are to scientific articles. You might get data or text from the web. In these cases, it is customary to cite the URL and the date that you accessed the text or data. For text or data created by an AI, you can cite this the same way as personal communication.

4. Verbatim copying. Copying text from another source is not plagiarism if it is quoted and cited. If you like the way your AI phrased something, then put it in quotes and cite it as above.

5. Paraphrasing. If you take text from a source and rewrite it, while keeping the meaning, this is paraphrasing. When doing this you might write “As noted by Einstein…<paraphrasing> <citation>.” Just do the same for an AI.

6. Proofreading and copy editing. People often compare Grammarly with ChatGPT. This is a false equivalence because the former focuses on syntax and the latter on semantics. Getting help with grammar/spelling corrections is accepted practice. You can acknowledge the person/machine, but this is not typically required.

7. Ghost writing. A ghost writer is someone who writes your text for you. The ideas are yours but the text is written by someone else. Using a ghost writer to generate your scientific papers is widely consider unethical. There are only two ways to ethically have a ghost writer: (1) make them an author, (2) declare that the text was generated by a ghost writer and name them. It doesn’t matter whether the ghost writer is a human or machine. There is currently no accepted way to make an AI an author and the responsibilities of authorship are hard to extend to machines. Consequently, I see no clean way to have a machine write your paper for you.

8. Translation. Can you write in a language you know well and then have a machine translate the text? Translation is also well-understood in publishing already. The key principle is that there is always a source text in the original language that people can refer to. Then the person who does the translation to a new language is credited. This is critical because there is always interpretation in translation and knowing the creator is necessary. So how can this be extended to scientific publication? Let’s say you write the paper in Chinese and translate it to English using an AI and want to submit the translation to a journal or conference. You can’t publish the same thing twice in science, even if it is in different languages. If you want the English translation to be the main publication, then I would put the Chinese original on a public server like arXiv. I would then generate the translation, citing the original source and the method of translation (name of the AI and date). This feels like an ethical solution but is probably not allowed by journals today.

In summary, we already have many of the tools to deal with the ethics of AI-generated content in scientific publishing. We need to quickly adapt existing rules regarding human-generated content to include machine-generated content. The guiding principle in using AI-generated text/images/code should be to treat the content as though it came from a human. Such an approach is unlikely to violate current ethical standards.

ethics llm ai publishing science

The Perceiving Systems Department is a leading Computer Vision group in Germany.

We are part of the Max Planck Institute for Intelligent Systems in Tübingen — the heart of Cyber Valley.

We use Machine Learning to train computers to recover human behavior in fine detail, including face and hand movement. We also recover the 3D structure of the world, its motion, and the objects in it to understand how humans interact with 3D scenes.

By capturing human motion, and modeling behavior, we contibute realistic avatars to Computer Graphics.

To have an impact beyond academia we develop applications in medicine and psychology, spin off companies, and license technology. We make most of our code and data available to the research community.