GitHub Copilot plagiarizes the real hammer, GitHub: Our AI does not "recite" the code
- Chen Roc
- Jul 4, 2021
- 5 min read
Copilot, the automatic code generation AI jointly produced by Microsoft, OpenAI, and GitHub seemed to fall into the altar the next day.
Relying on the powerful gimmick of automatically generating code, GitHub Copilot became the focus of discussion after its launch.
Copilot is built on OpenAI's new Codex algorithm, in which Codex has received TB-level public code extracted from GitHub and trained with English language examples.
Therefore, GitHub claims that Copilot can analyze the strings, comments, function names, and the code itself in the document to generate new matching code, including the specific function that was previously called.
At the same time, Copilot supports multiple programming languages: Python, JavaScript, TypeScript, Ruby and Go.
After the release, someone pulled Copilot to check Leetcode's question bank and was very satisfied with the performance of this "AI programmer".

After verifying several questions, Copilot can pass Leetcode's test every time. Given the almost real-time generation speed, the blogger said that AI may be better at writing code than us.
However, netizens suspect that Copilot has been trained on the LeetCode database because the generated annotations are almost the same as the template given by Leetcode.
In response to this, GitHub said that although there may be 0.1% direct citations, most of the code generated by Copilot will be original.
"Copy-Paste" becomes a real hammer
On the second day of the release, some netizens questioned that GitHub Copilot was a tool for making money after cleaning the free and open-source code.
These codes should be protected by the GPL (General Public License) to prevent them from being used in commercial projects.

Unsurprisingly, this suspicion became a real hammer within two days. Some netizens discovered that Copilot directly "copy-pasted" the most famous "square root reciprocal speed algorithm".

The code "generated" by Copilot not only uses the magic number 0x5f3759df that no one can understand so far, but also contains a complaint about this code: what the f***? .

In this way, what Copilot does is just reassemble the code written by others in the training set.
Our AI does not "recite" codes
However, GitHub seems to have made preparations for a long time. A team member named Albert Ziegler said that as of May 7, 2021, he has collected Copilot’s 453,780 suggestions for Python. These data are Used by 300 employees in their daily work.
Albert analyzed and sorted out this data set, and wrote a seemingly complete blog for discussion.
At the beginning of the article, Albert asked GitHub Copilot to recite a well-known article. Obviously, Copilot has firmly remembered the content of the article.
However, Albert believes that it is not a problem to remember the content of the training set. After all, he himself has recited poems, and this will not make him deviated from these recitations in his daily communication.

Category 1: Copilot sometimes puts forward a very similar suggestion after a suggestion that has been adopted, due to a new comment written by the programmer.
Albert believed that the second time was just a repeat of the previous "successful" cases, so they were removed from the problem analysis.
Category 2: Copilot may propose long, repetitive sequences. For example, in the following example, the repeated'<p>' was finally found in the training set.

Category 3: Copilot gives suggestions similar to standard lists such as natural numbers, prime numbers, and the Greek alphabet. Some suggestions may or may not be helpful.
However, Albert stated that these do not meet his assumptions about "reciting" the code.

Category 4: When doing some tasks with very low degrees of freedom, Copilot will give some common or common solutions.
For example, the middle part below can be counted as the standard way to parse Wikipedia lists using the BeautifulSoup package.
Albert said that the best matching fragments found in the training data use such codes to parse different articles. Again, this does not meet his definition of "reciting" the code.

Category 5: These last cases are in line with Albert's vision of "reciting codes", in which at least some specific overlaps in these codes or comments.
Test Results

For most of GitHub Copilot's suggestions, Albert stated that he did not find any obvious overlap with the training code. After removing the first category, 185 suggestions can be obtained.
Of these cases, 144 were classified into 2-4 categories. This leaves 41 cases in the last category 5. The author stated that this is the code "reciting" in his mind.
Quotes from GitHub Copilot in the absence of specific context
Of the 41 main cases selected during manual labeling, none appeared in less than 10 different documents. Most (35 cases) occurred more than a hundred times.
Once, GitHub Copilot suggested starting with an empty file, and it even saw something more than 700,000 times during training—that is, the GNU General Public License.
The chart below shows the results of the 5th category (with a red mark at the bottom of each result) and the number of matching files in the 2-4 categories.

The inferred profile is shown as a red line; it peaks between 100 and 1000 matches.
GitHub Copilot mainly quotes in general
Over time, each file becomes unique. But GitHub Copilot will provide solutions when your files are very generic.
At this time, without any specific content, it is more likely to be quoted from other places.

Of course, software developers spend most of their time in the middle of complex code, where the context is unique enough, and GitHub Copilot will provide unique suggestions.
In contrast, the initial suggestion is quite satisfactory, because GitHub Copilot has no way of knowing what the program will be.
However, in a stand-alone script, a moderate amount of context is enough to make a reasonable guess about what the user wants to do.
And sometimes, the context is still so common that Copilot thinks a solution it knows looks promising.

The above example is directly taken from the uploaded robotics courseware.
Conclusion
Albert believes that although GitHub Copilot can quote a set of code verbatim, it rarely does so, and when it does so, most of it is code that everyone will quote, and most of it is at the beginning of the file.
Albert stated that, ideally, when a suggestion contains a clip copied from the training set, the user interface should simply tell you where it was referenced from. You can then include the appropriate attribution or decide not to use the code. And his team will work hard to do this.
Reviews
Although netizens expressed gratification after seeing that the GitHub team was concerned about the "copy and paste" issue, however, this "survey" is obviously difficult to convince.
"This will cause every enthusiast to face risks, and at the same time, but the concern of "this thing may generate GPL code?" to anyone who works in the enterprise."
"You can't just use "Well, they are slightly different" to infer that "so they are not really the same thing." If it is substantially similar, it needs to be quoted."

For Copilot, there may be a long way to go.
Comentários