On a mission to measure the productivity of AI-powered developers, researchers at GitHub recently conducted an experiment comparing coding speeds for a group using its Copilot code completion tool versus a group based on human ability alone.
GitHub Copilot is a binary AI programming service that publicly launched earlier this year for $10 per user per month or $100 per user per year. Since its launch, researchers have been interested to see if these AI tools really translate to increasing developer productivity. The catch is that it is not easy to define the right metrics to measure performance changes.
Copilot is used as an extension for code editors, such as Microsoft VS Code. It creates code suggestions in multiple programming languages that users can accept, reject, or edit. Suggestions are provided by OpenAI’s Codex, a system that translates natural language into code and is based on OpenAI’s GPT-3 Language Model.
Google Research and the Google Brain team concluded in July, after studying the impact of AI code suggestions on more than 10,000 of their developers’ productivity, that The controversy over relative speed of performance remains an ‘open question’This is despite the finding that a set of large grammar- and paradigm-based semantic engines, such as Codex/Copilot, “can be used to dramatically improve developer productivity with better code completion.”
But how do you measure productivity? Other researchers earlier this year, using a small sample of 24 developers, have found That the co-pilot did not necessarily improve mission completion time or success rate. However, I found that Copilot saved developers the effort of searching online for code snippets to solve certain problems. This is an important indicator of how much an AI tool like Copilot can reduce context switching, when developers jump from an editor to solve a problem.
GitHub too Included more than 2,600 developers, asking questions like, “Do people feel GitHub Copilot makes them more productive?” Its researchers have also benefited from unique access to extensive telemetry data and Research published in June. Among other things, researchers have found that 60% to 75% of users feel more satisfied with their work when using Copilot, feel less frustration when programming, and are able to focus on more satisfying work.
“In our research, we’ve seen that GitHub Copilot supports faster completion times, conserves developers’ mental energy, helps them focus on more satisfying work, and ultimately finds more fun in the coding they do,” GitHub said.
GitHub researcher Dr. Irene Kalyamvako explained the approach: “We conducted multiple rounds of research including qualitative (perceptual) and quantitative (observational) data to piece together the full picture. We wanted to verify: (a) do actual user experiences confirm what we are inferring from Telemetry? (b) Are our qualitative observations generalized to our large user base?”
Kalliamvakou, who was involved in the original study, has now built on it with an experiment involving 95 developers who focused on the issue of coding speed with and without Copilot.
This research found that the group using Copilot (45 developers) completed the task in an average of 1 hour and 11 minutes. The group that did not use Copilot (50 developers) completed it in an average of 2 hours 41 minutes. Therefore, the group with Copilot was 55% faster than the group without it.
Kalliamvakou also found a higher percentage of the group with the copilot completed the task – 78% of the Copilot group versus 70% in the group without the copilot.
The experiment did not consider factors that contribute to productivity such as context switching. However, a previous GitHub study found that 73% of developers reported that Copilot helped them stay in flow.
In an email, Kalliamvakou explained to ZDNET what this number means in terms of context switching and developer productivity.
“Reporting ‘stay in the flow’ definitely means less context change, and we have additional evidence,” she wrote. 77% of those surveyed stated that, when using GitHub Copilot, they spend less time searching.
“The manifest measures a context key known to developers, such as searching for documentation, or visiting Q&A sites like Stack Overflow to find answers or ask questions. With GitHub Copilot bringing information to the editor, developers don’t need to exit the IDE as often,” she explained .
But using context switching alone to measure improved throughput from AI code suggestions cannot show the full picture. There is also “good” and “bad” context switching, which makes it difficult to gauge the effect of context switching.
Kalyamvako explained that during a typical task activity developers switch frequently between different activities, tools, and sources of information.
pointed to A study published in 2014 It found that developers spend an average of 1.6 minutes pre-switching activity, or switching an average of 47 times per hour.
“It’s just because of the nature of their work and the multiplicity of tools they use, so context switching is ‘good.’ In contrast, there’s ‘bad’ context switching due to delays or interruptions,” she said.
“We found in Our previous search That this hurts productivity a lot, as well as the developers’ sense of progress. Context switching is hard to measure, because we don’t have a good way to automatically distinguish between ‘good’ and ‘bad’ cases – or when switching is part of completing a task versus disrupting developer flow and productivity. However, there are ways to measure context switching through the self-reports and observations we use in our research.”
As for kopilot’s performance with other languages, Kalyamvaku says she is interested in doing experiments in the future.
“It was definitely a fun experiment to do. These controlled experiments are time consuming because we’re trying to make them bigger or more inclusive, but I’d like to explore testing for other languages in the future,” she said.
Kalliamvakou has posted other key findings from an extensive GitHub survey in a blog She details her quest to find the most appropriate metrics for measuring developer productivity.