Code duplication detection – terminology

Code duplication is one of the most popular forms of software reuse among developers. It is fast and little (read not) intellectually taxing, and consequently the preferred approach to software development under time pressure. However, code duplication can build up a huge technical debt, which may render a project unmaintainable. A cross-industry survey of code duplication shows that some industries (e.g., the enterprise and financial software segments) run a high risk in this respect and should actively monitor and reduce the amount of duplicate code. Fortunately, these two tasks can be supported by specialized tools, such as the Source Code Duplication Detector (SolidSDD), which dramatically decrease the required effort. These tools combined with the Pareto distribution of duplicate code observed in practice make the prospect of reducing the related technical debt highly appealing.
Compared to other tools for software analysis, such as coding standards checkers, code duplication detection tools require very little training and configuration. Once the basic terminology is understood, developers can start analyzing the code following a lean learning curve while being productive.
This article tries to give an overview of the basic terminology used in code duplication detection, to help developers jump-start the analysis and monitoring tasks.
Duplicate code terminology
Duplicate code terminology
Clone An ordered set of statements that is repeated in a number of places in the source code.
Clone instance A (minimal) piece of code that includes the ordered set of statements associated with a code clone.
Clone set The set of all instances of a clone.
Cloning relation An ordered pair of clone instances belonging to the same clone set. The first clone instance of the pair is called the reference; the second clone instance is called the cloning partner.
Clone fan-out The number of files containing instances of given clone.
Local gap A number of neighboring statements in a clone instance that are not part of the ordered statement set of the associated clone. Local gaps are the result or code insertion/deletion/change that typically takes place after duplicating code via copy-paste operations.
Identifier renaming The process of changing the name of identifiers (e.g., variable/function/type name) after duplicating code via copy-paste operations. Code resulting from copy-paste operations is rarely of immediate use during development. Such code has to be adapted first to meet the specification that it implements. Together with code insertion/deletion, identifier renaming is a common step performed in combination with code cloning.
Cumulated gap The sum of all local gaps from the beginning of a clone instance up to a given statement, corrected with the gap decay. This metric is clone instance and statement specific.
Gap decay A decrease in the cumulated gap to be considered for each statement of a clone instance that is part of the ordered set of statements of the associated clone. This metric is clone instance and location specific being considered on the same set of statements as the cumulated gap.

Does source code duplication matter?

A cross industry investigation of Open Source Software

Bigger is not always better, and that is particularly true when it comes to software. More software means always more trouble, especially during the maintenance phase. The main factor that influences the required maintenance effort, and also subsequently predicted quality, of a software system is its size. Ask any software manager whether she would like to reduce the size of their software and you would get immediate attention. This subject triggers however several questions: is it possible to reduce the code size without losing functionality? What is the cost of such an undertaking? Does it pay off? And, if all previous questions are answered favorably, how to do it?

One obvious approach to reducing the size of software is by removing the code that is not used (also known as “dead code”) and the so called “gold-plating code”. That is code that is used but is not actually necessary for implementing the actual software requirements. For example, code that estimates with a precision of 100 decimals the value of Pi might not be required in a simple drawing application. The main trouble with removing dead and gold-plating code is that such code is difficult to identify correctly. One runs always the risk of removing code that might be required under special circumstances, which can have serious consequences. A much safer alternative to reducing the size of software is by removing duplicated code.

What is code duplication?

Duplicated code is code that has been produced by copying and then adapting existing code. Also known as “copy-and-paste development”, this strategy of producing code is frequently employed as a way of reusing software. There are many levels at which software can be reused: small code fragments at the level of a function or set of related functions; libraries; or entire subsystems. In the following, the focus lays on the first level mentioned above.

Under the pressure of time, many developers try to make use of already implemented and tested software by cloning it and then adapting it to meet the needed functionality. When one cannot reuse an entire library (because of various constraints such as time required to understand or integrate such a library, or product size, platform, or copyright constraints), and when the elements to reuse are individually relatively small, copy-and-paste reuse is the favorite technique. It is simple, fast, and profitable – or, at least, so it seems at a first look.

Given the relatively small size of reused fragments, it is not surprising that the performed adaptations during copy-and-paste reuse are usually limited to renaming variables, changing the values of constants, or inserting/deleting small code fragments. What these modifications have in common is the fact that they do not modify the detailed design of the software. Consequently, “copy-and-paste” development puts limited intellectual strain on the developer, keeps modifications local, and can speed up development on short term.

Why is code duplication bad?

On the first impression, code duplication seems to be a desirable approach to development as it is associated with reuse, implementation speed-up, and developer care. However, in the long term, code duplication implications can be very negative. Let us have a look at several problems it creates.

First of all, code duplication causes an increase in software size. Every supplementary line of code will, sooner or later, enter in the maintenance process, and thus cost time and money.

Secondly, duplication has a direct influence on the difficulty of maintaining already developed code. A classical example in this respect is modifying a piece of duplicated code. Such a modification, let us say a bug fix, is in most cases relevant for all instances of the duplicated code. That is, if one decides to do it, it has to be done everywhere, in all duplicates. Yet, very often the modification is only performed in one instance at a time, as developers are not aware about the existence of the other instances – they do not have a standard mechanism of checking whether a code fragment is duplicated, and where all duplicates are. Consequently, the full resolution of the bug-fix could happen only after several costly implementation-test iterations. If software is released in the meantime, the bug-fixing cost can increase hundreds of times. Additionally, code-duplication can also be a sign of poor design, indicating that generic functionality has been not properly abstracted. This anti-agile approach leads eventually to larger development times, canceling the benefit of the initial speed-up. Consequently, on long term, especially during the maintenance phase, code duplication is to be avoided.

How to get rid of code duplication?

The main advantage of removing duplicated code instead of “dead” and “gold-plated code” is that it can be done cheaply and quickly. Recent advances in static code analysis and hardware performance made possible tool-based localization of code duplication in industrial contexts. However, while many of the existing tools offer reliable and accurate results, the usefulness of the detection results can be worlds apart. The effectiveness of a given approach or tool depends very much on the context in which it is used. There are a number of aspects that have to be always considered when choosing a code duplication detection tool:

  • Can the tool specify the minimal and maximal size of a code fragment that constitutes a clone? Size is typically measured in contiguous lines of code that form a duplicate.
  • Can the tool cope with code formatting and decoration issues, like white spaces, annotation, and comments, when searching for duplicates?
  • Does the tool recognize identifiers renaming, e.g., changing the name of variables?
  • Can the tool cope with insertion/deletion/modification of code? That is, does it detect only exact duplicates, or duplicates in which small amounts of code have been altered?
  • Does the tool follow statements on multiple lines?
  • Are the results presented in a usable way?
  • Is the application scalable enough?

All these issues are important for the effectiveness of a duplication detection tool, as follows. First, the tool should allow easy setting of the size of a code fragment considered to be a duplicate. Fixed (preset) sizes are not an option: Too small sizes will discover too many duplicates, which are useless. Too large sizes will yield too few duplicates. Second, code is in practice rarely duplicated ‘verbatim’. Small-scale changes such as formatting, indentation, comments, variable renaming, and insertion/removal of small code fragments, occur when duplicating code. Hence, a duplication detection tool should be able to detect duplicates in presence of such changes, or its usability will be limited. On the other hand, not too many changes should be allowed, otherwise the whole notion of a duplicate is lost. Third, the detected duplicates must be presented in a way that makes their analysis, searching, and understanding fast and effective, otherwise this information cannot be used for software improvement. Finally, duplication tools must handle large code bases of millions of lines of code fast enough so the detection cost does not offset its advantages. A careful balance must be struck between all these aspects to have a truly useful duplication tool.

Improvement potential assessment

In many cases, an acceptable solution for localizing code clones can be identified. The only question that remains is “What is the actual benefit of removing the duplication?” In a recent study, SolidSource has investigated the potential of code duplication removal for software size reduction across a number of industries. The trigger of this study was the large amount of code duplication observed by the company during its consultancy practice in the embedded industry. With code duplication percentages in the range 15-25%, the embedded industry is one of the principal candidates to improvement. To get a good representation from more industries, the study has been performed on a representative set of applications from the Open Source Software (OSS) arena. Six industry segments have been identified and chosen for their similarity with the commercial world. In each segment, three representative applications have been selected for analysis based on popularity as reflected by the number of downloads on and rank in Google. The application development language had to be one of C, C++, C# or Java. The considered segments and the corresponding applications are given in Table 1:

Table 1 (click to enlarge): Investigated OSS segments and applications

Considered industry segments

For each considered application, a number of measurements have been performed using the Software Duplication Detector (SolidSDD) tool to estimate the amount of code duplication and the potential size decrease. In each measurement, different settings for the minimum size of the considered duplicates have been used. The amount of code duplication was investigated for high (> 35 statements), medium (>25 statements) and low duplicate sizes (> 15 statements). The higher the level, the less duplication is found, as lower level results include those obtained for higher levels. The detailed measurements are presented in Table 2, and aggregated results are depicted in Figure 1.

Table 2 (click to enlarge): Results of code duplication measurements

Measurement results

Figure 1 (click to enlarge): a) Duplicated code across industries in the OSS arena; b) potential of duplication reduction when refactoring top 5% clones in a software stack

Aggregated results

Figure 1a shows that the amount of duplication is relatively low in four out of the six considered industry segments: Communication, Software Development, Office and Databases. In these industry segments the average amount of duplicated code is below 5% at medium and high duplicate size levels.

Nevertheless, two industries exhibit significant deviations from the rest: Networking/Embedded (>10% duplication) and Enterprise/Financial (>25% duplication). These OSS industry segments could benefit the most from removing code duplication. Understanding why these segments exhibit higher duplication than the other considered segments is an interesting challenge to be studied next. Potential answers are related to the programming language and development style, age of the code stack, or background of the programmers involved.

Figure 1b depicts the potential code duplication reduction upon refactoring the top 5% of the largest clones. By refactoring, one should understand here the removal of all but one copies of a given duplicate. This gives an indication of how easy it is easy to achieve improvement when removing duplication. One can see that in most industries, the top 5% of the largest clones account for more than 25% of all duplication in the code. In the Networking / Embedded segment, the improvement can reach even 40% of all achievable benefit.

Consequently, the conclusions of the study were that much of the theoretical benefit of removing duplication can be achieved in any industry segment with relatively little effort. Yet, the segments were it really pays off to do it are Networking/Embedded and Enterprise/Financial.

Implications on the commercial industry segments

Can the results of this study be projected from the OSS arena into the commercial one? In contrast to commercial software, OSS applications are aimed at developing a technology rather than at making it available for as many business contexts (e.g. product families and versions) as possible. Therefore, the expected amount of duplication in the OSS area is arguably lower than in the commercial one. The Networking/Embedded segment could be actually used as calibration factor when making correlations. This segment contains applications that are highly hardware-aware, and therefore it is the closest to the Embedded segment from the commercial zone. By correlating these two segments, one can derive a bias factor for estimating the amount of duplication in various industry segments of the commercial segments. Based on previous observations from the consultancy practice in the commercial embedded world, SolidSource estimated the bias factor to be between 5 and 15%.

Using this bias one can deduce that the amount of duplication in the commercial area for Enterprise/Financial applications might be very large (i.e., 30-45%), therefore overtaking by far the embedded segment. Consequently, investigating the potential of removing code duplication should be a main concern for this industry as well.


Code duplication is an important factor that drives up maintenance costs. No software industry branch is immune from this problem. The problem is present in all areas where large software projects are developed, including the Open Source Software arena. Recent advances in code duplication techniques and tools make the detection and measurement of duplication possible with little effort. Such tools and techniques can bring high cost reductions for a little price, if used in the right way. As the industry evolves, code duplication detection applications will become part of the standard accepted arsenal of software developers, at an equal side with other well-established tools such as compilers, debuggers, design tools, and software configuration management systems.