Does source code duplication matter?

A cross industry investigation of Open Source Software

Bigger is not always better, and that is particularly true when it comes to software. More software means always more trouble, especially during the maintenance phase. The main factor that influences the required maintenance effort, and also subsequently predicted quality, of a software system is its size. Ask any software manager whether she would like to reduce the size of their software and you would get immediate attention. This subject triggers however several questions: is it possible to reduce the code size without losing functionality? What is the cost of such an undertaking? Does it pay off? And, if all previous questions are answered favorably, how to do it?

One obvious approach to reducing the size of software is by removing the code that is not used (also known as “dead code”) and the so called “gold-plating code”. That is code that is used but is not actually necessary for implementing the actual software requirements. For example, code that estimates with a precision of 100 decimals the value of Pi might not be required in a simple drawing application. The main trouble with removing dead and gold-plating code is that such code is difficult to identify correctly. One runs always the risk of removing code that might be required under special circumstances, which can have serious consequences. A much safer alternative to reducing the size of software is by removing duplicated code.

What is code duplication?

Duplicated code is code that has been produced by copying and then adapting existing code. Also known as “copy-and-paste development”, this strategy of producing code is frequently employed as a way of reusing software. There are many levels at which software can be reused: small code fragments at the level of a function or set of related functions; libraries; or entire subsystems. In the following, the focus lays on the first level mentioned above.

Under the pressure of time, many developers try to make use of already implemented and tested software by cloning it and then adapting it to meet the needed functionality. When one cannot reuse an entire library (because of various constraints such as time required to understand or integrate such a library, or product size, platform, or copyright constraints), and when the elements to reuse are individually relatively small, copy-and-paste reuse is the favorite technique. It is simple, fast, and profitable – or, at least, so it seems at a first look.

Given the relatively small size of reused fragments, it is not surprising that the performed adaptations during copy-and-paste reuse are usually limited to renaming variables, changing the values of constants, or inserting/deleting small code fragments. What these modifications have in common is the fact that they do not modify the detailed design of the software. Consequently, “copy-and-paste” development puts limited intellectual strain on the developer, keeps modifications local, and can speed up development on short term.

Why is code duplication bad?

On the first impression, code duplication seems to be a desirable approach to development as it is associated with reuse, implementation speed-up, and developer care. However, in the long term, code duplication implications can be very negative. Let us have a look at several problems it creates.

First of all, code duplication causes an increase in software size. Every supplementary line of code will, sooner or later, enter in the maintenance process, and thus cost time and money.

Secondly, duplication has a direct influence on the difficulty of maintaining already developed code. A classical example in this respect is modifying a piece of duplicated code. Such a modification, let us say a bug fix, is in most cases relevant for all instances of the duplicated code. That is, if one decides to do it, it has to be done everywhere, in all duplicates. Yet, very often the modification is only performed in one instance at a time, as developers are not aware about the existence of the other instances – they do not have a standard mechanism of checking whether a code fragment is duplicated, and where all duplicates are. Consequently, the full resolution of the bug-fix could happen only after several costly implementation-test iterations. If software is released in the meantime, the bug-fixing cost can increase hundreds of times. Additionally, code-duplication can also be a sign of poor design, indicating that generic functionality has been not properly abstracted. This anti-agile approach leads eventually to larger development times, canceling the benefit of the initial speed-up. Consequently, on long term, especially during the maintenance phase, code duplication is to be avoided.

How to get rid of code duplication?

The main advantage of removing duplicated code instead of “dead” and “gold-plated code” is that it can be done cheaply and quickly. Recent advances in static code analysis and hardware performance made possible tool-based localization of code duplication in industrial contexts. However, while many of the existing tools offer reliable and accurate results, the usefulness of the detection results can be worlds apart. The effectiveness of a given approach or tool depends very much on the context in which it is used. There are a number of aspects that have to be always considered when choosing a code duplication detection tool:

  • Can the tool specify the minimal and maximal size of a code fragment that constitutes a clone? Size is typically measured in contiguous lines of code that form a duplicate.
  • Can the tool cope with code formatting and decoration issues, like white spaces, annotation, and comments, when searching for duplicates?
  • Does the tool recognize identifiers renaming, e.g., changing the name of variables?
  • Can the tool cope with insertion/deletion/modification of code? That is, does it detect only exact duplicates, or duplicates in which small amounts of code have been altered?
  • Does the tool follow statements on multiple lines?
  • Are the results presented in a usable way?
  • Is the application scalable enough?

All these issues are important for the effectiveness of a duplication detection tool, as follows. First, the tool should allow easy setting of the size of a code fragment considered to be a duplicate. Fixed (preset) sizes are not an option: Too small sizes will discover too many duplicates, which are useless. Too large sizes will yield too few duplicates. Second, code is in practice rarely duplicated ‘verbatim’. Small-scale changes such as formatting, indentation, comments, variable renaming, and insertion/removal of small code fragments, occur when duplicating code. Hence, a duplication detection tool should be able to detect duplicates in presence of such changes, or its usability will be limited. On the other hand, not too many changes should be allowed, otherwise the whole notion of a duplicate is lost. Third, the detected duplicates must be presented in a way that makes their analysis, searching, and understanding fast and effective, otherwise this information cannot be used for software improvement. Finally, duplication tools must handle large code bases of millions of lines of code fast enough so the detection cost does not offset its advantages. A careful balance must be struck between all these aspects to have a truly useful duplication tool.

Improvement potential assessment

In many cases, an acceptable solution for localizing code clones can be identified. The only question that remains is “What is the actual benefit of removing the duplication?” In a recent study, SolidSource has investigated the potential of code duplication removal for software size reduction across a number of industries. The trigger of this study was the large amount of code duplication observed by the company during its consultancy practice in the embedded industry. With code duplication percentages in the range 15-25%, the embedded industry is one of the principal candidates to improvement. To get a good representation from more industries, the study has been performed on a representative set of applications from the Open Source Software (OSS) arena. Six industry segments have been identified and chosen for their similarity with the commercial world. In each segment, three representative applications have been selected for analysis based on popularity as reflected by the number of downloads on SourceForge.net and rank in Google. The application development language had to be one of C, C++, C# or Java. The considered segments and the corresponding applications are given in Table 1:

Table 1 (click to enlarge): Investigated OSS segments and applications

Considered industry segments

For each considered application, a number of measurements have been performed using the Software Duplication Detector (SolidSDD) tool to estimate the amount of code duplication and the potential size decrease. In each measurement, different settings for the minimum size of the considered duplicates have been used. The amount of code duplication was investigated for high (> 35 statements), medium (>25 statements) and low duplicate sizes (> 15 statements). The higher the level, the less duplication is found, as lower level results include those obtained for higher levels. The detailed measurements are presented in Table 2, and aggregated results are depicted in Figure 1.

Table 2 (click to enlarge): Results of code duplication measurements

Measurement results

Figure 1 (click to enlarge): a) Duplicated code across industries in the OSS arena; b) potential of duplication reduction when refactoring top 5% clones in a software stack

Aggregated results

Figure 1a shows that the amount of duplication is relatively low in four out of the six considered industry segments: Communication, Software Development, Office and Databases. In these industry segments the average amount of duplicated code is below 5% at medium and high duplicate size levels.

Nevertheless, two industries exhibit significant deviations from the rest: Networking/Embedded (>10% duplication) and Enterprise/Financial (>25% duplication). These OSS industry segments could benefit the most from removing code duplication. Understanding why these segments exhibit higher duplication than the other considered segments is an interesting challenge to be studied next. Potential answers are related to the programming language and development style, age of the code stack, or background of the programmers involved.

Figure 1b depicts the potential code duplication reduction upon refactoring the top 5% of the largest clones. By refactoring, one should understand here the removal of all but one copies of a given duplicate. This gives an indication of how easy it is easy to achieve improvement when removing duplication. One can see that in most industries, the top 5% of the largest clones account for more than 25% of all duplication in the code. In the Networking / Embedded segment, the improvement can reach even 40% of all achievable benefit.

Consequently, the conclusions of the study were that much of the theoretical benefit of removing duplication can be achieved in any industry segment with relatively little effort. Yet, the segments were it really pays off to do it are Networking/Embedded and Enterprise/Financial.

Implications on the commercial industry segments

Can the results of this study be projected from the OSS arena into the commercial one? In contrast to commercial software, OSS applications are aimed at developing a technology rather than at making it available for as many business contexts (e.g. product families and versions) as possible. Therefore, the expected amount of duplication in the OSS area is arguably lower than in the commercial one. The Networking/Embedded segment could be actually used as calibration factor when making correlations. This segment contains applications that are highly hardware-aware, and therefore it is the closest to the Embedded segment from the commercial zone. By correlating these two segments, one can derive a bias factor for estimating the amount of duplication in various industry segments of the commercial segments. Based on previous observations from the consultancy practice in the commercial embedded world, SolidSource estimated the bias factor to be between 5 and 15%.

Using this bias one can deduce that the amount of duplication in the commercial area for Enterprise/Financial applications might be very large (i.e., 30-45%), therefore overtaking by far the embedded segment. Consequently, investigating the potential of removing code duplication should be a main concern for this industry as well.

Conclusions

Code duplication is an important factor that drives up maintenance costs. No software industry branch is immune from this problem. The problem is present in all areas where large software projects are developed, including the Open Source Software arena. Recent advances in code duplication techniques and tools make the detection and measurement of duplication possible with little effort. Such tools and techniques can bring high cost reductions for a little price, if used in the right way. As the industry evolves, code duplication detection applications will become part of the standard accepted arsenal of software developers, at an equal side with other well-established tools such as compilers, debuggers, design tools, and software configuration management systems.

Leave a comment