Code duplication detection – terminology

Code duplication is one of the most popular forms of software reuse among developers. It is fast and little (read not) intellectually taxing, and consequently the preferred approach to software development under time pressure. However, code duplication can build up a huge technical debt, which may render a project unmaintainable. A cross-industry survey of code duplication shows that some industries (e.g., the enterprise and financial software segments) run a high risk in this respect and should actively monitor and reduce the amount of duplicate code. Fortunately, these two tasks can be supported by specialized tools, such as the Source Code Duplication Detector (SolidSDD), which dramatically decrease the required effort. These tools combined with the Pareto distribution of duplicate code observed in practice make the prospect of reducing the related technical debt highly appealing.
Compared to other tools for software analysis, such as coding standards checkers, code duplication detection tools require very little training and configuration. Once the basic terminology is understood, developers can start analyzing the code following a lean learning curve while being productive.
This article tries to give an overview of the basic terminology used in code duplication detection, to help developers jump-start the analysis and monitoring tasks.
Duplicate code terminology
Duplicate code terminology
Clone An ordered set of statements that is repeated in a number of places in the source code.
Clone instance A (minimal) piece of code that includes the ordered set of statements associated with a code clone.
Clone set The set of all instances of a clone.
Cloning relation An ordered pair of clone instances belonging to the same clone set. The first clone instance of the pair is called the reference; the second clone instance is called the cloning partner.
Clone fan-out The number of files containing instances of given clone.
Local gap A number of neighboring statements in a clone instance that are not part of the ordered statement set of the associated clone. Local gaps are the result or code insertion/deletion/change that typically takes place after duplicating code via copy-paste operations.
Identifier renaming The process of changing the name of identifiers (e.g., variable/function/type name) after duplicating code via copy-paste operations. Code resulting from copy-paste operations is rarely of immediate use during development. Such code has to be adapted first to meet the specification that it implements. Together with code insertion/deletion, identifier renaming is a common step performed in combination with code cloning.
Cumulated gap The sum of all local gaps from the beginning of a clone instance up to a given statement, corrected with the gap decay. This metric is clone instance and statement specific.
Gap decay A decrease in the cumulated gap to be considered for each statement of a clone instance that is part of the ordered set of statements of the associated clone. This metric is clone instance and location specific being considered on the same set of statements as the cumulated gap.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s