Checking for duplicate code

06-01-2012

Registered User

846, 29

Join Date: Jan 2007

Last Activity: 2 December 2019, 5:59 PM EST

Posts: 846

Thanks Given: 94

Thanked 29 Times in 25 Posts

Checking for duplicate code

I have a short line of code that checks very rudimentary for duplicate code:

Code:

sort myfile.cpp | uniq -c | grep -v "^.*1 " | grep -v "}"

It sorts the file, counts occurrences of each line, removes single occurrences and removes the ubiquitous closing brace. The language is C++, but is easily extensible to other programming languages.

I would like to make this a bit more advanced. A few examples:

1- Allow for spaces, so that the following lines of output are considered identical:

Code:

   2     for (i = 0; i < N; i++) {
   2        for (i = 0; i < N; i++) {

2- Allow for spaces within the code, so that the following lines of output are considered identical:

Code:

   2     for (i = 0; i < N; i++) {
   2     for ( i = 0; i < N; i++ ) {

If there are easy ways to fix this, I like to hear from you.

I am deliberately not excluding lines of comment, such as those containing "/*" or "*/" or "//", as this would reduce the case to tell developers to document their code better.

Any other one-liner ideas to check for duplicate code are also welcome.

figaro

View Public Profile for figaro

Find all posts by figaro

06-02-2012

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

What is your idea of duplicate code? I'm sure you can not script out duplicate code and still keep a functioning, logically ordered execution path by doing that.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

06-03-2012

Registered User

846, 29

Join Date: Jan 2007

Last Activity: 2 December 2019, 5:59 PM EST

Posts: 846

Thanks Given: 94

Thanked 29 Times in 25 Posts

I want to be able to spot code that is a candidate for refactoring. There is no intention to script out lines of code.

figaro

View Public Profile for figaro

Find all posts by figaro

06-03-2012

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

I'd start with removing blanks before doing anything else. In C/C++ blanks can only serve two functions: to make code easier to read (indentation) or in output (like "printf( " \n");"). Replace in the following "<spc>" and "<tab>" with literal space/tab characters.

Code:

sed 's/[<spc><tab>]*//g'

This removes any space or tab character from the source, including indentation.

An idea you might want to follow is to concatenate lines which do not end in a closing brace or semicolon. Consider the following two lines:

Code:

a=b+c;

a =
b + c;

They are equal to the compiler, but your procedure would count them as different.

You can do this concatenation with a regexp, but it involves a little hold space / pattern space gymnastics:

Code:

sed -n 's/[<spc><tab>]*//g
     $ { x
         G
         s/\n//g
         p
         q
       }
     /[;{}]$/ {
            x
            G
            s/\n//g
            p
            s/.*//
            x
            d
          }
     /[;{}]$/! {
            H
            d
           }' /path/to/input

What it does (i suggest you get a sed-reference if you don't feel familiar with this): at first, all the spaces/tabs are deleted in the first line. Then there are 3 types of lines to handle:

The last line is covered first in the paragraph "$ {..". The content of the hold space is exchanged with the pattern space, then the content of the hold space (the former pattern space content) is copied to the end of the pattern space - we concatenate the line with the former read lines. Next, all the line feeds are deleted (s/\n//g) and the line is printed out, then we quit.

The next type of lines are the ones ending either with a ";" with a "{" or "}". (Braces end expressions too). We do practically the same as with the last line, but after printing the line to output we clear the pattern space and hold space to "flush the buffers". Otherwise portions of the text would be duplicated.

The last type of lines are the one which don't end on braces or semi-colons. We append their content to the hold space, delete the pattern space and start over with the next line.

So, in principle, we are collecting text in the hold space and flush that out on specific occasions (whenever we feel a "program line" is completely read).

I hope this helps.

bakunin

Last edited by bakunin; 06-03-2012 at 01:53 PM..

These 2 Users Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

Shell Programming and Scripting

Checking for duplicate code

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Iterate through a list - checking for a duplicate then report it ot

Discussion started by: worky

2. UNIX for Beginners Questions & Answers

Code for checking if certain no of files exists

Discussion started by: SRPR

3. Shell Programming and Scripting

REMOVE DUPLICATE IN a ROW AFTER CHECKING THE FIRST SIMILAR NAME

Discussion started by: manigrover

4. Shell Programming and Scripting

awk remove duplicate code

Discussion started by: pawelrc

5. UNIX for Dummies Questions & Answers

Nested for loops for checking duplicate files

Discussion started by: shubham92

6. Shell Programming and Scripting

code checking

Discussion started by: bshell_1214

7. Shell Programming and Scripting

checking duplicate entry in file

Discussion started by: saluja.deepak

8. Shell Programming and Scripting

checking the return code

Discussion started by: Satyak

9. Shell Programming and Scripting

Error code checking

Discussion started by: jepombar

10. Shell Programming and Scripting

Code checking for all values in the same if statement.

Discussion started by: oracle8