Highlighting duplicate string on a line

09-11-2014

Registered User

6, 3

Join Date: Sep 2014

Last Activity: 6 May 2015, 5:17 AM EDT

Location: South England

Posts: 6

Thanks Given: 2

Thanked 3 Times in 2 Posts

Highlighting duplicate string on a line

Hi all

I have a grep written to pull out values; below (in the code snip-it) is an example of the output.
What I'm struggling to do, and looking for assistance on, is identifying the lines that have duplicate strings.
For example 74859915K74859915K in the below is 74859915K repeated twice but 32575310100014 is not a whole repeating value so I don't want to see it.

In my head (and what I'm unable to do) I want to do something like count it's length, split it in half and confirm the first half matches the second half... I'm open to suggestions as there may be a better way to do it.

Background - these values are in multiple files within an xml tag <foo></foo>. My grep is extracting them and removing the xml tags with sed leaving just the below output... it's the next step where I want to only have the true dupes.

Many thanks in advance.

Code:

74859915K74859915K
0B153858340B15385834
MUNS0-0000000001MUNS0-0000000001
10594556C10594556C
0B982730630B98273063
Q1818002FQ1818002F
78883385D78883385D
44871376D44871376D
B14513386B14513386
016797265C016797265C
0A120861950A12086195
025691290Z025691290Z
31262294G31262294G
B57312068B57312068
16803742B16803742B
723029268723029268
A50470772A50470772
B64841927B64841927
32575310100014
50836566B50836566B
499984

brighty

View Public Profile for brighty

Find all posts by brighty

09-11-2014

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

A clunky way in a shell script would be to:-

Read the file and loop for each line
- Calculate half the line length with ((half=${#line}/2))
- Build up a string of question marks for the given length (each on represents a single character)
- Use variable substitution to split the line
- Compare the original with the half you have (twice)
- If there is a match, take action one way, if not, the other way
Repeat for the remainder of the lines.

Does this seem a sensible logic to you? If so, we can help you code where you are stuck.

What do you think?

Robin

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

09-11-2014

Registered User

6, 3

Join Date: Sep 2014

Last Activity: 6 May 2015, 5:17 AM EDT

Location: South England

Posts: 6

Thanks Given: 2

Thanked 3 Times in 2 Posts

Thanks for the reply Robin; you're on the same page as me.

Not being one to sit back and expect it to be written for me I've had a go with that pointer you gave me but I'm getting some strange results from it.
I thought I'd start simple and built up. I was half expecting (excuse the pun) that the below would output half the length of the string and store it in $half then using that combined with an awk sub string I would be able to just out put the first half of the string. The idea being I could store that in a variable do the same for the second half by getting the awk substr to start at $half for $half and I could compare the two. If they match then output if they don't then bin them.

Code:

while read line; do
((half=${#line}/2))
echo $line | awk '{print substr($0,1,$half)}'
done < $TEMP_1

That doesn't give me the output I was expecting. $TEMP_1 is a file name which contains the values one per line as per my previous post.

brighty

View Public Profile for brighty

Find all posts by brighty

09-11-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

While you are one the right track, it is best not to call an external program inside a loop because that will make it very slow. You could do it all in shell inside the loop, or use a single utility instead of a shell loop..

--
Another option would be to use a back reference in a regex :

Code:

grep '^\(.*\)\1$' file

The anchors ^ and $ make sure the two identical patterns glued together form the whole line...

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

09-11-2014

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

Well good for you. We all learn better by trying, rather than being spoon-fed. With a nice pun like that, are you British?

You might need $1 in your awk rather than $0

It should still work though. This will give you the first half of each line, so you'd need to catch and compare that to the original, something like:-

Code:

while read line; do
   ((half=${#line}/2))
   halfline=`echo $line | awk '{print substr($0,1,$half)}'`
   if [ "${halfline}${halfline}" = "${line}" ]
   then
      echo "${line} is a duplicated entry"
   else
      echo "${line} is not repeated"
   fi
done < $TEMP_1 > logfile

Personally, I'd replace the awk with a substitution, so you are not calling awk over and again, something like this:-

Code:

while read line; do
   ((half=${#line}/2))
   h=1                                           # Set a counter
   mask=                                         # Null the variable
   until [ $h -gt $half ]                        # Loop until counter is right
   do
      mask="${mask}?"                            # Add a ? (single character wildcard)
      ((h=$h+1))
   done
   halfline="${line#${mask}}"                    # Split the line
   if [ "${halfline}${halfline}" = "${line}" ]   # Match twice the split line with the original
   then
      echo "${line} is a duplicated entry"
   else
      echo "${line} is not repeated"
   fi
done < $TEMP_1 > logfile

Does that suit? Does it work even.........

?

Robin

This User Gave Thanks to rbatte1 For This Post:

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

09-11-2014

Registered User

6, 3

Join Date: Sep 2014

Last Activity: 6 May 2015, 5:17 AM EDT

Location: South England

Posts: 6

Thanks Given: 2

Thanked 3 Times in 2 Posts

Spot on Scrutinizer that does exactily what I need it to do; both as a pipe on the end of my original grep or in the loop whilst reading each line.

Code:

grep 'somestuff' | sed 's/afew bits/g' | grep '^\(.*\)\1$' |

Code:

while read line; do
echo $line | grep '^\(.*\)\1$'
done < $TEMP_1

I'm not going to pretend I know what it's doing. Can you recommend some reading on this? is it know as back referencing within normal regex?

Robin - Thank you. Whilst Scrutinizer has answered it I'm still going to read and digest your reply so that I understand how what I was trying to achieve should work. All good learning.

Thank you both.

These 2 Users Gave Thanks to brighty For This Post:

brighty

View Public Profile for brighty

Find all posts by brighty

09-11-2014

Registered User

344, 126

Join Date: Aug 2014

Last Activity: 28 June 2017, 4:04 PM EDT

Posts: 344

Thanks Given: 37

Thanked 126 Times in 114 Posts

May I present my approach

Code:

#!/bin/bash

while read value; do 
 len=${#value}
 center=`expr $len / 2`
 firsthalf=${value:0:center}
 secondhalf=${value:center:len}
  if [ "$firsthalf" == "$secondhalf" ]; then
   echo "$value"
  fi
done <values >truedupes

junior-helper

View Public Profile for junior-helper

Find all posts by junior-helper

Shell Programming and Scripting

Highlighting duplicate string on a line

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Shell script to get duplicate string

Discussion started by: Deekhari

2. Shell Programming and Scripting

Highlighting duplicate string on a line

Discussion started by: brighty

3. Red Hat

How to add a new string at the end of line by searching a string on the same line?

Discussion started by: raghavendra

4. Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

Discussion started by: Michael Stora

5. Shell Programming and Scripting

Remove not only the duplicate string but also the keyword of the string in Perl

Discussion started by: askari

6. Shell Programming and Scripting

find duplicate string in many different files

Discussion started by: xshang

7. Shell Programming and Scripting

Delete duplicate in certain number of string

Discussion started by: kenshinhimura

8. Shell Programming and Scripting

filtering out duplicate substrings, regex string from a string

Discussion started by: kchinnam

9. Shell Programming and Scripting

How to remove duplicate sentence/string in perl?

Discussion started by: vanitham

10. UNIX for Dummies Questions & Answers

removing line and duplicate line

Discussion started by: ocelot