Deleting duplicated chunks in a file using awk/sed

05-25-2016

Registered User

29, 0

Join Date: Dec 2015

Last Activity: 26 May 2016, 4:28 PM EDT

Posts: 29

Thanks Given: 5

Thanked 0 Times in 0 Posts

Deleting duplicated chunks in a file using awk/sed

Hi all,

I'd always appreciate all helps from this site.

I would like to delete duplicated chunks of strings on the same row(?).

One chunk is comprised of four lines such as:
path name
starting point
ending point
voltage number

I would like to delete duplicated chunks on the same row(?) if "ending point" is duplicated.
For example, ending points of the first and the second chunk are same in the first row and I would like to only keep the first chunk. Therefore, the second chunk is removed on the first row.

In the second row, ending points of the first and the third chunk are same and keep the first chunk.

input.txt:

Code:

path_sparc_ffu_dp_out_1885  path_sparc_ffu_dp_out_2759  path_sparc_ffu_dp_out_3115
R_1545/Q    R_1541/Q    R_1545/Q
dp_ctl_synd_out_low[6]  dp_ctl_synd_out_low[6]  dp_ctl_synd_out_low[2]
0.926208    0.910592    0.905082
path_sparc_ffu_dp_out_699   path_sparc_ffu_dp_out_712   path_sparc_ffu_dp_out_819
R_1053/Q    R_1053/Q    R_1053/Q
dp_ctl_synd_out_low[2]  dp_ctl_synd_out_low[6]  dp_ctl_synd_out_low[2]
0.945436    0.945436    0.9435
path_sparc_ffu_dp_in_686
frf_dp_data[42]
dp_ctl_synd_out_high[6]
0.812538

Expected_output.txt:

Code:

path_sparc_ffu_dp_out_1885  path_sparc_ffu_dp_out_3115
R_1545/Q        R_1545/Q
dp_ctl_synd_out_low[6]      dp_ctl_synd_out_low[2]
0.926208        0.905082
path_sparc_ffu_dp_out_699   path_sparc_ffu_dp_out_712   
R_1053/Q    R_1053/Q    
dp_ctl_synd_out_low[2]  dp_ctl_synd_out_low[6]  
0.945436    0.945436 
path_sparc_ffu_dp_in_686
frf_dp_data[42]
dp_ctl_synd_out_high[6]
0.81253

The number of columns can be up to 20 in a file.

Actually, I have posted the same question on other website to get a help, and somebody posted replies, but did not work correctly. Any help is appreciated.

Best,

Jaeyoung

jypark22

View Public Profile for jypark22

Find all posts by jypark22

05-26-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Instead of trying to get multiple websites to act as your unpaid programming staff, why don't you show us how you have tried to solved this problem on your own? If you can show us what you have tried, maybe we can help you fix it.

We have helped you with 8 other awk scripts in the last six months. Can't you use the examples provided by those scripts to get a good start on what you need here?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

05-26-2016

Registered User

29, 0

Join Date: Dec 2015

Last Activity: 26 May 2016, 4:28 PM EDT

Posts: 29

Thanks Given: 5

Thanked 0 Times in 0 Posts

First of all, I am sorry about my bad attitude.

I tried 'uniq' that only show uniq strings first, but don't know how to show uniq chunks. It only works for lines. And then I tried sed.

Code:

sed -r 's/(dp_ctl_synd_out_low\[[0-9]\])(.+)(\1)/\1 \2 -/g' input.txt

So now I could find the duplicated one and replaced with "-", but failed to remove the chunk.

If my post is not appropriate, I will remove it soon.

Best,

Jaeyoung

Moderator's Comments:

Please use CODE tags when displaying sample input, sample output, and code segments.

Last edited by Don Cragun; 05-26-2016 at 02:01 AM.. Reason: Add CODE tags.

jypark22

View Public Profile for jypark22

Find all posts by jypark22

05-26-2016

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Save as chunks.pl
Run as perl chunks.pl chunks.data

Code:

#!/usr/bin/perl
#
use strict;
use warnings;

my @chunks;
my $lines = 0;
while(<>){
    my @parts = split;
    push @{$chunks[$lines++]}, @parts;
    if ($lines == 4) {
        my %seen;
        my $count = 0;
        my @keep;
        for my $i (@{$chunks[2]}) {
            !$seen{$i}++ and push @keep, $count;
            ++$count;
        }

        for my $i (@chunks){
            my @returns;
            for my $j (@keep) {
                push @returns, @{$i}[$j];
            }
            print "@returns\n";
        }

        clean();
    }
}

sub clean {
    @chunks = ();
    $lines = 0;
}

Output:

Code:

path_sparc_ffu_dp_out_1885 path_sparc_ffu_dp_out_3115
R_1545/Q R_1545/Q
dp_ctl_synd_out_low[6] dp_ctl_synd_out_low[2]
0.926208 0.905082
path_sparc_ffu_dp_out_699 path_sparc_ffu_dp_out_712
R_1053/Q R_1053/Q
dp_ctl_synd_out_low[2] dp_ctl_synd_out_low[6]
0.945436 0.945436
path_sparc_ffu_dp_in_686
frf_dp_data[42]
dp_ctl_synd_out_high[6]
0.812538

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

05-26-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Maybe something more like:

Code:

awk '
{	for(i = 1; i <= NF; i++) {
		f[NR % 4, i] = $i
	}
}
!(NR % 4) {
	ocnt = 0
	for(i = 1; i <= NF; i++)
		if(!(f[3, i] in of)) {
			of[f[3, i]]
			spot[++ocnt] = i
		}
	for(i = 1; i <= ocnt; i++)
		for(j = 1; j <= 4; j++) {
			ol[j] = ol[j] f[j % 4, spot[i]] ((i == ocnt) ? "" : "\t")
		}
	for(i = 1; i <= 4; i++) {
		print ol[i]
		delete ol[i]
	}
	for(i in of)
		delete of[i]
}' input.txt

would work better for you. This uses a single tab character as the output field separator instead of a seemingly random number of spaces (but you can easily change it to a fixed number of spaces if you want to).

From the sed command you're using, I assume that you're not running this on a Solaris system, but if someone else wants to try the above code on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

05-26-2016

Registered User

29, 0

Join Date: Dec 2015

Last Activity: 26 May 2016, 4:28 PM EDT

Posts: 29

Thanks Given: 5

Thanked 0 Times in 0 Posts

Thank you, Don.

Your code is perfect for me. I will be careful in the next time before I will post.

Jaeyoung

jypark22

View Public Profile for jypark22

Find all posts by jypark22

Shell Programming and Scripting

Deleting duplicated chunks in a file using awk/sed

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Deleting lines containing duplicated strings

Discussion started by: jypark22

2. Shell Programming and Scripting

awk for splitting file in constant chunks

Discussion started by: mukesh.lalwani

3. UNIX for Dummies Questions & Answers

Awk: Print out overlapping chunks of file - rows 0-20,10-30,20-40 etc.

Discussion started by: matfald

4. Shell Programming and Scripting

deleting lines between patterns using sed or awk

Discussion started by: sunrexstar

5. Shell Programming and Scripting

Can I use a shell script for deleting chunks from a watch folder?

Discussion started by: ajsoto

6. Shell Programming and Scripting

Deleting characters with sed,perl,awk

Discussion started by: cola

7. Shell Programming and Scripting

Deleting the first column with sed,awk or perl

Discussion started by: cola

8. Shell Programming and Scripting

Deleting a line from a file with sed and awk?

Discussion started by: cola

9. Shell Programming and Scripting

Deleting Doubled Block with sed or awk

Discussion started by: rolandh

10. Shell Programming and Scripting

using sed to get rid of duplicated columns...

Discussion started by: fedora