Scanning columns for duplicates and printing in one line

02-23-2010

Registered User

9, 0

Join Date: Jul 2009

Last Activity: 16 September 2010, 1:30 PM EDT

Posts: 9

Thanks Given: 0

Thanked 0 Times in 0 Posts

Scanning columns for duplicates and printing in one line

Description of data:

Code:

NC_002737.1  4  F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU
NC_006372.1  5   F1VI4M001BH0HY FF1VI4M001BH0HY F1VI4M001C0ZC5 F1VI4M001DOF2X F1VI4M001AYNTS

Every field in every record is tab separated
There can be "n" columns.

Problem:
What I want to achieve is following

Code:

NC_002737.1 4 F1VI4M001A3IAU 4
NC_006372.1 5 F1VI4M001BH0HY 2 F1VI4M001C0ZC5 1 F1VI4M001DOF2X 1 F1VI4M001AYNTS 1

So far this happening:

Code:

awk 'BEGIN{OFS="\t";cnt=0}{if (NF>3) {for (i=3;i<=NF;i++) {if ($(i)==$(i+1)) {cnt = cnt+1} print $1,$(i),cnt}}cnt=1}'

Output:

Code:

NC_002737.1     F1VI4M001A3IAU  2
NC_002737.1     F1VI4M001A3IAU  3
NC_002737.1     F1VI4M001A3IAU  4
NC_002737.1     F1VI4M001A3IAU  4
NC_006372.1     F1VI4M001BH0HY  2
NC_006372.1     F1VI4M001C0ZC5  1
NC_006372.1     F1VI4M001DOF2X  1
NC_006372.1     F1VI4M001AYNTS  1

I know the problem: For loop is going one by one and then it is printing the lines. At this I am not able to come up with a way to get the desired output (in one line)
as shown above.
Can any one suggest away. I can again write another awk script to do this, but I wondering if there is a way to get this fix all in go.

Thanks

Last edited by Scott; 02-23-2010 at 04:24 PM.. Reason: please use code tags

Deep9000

View Public Profile for Deep9000

Find all posts by Deep9000

02-23-2010

Registered User

1,690, 205

Join Date: Jun 2007

Last Activity: 13 July 2020, 5:35 PM EDT

Location: Mumbai, India

Posts: 1,690

Thanks Given: 139

Thanked 205 Times in 199 Posts

Quote:

I know the problem: For loop is going one by one and then it is printing the lines. At this I am not able to come up with a way to get the desired output (in one line)
as shown above.

use printf.

clx

View Public Profile for clx

Find all posts by clx

02-23-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

With awk[1]:

Code:

awk '{ 
  split(x, c); j = 0 # for GNU awk use delete c
  for (i = 0; ++i <= NF;)  
  if (i < 3) printf "%s\t", $i 
    else c[$i]++ || n[++j] = $i 
  for (i = 0; ++i <= j;)
    printf "%s", (n[i] OFS c[n[i]] (i < j ? OFS : RS))  
  }' OFS='\t' infile

Perl:

Code:

perl -lane'
  %_ = (); $_{$F[$_]}++ for 2..$#F;
  print join "\t", @F[0..1], map {$_, $_{$_}} keys %_;
  ' infile

[1]. Use GNU awk (gawk), Bell Labs New awk (nawk) or X/Open sawk (/usr/xpg4/bin/awk) on Solaris.
[2]. I'm assuming that every record has at least three fields. If you need something different, let me know.

Last edited by radoulov; 02-23-2010 at 06:20 PM..

radoulov

View Public Profile for radoulov

Find all posts by radoulov

02-23-2010

Registered User

1,305, 26

Join Date: Jun 2007

Last Activity: 11 November 2016, 3:44 AM EST

Location: Beijing China

Posts: 1,305

Thanks Given: 0

Thanked 26 Times in 26 Posts

Code:

while(<DATA>){
	my @tmp = split;
	my %hash;
	print $tmp[0]," ",$tmp[1]," ";
	map {$hash{$_}++} @tmp[2..$#tmp];
	foreach my $key(keys %hash){
		print $key," ",$hash{$key}," ";
	}
	print "\n";
}
__DATA__
NC_002737.1  4  F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU F1VI4M001A3IAU
NC_006372.1  5   F1VI4M001BH0HY F1VI4M001BH0HY F1VI4M001C0ZC5 F1VI4M001DOF2X F1VI4M001AYNTS
test	6 a a a b b c c c d a b c d

summer_cherry

View Public Profile for summer_cherry

Find all posts by summer_cherry

02-24-2010

Registered User

9, 0

Join Date: Jul 2009

Last Activity: 16 September 2010, 1:30 PM EDT

Posts: 9

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks everyone for your suggestions, I really wanted to make this work in AWK, but Perl codes are welcome. And many thanks for the code. I have not yet tried them, but will do now.
Cheers

Deep9000

View Public Profile for Deep9000

Find all posts by Deep9000

Shell Programming and Scripting

Scanning columns for duplicates and printing in one line

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Remove duplicates in a dataframe (table) keeping all the different cells of just one of the columns

Discussion started by: pedro88

2. UNIX for Beginners Questions & Answers

Sort and remove duplicates in directory based on first 5 columns:

Discussion started by: gnnsprapa

3. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Discussion started by: kevinprood

4. Shell Programming and Scripting

Shell do loop to compare two columns (duplicates)

Discussion started by: Paul Moghadam

5. Shell Programming and Scripting

finding duplicates in csv based on key columns

Discussion started by: baskivs

6. Shell Programming and Scripting

Scanning Line by Line and Replace

Discussion started by: filter

7. Shell Programming and Scripting

Search based on 1,2,4,5 columns and remove duplicates in the same file.

Discussion started by: onesuri

8. Shell Programming and Scripting

Remove duplicates based on the two key columns

Discussion started by: kmsekhar

9. Shell Programming and Scripting

finding duplicates in columns and removing lines

Discussion started by: totus

10. UNIX for Dummies Questions & Answers

searching text files on specific columns for duplicates

Discussion started by: Gerry405