Count number of unique values in each column of array

01-14-2018

Registered User

46, 3

Join Date: Jan 2018

Last Activity: 11 February 2019, 11:45 PM EST

Posts: 46

Thanks Given: 44

Thanked 3 Times in 3 Posts

Count number of unique values in each column of array

What is an efficient way of counting the number of unique values in a 400 column by 1000 row array and outputting the counts per column, assuming the unique values in the array are:

A, B, C, D

In other words the output should look like:

Code:

     Value    COL1    COL2    COL3
A    50    51    52
B    95    23    12
C    51    95    85
D    32    60    20

Thanks in advance

Last edited by Scrutinizer; 01-14-2018 at 02:03 PM.. Reason: added example; mod: code tags

Geneanalyst

View Public Profile for Geneanalyst

Find all posts by Geneanalyst

01-14-2018

Moderator

3,689, 1,352

Join Date: Jan 2012

Last Activity: 22 August 2020, 11:29 PM EDT

Location: Galactic Empire

Posts: 3,689

Thanks Given: 268

Thanked 1,352 Times in 1,258 Posts

Here is an awk approach:-

Code:

awk '
        BEGIN {
                n = split ( "A B C D", T )
        }
        {
                for ( i = 1; i <= NF; i++ )
                        R[i FS $i] += 1
        }
        END {
                printf "VAL\t"
                for ( i = 1; i <= NF; i++ )
                        printf "COL%d\t", i
                printf "\n"

                for ( j = 1; j <= n; j++ )
                {
                        printf "%c\t", T[j]
                        for ( i = 1; i <= NF; i++ )
                                printf "%d\t", R[i FS T[j]]
                        printf "\n"
                }
        }
' file

This User Gave Thanks to Yoda For This Post:

Yoda

View Public Profile for Yoda

Visit Yoda's homepage!

Find all posts by Yoda

01-14-2018

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

If you are looking for a report on all values that appear in data (not just A B C or D) you could also try the following:

Code:

awk '
{ 
   mc = NF > mc ? NF : mc
   for(i=NF; i; i--) {
      T[$i]
      C[i FS $i]++
   }
}
END {
  printf "Value"
  for(i=1; i<=mc;i++) printf "\tCOL%d",i

  for(v in T) {
     printf "\n%s", v
     for(i=1; i<=mc;i++) printf "\t%d",C[i FS v]
  }
  printf "\n"
}' infile

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

01-14-2018

Registered User

46, 3

Join Date: Jan 2018

Last Activity: 11 February 2019, 11:45 PM EST

Posts: 46

Thanks Given: 44

Thanked 3 Times in 3 Posts

Quote:

Originally Posted by Chubler_XL

If you are looking for a report on all values that appear in data (not just A B C or D) you could also try the following:

Code:

awk '
{ 
   mc = NF > mc ? NF : mc
   for(i=NF; i; i--) {
      T[$i]
      C[i FS $i]++
   }
}
END {
  printf "Value"
  for(i=1; i<=mc;i++) printf "\tCOL%d",i

  for(v in T) {
     printf "\n%s", v
     for(i=1; i<=mc;i++) printf "\t%d",C[i FS v]
  }
  printf "\n"
}' infile

Thank you it works except for certain columns. So starting at column 13 and every 13 columns thereafter it gives an incorrect value.

Here is the output for the 1st 26 columns. Columns 13 and 26 have an incorrect count of the 1/1 values. The rest looks good.

Code:

Value    1/1    0/0    0/1
COL1    4    61    18
COL2    6    63    14
COL3    2    59    22
COL4    3    64    16
COL5    2    60    21
COL6    2    61    20
COL7    2    64    17
COL8    0    60    23
COL9    2    56    25
COL10    2    66    15
COL11    2    63    18
COL12    1    62    20
COL13    53    63    15
COL14    2    54    26
COL15    1    63    18
COL16    2    66    15
COL17    4    65    16
COL18    2    63    16
COL19    6    59    20
COL20    2    55    22
COL21    0    63    18
COL22    6    67    16
COL23    4    60    17
COL24    3    57    22
COL25    3    55    25
COL26    53    62    18

---------- Post updated at 10:34 PM ---------- Previous update was at 10:32 PM ----------

Quote:

Originally Posted by Yoda

Here is an awk approach:-

Code:

awk '
        BEGIN {
                n = split ( "A B C D", T )
        }
        {
                for ( i = 1; i <= NF; i++ )
                        R[i FS $i] += 1
        }
        END {
                printf "VAL\t"
                for ( i = 1; i <= NF; i++ )
                        printf "COL%d\t", i
                printf "\n"

                for ( j = 1; j <= n; j++ )
                {
                        printf "%c\t", T[j]
                        for ( i = 1; i <= NF; i++ )
                                printf "%d\t", R[i FS T[j]]
                        printf "\n"
                }
        }
' file

Thanks Yoda, works except with the same problems as Chubler_XL's script below.

Last edited by Scrutinizer; 01-15-2018 at 12:47 AM.. Reason: code tags

Geneanalyst

View Public Profile for Geneanalyst

Find all posts by Geneanalyst

01-15-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Given the code suggested by Chubler_XL and Yoda, it is hard to imagine that anything is different in the way counts are accumulated for column numbers that are multiples of 13.

Can you provide us with sample data that demonstrates the inaccurate counts that you have reported?

What operating system (including release number) are you using?

Which version of awk are you using?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-15-2018

Registered User

46, 3

Join Date: Jan 2018

Last Activity: 11 February 2019, 11:45 PM EST

Posts: 46

Thanks Given: 44

Thanked 3 Times in 3 Posts

Quote:

Originally Posted by Don Cragun

Sure, for Awk I have GNU Awk 4.1.3, and for OS I have Ubuntu 16.04.3 LTS, and I have attached the actual text file I used. Also, I stand corrected it seems that it is not multiples of 13 but many more columns at random that are off.

a2.txt (98.6 KB)

Last edited by Geneanalyst; 01-15-2018 at 05:38 AM.. Reason: clarification

Geneanalyst

View Public Profile for Geneanalyst

Find all posts by Geneanalyst

01-15-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

When I slightly modify the code suggested by Chubler_XL to be:

Code:

awk '
{ 
   mc = NF > mc ? NF : mc
   for(i=NF; i; i--) {
      T[$i]
      C[i FS $i]++
   }
}
END {
  printf "Value"
  for(i=1; i<=mc;i++) printf "\tCOL%d",i

  for(v in T) {
     printf "\n%s", v
     for(i=1; i<=mc;i++) printf "\t%d",C[i FS v]
  }
  printf "\n"
}' a2.txt

and store this in a file named Chubler_XL, make it executable and run the command:

Code:

./Chubler_XL > Chubler_XL.out

and I slightly modify the code suggested by Yoda to be:

Code:

awk '
        BEGIN {
                n = split ( "./. 0/0 0/1 1/1", T )
        }
        {
                for ( i = 1; i <= NF; i++ )
                        R[i FS $i] += 1
        }
        END {
                printf "VAL\t"
                for ( i = 1; i <= NF; i++ )
                        printf "COL%d\t", i
                printf "\n"

                for ( j = 1; j <= n; j++ )
                {
                        printf "%s\t", T[j]
                        for ( i = 1; i <= NF; i++ )
                                printf "%d\t", R[i FS T[j]]
                        printf "\n"
                }
        }
' a2.txt

and store this in a file named Yoda, make it executable and run the command:

Code:

./Yoda > Yoda.out

and I write the code:

Code:

awk -v line_count="$(wc -l < a2.txt)" '
function check() {
	printf("Checking fields 2 through %d in file: %s\n", NF, f)
	for(i = 2; i <= NF; i++)
		if(c[i] != line_count)
			printf("file %s: field %d count %d\n", f, i, c[i])
	split("", c)
}
FNR == 1 {
	line_count += 0
	if(f == "")
		printf("Evaluating output produced from %d lines in a2.txt\n",
		    line_count)
	else
		check()
	f = FILENAME
	next
}
{	for(i = 2; i <= NF; i++)
		c[i] += $i
}
END {	check()
}' *.out

and store that in a file named counter, make it executable, and run it, I get the output:

Code:

Evaluating output produced from 83 lines in a2.txt
Checking fields 2 through 305 in file: Chubler_XL.out
Checking fields 2 through 305 in file: Yoda.out

which shows that the sums of the values for each of the 304 fields does indeed equal the number of lines found in the file you attached in post #6.

I see no indication that either of these suggestions is producing results that are incorrect although neither of them produce output that is at all close to the output you showed us in post #4. I do note that the output you showed us in post #4 only shows output for the three values "0/0", "0/1", and "1/1"; but the data in a2.txt also includes some entries with the value "./." which is included in the output produced by the code Chubler_XL suggested and in the output produced by the code Yoda suggested (after changing it to look for those four values instead of the values, "A", "B", "C", and "D" that you said were included as values in your statements in post #1.

If you'd like to show us the code you used to produce the output for the 1st 26 columns you showed us in post #4, maybe we can help you explain why that code failed to correctly interpret the output produced by Chubler_XL's code or Yoda's code.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Count number of unique values in each column of array

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count unique column

Discussion started by: nans

2. UNIX for Beginners Questions & Answers

Awk: count unique element of array

Discussion started by: beca123456

3. Shell Programming and Scripting

Print count of unique values

Discussion started by: H squared

4. Shell Programming and Scripting

Count occurrence of column one unique value having unique second column value

Discussion started by: angshuman

5. Shell Programming and Scripting

Count frequency of unique values in specific column

Discussion started by: owwow14

6. UNIX for Dummies Questions & Answers

count number of distinct values in each column with awk

Discussion started by: beca123456

7. Shell Programming and Scripting

How to count Unique Values from a file.

Discussion started by: Prega

8. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Discussion started by: itsme999

9. Shell Programming and Scripting

print unique values of a column and sum up the corresponding values in next column

Discussion started by: amigarus

10. Shell Programming and Scripting

Not able to read unique values in array

Discussion started by: faiz1985