Count number of unique values in each column of array


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Count number of unique values in each column of array
# 1  
Old 01-14-2018
Question Count number of unique values in each column of array

What is an efficient way of counting the number of unique values in a 400 column by 1000 row array and outputting the counts per column, assuming the unique values in the array are:

A, B, C, D

In other words the output should look like:
Code:
     Value    COL1    COL2    COL3
A    50    51    52
B    95    23    12
C    51    95    85
D    32    60    20

Thanks in advance

Last edited by Scrutinizer; 01-14-2018 at 03:03 PM.. Reason: added example; mod: code tags
# 2  
Old 01-14-2018
Here is an awk approach:-
Code:
awk '
        BEGIN {
                n = split ( "A B C D", T )
        }
        {
                for ( i = 1; i <= NF; i++ )
                        R[i FS $i] += 1
        }
        END {
                printf "VAL\t"
                for ( i = 1; i <= NF; i++ )
                        printf "COL%d\t", i
                printf "\n"

                for ( j = 1; j <= n; j++ )
                {
                        printf "%c\t", T[j]
                        for ( i = 1; i <= NF; i++ )
                                printf "%d\t", R[i FS T[j]]
                        printf "\n"
                }
        }
' file

This User Gave Thanks to Yoda For This Post:
# 3  
Old 01-14-2018
If you are looking for a report on all values that appear in data (not just A B C or D) you could also try the following:


Code:
awk '
{ 
   mc = NF > mc ? NF : mc
   for(i=NF; i; i--) {
      T[$i]
      C[i FS $i]++
   }
}
END {
  printf "Value"
  for(i=1; i<=mc;i++) printf "\tCOL%d",i

  for(v in T) {
     printf "\n%s", v
     for(i=1; i<=mc;i++) printf "\t%d",C[i FS v]
  }
  printf "\n"
}' infile

This User Gave Thanks to Chubler_XL For This Post:
# 4  
Old 01-15-2018
Quote:
Originally Posted by Chubler_XL
If you are looking for a report on all values that appear in data (not just A B C or D) you could also try the following:


Code:
awk '
{ 
   mc = NF > mc ? NF : mc
   for(i=NF; i; i--) {
      T[$i]
      C[i FS $i]++
   }
}
END {
  printf "Value"
  for(i=1; i<=mc;i++) printf "\tCOL%d",i

  for(v in T) {
     printf "\n%s", v
     for(i=1; i<=mc;i++) printf "\t%d",C[i FS v]
  }
  printf "\n"
}' infile

Thank you it works except for certain columns. So starting at column 13 and every 13 columns thereafter it gives an incorrect value.

Here is the output for the 1st 26 columns. Columns 13 and 26 have an incorrect count of the 1/1 values. The rest looks good.

Code:
Value    1/1    0/0    0/1
COL1    4    61    18
COL2    6    63    14
COL3    2    59    22
COL4    3    64    16
COL5    2    60    21
COL6    2    61    20
COL7    2    64    17
COL8    0    60    23
COL9    2    56    25
COL10    2    66    15
COL11    2    63    18
COL12    1    62    20
COL13    53    63    15
COL14    2    54    26
COL15    1    63    18
COL16    2    66    15
COL17    4    65    16
COL18    2    63    16
COL19    6    59    20
COL20    2    55    22
COL21    0    63    18
COL22    6    67    16
COL23    4    60    17
COL24    3    57    22
COL25    3    55    25
COL26    53    62    18

---------- Post updated at 10:34 PM ---------- Previous update was at 10:32 PM ----------

Quote:
Originally Posted by Yoda
Here is an awk approach:-
Code:
awk '
        BEGIN {
                n = split ( "A B C D", T )
        }
        {
                for ( i = 1; i <= NF; i++ )
                        R[i FS $i] += 1
        }
        END {
                printf "VAL\t"
                for ( i = 1; i <= NF; i++ )
                        printf "COL%d\t", i
                printf "\n"

                for ( j = 1; j <= n; j++ )
                {
                        printf "%c\t", T[j]
                        for ( i = 1; i <= NF; i++ )
                                printf "%d\t", R[i FS T[j]]
                        printf "\n"
                }
        }
' file

Thanks Yoda, works except with the same problems as Chubler_XL's script below.

Last edited by Scrutinizer; 01-15-2018 at 01:47 AM.. Reason: code tags
# 5  
Old 01-15-2018
Given the code suggested by Chubler_XL and Yoda, it is hard to imagine that anything is different in the way counts are accumulated for column numbers that are multiples of 13.

Can you provide us with sample data that demonstrates the inaccurate counts that you have reported?

What operating system (including release number) are you using?

Which version of awk are you using?
This User Gave Thanks to Don Cragun For This Post:
# 6  
Old 01-15-2018
Quote:
Originally Posted by Don Cragun
Given the code suggested by Chubler_XL and Yoda, it is hard to imagine that anything is different in the way counts are accumulated for column numbers that are multiples of 13.

Can you provide us with sample data that demonstrates the inaccurate counts that you have reported?

What operating system (including release number) are you using?

Which version of awk are you using?

Sure, for Awk I have GNU Awk 4.1.3, and for OS I have Ubuntu 16.04.3 LTS, and I have attached the actual text file I used. Also, I stand corrected it seems that it is not multiples of 13 but many more columns at random that are off.

Last edited by Geneanalyst; 01-15-2018 at 06:38 AM.. Reason: clarification
# 7  
Old 01-15-2018
When I slightly modify the code suggested by Chubler_XL to be:
Code:
awk '
{ 
   mc = NF > mc ? NF : mc
   for(i=NF; i; i--) {
      T[$i]
      C[i FS $i]++
   }
}
END {
  printf "Value"
  for(i=1; i<=mc;i++) printf "\tCOL%d",i

  for(v in T) {
     printf "\n%s", v
     for(i=1; i<=mc;i++) printf "\t%d",C[i FS v]
  }
  printf "\n"
}' a2.txt

and store this in a file named Chubler_XL, make it executable and run the command:
Code:
./Chubler_XL > Chubler_XL.out

and I slightly modify the code suggested by Yoda to be:
Code:
awk '
        BEGIN {
                n = split ( "./. 0/0 0/1 1/1", T )
        }
        {
                for ( i = 1; i <= NF; i++ )
                        R[i FS $i] += 1
        }
        END {
                printf "VAL\t"
                for ( i = 1; i <= NF; i++ )
                        printf "COL%d\t", i
                printf "\n"

                for ( j = 1; j <= n; j++ )
                {
                        printf "%s\t", T[j]
                        for ( i = 1; i <= NF; i++ )
                                printf "%d\t", R[i FS T[j]]
                        printf "\n"
                }
        }
' a2.txt

and store this in a file named Yoda, make it executable and run the command:
Code:
./Yoda > Yoda.out

and I write the code:
Code:
awk -v line_count="$(wc -l < a2.txt)" '
function check() {
	printf("Checking fields 2 through %d in file: %s\n", NF, f)
	for(i = 2; i <= NF; i++)
		if(c[i] != line_count)
			printf("file %s: field %d count %d\n", f, i, c[i])
	split("", c)
}
FNR == 1 {
	line_count += 0
	if(f == "")
		printf("Evaluating output produced from %d lines in a2.txt\n",
		    line_count)
	else
		check()
	f = FILENAME
	next
}
{	for(i = 2; i <= NF; i++)
		c[i] += $i
}
END {	check()
}' *.out

and store that in a file named counter, make it executable, and run it, I get the output:
Code:
Evaluating output produced from 83 lines in a2.txt
Checking fields 2 through 305 in file: Chubler_XL.out
Checking fields 2 through 305 in file: Yoda.out

which shows that the sums of the values for each of the 304 fields does indeed equal the number of lines found in the file you attached in post #6.

I see no indication that either of these suggestions is producing results that are incorrect although neither of them produce output that is at all close to the output you showed us in post #4. I do note that the output you showed us in post #4 only shows output for the three values "0/0", "0/1", and "1/1"; but the data in a2.txt also includes some entries with the value "./." which is included in the output produced by the code Chubler_XL suggested and in the output produced by the code Yoda suggested (after changing it to look for those four values instead of the values, "A", "B", "C", and "D" that you said were included as values in your statements in post #1.

If you'd like to show us the code you used to produce the output for the 1st 26 columns you showed us in post #4, maybe we can help you explain why that code failed to correctly interpret the output produced by Chubler_XL's code or Yoda's code.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count unique column

Hello, I am trying to count unique rows in my file based on 4 columns (2-5) and to output its frequency in a sixth column. My file is tab delimited My input file looks like this: Colum1 Colum2 Colum3 Colum4 Coulmn5 1.1 100 100 a b 1.1 100 100 a c 1.2 200 205 a d 1.3 300 301 a y 1.3 300... (6 Replies)
Discussion started by: nans
6 Replies

2. UNIX for Beginners Questions & Answers

Awk: count unique element of array

Hi, tab-separated input: blabla_1 A,B,C,C blabla_2 A,E,G blabla_3 R,Q,A,B,C,R,Q output: blabla_1 3 blabla_2 3 blabla_3 5 After splitting $2 in an array, I am trying to store the number of unique elements in a variable, but have some difficulties resetting the variable to 0 before... (6 Replies)
Discussion started by: beca123456
6 Replies

3. Shell Programming and Scripting

Print count of unique values

Hello experts, I am converting a number into its binary output as : read n echo "obase=2;$n" | bc I wish to count the maximum continuous occurrences of the digit 1. Example : 1. The binary equivalent of 5 = 101. Hence the output must be 1. 2. The binary... (3 Replies)
Discussion started by: H squared
3 Replies

4. Shell Programming and Scripting

Count occurrence of column one unique value having unique second column value

Hello Team, I need your help on the following: My input file a.txt is as below: 3330690|373846|108471 3330690|373846|108471 0640829|459725|100001 0640829|459725|100001 3330690|373847|108471 Here row 1 and row 2 of column 1 are identical but corresponding column 2 value are... (4 Replies)
Discussion started by: angshuman
4 Replies

5. Shell Programming and Scripting

Count frequency of unique values in specific column

Hi, I have tab-deliminated data similar to the following: dot is-big 2 dot is-round 3 dot is-gray 4 cat is-big 3 hot in-summer 5 I want to count the frequency of each individual "unique" value in the 1st column. Thus, the desired output would be as follows: dot 3 cat 1 hot 1 is... (5 Replies)
Discussion started by: owwow14
5 Replies

6. UNIX for Dummies Questions & Answers

count number of distinct values in each column with awk

Hi ! input: A|B|C|D A|F|C|E A|B|I|C A|T|I|B As the title of the thread says, I would need to get: 1|3|2|4 I tried different variants of this command, but I don't manage to obtain what I need: gawk 'BEGIN{FS=OFS="|"}{for(i=1; i<=NF; i++) a++} END {for (b in a) print b}' input ... (2 Replies)
Discussion started by: beca123456
2 Replies

7. Shell Programming and Scripting

How to count Unique Values from a file.

Hi I have the following info in a file - <Cell id="25D"/> <Cell id="26A"/> <Cell id="26B"/> <Cell id="26C"/> <Cell id="27A"/> <Cell id="27B"/> <Cell id="27C"/> <Cell id="28A"/> I would like to know how would you go about counting all... (4 Replies)
Discussion started by: Prega
4 Replies

8. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Could anybody help with this? I have input below ..... david,39 david,39 emelie,40 clarissa,22 bob,42 bob,42 tim,32 bob,39 david,38 emelie,47 what i want to do is count how many names there are with different ages, so output would be like this .... david,2 emelie,2 clarissa,1... (3 Replies)
Discussion started by: itsme999
3 Replies

9. Shell Programming and Scripting

print unique values of a column and sum up the corresponding values in next column

Hi All, I have a file which is having 3 columns as (string string integer) a b 1 x y 2 p k 5 y y 4 ..... ..... Question: I want get the unique value of column 2 in a sorted way(on column 2) and the sum of the 3rd column of the corresponding rows. e.g the above file should return the... (6 Replies)
Discussion started by: amigarus
6 Replies

10. Shell Programming and Scripting

Not able to read unique values in array

Hi Friends, I am having some trouble reading into an array. Basically, I am trying to grep for a pattern and extract it's value and store the same into an array. For eg., if my input is: <L:RECORD>name=faisel farooq,age=21, company=TCS,project=BT</L:RECORD> <L:RECORD>name=abc... (1 Reply)
Discussion started by: faiz1985
1 Replies
Login or Register to Ask a Question

Featured Tech Videos