Count number of unique values in each column of array

01-16-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You have given us a 57.9KB a.txt and a 575.5KB a1.txt (which is created by your script from a.txt and should, therefore, be smaller than a.txt but is instead almost 10 times larger). Your script also creates a2.txt from a1.vcf. But, you haven't shown us what the contents of a1.vcf look like.

Please show us:

the a1.txt that should be created from the sample a.txt you provided in post #14,
a sample a.vcf file and a description of its contents (explaining what the field separator is in this file, what fields are used from which lines), and
the exact output you hope to produce from those sample a.txt and a.vcf files.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-16-2018

Registered User

46, 3

Join Date: Jan 2018

Last Activity: 11 February 2019, 11:45 PM EST

Posts: 46

Thanks Given: 44

Thanked 3 Times in 3 Posts

Quote:

Originally Posted by Don Cragun

the a1.txt that should be created from the sample a.txt you provided in post #14,
a sample a.vcf file and a description of its contents (explaining what the field separator is in this file, what fields are used from which lines), and
the exact output you hope to produce from those sample a.txt and a.vcf files.

Hi Don, a1.vcf is a typo. It should be a1.txt. For some reason, I don't have an edit button for post 14.

I clipped off the bottom of a.txt because the file was a large 86M file. The bottom part is not necessary because it is all a bunch of 0/0 0/1 1/1 and ./.. and a repeat of the preceding rows I just wanted to show the header part plus some of the data (0/0 0/1 1/1 ./.)

The last portion of post 14 shows the desired output, with a column for the sample names ( row 28 in a.txt), the counts of values for each sample; 0/0, 0/1, 1/1, ./. (from a2.txt), a column for the SUM of 0/0 and 0/1 values. A sorting from high to low by column containing the SUM of 0/0 and 0/1.

Geneanalyst

View Public Profile for Geneanalyst

Find all posts by Geneanalyst

01-22-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Geneanalyst

Thanks Don. Here is the my whole code:

Code:

# Identify target derived alleles to the exculsion of outgroups
awk -F "\t" '(NR>28) { if(($313 == "0/0") && ($314 != "0/0") && ($315 != "0/0") && ($316 != "0/0") && ($317 != "0/0") && ($318 != "0/0") && ($319 != "0/0")) {print $0} }' a.txt > a1.txt
#
# Strip columns 1-9 and write to a2.txt
awk '{for(i=10;i<=NF;i++){printf "%s ", $i}; printf "\n"}' a1.vcf > a2.txt
# PRINT HEADER
awk 'FNR==28 {for(i=9;i<=NF;i++){printf "%s ", $i}; printf "\n"}' a.txt
# Print count of 0/0 0/1 1/1 ./. for each sample in the run
awk '
{ 
   mc = NF > mc ? NF : mc
   for(i=NF; i; i--) {
      T[$i]
      C[i FS $i]++
   }
}
END {
  for(v in T) {
     printf "\n%s", v
     for(i=1; i<=mc;i++) printf "\t%d",C[i FS v]
  }
  printf "\n"
}' a2.txt

In the 1st part, the input data a.txt (attached file. I only copied the 1st 100 rows to include the header which has the sample names) is queried. Columns 1-9 do not contain relevant information. Columns 313-319 contain the target samples against which all the test samples are compared ( columns 10-312).

The rows that survive the comparison operation are written to a1.txt ( columns 1-9 don't contain relevant information).

Next the header containing the sample names is extracted from a.txt, and your code is executed for counting the number of unique values.

Next I manually add the values in the 0/1 and 1/1 columns, and create a totals column. I then sort the total column from high to low. The sample with the highest total indicates the most similarity to target sample (column 313).

I like the way you transposed the result and would like to also transpose the header with sample names. So instead of COL1. COL2, ..., I would like the sample names from row 28 ( columns 10-319), such as shown below.

Code:

FORMAT    1/1    0/0    0/1    TOTAL 0/0 & 0/1
.Kurd_C3_ID001    78    183    201    384
Balochi_HGDP00052    86    175    201    376
Balochi_HGDP00054    71    166    225    391
Balochi_HGDP00056    71    158    233    391
Balochi_HGDP00058    90    168    204    372
Balochi_HGDP00062    91    148    223    371
Balochi_HGDP00064    85    183    194    377
Balochi_HGDP00066    79    185    198    383
Balochi_HGDP00068    95    163    202    365
Balochi_HGDP00072    75    168    217    385
Balochi_HGDP00074    80    198    183    381
Balochi_HGDP00078    89    171    199    370
Balochi_HGDP00080    88    149    222    371
Balochi_HGDP00082    85    179    195    374
Balochi_HGDP00086    102    162    198    360
Balochi_HGDP00088    89    175    194    369
Balochi_HGDP00090    87    177    197    374
Balochi_HGDP00092    87    191    184    375
Balochi_HGDP00096    87    166    207    373
Balochi_HGDP00098    95    190    175    365
GujaratiD_NA20847    74    168    220    388
GujaratiD_NA20899    86    183    193    376

Moderator's Comments:

Please use CODE tags around sample input and output as well as around code segments.

I note that this output doesn't include any output for the fields that have the value ./.. Do you only want to display data in your output for the 1/1, 0/0, and 0/1 value counts?

Does the output order matter for the middle three columns?

You also said that your output should be sorted in decreasing order on the values in the last column, but your sample output appears to be unsorted???

Are columns 313-319 supposed to be counted and printed along with the test samples, or are just columns 10-312 supposed to be counted and printed?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Count number of unique values in each column of array

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Count unique column

Discussion started by: nans

2. UNIX for Beginners Questions & Answers

Awk: count unique element of array

Discussion started by: beca123456

3. Shell Programming and Scripting

Print count of unique values

Discussion started by: H squared

4. Shell Programming and Scripting

Count occurrence of column one unique value having unique second column value

Discussion started by: angshuman

5. Shell Programming and Scripting

Count frequency of unique values in specific column

Discussion started by: owwow14

6. UNIX for Dummies Questions & Answers

count number of distinct values in each column with awk

Discussion started by: beca123456

7. Shell Programming and Scripting

How to count Unique Values from a file.

Discussion started by: Prega

8. UNIX for Dummies Questions & Answers

count number of rows based on other column values

Discussion started by: itsme999

9. Shell Programming and Scripting

print unique values of a column and sum up the corresponding values in next column

Discussion started by: amigarus

10. Shell Programming and Scripting

Not able to read unique values in array

Discussion started by: faiz1985