HELP - uniq values per column

10-28-2016

Registered User

11, 0

Join Date: Oct 2013

Last Activity: 29 October 2016, 7:40 AM EDT

Posts: 11

Thanks Given: 8

Thanked 0 Times in 0 Posts

HELP - uniq values per column

Hi All,

I am trying to output uniq values per column. see file below. can you please assist? Thank you in advance.

Code:

cat names
joe allen ibm
joe smith ibm
joe allen google
joe smith google
rachel allen google

desired output is:

Code:

joe allen google
rachel smith ibm

Apollo

View Public Profile for Apollo

Find all posts by Apollo

10-29-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If you're willing to accept output with values in each column presented in the same order as they were seen in encountered in the input, you could try something like:

Code:

awk '
function AddFieldData(field) {
	if((field, $field) in data)
		return
	data[field, FieldValue[field, ++count[field]] = $field]
	if(count[field] > maxr)
		maxr = count[field]
	if(field > maxc)
		maxc = field
}
{	for(i = 1; i <= NF; i++)
		AddFieldData(i)
}
END {	for(i = 1; i <= maxr; i++)
		for(j = 1; j <= maxc; j++)
			printf("%s%s", FieldValue[j, i], (j == maxc) ? ORS : OFS)
}' names

which, with the sample data you provided, produces the output:

Code:

joe allen ibm
rachel smith google

If you need the last column to be in a different order, you need to more clearly explain your requirements. Hopefully, you can use the above code as a base to get something that will do what you're trying to do.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-29-2016

Registered User

11, 0

Join Date: Oct 2013

Last Activity: 29 October 2016, 7:40 AM EDT

Posts: 11

Thanks Given: 8

Thanked 0 Times in 0 Posts

thank you for this.. works perfectly.

if it's not much trouble, as the solution you have provided is pretty advance and would much appreciate if you can explain each line.

Apollo

View Public Profile for Apollo

Find all posts by Apollo

10-29-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Jumping in as Don Cragun seems to be logged out at the moment:

Code:

awk '
function AddFieldData(field)    {if ( (field, $field) in data) return                           # if field No. and field contents occurred (and registered) before, quit function
                                                                                                # data is an array whose elements'' contents don''t count, only its indices

                                 data[ field, FieldValue [field, ++count[field] ] = $field]     # this is quite complex (array elements are defined on first reference):
                                                                                                # register field No. and field contents as index into data array, and
                                                                                                #   at the same time, create FieldValue array element holding field contents and indexed
                                                                                                #   by field No. and an incremented counter for the field No. 
                                 if (count[field] > maxr) maxr = count[field]                   # keep max count across ALL fields (i.e. the No. of lines to be printed
                                                                                                #   e.g. for three unique values in field2 we need to print three lines)
                                 if (field > maxc) maxc = field                                 # keep max of NF across all lines
                                }

        {for(i = 1; i <= NF; i++) AddFieldData(i)                                               # main: call above function for ALL fields in ALL lines filling the necessary arrays
        }

END     {for (i = 1; i <= maxr; i++)                                                            # print maxr lines with
           for (j = 1; j <= maxc; j++)                                                          # maxc fields, each
                printf("%s%s", FieldValue[j, i], (j == maxc) ? ORS : OFS)                       # a sort of random distribution of unique field values ; after last (maxc)
                                                                                                # field print line feed (ORS) else field separator (OFS, space)
        }
'  file

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-29-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by RudiC

Jumping in as Don Cragun seems to be logged out at the moment:

Code:

awk '
function AddFieldData(field)    {if ( (field, $field) in data) return                           # if field No. and field contents occurred (and registered) before, quit function
                                                                                                # data is an array whose elements'' contents don''t count, only its indices

                                 data[ field, FieldValue [field, ++count[field] ] = $field]     # this is quite complex (array elements are defined on first reference):
                                                                                                # register field No. and field contents as index into data array, and
                                                                                                #   at the same time, create FieldValue array element holding field contents and indexed
                                                                                                #   by field No. and an incremented counter for the field No. 
                                 if (count[field] > maxr) maxr = count[field]                   # keep max count across ALL fields (i.e. the No. of lines to be printed
                                                                                                #   e.g. for three unique values in field2 we need to print three lines)
                                 if (field > maxc) maxc = field                                 # keep max of NF across all lines
                                }

        {for(i = 1; i <= NF; i++) AddFieldData(i)                                               # main: call above function for ALL fields in ALL lines filling the necessary arrays
        }

END     {for (i = 1; i <= maxr; i++)                                                            # print maxr lines with
           for (j = 1; j <= maxc; j++)                                                          # maxc fields, each
                printf("%s%s", FieldValue[j, i], (j == maxc) ? ORS : OFS)                       # a sort of random distribution of unique field values ; after last (maxc)
                                                                                                # field print line feed (ORS) else field separator (OFS, space)
        }
'  file

Hi RudiC,
Thank you for filling in while I was sleeping. I do have a couple of comments that may help explain what I was thinking when I wrote this code...

First, I'm not sure that I would say that:

Code:

	data[field, FieldValue[field, ++count[field]] = $field]

is complex code, but it certainly is dense. I just started by figuring out what I want to print after I had accumulated all of the data. Although the sample data provided happened to have the same number of unique values in each input column, there doesn't seem to be any reason to assume that that will be true with real input data. So, I need:

a count of the number of unique values that have been found in each column (which is handled by the array ++count[column], with the ++ incrementing the number of unique values seen in this column),
the unique values to be displayed in each output row and column (which is handled by the array FieldValue[column, output_row] = field_value),
and a quick way to determine whether or not we have seen a given value in a given field before (which is handled by the array data[column, field_value], and as RudiC said we don't need any value to be assigned to elements of this array; we just need to know whether a given column and field_value pair have been entered into this array).

And, second, I wouldn't say that the output is in random order. The 1st row of the output will contain the 1st unique value found in each input column. The 2nd row of the output will contain the 2nd unique value found in each input column. Etc.

Note that if an output row does not have a value for a given column, that field will be an empty string (printed as just a field separator). This works because referencing an array element that has not been assigned a value will return an empty string as its value.

Note also that since I used the awk variable OFS (instead of an explicit <space> as a field separator), you can change the character that appears in the output to separate fields by setting a different value for OFS before naming the file to be processed on the last line of the script.

Note also that there is no requirement that every input row contain the same number of fields, but the output will have the same number of fields in every row.

For example, if we had a file named numbers containing:

Code:

11 12 13 14
21
31 32 33 34 35
41 42 43

and we wanted the output field separator to be a <comma> instead of a <space>, we could change the last line of the script from:

Code:

}' names

Code:

}' OFS=, numbers

and get the output:

Code:

11,12,13,14,35
21,32,33,34,
31,42,43,,
41,,,,

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-30-2016

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

I like modular codes as well as generalized codes, so as an experiment I wrote a perl script that extracts the fields simultaneously to (effectively) files. Here is a simple example, and then the solution with the user data:

Code:

#!/usr/bin/env bash

# @(#) s3       Demonstrate display unique values from fields.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cutofi

pl " Sample easy dataset data2:"
cat data2

pl " Results, field 2 extracted to file f2:"
cutofi data2
cat f2

pl " Results, field 2 piped into \"sort|uniq\" to file f2:"
cutofi -c='sort|uniq' data2
cat f2

pe
pe " Results, all files f? pasted together:"
paste f?

# Solve the user problem.
FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Expected output:"
cat $E

pl " Results:"
cutofi -c='sort -u' -p=xx $FILE
paste xx? |
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

exit 0

producing:

Code:

$ ./s3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.6 (jessie) 
bash GNU bash 4.3.30
cutofi (local) 1.4

-----
 Sample easy dataset data2:
1 1 1 
1 1 1 
1 1 1 
2 3 4

-----
 Results, field 2 extracted to file f2:
1
1
1
3

-----
 Results, field 2 piped into "sort|uniq" to file f2:
1
3

 Results, all files f? pasted together:
1       1       1
2       3       4

-----
 Input data file data1:
joe allen ibm
joe smith ibm
joe allen google
joe smith google
rachel allen google

-----
 Expected output:
joe allen google
rachel smith ibm

-----
 Results:
joe     allen   google
rachel  smith   ibm

-----
 Verify results if possible:

-----
 Comparison of 2 created lines with 2 lines of desired results:
f1 expected-output.txt differ: char 4, line 1
 Failed -- files f1 and expected-output.txt not identical -- detailed comparison follows.
 Succeeded by ignoring whitespace differences.

There may be a small amount of parallelism seen because the separated fields are really written to pipes, which is supplied by the --command or -c option, resulting in child processes.

The final files may be then combined to produce the output similar to that desired by the OP.

Of course, this can also be done with separate awk processes as well.

The documentation for the experimental code:

Code:

$ ./cutofi -h

SYNOPSIS
cutofi - cut (split) fields from text file to separate filters, files.

PURPOSE
Separate fields in text files by streaming them to a pipeline. This
allows some parallel processing, albeit at the slight risk of having
too many processes active.

This allows operations on the field as a separate dataset so
that, for example, each field could be sorted without regard to
the other fields.

        Maximum number of processes for a user in a specific distribution:
        OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
        Distribution        : Debian 8.6 (jessie) 

        $ ulimit -u
        78528

See also the Missing Text Routine:
  colwise            Cut a column, run command, paste results.
  http://www1.cuni.cz/~obo/textutils/ (2016.08)

USAGE
cutofi [options] -- [files]

options (read "|" as "or"):

-h | --help
  print this message and quit.

-d | --debug
  turn on debugging output.  Default is off.

-p | --prefix=str
  set the output file base string as "str".  Default is "f".

-f | --fields=list
  set specific fields to be split in comma-separated list: f1,f2, ... fn.
  Default is all strings.

-s | --separator=string
  set the separator to "string". Default is \s+, whitespace.

-c | --command=string
  set the output filter as "string", e.g. "sort|uniq".  Default
  is (essentially) "cat > prefix<i>" where i is the field number.

Best wishes ... cheers, drl

Last edited by drl; 10-31-2016 at 08:50 AM.. Reason: Correct minor typos.

drl

View Public Profile for drl

Find all posts by drl

Shell Programming and Scripting

HELP - uniq values per column

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Get first column value uniq

Discussion started by: k_manimuthu

2. Shell Programming and Scripting

Bring values in the second column into single line (comma sep) for uniq value in the first column

Discussion started by: kchinnam

3. Shell Programming and Scripting

Uniq count second column

Discussion started by: Wan Fahmi

4. Shell Programming and Scripting

Filter uniq field values (non-substring)

Discussion started by: yifangt

5. Shell Programming and Scripting

awk uniq and longest string of a column as index

Discussion started by: yifangt

6. Shell Programming and Scripting

for uniq entries add values in corresponding columns

Discussion started by: Diya123

7. UNIX for Dummies Questions & Answers

shift values in one column as header for values in another column

Discussion started by: Unilearn

8. UNIX for Dummies Questions & Answers

Re: How To Use UNIQ UNIX Command On single Column

Discussion started by: anudeepkumar123

9. Shell Programming and Scripting

print unique values of a column and sum up the corresponding values in next column

Discussion started by: amigarus

10. Shell Programming and Scripting

Column sum group by uniq records

Discussion started by: Nayanajith