Uniq count second column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Uniq count second column
# 1  
Old 03-23-2015
Uniq count second column

Hello

How can I get a number of occurrence count for this file;

Code:
ERR315389.1000156       CTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATATGAGGAAGAG
ERR315389.1000281       GCGTCTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAA
ERR315389.1000504       GGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGGGCTGAGCTCTCAGAAGGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTGTGAC
ERR315389.1000637       GCTGGTGTCACTGCAAAAGAAACTCAAGGGCACCGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGC
ERR315389.1000647       CGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGAGAACGCCTT
ERR315389.1000762       AAAGCATTGATGACTTAGAAGACGAGCTGTACGCTCAGAAACTGAAGTACAAAGCCATCAGCGAGGAGCTGGACCACGCTCTCAACGATATGACTT
ERR315389.1000854       AGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATTGAGAGCGACC
ERR315389.1001141       AAAAAGGCCACCGATGCTGAAGCCGACGTAGCTTCTCTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCA
ERR315389.1001145       GCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAAAAATGGA
ERR315389.1001393       CAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGAGTCGAGCCCAAAAAGATGAAGAAA

I tried cat file1 | uniq -cf1 > file2 count the occurrence for the first column but it end up with the count for 1st column.

Code:
11 ERR315389.1502254       CTCCGCCCGACCGCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGA
     12 ERR315389.6544981       NTCCGCCCGACCGCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGA
     24 ERR315389.4012310       CCGACCGCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCG
     24 ERR315389.5696434       CGACCGCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGA
     36 ERR315389.456083        CCGCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAA
     12 ERR315389.894063        CGCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAG
     12 ERR315389.1554704       CTCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAG
     24 ERR315389.5277557       CGCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAG
     60 ERR315389.2681352       GCGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGG
    144 ERR315389.452044        CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA

How can I get the count based second column and ignore which name from the first column they take. Te first column will be an arbitrary name for the second column.

For instance this raw file

Code:
ERR315389.1451218       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.1640056       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.3946553       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.4137809       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.452044        CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.4597314       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.4896643       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.5450210       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.6159786       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA
ERR315389.7443074       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA

and the desired file which count the number of occurrence from 2nd column is here

Code:
10 ERR315389.1451218       CGCGCTCGCCCCGCCGCTCCTGCTGCAGCCCCAGGGCCCCTCGCCGCCGCCACCATGGACGCCATCAAGAAGAAGATGCAGATGCTGAAGCTCGACAAGGA


Thank you

Last edited by Wan Fahmi; 03-23-2015 at 08:57 AM..
# 2  
Old 03-23-2015
uniq needs sorted input. man uniq:
Quote:
Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'.
Unfortunately, your sample has only unique second fields so it can't be tested against. Adding a few non-unique lines, this may lead to the desired result:
Code:
sort -k2 file | uniq -cf1
      1 ERR315389.1000637       GCTGGTGTCACTGCAAAAGAAACTCAAGGGCACCGAAGATGAACTGG. . .
      7 ERR315389.1000504       GGTCATCATTGAGAGCGACCTGGAACGTGCAGAGGAGCGGGCTGAGC. . .

This User Gave Thanks to RudiC For This Post:
# 3  
Old 03-23-2015
Thanks! As you said because uniq works with sort and count the redundant adjacent line. So I combine both uniq and sort to get the desired output. Here is my code;

Code:
 cat file |sort -k1 -u | sort -k2 | uniq -cf1| sort -rn

The output as here;

Code:
633 ERR315389.1008500       GAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATATGAGGAAGAGATCAAGGTCCT
    519 ERR315389.1012317       CGAAGATGAACTGGACAAATACTCTGAGGCTCTCAAAGATGCCCAGGAGAAGCTGGAGCTGGCAGAGAAAAAGGCCACCGATGCTGAAGCCGACGTAGCTT
    500 ERR315389.1004436       CTTGGATCGAGCTGAGCAGGCGGAGGCCGACAAGAAGGCGGCGGAAGACAGGAGCAAGCAGCTGGAAGATGAGCTGGTGTCACTGCAAAAGAAACTCAAGG
    481 ERR315389.1029324       GTTGGATCGTGCCCAGGAGCGTCTGGCAACAGCTTTGCAGAAGCTGGAGGAAGCTGAGAAGGCAGCAGATGAGAGTGAGAGAGGCATGAAAGTCATTGAGA
    464 ERR315389.10163 CTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATATGAGGAAGAGATCAAGGTCCTTTCCGACAAGCTGAAGGAGGCTGAGA
    369 ERR315389.1010914       CCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAGAAGGAAGACAGATATGAGGAAGAG
    365 ERR315389.1010286       CTGAGCTCTCAGAAGGCAAATGTGCCGAGCTTGAAGAAGAATTGAAAACTGTGACGAACAACTTGAAGTCACTGGAGGCTCAGGCTGAGAAGTACTCGCAG
    342 ERR315389.1005391       CTCGGGCTGAGTTTGCGGAGAGGTCAGTAACTAAATTGGAGAAAAGCATTGATGACTTAGAAGACGAGCTGTACGCTCAGAAACTGAAGTACAAAGCCATC
    296 ERR315389.1005033       AAAAAATGGAAATTCAGGAGATCCAACTGAAAGAGGCAAAGCACATTGCTGAAGATGCCGACCGCAAATATGAAGAGGTGGCCCGTAAGCTGGTCATCATT
    289 ERR315389.1001141       AAAAAGGCCACCGATGCTGAAGCCGACGTAGCTTCTCTGAACAGACGCATCCAGCTGGTTGAGGAAGAGTTGGATCGTGCCCAGGAGCGTCTGGCAACAGC



Thanks again!
# 4  
Old 03-23-2015
Don't use cat, it's a waste of resources.
Are you sure you want sort -u? It may spoil your count results...
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need help in awk: running a loop with one column and segregate data 4 each uniq value in that field

Hi All, I have a file like this(having 2 column). Column 1: like a,b,c.... Column 2: having numbers. I want to segregate those numbers based on column 1. Example: file. a 5 b 9 b 620 a 710 b 230 a 330 b 1910 (4 Replies)
Discussion started by: Raza Ali
4 Replies

2. UNIX for Beginners Questions & Answers

Get first column value uniq

Hi All, I have a directory and sub-directory that having ‘n' number of .log file in nearly 1GB. The file is comma separated file. I need to recursively grep and uniq first column values only. I did in perl. But i wish to know more command line utilities to calculate the time for grep and... (4 Replies)
Discussion started by: k_manimuthu
4 Replies

3. Shell Programming and Scripting

HELP - uniq values per column

Hi All, I am trying to output uniq values per column. see file below. can you please assist? Thank you in advance. cat names joe allen ibm joe smith ibm joe allen google joe smith google rachel allen google desired output is: joe allen google rachel smith ibm (5 Replies)
Discussion started by: Apollo
5 Replies

4. Shell Programming and Scripting

Bring values in the second column into single line (comma sep) for uniq value in the first column

I want to bring values in the second column into single line for uniq value in the first column. My input jvm01, Web 2.0 Feature Pack Library jvm01, IBM WebSphere JAX-RS jvm01, Custom01 Shared Library jvm02, Web 2.0 Feature Pack Library jvm02, IBM WebSphere JAX-RS jvm03, Web 2.0 Feature... (10 Replies)
Discussion started by: kchinnam
10 Replies

5. Shell Programming and Scripting

awk uniq and longest string of a column as index

I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions: 1) longest string of each pattern in column 2, ignore any sub-string, as the index; 2) all the unique patterns after 1); 3) print the whole row; input: 1 ABCDEFGHI longest_sequence1 2 ABCDEFGH... (12 Replies)
Discussion started by: yifangt
12 Replies

6. Shell Programming and Scripting

awk - getting uniq count on multiple col

Hi My file have 7 column, FIle is pipe delimed Col1|Col2|col3|Col4|col5|Col6|Col7 I want to find out uniq record count on col3, col4 and col2 ( same order) how can I achieve it. ex 1|3|A|V|C|1|1 1|3|A|V|C|1|1 1|4|A|V|C|1|1 Output should be FREQ|A|V|3|2 FREQ|A|V|4|1 Here... (5 Replies)
Discussion started by: sanranad
5 Replies

7. UNIX for Dummies Questions & Answers

Re: How To Use UNIQ UNIX Command On single Column

Hi , Can You Please let Know How use unix uniq command on a single column for deleting records from file with Below Structure.Pipe Delimter File . Source Name | Account_Id A | 101 B... (2 Replies)
Discussion started by: anudeepkumar123
2 Replies

8. Shell Programming and Scripting

Uniq sorting and count

Hi Unix gurus, I have a requirement where I need to find the file count based on unique file names. OPEN_INV_MMDDYYYY_HHMM.xls OPEN_INV_MMDDYYYY_HHMM.xls OPEN_INV_MMDDYYYY_HHMM.xls CLOSE_INV_MMDDYYYY_HHMM.xls CLOSE_INV_MMDDYYYY_HHMM.xls OPEN_INV_MMDDYYYY_HHMM.txt... (2 Replies)
Discussion started by: shankar1dada
2 Replies

9. UNIX for Dummies Questions & Answers

deleteing duplicate lines sing uniq while ignoring a column

I have a data set that has 4 columns, I want to know if I can delete duplicate lines while ignoring one of the columns, for example 10 chr1 ASF 30 15 chr1 ASF 20 5 chr1 ASF 30 6 chr2 EBC 15 4 chr2 EBC 30 ... I want to know if I can delete duplicate lines while ignoring column 1, so the... (5 Replies)
Discussion started by: japaneseguitars
5 Replies

10. Shell Programming and Scripting

Column sum group by uniq records

Dear All, I want to get help for below case. I have a file like this. saman 1 gihan 2 saman 4 ravi 1 ravi 2 so i want to get the result, saman 5 gihan 2 ravi 3 like this. Pls help me. (17 Replies)
Discussion started by: Nayanajith
17 Replies
Login or Register to Ask a Question