Find most and second most abundant value


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find most and second most abundant value
# 1  
Old 04-10-2015
Find most and second most abundant value

I would like to convert the most frequent and second most frequent duplet in each row to 1 and -1 respectively ...and everything else to 0. please assist

A duplet is only AA , CC, GG and TT


Code:
- C1 C2 C3 C4 C5
R1 AA AA - - CC
R2 AC AA AA CC CC
R3 AT AT TT TT TT
R5 AT TT AA AA AA

Desired result


Code:
- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 0 1 1 -1 -1
R3 0 0 1 1 1
R5 0 -1 1 1 1

My attempt

Code:
awk 'NR>1{ for (i=2;i<=NF;i++) { if ( substr($i,1,1)==substr($i,2) ) {x[i]++ ; for (c in x ) { if ( c > max ) c=max ; else if ( c < max ) max2=c } else $i="0"} END {print max, max2}' file

# 2  
Old 04-10-2015
I don't quite understand your output given your description and the input.
My understanding given your sample input would be - counting '-' as 0-s:
Code:
- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 -1 1 1 1 1
R3 -1 -1 1 1 1
R5 -1 -1 1 1 1

What happens if you have TWO strings with the SAME frequency? (as in line R2)
What happens if you have TWO strings which are both highest and the next to highest frequencies? ( like on line R3)
This User Gave Thanks to vgersh99 For This Post:
# 3  
Old 04-10-2015
If two strings are same frequency,then any one can be 1 and the other -1.

I dont want to consider any other values than duplets AA,TT,GG ,CC ..For values like AT, - they should all be converted to 0 (and not counted at all).

In R2, since they are the same frequency it could be

Code:
R2 AC AA AA CC CC

becomes

Code:
R2 0 1 1 -1 -1

or

Code:
R2 0 -1 -1 1 1

In R3 there is only 1 duplet TT, hence that is the highest. AT is not a duplet and becomes 0.

Code:
R3 AT AT TT TT TT

becomes

Code:
R3 0 0 1 1 1

# 4  
Old 04-10-2015
A bit verbose - probably can be done without sorting, but....
awk -f rita.awk myFile where rita.awk is:
Code:
function quicksort(data, left, right,    i, last)
{
    if (left >= right)  # do nothing if array contains fewer
        return          # than two elements

    quicksort_swap(data, left, int((left + right) / 2))
    last = left
    for (i = left + 1; i <= right; i++)
        if (count[data[i]]<count[data[left]])
            quicksort_swap(data, ++last, i)
    quicksort_swap(data, left, last)
    quicksort(data, left, last - 1, less_than)
    quicksort(data, last + 1, right, less_than)
}

# quicksort_swap --- helper function for quicksort, should really be inline

function quicksort_swap(data, i, j, temp)
{
    temp = data[i]
    data[i] = data[j]
    data[j] = temp
}
BEGIN {
   split("AA,TT,GG,CC", tA,",")
   for(i=1;i in tA;i++)
     goodA[tA[i]]
}

FNR==1 {print;next}
{
  split("",arr)
  split("",count)
  tally=0
  for(i=2;i<=NF;i++) {
    if (!($i in goodA)) continue
    if (!($i in count)) arr[++tally]=$i
    count[$i]++
  }
  quicksort(arr,1,tally)
  printf $1
  for(i=2;i<=NF;i++) {
    if ($i == arr[tally])
      $i=1
    else if ($i == arr[tally-1])
           $i=-1
         else $i=0
    printf("%s%d%s", OFS, $i, (i==NF)?ORS:"")
  }
}

This produces the following based on your sample input:
Code:
- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 0 -1 -1 1 1
R3 0 0 1 1 1
R5 0 -1 1 1 1


Last edited by vgersh99; 04-10-2015 at 08:01 PM..
This User Gave Thanks to vgersh99 For This Post:
# 5  
Old 04-11-2015
Alternative without sorting:

Code:
awk '
  BEGIN {
    A["AA"]; A["CC"]; A["GG"]; A["TT"]
  } 
  NR>1 { 
    minkey=""; max=0
    for(i=2; i<=NF; i++) if($i in A) {
      A[$i]++
      if(max<A[$i]) {
        max=A[$i]
        maxkey=$i
      }
    }
    for(i in A) {
      if(i!=maxkey && !minkey && A[i]>0) minkey=i
      A[i]=""
    }
    for(i=2; i<=NF; i++) $i=($i==minkey)?-1:($i==maxkey)?1:0
  }
  1
' file

Output:
Code:
- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 0 1 1 -1 -1
R3 0 0 1 1 1
R5 0 -1 1 1 1

This User Gave Thanks to Scrutinizer For This Post:
# 6  
Old 04-11-2015
@Scrutinizer: Nice, but that has got a problem with larger duplicate count later in the line. For a line like
Code:
R1 AA AA AA AA - - CC CC TT TT TT

it yields
Code:
R1 1 1 1 1 0 0 -1 -1 0 0 0

Small modification
Code:
awk '
BEGIN   {A["AA"]; A["CC"]; A["GG"]; A["TT"] }

NR>1    {mx2key=""; max=max2=0
         for(i=2; i<=NF; i++) if ($i in A) { A[$i]++ }
         for (i in A)   {if(max<A[i])   {max2=max
                                         mx2key=maxkey
                                         max=A[i]
                                         maxkey=i
                                        }
                         else if (max2<A[i])    {max2=A[i]; mx2key=i}
                         A[i]=""
                        }
         for (i=2; i<=NF; i++) $i=($i==mx2key)?-1:($i==maxkey)?1:0
        }
 
1
' file

would yield the correct
Code:
R1 1 1 1 1 0 0 0 0 -1 -1 -1

These 3 Users Gave Thanks to RudiC For This Post:
# 7  
Old 04-11-2015
Hi RudiC, but I think the former output is correct, no? CC has the lowest frequency on the line (less than TT) , so it should get -1 ...
--
OK I see, the OP said the second most frequent duplet, I must have misread..

Last edited by Scrutinizer; 04-11-2015 at 09:18 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find string1, when true find string2 in reverse direction

Hello, This is a bit complicated for me. My scenario in MyFile: Search string1, When string1 is found, grep the line containing string1, go back over that line in upward direction and grep the first line containing string2. Here is an example: MyFile His email address... (17 Replies)
Discussion started by: baris35
17 Replies

2. Shell Programming and Scripting

How to find a file with a specific pattern for current sysdate & upon find email the details?

I need assistance with following requirement, I am new to Unix. I want to do the following task but stuck with file creation date(sysdate) Following is the requirement I need to create a script that will read the abc/xyz/klm folder and look for *.err files for that day’s date and then send an... (4 Replies)
Discussion started by: PreetArul
4 Replies

3. Shell Programming and Scripting

find: missing argument to `-exec' while redirecting using find in perl

Hi Friends, Please help me to sort out this problem, I am running this in centos o/s and whenever I run this script I am getting "find: missing argument to `-exec' " but when I run the same code in the command line I didn't find any problem. I am using perl script to run this ... (2 Replies)
Discussion started by: ramkumarselvam
2 Replies

4. Shell Programming and Scripting

How to use grep & find command to find references to a particular file

Hi all , I'm new to unix I have a checked project , there exists a file called xxx.config . now my task is to find all the files in the checked out project which references to this xxx.config file. how do i use grep or find command . (2 Replies)
Discussion started by: Gangam
2 Replies

5. Linux

Simplified find command to find multiple file types

Hi, I'm using the following command to find the multiple requierd file types and its working fine find . -name "*.pl" -o -name "*.pm" -o -name "*.sql" -o -name "*.so" -o -name "*.sh" -o -name "*.java" -o -name "*.class" -o -name "*.jar" -o -name "*.gz" -o -name "*.Z" -type f Though... (2 Replies)
Discussion started by: vickramshetty
2 Replies

6. UNIX for Dummies Questions & Answers

how to find a file named vijay in a directory using find command

I need to find whether there is a file named vijay is there or not in folder named "opt" .I tried "ls *|grep vijay" but it showed permission problem. so i need to use find command (6 Replies)
Discussion started by: amirthraj_12
6 Replies

7. Shell Programming and Scripting

Little bit weired : Find files in UNIX w/o using find or where command

Yes , I have to find a file in unix without using any find or where commands.Any pointers for the same would be very helpful as i am beginner in shell scritping and need a solution for the same. Thanks in advance. Regards Jatin Jain (10 Replies)
Discussion started by: jatin.jain
10 Replies

8. Shell Programming and Scripting

command find returned bash: /usr/bin/find: Argument list too long

Hello, I create a file touch 1201093003 fichcomp and inside a repertory (which hava a lot of files) I want to list all files created before this file : find *.* \! -maxdepth 1 - newer fichcomp but this command returned bash: /usr/bin/find: Argument list too long but i make a filter all... (1 Reply)
Discussion started by: yacsil
1 Replies
Login or Register to Ask a Question