Find most and second most abundant value

04-10-2015

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Find most and second most abundant value

I would like to convert the most frequent and second most frequent duplet in each row to 1 and -1 respectively ...and everything else to 0. please assist

A duplet is only AA , CC, GG and TT

Code:

- C1 C2 C3 C4 C5
R1 AA AA - - CC
R2 AC AA AA CC CC
R3 AT AT TT TT TT
R5 AT TT AA AA AA

Desired result

Code:

- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 0 1 1 -1 -1
R3 0 0 1 1 1
R5 0 -1 1 1 1

My attempt

Code:

awk 'NR>1{ for (i=2;i<=NF;i++) { if ( substr($i,1,1)==substr($i,2) ) {x[i]++ ; for (c in x ) { if ( c > max ) c=max ; else if ( c < max ) max2=c } else $i="0"} END {print max, max2}' file

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

04-10-2015

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

I don't quite understand your output given your description and the input.
My understanding given your sample input would be - counting '-' as 0-s:

Code:

- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 -1 1 1 1 1
R3 -1 -1 1 1 1
R5 -1 -1 1 1 1

What happens if you have TWO strings with the SAME frequency? (as in line R2)
What happens if you have TWO strings which are both highest and the next to highest frequencies? ( like on line R3)

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

04-10-2015

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

If two strings are same frequency,then any one can be 1 and the other -1.

I dont want to consider any other values than duplets AA,TT,GG ,CC ..For values like AT, - they should all be converted to 0 (and not counted at all).

In R2, since they are the same frequency it could be

Code:

R2 AC AA AA CC CC

becomes

Code:

R2 0 1 1 -1 -1

or

Code:

R2 0 -1 -1 1 1

In R3 there is only 1 duplet TT, hence that is the highest. AT is not a duplet and becomes 0.

Code:

R3 AT AT TT TT TT

becomes

Code:

R3 0 0 1 1 1

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

04-10-2015

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

A bit verbose - probably can be done without sorting, but....
awk -f rita.awk myFile where rita.awk is:

Code:

function quicksort(data, left, right,    i, last)
{
    if (left >= right)  # do nothing if array contains fewer
        return          # than two elements

    quicksort_swap(data, left, int((left + right) / 2))
    last = left
    for (i = left + 1; i <= right; i++)
        if (count[data[i]]<count[data[left]])
            quicksort_swap(data, ++last, i)
    quicksort_swap(data, left, last)
    quicksort(data, left, last - 1, less_than)
    quicksort(data, last + 1, right, less_than)
}

# quicksort_swap --- helper function for quicksort, should really be inline

function quicksort_swap(data, i, j, temp)
{
    temp = data[i]
    data[i] = data[j]
    data[j] = temp
}
BEGIN {
   split("AA,TT,GG,CC", tA,",")
   for(i=1;i in tA;i++)
     goodA[tA[i]]
}

FNR==1 {print;next}
{
  split("",arr)
  split("",count)
  tally=0
  for(i=2;i<=NF;i++) {
    if (!($i in goodA)) continue
    if (!($i in count)) arr[++tally]=$i
    count[$i]++
  }
  quicksort(arr,1,tally)
  printf $1
  for(i=2;i<=NF;i++) {
    if ($i == arr[tally])
      $i=1
    else if ($i == arr[tally-1])
           $i=-1
         else $i=0
    printf("%s%d%s", OFS, $i, (i==NF)?ORS:"")
  }
}

This produces the following based on your sample input:

Code:

- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 0 -1 -1 1 1
R3 0 0 1 1 1
R5 0 -1 1 1 1

Last edited by vgersh99; 04-10-2015 at 08:01 PM..

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

04-11-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Alternative without sorting:

Code:

awk '
  BEGIN {
    A["AA"]; A["CC"]; A["GG"]; A["TT"]
  } 
  NR>1 { 
    minkey=""; max=0
    for(i=2; i<=NF; i++) if($i in A) {
      A[$i]++
      if(max<A[$i]) {
        max=A[$i]
        maxkey=$i
      }
    }
    for(i in A) {
      if(i!=maxkey && !minkey && A[i]>0) minkey=i
      A[i]=""
    }
    for(i=2; i<=NF; i++) $i=($i==minkey)?-1:($i==maxkey)?1:0
  }
  1
' file

Output:

Code:

- C1 C2 C3 C4 C5
R1 1 1 0 0 -1
R2 0 1 1 -1 -1
R3 0 0 1 1 1
R5 0 -1 1 1 1

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-11-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

@Scrutinizer: Nice, but that has got a problem with larger duplicate count later in the line. For a line like

Code:

R1 AA AA AA AA - - CC CC TT TT TT

it yields

Code:

R1 1 1 1 1 0 0 -1 -1 0 0 0

Small modification

Code:

awk '
BEGIN   {A["AA"]; A["CC"]; A["GG"]; A["TT"] }

NR>1    {mx2key=""; max=max2=0
         for(i=2; i<=NF; i++) if ($i in A) { A[$i]++ }
         for (i in A)   {if(max<A[i])   {max2=max
                                         mx2key=maxkey
                                         max=A[i]
                                         maxkey=i
                                        }
                         else if (max2<A[i])    {max2=A[i]; mx2key=i}
                         A[i]=""
                        }
         for (i=2; i<=NF; i++) $i=($i==mx2key)?-1:($i==maxkey)?1:0
        }
 
1
' file

would yield the correct

Code:

R1 1 1 1 1 0 0 0 0 -1 -1 -1

These 3 Users Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-11-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi RudiC, but I think the former output is correct, no? CC has the lowest frequency on the line (less than TT) , so it should get -1 ...
--
OK I see, the OP said the second most frequent duplet, I must have misread..

Last edited by Scrutinizer; 04-11-2015 at 09:18 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Find most and second most abundant value

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find string1, when true find string2 in reverse direction

Discussion started by: baris35

2. Shell Programming and Scripting

How to find a file with a specific pattern for current sysdate & upon find email the details?

Discussion started by: PreetArul

3. Shell Programming and Scripting

find: missing argument to `-exec' while redirecting using find in perl

Discussion started by: ramkumarselvam

4. Shell Programming and Scripting

How to use grep & find command to find references to a particular file

Discussion started by: Gangam

5. Linux

Simplified find command to find multiple file types

Discussion started by: vickramshetty

6. UNIX for Dummies Questions & Answers

how to find a file named vijay in a directory using find command

Discussion started by: amirthraj_12

7. Shell Programming and Scripting

Little bit weired : Find files in UNIX w/o using find or where command

Discussion started by: jatin.jain

8. Shell Programming and Scripting

command find returned bash: /usr/bin/find: Argument list too long

Discussion started by: yacsil