Counting characters vertically

04-24-2012

Registered User

175, 0

Join Date: Oct 2009

Last Activity: 31 January 2014, 7:06 PM EST

Posts: 175

Thanks Given: 18

Thanked 0 Times in 0 Posts

Counting characters vertically

I do have a big file in the following format

Code:

>A1
ATGCGG
>A2
TCATGC
>A3
-TGCTG

The number of characters will be same under each subheader and only possible characters are A,T,G,C and -

I want to count the number of A's, T's,G's, C's & -'s vertically for all the positions so that I get the following output

Code:

1 A= 1, T=1, C=0, G=0, -=1
2 A=0, T=2, C=1,G=0, -=0
3 A=1, T=0, C=0, G=2, -=0
4 A=0, T=1, C=2, G=0, -=0
5 A=0, T=1, C=0, G=2, -=0
6 A=0, T=0, C=1, G=2, -=0

please let me know the best way to do this using awk

Lucky Ali

View Public Profile for Lucky Ali

Find all posts by Lucky Ali

04-24-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

'best' as always is a thing open to interpretation. Here's one way:

Code:

$ cat vpos.awk

BEGIN { split("A T C G -", T); }

/^>/ { next }

{
        if((!LMAX) || (LMAX<length($0))) LMAX=length($0);
        for(N=1; N<=length($0); N++)    A[N,substr($0,N,1)]++;
}

END {   for(N=1; N<=LMAX; N++)
        {
                printf("%d", N);
                for(M=1; T[M]; M++)     printf("\t%s=%d", T[M], A[N,T[M]]);
                printf("\n");
        }
}

$ awk -f vpos.awk data

1       A=1     T=1     C=0     G=0     -=1
2       A=0     T=2     C=1     G=0     -=0
3       A=1     T=0     C=0     G=2     -=0
4       A=0     T=1     C=2     G=0     -=0
5       A=0     T=1     C=0     G=2     -=0
6       A=0     T=0     C=1     G=2     -=0

$

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-24-2012

Registered User

175, 0

Join Date: Oct 2009

Last Activity: 31 January 2014, 7:06 PM EST

Posts: 175

Thanks Given: 18

Thanked 0 Times in 0 Posts

Thanks, it worked.

I do have another problem as well, which occurred after seeing this output.

This could be an entirely different question.
I have the same formated file as above but now with 4 positions. For each position there are the chances of being 2 types of character, either the 1 type character or 2 type character. For example at position 1, characters should be either T (for 1 type) or C (for 2 type), similiarily for position 2, C (for 1 type) or T (2 type), position 3, A(for 1 type)or G (for 2 type) and position 4, T (for 1 type) or C (for 2 type).
below is the input file

Code:

>A1
TCAT
>A2
CTGC
>A3
TCGC
>A4
TTAT
>A5
TTTT

Based on this, I want to characterize all the sub-headers (>A1, A2, A3, A4, A5) in the above file so that I would know which type it is.

the desired output ( No need for the part after #, it is just to make it clearer)

Code:

PLease let me know the way to do it in awk
A1 1 #all type 1 characters
A2 2 # all type 2 characters
A3 mixed  # contains at least one type 1 or type 2 characters in any of the 4 positions
A4 mixed  # contains at least one type 1 or type 2 characters in any of the 4 positions
A5 NA #if any of the positions have any other character other than type 1 or type

Lucky Ali

View Public Profile for Lucky Ali

Find all posts by Lucky Ali

04-24-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code:

$ cat types

1       T       1
1       C       2
2       C       1
2       T       2
3       A       1
3       G       2
4       T       1
4       C       2

$ cat vpos2.awk

BEGIN {
        while((getline<"types")>0)      T[$1,$2]=$3
        FS=">"; OFS="\t"
}

/^>/ {  NAMES[++L]=$2; next     }

L {
        for(N=1; N<=length($0); N++)
        {
                C=substr($0,N,1);
                A[NAMES[L],T[N,C]]++;
        }
}

END {
        for(N=1; N<=L; N++)
        {
                if(A[NAMES[N],""])
                        print NAMES[N],"NA";
                else if(A[NAMES[N], 1] && A[NAMES[N], 2])
                        print NAMES[N],"mixed";
                else if(A[NAMES[N], 1])
                        print NAMES[N], 1;
                else    print NAMES[N], 2;
        }
}

$ awk -f vpos2.awk data2

>A1
TCAT
>A2
CTGC
>A3
TCGC
>A4
TTAT
>A5
TTTT

$

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

04-24-2012

Registered User

175, 0

Join Date: Oct 2009

Last Activity: 31 January 2014, 7:06 PM EST

Posts: 175

Thanks Given: 18

Thanked 0 Times in 0 Posts

Do I need to make a types file like you made?

Lucky Ali

View Public Profile for Lucky Ali

Find all posts by Lucky Ali

04-24-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Yes, it reads it.

If you really wanted you could embed it into awk itself like

Code:

T[1,"T"]=1;
T[1,"C"]=2;
...

in the BELOW section instead, but when there's more than three lines of it, I tend to put that in files. Just better organization, and far less chance of typoes than doing fiddly [] operations over and over.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Counting characters vertically

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Counting characters at each position

Discussion started by: amits22

2. Shell Programming and Scripting

Counting the number of characters

Discussion started by: Sri3001

3. Shell Programming and Scripting

Counting characters within a file

Discussion started by: puttster

4. Shell Programming and Scripting

taking characters and counting them

Discussion started by: puttster

5. Shell Programming and Scripting

Counting characters with sed

Discussion started by: cola

6. Shell Programming and Scripting

counting characters

Discussion started by: Lucky Ali

7. UNIX for Dummies Questions & Answers

counting the occurence of particular characters

Discussion started by: Audra

8. Shell Programming and Scripting

Counting characters between comma's

Discussion started by: dbrundrett

9. Shell Programming and Scripting

counting characters

Discussion started by: plelie2