Counting characters vertically


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Counting characters vertically
# 1  
Old 04-24-2012
Counting characters vertically

I do have a big file in the following format

Code:
>A1
ATGCGG
>A2
TCATGC
>A3
-TGCTG

The number of characters will be same under each subheader and only possible characters are A,T,G,C and -

I want to count the number of A's, T's,G's, C's & -'s vertically for all the positions so that I get the following output
Code:
1 A= 1, T=1, C=0, G=0, -=1
2 A=0, T=2, C=1,G=0, -=0
3 A=1, T=0, C=0, G=2, -=0
4 A=0, T=1, C=2, G=0, -=0
5 A=0, T=1, C=0, G=2, -=0
6 A=0, T=0, C=1, G=2, -=0

please let me know the best way to do this using awk
# 2  
Old 04-24-2012
'best' as always is a thing open to interpretation. Here's one way:

Code:
$ cat vpos.awk

BEGIN { split("A T C G -", T); }

/^>/ { next }

{
        if((!LMAX) || (LMAX<length($0))) LMAX=length($0);
        for(N=1; N<=length($0); N++)    A[N,substr($0,N,1)]++;
}

END {   for(N=1; N<=LMAX; N++)
        {
                printf("%d", N);
                for(M=1; T[M]; M++)     printf("\t%s=%d", T[M], A[N,T[M]]);
                printf("\n");
        }
}

$ awk -f vpos.awk data

1       A=1     T=1     C=0     G=0     -=1
2       A=0     T=2     C=1     G=0     -=0
3       A=1     T=0     C=0     G=2     -=0
4       A=0     T=1     C=2     G=0     -=0
5       A=0     T=1     C=0     G=2     -=0
6       A=0     T=0     C=1     G=2     -=0

$

# 3  
Old 04-24-2012
Thanks, it worked.

I do have another problem as well, which occurred after seeing this output.

This could be an entirely different question.
I have the same formated file as above but now with 4 positions. For each position there are the chances of being 2 types of character, either the 1 type character or 2 type character. For example at position 1, characters should be either T (for 1 type) or C (for 2 type), similiarily for position 2, C (for 1 type) or T (2 type), position 3, A(for 1 type)or G (for 2 type) and position 4, T (for 1 type) or C (for 2 type).
below is the input file
Code:
>A1
TCAT
>A2
CTGC
>A3
TCGC
>A4
TTAT
>A5
TTTT

Based on this, I want to characterize all the sub-headers (>A1, A2, A3, A4, A5) in the above file so that I would know which type it is.

the desired output ( No need for the part after #, it is just to make it clearer)
Code:
PLease let me know the way to do it in awk
A1 1 #all type 1 characters
A2 2 # all type 2 characters
A3 mixed  # contains at least one type 1 or type 2 characters in any of the 4 positions
A4 mixed  # contains at least one type 1 or type 2 characters in any of the 4 positions
A5 NA #if any of the positions have any other character other than type 1 or type

# 4  
Old 04-24-2012
Code:
$ cat types

1       T       1
1       C       2
2       C       1
2       T       2
3       A       1
3       G       2
4       T       1
4       C       2

$ cat vpos2.awk

BEGIN {
        while((getline<"types")>0)      T[$1,$2]=$3
        FS=">"; OFS="\t"
}

/^>/ {  NAMES[++L]=$2; next     }

L {
        for(N=1; N<=length($0); N++)
        {
                C=substr($0,N,1);
                A[NAMES[L],T[N,C]]++;
        }
}

END {
        for(N=1; N<=L; N++)
        {
                if(A[NAMES[N],""])
                        print NAMES[N],"NA";
                else if(A[NAMES[N], 1] && A[NAMES[N], 2])
                        print NAMES[N],"mixed";
                else if(A[NAMES[N], 1])
                        print NAMES[N], 1;
                else    print NAMES[N], 2;
        }
}

$ awk -f vpos2.awk data2

>A1
TCAT
>A2
CTGC
>A3
TCGC
>A4
TTAT
>A5
TTTT

$

# 5  
Old 04-24-2012
Do I need to make a types file like you made?
# 6  
Old 04-24-2012
Yes, it reads it.

If you really wanted you could embed it into awk itself like

Code:
T[1,"T"]=1;
T[1,"C"]=2;
...

in the BELOW section instead, but when there's more than three lines of it, I tend to put that in files. Just better organization, and far less chance of typoes than doing fiddly [] operations over and over.
This User Gave Thanks to Corona688 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Counting characters at each position

Hi All, here's a question from newbie I have a data like this, which set of small DNA sequences separated by new line GAATCCGGAAACAGCAACTTCAAANCA GTNATTCGGGCCAAACTGTCGAA TTNGGCAACTGTTAGAGCTCATGCGACA CCTGCTAAACGAGTTCGAGTTGAANGA TTNCGGAAGTGGTCGCTGGCACGG ACNTGCATGTACGGAGTGACGAAACCI... (6 Replies)
Discussion started by: amits22
6 Replies

2. Shell Programming and Scripting

Counting the number of characters

Hi all, Can someone help me in getting the following o/p I/p:... (7 Replies)
Discussion started by: Sri3001
7 Replies

3. Shell Programming and Scripting

Counting characters within a file

Ok say I wanted to count every Y in a data file. Then set Y as my delimiter so that I can separate my file by taking all the contents that occur BEFORE the first Y and store them in a variable so that I may use this content later on in my program. Then I could do the same thing with the next Y's... (5 Replies)
Discussion started by: puttster
5 Replies

4. Shell Programming and Scripting

taking characters and counting them

Nevermind, I figured out a way using the sed command. But I forget the basic way of counting characters within a variable :( (4 Replies)
Discussion started by: puttster
4 Replies

5. Shell Programming and Scripting

Counting characters with sed

Input: ghw//yw/hw///??u How can i count the slashes("/") using sed? (13 Replies)
Discussion started by: cola
13 Replies

6. Shell Programming and Scripting

counting characters

Hi All, I need some help in counting the number of letters in a big file with separations. Following is the file I have >AB_1 MLKKPIIIGVTGGSGGGKTSVSRAILDSFPNARIAMIQHDSYYKDQSHMSFEERVKTNYDHPLAFDTDFM IQQLKELLAGRPVDIPIYDYKKHTRSNTTFRQDPQDVIIVEGILVLEDERLRDLMDIKLFVDTDDDIRII... (6 Replies)
Discussion started by: Lucky Ali
6 Replies

7. UNIX for Dummies Questions & Answers

counting the occurence of particular characters

I want to list the occurence of particular characters in a line. my file looks like this a,b,c,d e,f,g h,y:e,g,y s f;g,s,w and I want to count how many commas are in each line so the file in the end looks like this: a,b,c,d 3 e,f,g 2 h,y:e,g,y s 3 f;g,s,w ... (2 Replies)
Discussion started by: Audra
2 Replies

8. Shell Programming and Scripting

Counting characters between comma's

I have a comma delimited file that roughly has 300 fields. Not all fields are populated. This file is fed into another system, what I need to do is count the amount of characters in each field and give me an output similiar to this: 1 - 6,2 - 25 The first number is the field and the second... (2 Replies)
Discussion started by: dbrundrett
2 Replies

9. Shell Programming and Scripting

counting characters

Dears, I would like to count the number of "(" and ")" that occur in a file. (syntax checking script). I tried to use "grep -c" and this works fine as long as there is only one character (for which I do a search) on a line. Has anyone an idea how I can count the number of specific characters... (6 Replies)
Discussion started by: plelie2
6 Replies
Login or Register to Ask a Question