AWK - number of specified characters in a string

12-17-2010

Registered User

4, 0

Join Date: Dec 2010

Last Activity: 19 January 2011, 3:22 AM EST

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

AWK - number of specified characters in a string

Hello,

I'm new to using AWK and would be grateful for some basic advice to get me started.

I have a file consisting of 10 fields. Initially I wish to calculate the number of . , ~ and ^ characters in the 9th field ($9) of each line. This particular string also contains alphabetical characters. Lets call this value "count". I presume the length function would be used here, but am unsure how to specify multiple criteria.

I'd then like to create a new field at the end of each line that equals count/$8.

Finally I'd like to save this as a new file.

Many thanks,

Olly

Olly

View Public Profile for Olly

Find all posts by Olly

12-17-2010

Registered User

61, 9

Join Date: Nov 2009

Last Activity: 31 August 2011, 10:26 PM EDT

Posts: 61

Thanks Given: 1

Thanked 9 Times in 9 Posts

samples

Always post some sample data. A couple of input lines and then the desired output. So here's a guess....

Code:

{
    count = split($9, a, "[.,~^]");
    printf ("%s %.3f\n", $0,  (count - 1) / $8);
    #print "nine = " $9 ", count = " count;
}

Input:

Code:

1 2 3 4 5 6 7 8 test 10 11
1 2 3 4 5 6 7 8 test. 10 11
1 2 3 4 5 6 7 8 ^test, 10 11
1 2 3 4 5 6 7 8 t~~e..s^t, 10 11
1 2 3 4 5 6 7 8 9 10 11

Output:

Code:

1 2 3 4 5 6 7 8 test 10 11 0.000
1 2 3 4 5 6 7 8 test. 10 11 0.125
1 2 3 4 5 6 7 8 ^test, 10 11 0.250
1 2 3 4 5 6 7 8 t~~e..s^t, 10 11 0.750
1 2 3 4 5 6 7 8 9 10 11 0.000

Code:

cat input | awk -f test18.awk > output

As $8 is always 8 you should get 1/8s in the output. If $8 is changing you might want to check if it is a number, like...

Code:

{
    count = split($9, a, "[.,~^]");
    numerator = 0.0;
    if ($8 ~ /[-+]?[0-9]*\.?[0-9]+/) {
        numerator = $8;
    }
    printf ("%s %.3f\n", $0,  (count - 1) / numerator);
    #print "nine = " $9 ", count = " count;
}

This User Gave Thanks to m1xram For This Post:

m1xram

View Public Profile for m1xram

Find all posts by m1xram

12-17-2010

Registered User

4, 0

Join Date: Dec 2010

Last Activity: 19 January 2011, 3:22 AM EST

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

Thankyou.

Thankyou m1xram, your example is spot on, and code executes the manipulation perfectly!

Olly

View Public Profile for Olly

Find all posts by Olly

01-13-2011

Registered User

4, 0

Join Date: Dec 2010

Last Activity: 19 January 2011, 3:22 AM EST

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

split function multi-character separators

Hello again,

A slightly more complicated query - Can the split function in gawk count instances of multiple characters that act as separators? For example ".$" or ".^~".

A typical line of data for me looks something like:

F1 F2 F3 F4 Aa$.c$A..,.$,.,$.^~.

In $5 I'd like to count instances of "A" "A$" "." ".$" "^~." etc.

Many Thanks.

Olly

View Public Profile for Olly

Find all posts by Olly

01-14-2011

Registered User

61, 9

Join Date: Nov 2009

Last Activity: 31 August 2011, 10:26 PM EDT

Posts: 61

Thanks Given: 1

Thanked 9 Times in 9 Posts

instances of delimiters

There are some issues with the match strings you supplied. For one some match strings are substrings of others. If you wanted to know occurrences of ".$" and "." you'd have to look at the count of long strings - count of short strings to determine unique short strings. Order is important.

Code:

BEGIN {
        m = "^~. .$ . A$ A";
        acnt = split(m, astr, " ");
}
{
        print $0;
        for (i = 1; i <= acnt; ++i) {
                mtch = astr[i];
                c = split($5, a, mtch);
                print "Match " i ": " astr[i] " = " c
        }
}

Output:

Code:

Match 1: ^~. = 1
Match 2: .$ = 2
Match 3: . = 8
Match 4: A$ = 1
Match 5: A = 3

Matches that have a count of 1 essentially didn't do a split. If you change "c" to "c - 1" in the program it will show the number of matches that caused splits. You can see that matches 1 and 4 didn't do anything, match 1 probably because it was at the end.

If you take match 3 - match 2 you'd have the unique matches for ".". The same is true for match 4 and 5 relative to "$".

Ok, enough of that, lets talk about actual delimiters.

Suppose you wrote a regular expression (RE) to split $5. That RE could be used in the AWK match() function which would return starting position and length. Those values could be used to pull substrings from $5 which include the split values and the actual delimiters. From the delimiter strings you could build a frequency list per line and or per file. Is that what you really had in mind to do?

For me it would be a bit easier to create it in PERL. Here it is in AWK.

Input:

Code:

F1 F2 F3 F4 Aa$.c$A..,.$,.,$.^~.
F1 F2 F3 F4 one^~.two.$threeA$four.five$six

Code:

BEGIN {
    myRE = "(\\^~\\.)|(A\\$)|(\\.\\$)|[\\.\\$]";
    m = "^~. .$ . A$ A";
    acnt = split(m, astr, " ");
}
{
    print $0;
    c = split($5, a, myRE);
    for (i = 1; i <= c; ++i) {
        print "SPLIT " i ": " a[i]
    }
    str = $5;
    p = match(str, myRE, a);
    i = 1;
    while (p) {
        strsplit = substr(str, 1, RSTART - 1);
        strdelim = substr(str, RSTART, RLENGTH);
        if (strdelim in freq) {
            ++freq[strdelim];
        } else {
            freq[strdelim] = 1;
        }
        str = substr(str, RSTART + RLENGTH);
        print "Match " i ": /" strsplit "/ /" strdelim "/ /" str "/";
        ++i;
        p = match(str, myRE, a);
    }
    print "Match " i ": /" str "/";
}
END {
    for (x in freq) {
        xstr = "/" x "/";
        printf("%10s %3d\n", xstr, freq[x]);
    }
}

Output:

Code:

F1 F2 F3 F4 Aa$.c$A..,.$,.,$.^~.
SPLIT 1: Aa
SPLIT 2: 
SPLIT 3: c
SPLIT 4: A
SPLIT 5: 
SPLIT 6: ,
SPLIT 7: ,
SPLIT 8: ,
SPLIT 9: 
SPLIT 10: 
SPLIT 11: 
Match 1: /Aa/ /$/ /.c$A..,.$,.,$.^~./
Match 2: // /./ /c$A..,.$,.,$.^~./
Match 3: /c/ /$/ /A..,.$,.,$.^~./
Match 4: /A/ /./ /.,.$,.,$.^~./
Match 5: // /./ /,.$,.,$.^~./
Match 6: /,/ /.$/ /,.,$.^~./
Match 7: /,/ /./ /,$.^~./
Match 8: /,/ /$/ /.^~./
Match 9: // /./ /^~./
Match 10: // /^~./ //
Match 11: //
F1 F2 F3 F4 one^~.two.$threeA$four.five$six
SPLIT 1: one
SPLIT 2: two
SPLIT 3: three
SPLIT 4: four
SPLIT 5: five
SPLIT 6: six
Match 1: /one/ /^~./ /two.$threeA$four.five$six/
Match 2: /two/ /.$/ /threeA$four.five$six/
Match 3: /three/ /A$/ /four.five$six/
Match 4: /four/ /./ /five$six/
Match 5: /five/ /$/ /six/
Match 6: /six/
      /.$/   2
     /^~./   2
       /./   6
      /A$/   1
       /$/   4

Well that should do it. You could sort the frequency list of course but don't use AWK as it either whacks the indices or sorts the indices, both of which you don't want.

This User Gave Thanks to m1xram For This Post:

m1xram

View Public Profile for m1xram

Find all posts by m1xram

01-17-2011

Registered User

4, 0

Join Date: Dec 2010

Last Activity: 19 January 2011, 3:22 AM EST

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

Thankyou once again m1xram for your detailed & insightful ideas.

Your code does a great job and actually outputs much more information that I had expected - but thats because of my superficial problem description. Nonetheless, it will be a valuable resource for me.

In the interim I did some tinkering with my crude code, and it seems to work too - though outputting much less. What I ultimately was trying to achieve was to insert a new field that calculates the frequency of "," or "." in my string relative to "[a-zA-Z]" - BUT once all those same characters with either $ or ~^ next to them had been removed (those extra characters act as indicators/modifiers of their adjacent character).

Despite what I'd thought, split does seem to search for groups of characters if you enclose them in parentheses.

This is my input data:

Code:

153-0	29	A	M	85	85	60	6	CC..,.	gggggg 0.667 
153-0	37	A	W	83	83	60	6	TT..,.	geggdg 0.667 
153-0	85	G	R	80	80	60	6	AA..,.	aggggg 0.667 
153-0	98	G	R	129	129	60	6	A$A$A.,.	`geggg 0.500 
176-0	48	A	W	82	82	60	7	.$TT,..,	ggggegg 0.714

$8 is the stringof interest. $7 is the number of "characters" in $8, where a character with a modifier like A$ is treated as one single character. $9 is the frequency of . or , in the $8 string.

The code I used was:

Code:

{consends = split($8, a, "((\\.\\$)|(\\,\\$)|(\\^~\\.)|(\\^~,))");
allends = split($8, a, "[\\$]|[\\^~]");
consall = split($8, a, "[\\.,]");
readnotend = $7-(allends-1);
if (readnotend == 0.000) {printf ("%s %s %s\n", $0, "1.000", readnotend)} else {printf ("%s %3.3f %s\n", $0, ((consall-1)-(consends-1))/($7-(allends-1)), readnotend);
}
}

Which gave me an outputs of:

Code:

153-0	29	A	M	85	85	60	6	CC..,.	gggggg 0.667 0.667 6
153-0	37	A	W	83	83	60	6	TT..,.	geggdg 0.667 0.667 6
153-0	85	G	R	80	80	60	6	AA..,.	aggggg 0.667 0.667 6
153-0	98	G	R	129	129	60	6	A$A$A.,.	`geggg 0.500 0.750 4
176-0	48	A	W	82	82	60	7	.$TT,..,	ggggegg 0.714 0.667 6

This adds the new frequencies, and the number of "non-modified" characters in the $8 string. You can see that where there are $ in the string the new & old frequencies differ as does the number of "characters" in the string, but otherwise they remain unchanged.

Cheers,

Olly

Last edited by Franklin52; 01-17-2011 at 04:56 AM.. Reason: Please use code tags and indent your code

Olly

View Public Profile for Olly

Find all posts by Olly

01-18-2011

Registered User

61, 9

Join Date: Nov 2009

Last Activity: 31 August 2011, 10:26 PM EDT

Posts: 61

Thanks Given: 1

Thanked 9 Times in 9 Posts

Tricky

So..

A char is RE ([a-zA-Z][\^\$\~])

A fill is RE [\.,]

Freq = fills / (chars + fills)

Input:

Code:

F1 F2 F3 F4 Aa$.c$A..,.$,.,$.^~.
F1 F2 F3 F4 one^~.two.a$threeA$four.five$six
F1 F2 F3 F4 a$,b~c.d^efg
F1 F2 F3 F4 abc
F1 F2 F3 F4 ,,,,
F1 F2 F3 F4 789790

Code:

BEGIN {
    myRE = "(\\^~\\.)|(A\\$)|(\\.\\$)|[\\.\\$]";
    myChar = "[a-zA-Z][\\$\\~\\^]?";
    myFill = "[\\.\\,]";
    m = "^~. .$ . A$ A";
    acnt = split(m, astr, " ");
}
{
    print $0;
    str = $5;
    print str;
    chars = gsub(myChar, "+", str);
    print "/" str "/, found " chars;
        fills = gsub(myFill, "x", str);
    print "/" str "/, found " fills;
    if (chars + fills) {
        printf("freq = %.1f%%\n", fills * 100.0 / (fills + chars));
    } else {
        print "freq = undef";
    }
}

Output:

Code:

F1 F2 F3 F4 Aa$.c$A..,.$,.,$.^~.
Aa$.c$A..,.$,.,$.^~.
/++.++..,.$,.,$.^~./, found 4
/++x++xxxx$xxx$x^~x/, found 10
freq = 71.4%
F1 F2 F3 F4 one^~.two.a$threeA$four.five$six
one^~.two.a$threeA$four.five$six
/+++~.+++.+++++++++++.+++++++/, found 24
/+++~x+++x+++++++++++x+++++++/, found 3
freq = 11.1%
F1 F2 F3 F4 a$,b~c.d^efg
a$,b~c.d^efg
/+,++.++++/, found 7
/+x++x++++/, found 2
freq = 22.2%
F1 F2 F3 F4 abc
abc
/+++/, found 3
/+++/, found 0
freq = 0.0%
F1 F2 F3 F4 ,,,,
,,,,
/,,,,/, found 0
/xxxx/, found 4
freq = 100.0%
F1 F2 F3 F4 789790
789790
/789790/, found 0
/789790/, found 0
freq = undef


//, found 0
//, found 0
freq = undef

There's some extra code in there, ignore it. The input has illegal numbers and a blank line at the end, just for fun.

Question: Can a character have more than one modifier? If so then alter the RE myChar '?' to be '*'.

You'll have to use the REs above with the match loop from the previous example if you want the actual strings. But, function gsub() is a good hack for what you described.

Output should have some break between records, it's hard to read. Toss in a print "--------------" somewhere for clarity.

Last edited by m1xram; 01-18-2011 at 07:04 AM.. Reason: add "Output:"

This User Gave Thanks to m1xram For This Post:

m1xram

View Public Profile for m1xram

Find all posts by m1xram

UNIX for Dummies Questions & Answers

AWK - number of specified characters in a string

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Concatenate a string and number and compare that with another string in awk script

Discussion started by: bhagya123

2. Shell Programming and Scripting

awk to print column number while ignoring alpha characters

Discussion started by: ncwxpanther

3. Shell Programming and Scripting

Replace characters in string with awk gsub

Discussion started by: r_t_1601

4. Shell Programming and Scripting

Help awk/sed: putting a space after numbers:to separate number and characters.

Discussion started by: rveri

5. Shell Programming and Scripting

How to truncate a string to x number characters?

Discussion started by: Tectona

6. Shell Programming and Scripting

Awk to extract lines with a defined number of characters

Discussion started by: Xterra

7. Shell Programming and Scripting

help: Awk to control number of characters per line

Discussion started by: DerSeb

8. Shell Programming and Scripting

number of characters in a string

Discussion started by: rethink

9. Programming

Count the number of repeated characters in a given string

Discussion started by: pgmfourms

10. Shell Programming and Scripting

Counting the number of occurances of all characters (a-z) in a string

Discussion started by: rsendhilmani