Shell Programming and Scripting

View Public Profile for fedonMan

04-19-2011

Registered User

8, 0

Join Date: Apr 2011

Last Activity: 15 June 2011, 5:00 PM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Corona688

When I asked you why they had to be random, you said they didn't, that's just the only way you knew how to read lines. So I didn't use random lines.

If you've changed your mind, you can get 10,000 random non-duplicate lines with

Code:

sort -R <inputfile | head -n 10000 | awk -v FS='>' '{
        # Get stuff after second >, and convert it all to uppercase
        $0=toupper($3);
        # Substitute every non-alpha char with space
        gsub(/[^A-Z]/, " ");
        # Split on spaces, into array A
        split($0, A, " ");
        # count words
        for(key in A)   count[A[key]]++;
}
END {
        # print words
        for(key in count)
                printf("%s\t%d\n", key, count[key]);
}' < filename

hehe i didn't mean i don't want randomness, i just don't know how to do it without sed.

Hm, it displays an error as -R is an invalid option. Also, if inputfile is the input file... what is the purpose of filename?

fedonMan

Find all posts by fedonMan

04-19-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by fedonMan

Hm, it displays an error as -R is an invalid option.

Ordering lines randomly isn't so easy to do without it... What is your system?

Quote:

Also, if inputfile is the input file... what is the purpose of filename?

Good catch, you can leave off the last <inputfile.

Corona688

View Public Profile for fedonMan

04-19-2011

Registered User

8, 0

Join Date: Apr 2011

Last Activity: 15 June 2011, 5:00 PM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by Corona688

Ordering lines randomly isn't so easy to do without it... What is your system? Good catch, you can leave off the last <inputfile.

well, right now i run bash through mac os x, so i guess it is natural to be some differences.

May i guess that -R trigger Randomizes instead of sorting?

Last edited by fedonMan; 04-19-2011 at 06:50 PM..

fedonMan

Find all posts by fedonMan

04-19-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Right, you have BSD sort, not GNU sort. This is why you should always say what your system is from the start.

This won't be TRULY random, but should be reasonable I think. When requesting 1000 random lines from a file with 10,000 lines, it processes in chunks of 10 lines, selecting 1 random line from each.

If your samples are greater than half the number of lines, I think it'll start giving you *all* the lines.

Code:

LINES=`wc -l < random.txt`

awk -v FS='>' -v LINES=$LINES -v COUNT=10000 '
BEGIN {
        SKIP=(LINES/COUNT)-1;
        srand();
        R=rand()*SKIP;
}

{
        if(M++ > R)
        {
                # Get stuff after second >, and convert it all to uppercase
                $0=toupper($3);
                # Substitute every non-alpha char with space
                gsub(/[^A-Z]/, " ");
                # Split on spaces, into array A
                split($0, A, " ");
                # count words
                for(key in A)   count[A[key]]++;

                # Cheat out of the loop
                R=SKIP+1;
        }

        if(M > SKIP)
        {
                M=0
                R=rand()*SKIP;
        }
}

END {
        # print words
        for(key in count)
                printf("%s\t%d\n", key, count[key]);
        }' <random.txt

Corona688

View Public Profile for fedonMan

04-20-2011

Registered User

8, 0

Join Date: Apr 2011

Last Activity: 15 June 2011, 5:00 PM EDT

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

finally solved...
sort -R was way too slow for such file and the second solution didn't work at all :/

but i found the rl command which does the same thing faster...
so the complete code:

Code:

FILE=$1
COUNT=$2
rl -c $COUNT $FILE | awk -v FS='>' '{
        $0=tolower($3);
        gsub(/[^a-z]/, " ");
        split($0, A, " ");
        for(key in A)   count[A[key]]++;
}
END {
        for(key in count) {
        words += count[key];
        uniq++;
    }
        printf("Total words: %d\tUnique words:%d\n", words, uniq);
}'

thank you for your help!

Do you know how to pass the variables inside awk (words,uniq) outside its scope? so i can use them in the rest of the script.

fedonMan

Find all posts by fedonMan

04-20-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by fedonMan

sort -R was way too slow for such file

wait a second, you didn't have sort -R, what were you doing?

Quote:

and the second solution didn't work at all :/

In what way did it "not work"?

Quote:

but i found the rl command which does the same thing faster...

Thanks, that's a new one on me.

Corona688