Very big text file - Too slow!


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Very big text file - Too slow!
# 8  
Old 04-19-2011
Quote:
Originally Posted by fedonMan
thx that works just fine, but what about the random fact?
As far i could understand your code, this reads the whole text instead of random lines
When I asked you why they had to be random, you said they didn't, that's just the only way you knew how to read lines. So I didn't use random lines.

If you've changed your mind, you can get 10,000 random non-duplicate lines with
Code:
sort -R <inputfile | head -n 10000 | awk -v FS='>' '{
        # Get stuff after second >, and convert it all to uppercase
        $0=toupper($3);
        # Substitute every non-alpha char with space
        gsub(/[^A-Z]/, " ");
        # Split on spaces, into array A
        split($0, A, " ");
        # count words
        for(key in A)   count[A[key]]++;
}
END {
        # print words
        for(key in count)
                printf("%s\t%d\n", key, count[key]);
}' < filename

# 9  
Old 04-19-2011
Quote:
Originally Posted by Corona688
When I asked you why they had to be random, you said they didn't, that's just the only way you knew how to read lines. So I didn't use random lines.

If you've changed your mind, you can get 10,000 random non-duplicate lines with
Code:
sort -R <inputfile | head -n 10000 | awk -v FS='>' '{
        # Get stuff after second >, and convert it all to uppercase
        $0=toupper($3);
        # Substitute every non-alpha char with space
        gsub(/[^A-Z]/, " ");
        # Split on spaces, into array A
        split($0, A, " ");
        # count words
        for(key in A)   count[A[key]]++;
}
END {
        # print words
        for(key in count)
                printf("%s\t%d\n", key, count[key]);
}' < filename

hehe i didn't mean i don't want randomness, i just don't know how to do it without sed.Smilie

Hm, it displays an error as -R is an invalid option. Also, if inputfile is the input file... what is the purpose of filename?
# 10  
Old 04-19-2011
Quote:
Originally Posted by fedonMan
Hm, it displays an error as -R is an invalid option.
Ordering lines randomly isn't so easy to do without it... What is your system?
Quote:
Also, if inputfile is the input file... what is the purpose of filename?
Good catch, you can leave off the last <inputfile.
# 11  
Old 04-19-2011
Quote:
Originally Posted by Corona688
Ordering lines randomly isn't so easy to do without it... What is your system? Good catch, you can leave off the last <inputfile.
well, right now i run bash through mac os x, so i guess it is natural to be some differences.

May i guess that -R trigger Randomizes instead of sorting?

Last edited by fedonMan; 04-19-2011 at 06:50 PM..
# 12  
Old 04-19-2011
Right, you have BSD sort, not GNU sort. This is why you should always say what your system is from the start.

This won't be TRULY random, but should be reasonable I think. When requesting 1000 random lines from a file with 10,000 lines, it processes in chunks of 10 lines, selecting 1 random line from each.

If your samples are greater than half the number of lines, I think it'll start giving you *all* the lines.

Code:
LINES=`wc -l < random.txt`

awk -v FS='>' -v LINES=$LINES -v COUNT=10000 '
BEGIN {
        SKIP=(LINES/COUNT)-1;
        srand();
        R=rand()*SKIP;
}

{
        if(M++ > R)
        {
                # Get stuff after second >, and convert it all to uppercase
                $0=toupper($3);
                # Substitute every non-alpha char with space
                gsub(/[^A-Z]/, " ");
                # Split on spaces, into array A
                split($0, A, " ");
                # count words
                for(key in A)   count[A[key]]++;

                # Cheat out of the loop
                R=SKIP+1;
        }

        if(M > SKIP)
        {
                M=0
                R=rand()*SKIP;
        }
}

END {
        # print words
        for(key in count)
                printf("%s\t%d\n", key, count[key]);
        }' <random.txt

# 13  
Old 04-20-2011
finally solved...
sort -R was way too slow for such file and the second solution didn't work at all :/

but i found the rl command which does the same thing faster...
so the complete code:
Code:
FILE=$1
COUNT=$2
rl -c $COUNT $FILE | awk -v FS='>' '{
        $0=tolower($3);
        gsub(/[^a-z]/, " ");
        split($0, A, " ");
        for(key in A)   count[A[key]]++;
}
END {
        for(key in count) {
        words += count[key];
        uniq++;
    }
        printf("Total words: %d\tUnique words:%d\n", words, uniq);
}'


thank you for your help! Smilie

Do you know how to pass the variables inside awk (words,uniq) outside its scope? so i can use them in the rest of the script.
# 14  
Old 04-20-2011
Quote:
Originally Posted by fedonMan
sort -R was way too slow for such file
wait a second, you didn't have sort -R, what were you doing?
Quote:
and the second solution didn't work at all :/
In what way did it "not work"?
Quote:
but i found the rl command which does the same thing faster...
Thanks, that's a new one on me.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script - slow process with big files

Gents, Please can u help me to improve this script to be more faster, it works perfectly but for big files take a lot time to end the job.. I see the problem is in the step (while) and in this part the script takes a lot time.. Please if you can find a best way to do will be great. ... (13 Replies)
Discussion started by: jiam912
13 Replies

2. UNIX and Linux Applications

tabbed text editor without big libraries

I am looking for a tabbed text editor without a big library like gnome, kde, and gtk, I know about gedit, kate with extensions, geany, and bluefish. I would prefer it to be like gedit and be really light weight. So if anyone knows of a text editor that doesn't require those big libraries please let... (3 Replies)
Discussion started by: cokedude
3 Replies

3. UNIX for Advanced & Expert Users

sed working slow on big files

HI Experts , I'm using the following code to remove spaces appearing at the end of the file. sed "s/*$//g" <filename> > <new_filename> mv <new_filename> <filename> this is working fine for volumes upto 20-25 GB. for the bigger files it is taking more time that it is required... (5 Replies)
Discussion started by: sumoka
5 Replies

4. Shell Programming and Scripting

Helping in parsing subset of text from a big results file

Hi All, I need some help to effectively parse out a subset of results from a big results file. Below is an example of the text file. Each block that I need to parse starts with "reading sequence file 10.codon" (next block starts with another number) and ends with **p-Value(s)**. I have given... (1 Reply)
Discussion started by: Lucky Ali
1 Replies

5. UNIX for Dummies Questions & Answers

How big is too big a config.log file?

I have a 5000 line config.log file with several "maybe" errors. Any reccomendations on finding solvable problems? (2 Replies)
Discussion started by: NeedLotsofHelp
2 Replies

6. UNIX for Dummies Questions & Answers

How to slow down text output?

I found some ascii art that is animated (vt100) and would like to view it properly. However, when I try using 'cat', the file is done almost the instant I press enter. How can I view the file in a slower fashion (similar to the days of 2400baud, for example)? (2 Replies)
Discussion started by: Fangs McWolf
2 Replies

7. Shell Programming and Scripting

Cut big text file into 2

I have a big text file. I want to cut it into 2 pieces at known point or I know the pattern of the contents from where it can separate the files. Is there any quick command/solution? (4 Replies)
Discussion started by: sandy221
4 Replies

8. AIX

How to send big files over slow network?

Hi, I am trying to send oracle archives over WAN and it is taking hell a lot of time. To reduce the time, I tried to gzip the files and send over to the other side. That seems to reduce the time. Does anybody have experienced this kind of problem and any possible ways to reduce the time. ... (1 Reply)
Discussion started by: giribt
1 Replies

9. UNIX for Dummies Questions & Answers

How to view a big file(143M big)

1 . Thanks everyone who read the post first. 2 . I have a log file which size is 143M , I can not use vi open it .I can not use xedit open it too. How to view it ? If I want to view 200-300 ,how can I implement it 3 . Thanks (3 Replies)
Discussion started by: chenhao_no1
3 Replies
Login or Register to Ask a Question