Very big text file - Too slow!


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Very big text file - Too slow!
# 1  
Old 04-19-2011
Very big text file - Too slow!

Hello everyone,
suppose there is a very big text file (>800 mb) that each line contains an article from wikipedia. Each article begins with a tag (<..>) containing its url. Currently there are 10^6 articles in the file.

I want to take random N articles, eliminate all non-alpharithmetic characters and count the total number of words, and total number of unique words for further processing.

I have written the following script that works fine, except taking forever....
Code:
#!/bin/bash
FILENAME=$1
COUNT=0
DOCS=$(wc -l < $FILENAME)

if [ -f words.txt ] 
    then rm words.txt
fi
if [ -f words_uniq.txt ]
    then rm words_uniq.txt
fi

while [ $COUNT -le 1000 ] 
do
      let COUNT++
      if [ $(($COUNT % 100)) -eq 0 ]
            then echo "$COUNT texts processed"
      fi
      RAND=$((RANDOM % DOCS + 1))
      sed -n "$RAND{p;q;}" "$FILENAME" | sed "s/<.*>//g" | tr -cs "[:alpha:]" "[\n*]" | tr "[:upper:]" "[:lower:]" >> words.txt
done

n=$(wc -w < words.txt)
cat words.txt | sort | uniq > words_uniq.txt
V=$(wc -w < words_uniq.txt)
echo "V = $V, n = $n"

The above code (for 1000 random articles) takes around 2 minutes to run in my env. Suppose the time need for 10.000+ articles or more.

Is there a way to make this run faster? I am completely new to shell scripting and don't have much experience about how sed,tr,awk and other similar commands are supposed to be used efficiently.

Thank you
# 2  
Old 04-19-2011
sed | awk | tr | this | that | other is never going to be efficient. That runs all these commands simultaneously.

Use shell builtins whenever possible.

Could you show a line or two of the actual input data instead of just describing it?

What are you doing with sed "$RANDOM..."? What is RANDOM for?

---------- Post updated at 02:35 PM ---------- Previous update was at 02:25 PM ----------

How about:

Code:
awk -v FS='>' '{
        $2=toupper($2);
        split($2, A, " ");
        for(key in A)   count[A[key]]++;
}
END {
        for(key in count)
                printf("%s\t%d\n", key, count[key]);
}' < inputfile

# 3  
Old 04-19-2011
ok, here are the first 4 lines of the file. Very very sorry about the ungliness

Code:
<http://dbpedia.org/resource/Albedo> <http://dbpedia.org/ontology/abstract> "The albedo of an object is a measure of how strongly it reflects light from light sources such as the Sun. It is therefore a more specific form of the term reflectivity. Albedo is defined as the ratio of total-reflected to incident electromagnetic radiation. It is a unitless measure indicative of a surface's or body's diffuse reflectivity. The word is derived from Latin albedo \"whiteness\", in turn from albus \"white\", and was introduced into optics by Johann Heinrich Lambert in his 1760 work Photometria. The range of possible values is from 0 (dark) to 1 (bright). The albedo is an important concept in climatology and astronomy, as well as in computer graphics and computer vision. In climatology it is sometimes expressed as a percentage. Its value depends on the frequency of radiation considered: unqualified, it usually refers to some appropriate average across the spectrum of visible light. In general, the albedo depends on the direction and directional distribution of incoming radiation. Exceptions are Lambertian surfaces, which scatter radiation in all directions in a cosine function, so their albedo does not depend on the incoming distribution. In realistic cases, a bidirectional reflectance distribution function (BRDF) is required to characterize the scattering properties of a surface accurately, although albedos are a very useful first approximation."@en .
<http://dbpedia.org/resource/Anarchism> <http://dbpedia.org/ontology/abstract> "Anarchism is a political philosophy which considers the state undesirable, unnecessary and harmful, and instead promotes a stateless society, or anarchy. It seeks to diminish or even abolish authority in the conduct of human relations. Anarchists may widely disagree on what additional criteria are required in anarchism. The Oxford Companion to Philosophy says, \"there is no single defining position that all anarchists hold, and those considered anarchists at best share a certain family resemblance. \" There are many types and traditions of anarchism, not all of which are mutually exclusive. Strains of anarchism have been divided into the categories of social and individualist anarchism or similar dual classifications. Anarchism is often considered to be a radical left-wing ideology, and much of anarchist economics and anarchist legal philosophy reflect anti-statist interpretations of communism, collectivism, syndicalism or participatory economics. However, anarchism has always included an individualist strain supporting a market economy and private property, or unrestrained egoism that bases right on might. Others, such as panarchists and anarchists without adjectives, neither advocate nor object to any particular form of organization as long as it is not compulsory. Differing fundamentally, some anarchist schools of thought support anything from extreme individualism to complete collectivism. The central tendency of anarchism as a social movement have been represented by communist anarchism, with individualist anarchism being primarily a philosophical or literary phenomenon. Some anarchists fundamentally oppose all forms of aggression, supporting self-defense or non-violence, while others have supported the use of some coercive measures, including violent revolution and terrorism, on the path to an anarchist society."@en .
<http://dbpedia.org/resource/Achilles> <http://dbpedia.org/ontology/abstract> "In Greek mythology, Achilles was a Greek hero of the Trojan War, the central character and the greatest warrior of Homer's Iliad. Achilles also has the attributes of being the most handsome of the heroes assembled against Troy. Later legends (beginning with a poem by Statius in the first century AD) state that Achilles was invulnerable in all of his body except for his heel. Since he died due to an arrow shot into his heel, the \"Achilles' heel\" has come to mean a person's principal weakness."@en .
<http://dbpedia.org/resource/Abraham_Lincoln> <http://dbpedia.org/ontology/abstract> "Abraham Lincoln (February 12, 1809 \u2013 April 15, 1865) served as the 16th President of the United States from March 1861 until his assassination in April 1865. He successfully led his country through its greatest internal crisis, the American Civil War, preserving the Union and ending slavery. Before his election in 1860 as the first Republican president, Lincoln had been a country lawyer, an Illinois state legislator, a member of the United States House of Representatives, and twice an unsuccessful candidate for election to the U.S. Senate. As an outspoken opponent of the expansion of slavery in the United States,"@en .

In var $RAND a random integer 1 to total_lines_of_file is stored using the $RANDOM variable of bash. So with sed $RAND i traverse to that line.
# 4  
Old 04-19-2011
Quote:
In var $RAND a random integer 1 to total_lines_of_file is stored using the $RANDOM variable of bash. So with sed $RAND i traverse to that line.
I figured that much out, but I have no idea why you're using sed to snip out random lines. You could even get the same line more than once that way!
# 5  
Old 04-19-2011
Quote:
Originally Posted by Corona688
I figured that much out, but I have no idea why you're using sed to snip out random lines. You could even get the same line more than once that way!
hm, to be honest because i don't know any other way Smilie
# 6  
Old 04-19-2011
Updated to fit your data:
Code:
awk -v FS='>' '{
        # Get stuff after second >, and convert it all to uppercase
        $0=toupper($3);
        # Substitute every non-alpha char with space
        gsub(/[^A-Z]/, " ");
        # Split on spaces, into array A
        split($0, A, " ");
        # count words
        for(key in A)   count[A[key]]++;
}
END {
        # print words
        for(key in count)
                printf("%s\t%d\n", key, count[key]);
}' < filename

Use nawk or gawk on non-linux systems.
# 7  
Old 04-19-2011
thx that works just fine, but what about the random fact?
As far i could understand your code, this reads the whole text instead of random lines
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script - slow process with big files

Gents, Please can u help me to improve this script to be more faster, it works perfectly but for big files take a lot time to end the job.. I see the problem is in the step (while) and in this part the script takes a lot time.. Please if you can find a best way to do will be great. ... (13 Replies)
Discussion started by: jiam912
13 Replies

2. UNIX and Linux Applications

tabbed text editor without big libraries

I am looking for a tabbed text editor without a big library like gnome, kde, and gtk, I know about gedit, kate with extensions, geany, and bluefish. I would prefer it to be like gedit and be really light weight. So if anyone knows of a text editor that doesn't require those big libraries please let... (3 Replies)
Discussion started by: cokedude
3 Replies

3. UNIX for Advanced & Expert Users

sed working slow on big files

HI Experts , I'm using the following code to remove spaces appearing at the end of the file. sed "s/*$//g" <filename> > <new_filename> mv <new_filename> <filename> this is working fine for volumes upto 20-25 GB. for the bigger files it is taking more time that it is required... (5 Replies)
Discussion started by: sumoka
5 Replies

4. Shell Programming and Scripting

Helping in parsing subset of text from a big results file

Hi All, I need some help to effectively parse out a subset of results from a big results file. Below is an example of the text file. Each block that I need to parse starts with "reading sequence file 10.codon" (next block starts with another number) and ends with **p-Value(s)**. I have given... (1 Reply)
Discussion started by: Lucky Ali
1 Replies

5. UNIX for Dummies Questions & Answers

How big is too big a config.log file?

I have a 5000 line config.log file with several "maybe" errors. Any reccomendations on finding solvable problems? (2 Replies)
Discussion started by: NeedLotsofHelp
2 Replies

6. UNIX for Dummies Questions & Answers

How to slow down text output?

I found some ascii art that is animated (vt100) and would like to view it properly. However, when I try using 'cat', the file is done almost the instant I press enter. How can I view the file in a slower fashion (similar to the days of 2400baud, for example)? (2 Replies)
Discussion started by: Fangs McWolf
2 Replies

7. Shell Programming and Scripting

Cut big text file into 2

I have a big text file. I want to cut it into 2 pieces at known point or I know the pattern of the contents from where it can separate the files. Is there any quick command/solution? (4 Replies)
Discussion started by: sandy221
4 Replies

8. AIX

How to send big files over slow network?

Hi, I am trying to send oracle archives over WAN and it is taking hell a lot of time. To reduce the time, I tried to gzip the files and send over to the other side. That seems to reduce the time. Does anybody have experienced this kind of problem and any possible ways to reduce the time. ... (1 Reply)
Discussion started by: giribt
1 Replies

9. UNIX for Dummies Questions & Answers

How to view a big file(143M big)

1 . Thanks everyone who read the post first. 2 . I have a log file which size is 143M , I can not use vi open it .I can not use xedit open it too. How to view it ? If I want to view 200-300 ,how can I implement it 3 . Thanks (3 Replies)
Discussion started by: chenhao_no1
3 Replies
Login or Register to Ask a Question