Sponsored Content
Top Forums Shell Programming and Scripting Very big text file - Too slow! Post 302515266 by fedonMan on Tuesday 19th of April 2011 04:20:15 PM
Old 04-19-2011
Very big text file - Too slow!

Hello everyone,
suppose there is a very big text file (>800 mb) that each line contains an article from wikipedia. Each article begins with a tag (<..>) containing its url. Currently there are 10^6 articles in the file.

I want to take random N articles, eliminate all non-alpharithmetic characters and count the total number of words, and total number of unique words for further processing.

I have written the following script that works fine, except taking forever....
Code:
#!/bin/bash
FILENAME=$1
COUNT=0
DOCS=$(wc -l < $FILENAME)

if [ -f words.txt ] 
    then rm words.txt
fi
if [ -f words_uniq.txt ]
    then rm words_uniq.txt
fi

while [ $COUNT -le 1000 ] 
do
      let COUNT++
      if [ $(($COUNT % 100)) -eq 0 ]
            then echo "$COUNT texts processed"
      fi
      RAND=$((RANDOM % DOCS + 1))
      sed -n "$RAND{p;q;}" "$FILENAME" | sed "s/<.*>//g" | tr -cs "[:alpha:]" "[\n*]" | tr "[:upper:]" "[:lower:]" >> words.txt
done

n=$(wc -w < words.txt)
cat words.txt | sort | uniq > words_uniq.txt
V=$(wc -w < words_uniq.txt)
echo "V = $V, n = $n"

The above code (for 1000 random articles) takes around 2 minutes to run in my env. Suppose the time need for 10.000+ articles or more.

Is there a way to make this run faster? I am completely new to shell scripting and don't have much experience about how sed,tr,awk and other similar commands are supposed to be used efficiently.

Thank you
 

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How to view a big file(143M big)

1 . Thanks everyone who read the post first. 2 . I have a log file which size is 143M , I can not use vi open it .I can not use xedit open it too. How to view it ? If I want to view 200-300 ,how can I implement it 3 . Thanks (3 Replies)
Discussion started by: chenhao_no1
3 Replies

2. AIX

How to send big files over slow network?

Hi, I am trying to send oracle archives over WAN and it is taking hell a lot of time. To reduce the time, I tried to gzip the files and send over to the other side. That seems to reduce the time. Does anybody have experienced this kind of problem and any possible ways to reduce the time. ... (1 Reply)
Discussion started by: giribt
1 Replies

3. Shell Programming and Scripting

Cut big text file into 2

I have a big text file. I want to cut it into 2 pieces at known point or I know the pattern of the contents from where it can separate the files. Is there any quick command/solution? (4 Replies)
Discussion started by: sandy221
4 Replies

4. UNIX for Dummies Questions & Answers

How to slow down text output?

I found some ascii art that is animated (vt100) and would like to view it properly. However, when I try using 'cat', the file is done almost the instant I press enter. How can I view the file in a slower fashion (similar to the days of 2400baud, for example)? (2 Replies)
Discussion started by: Fangs McWolf
2 Replies

5. UNIX for Dummies Questions & Answers

How big is too big a config.log file?

I have a 5000 line config.log file with several "maybe" errors. Any reccomendations on finding solvable problems? (2 Replies)
Discussion started by: NeedLotsofHelp
2 Replies

6. Shell Programming and Scripting

Helping in parsing subset of text from a big results file

Hi All, I need some help to effectively parse out a subset of results from a big results file. Below is an example of the text file. Each block that I need to parse starts with "reading sequence file 10.codon" (next block starts with another number) and ends with **p-Value(s)**. I have given... (1 Reply)
Discussion started by: Lucky Ali
1 Replies

7. UNIX for Advanced & Expert Users

sed working slow on big files

HI Experts , I'm using the following code to remove spaces appearing at the end of the file. sed "s/*$//g" <filename> > <new_filename> mv <new_filename> <filename> this is working fine for volumes upto 20-25 GB. for the bigger files it is taking more time that it is required... (5 Replies)
Discussion started by: sumoka
5 Replies

8. UNIX and Linux Applications

tabbed text editor without big libraries

I am looking for a tabbed text editor without a big library like gnome, kde, and gtk, I know about gedit, kate with extensions, geany, and bluefish. I would prefer it to be like gedit and be really light weight. So if anyone knows of a text editor that doesn't require those big libraries please let... (3 Replies)
Discussion started by: cokedude
3 Replies

9. Shell Programming and Scripting

Improve script - slow process with big files

Gents, Please can u help me to improve this script to be more faster, it works perfectly but for big files take a lot time to end the job.. I see the problem is in the step (while) and in this part the script takes a lot time.. Please if you can find a best way to do will be great. ... (13 Replies)
Discussion started by: jiam912
13 Replies
STREAM_COPY_TO_STREAM(3)						 1						  STREAM_COPY_TO_STREAM(3)

stream_copy_to_stream - Copies data from one stream to another

SYNOPSIS
int stream_copy_to_stream (resource $source, resource $dest, [int $maxlength = -1], [int $offset]) DESCRIPTION
Makes a copy of up to $maxlength bytes of data from the current position (or from the $offset position, if specified) in $source to $dest. If $maxlength is not specified, all remaining content in $source will be copied. PARAMETERS
o $source - The source stream o $dest - The destination stream o $maxlength - Maximum bytes to copy o $offset - The offset where to start to copy data RETURN VALUES
Returns the total count of bytes copied. CHANGELOG
+--------+------------------------------+ |Version | | | | | | | Description | | | | +--------+------------------------------+ | 5.1.0 | | | | | | | Added the $offset parameter | | | | +--------+------------------------------+ EXAMPLES
Example #1 A stream_copy_to_stream(3) example <?php $src = fopen('http://www.example.com', 'r'); $dest1 = fopen('first1k.txt', 'w'); $dest2 = fopen('remainder.txt', 'w'); echo stream_copy_to_stream($src, $dest1, 1024) . " bytes copied to first1k.txt "; echo stream_copy_to_stream($src, $dest2) . " bytes copied to remainder.txt "; ?> SEE ALSO
copy(3). PHP Documentation Group STREAM_COPY_TO_STREAM(3)
All times are GMT -4. The time now is 05:17 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy