Script to sort large file with frequency Post: 302663281

Sponsored Content

Top Forums Shell Programming and Scripting Script to sort large file with frequency Post 302663281 by gimley on Wednesday 27th of June 2012 11:34:42 PM

06-28-2012

Registered User

Script to sort large file with frequency

Hello,
I have a very large file of around 2 million records which has the following structure:

Quote:

English characters#Hindi in Utf8 format
Mohit#मोहित
Shailesh#शैलेश
Bagde#बागडे
Mohit#मोहित
Shailesh#शैलेश
Goud#गौड
Mohit#मोहित
Shailesh#शैलेश
Ladava#लाडवा
Mohit#मोहित
Shailesh#शैलेश
Mehetre#मेहेत्रे
Mohit#मोहित

I have used the standard awk program to sort:

Code:

# wordfreq.awk --- print list of word frequencies
{
# remove punctuation
#gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}

and a PERL program I found on the net:

Code:

my %seen=();
while(<>)
{
    chomp;
    foreach my $word ( grep /\w/, split )
    {
       # $word =~ s/[. ,]*$//; # strip off punctuation, etc.
        $seen{$word}++;
    }
}

use Data::Dumper;
$Data::Dumper::Terse = 1;
print Dumper \%seen;

While both work beautifully for small files of around fifty thousand lines when I execute them on the very large file, they run out of memory.
I am working on a Windows machine VISTA OS and have even tries increasing the paging memory size to around 8Mb but to no avail.
I believe there is a function in Perl where you can set the variable to 99999 which allows for very large file execution. I have tried to insert that in the Perl program but I get an out of memory call.
Could anybody provide with a solution where the program can run on a very large file of around 9 mb.
Many thanks.

gimley

View Public Profile for gimley

Find all posts by gimley

10 More Discussions You Might Find Interesting

1. HP-UX

Need to split a large data file using a Unix script

Greetings all: I am still new to Unix environment and I need help with the following requirement. I have a large sequential file sorted on a field (say store#) that is being split into several smaller files, one for each store. That means if there are 500 stores, there will be 500 files. This...

2. Shell Programming and Scripting

script to splite large file to number of small files

Dear All, Could you please help me to split a file contain around 240,000,000 line to 4 files all equally likely , note that we need to maintain that the end of each file should started by start flage (MSISDN) and ended by end flag (End), also the number of the line between the...

3. UNIX for Dummies Questions & Answers

Sort large file

I was wondering how sort works. Does file size and time to sort increase geometrically? I have a 5.3 billion line file I'd like to use with sort -u I'm wondering if that'll take forever because of a geometric expansion? If it takes 100 hours that's fine but not 100 days. Thanks so much.

4. Shell Programming and Scripting

Script to search a large file with a list of terms in another file

Hi- I am trying to search a large file with a number of different search terms that are listed one per line in 3 different files. Most importantly I need to be able to do a case insensitive search. I have tried just using egrep -f but it doesn't seam to be able to handle the -i option when...

5. Shell Programming and Scripting

Word Frequency Sort

hello, Here is a program for creating a word-frequency # wf.gk --- program to generate word frequencies from a file { # remove punctuation: This will remove all punctuations from the file gsub(/_]/, "", $0) #Start frequency analysis for (i = 1; i <= NF; i++) freq++ } END #Print output...

6. UNIX for Advanced & Expert Users

Script to sort the files and append the extension .sort to the sorted version of the file

Hello all - I am to this forum and fairly new in learning unix and finding some difficulty in preparing a small shell script. I am trying to make script to sort all the files given by user as input (either the exact full name of the file or say the files matching the criteria like all files...

7. Shell Programming and Scripting

Script to pull hashes out of large text file

I am attempting to write a script that will pull out NTLM hashes from a text file that contains about 500,000 lines of data. Not all accounts contain hashes and I only need the ones that do contain hashes. Here is a sample of what the data looks like: There are thousands of other lines in...

8. UNIX for Advanced & Expert Users

Help optimizing sort of large files

I'm doing a hobby project that has me sorting huge files with sort of monotonous keys. It's very slow -- the current file is about 300 GB and has been sorting for a day. I know that sort has this --batch-size and --buffer-size parameters, but I'd like a jump start if possible to limit the...

9. Shell Programming and Scripting

Frequency of Words in a File, sed script from 1980

tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 | sed ${1:-25} < book7.txt This is not my script, it can be found way back from 1980 but once it worked fine to give me the most used words in a text file. Now the shell is complaining about an error in sed sed: -e...

10. Shell Programming and Scripting

Script to compare files in 2 folders and delete the large file

Hello, my first thread here. I've been searching and fiddling around for about a week and I cannot find a solution.:confused: I have been converting all of my home videos to HEVC and sometimes the files end up smaller and sometimes they don't. I am currently comparing all the video files...

LEARN ABOUT DEBIAN

dpkg-awk

DPKG-AWK(1)						      General Commands Manual						       DPKG-AWK(1)

NAME

       dpkg-awk - Utility to read a dpkg style db file

SYNOPSIS

       dpkg-awk [(-f|--file) filename] [(-d|--debug) ##] [(-s|--sort) list] [(-rs|--rec_sep) ??] '<fieldname>:<regex>' ... -- <out_fieldname> ..

DESCRIPTION

       dpkg-awk  Parses  a  dpkg status file (or other similarly formatted file) and outputs the resulting records.  It can use regex on the field
       values to limit the returned records, it can also be told which fields to output, and it can sort the matched fields.

OPTIONS

       -f filename
       --file filename
	      The file to parse.  The default is /var/lib/dpkg/status.

       -d [#]
       --debug [#]
	      Each time this is specified, it increased the debug level.

       -s field(s)
       --sort field(s)
	      A space or comma separated list of fields to sort on.

       -n field(s)
       --numeric field(s)
	      A space or comma separated list of fields that should be interpreted as numeric in value.

       -rs ??
       --rec_sep ??
	      Output this string at the end of each output paragraph.

       -h
       --help Display some help.

       fieldname
	      The fields from the file, that are matched with the regex given.	The fieldnames are case insensitive.

       out_fieldname
	      The fields from the file, that are output for each record.  If the first field listed begins with ^, then the list  of  fields  that
	      follows will NOT be output.

BUGS

       Be  warned  that  the author has only a shallow understanding of the dpkg packaging system, so there are probably tons of bugs in this pro-
       gram.

       This program comes with no warranties.  If running this program causes fire and brimstone to rain down upon the earth, you will be on  your
       own.

       This program accesses the dpkg database directly in places, querying for data that cannot be gotten via dpkg.

AUTHOR

       Adam Heath <doogie@debian.org>

DEBIAN
								 Debian Utilities						       DPKG-AWK(1)

10 More Discussions You Might Find Interesting

1. HP-UX

Need to split a large data file using a Unix script

Discussion started by: SAIK

2. Shell Programming and Scripting

script to splite large file to number of small files

Discussion started by: ahmed.gad

3. UNIX for Dummies Questions & Answers

Sort large file

Discussion started by: dcfargo

4. Shell Programming and Scripting

Script to search a large file with a list of terms in another file

Discussion started by: dougzilla