Extremely Fast Text Feature Extraction for Classification and Indexing

Tags

Special Forums News, Links, Events and Announcements UNIX and Linux RSS News Extremely Fast Text Feature Extraction for Classification and Indexing

08-22-2008

Registered User

26,240, 27

Join Date: Sep 2000

Last Activity: 1 August 2008, 3:09 PM EDT

Posts: 26,240

Thanks Given: 0

Thanked 27 Times in 26 Posts

Extremely Fast Text Feature Extraction for Classification and Indexing

HPL-2008-91R1 Extremely Fast Text Feature Extraction for Classification and Indexing - Forman, George; Kirshenbaum, Evan
Keyword(s): text mining, text indexing, bag-of-words, feature engineering, feature extraction, document categorization, text tokenization
Abstract: Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial trainin ...
Full Report

More...

Linux Bot

View Public Profile for Linux Bot

Find all posts by Linux Bot

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Text extraction

Dear All, I am trying to extract text from a file containing cron entries. cat /var/tmp/cron_backups/debmed_tmp < * * * * * /bell > * * * * * /belly what I am trying to do is create two text files containing all entries that begin with < and another text files containing entries with > ....

2. Shell Programming and Scripting

sed text extraction between 2 patterns using variables

Hi everyone! I'm writting a function in .bashrc to extract some text from a file. The file looks like this: " random text Begin CG step 1 random text Begin CG step 2 ... Begin CG step 100 random text" For a given number, let's say 70, I want all the text between "Begin CG...

3. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450...

4. Programming

Fast string removal from large text collection

Hi All, I don't want any codes for this problem. Just suggestions: I have a huge collection of text files (around 300,000) which look like this: 1.fil orange apple dskjdsk computer skjks The entire text collection (referenced above) has about 1 billion words. I have created...

5. UNIX for Dummies Questions & Answers

String extraction from a text file

The following script code works great for extracting 'postmaster' from a line of text stored in a variable named string: string="PenaltyError:=554 5.7.1 Error, send your mail to postmaster@LOCALDOMAIN" stuff=$( echo $string | cut -d@ -f1 | awk '{ print $NF }' ) echo $stuff However, I need to be...

6. Shell Programming and Scripting

extraction of perfect text from file.

Hi All, I have a file of the following format. <?xml version='1.0' encoding='utf-8'?> <tomcat-users> <role rolename="tomcat"/> <role rolename="role1"/> <role rolename="manager"/> <role rolename="admin"/> <user username="tomcat" password="tomcat" roles="tomcat"/> <user...

LEARN ABOUT DEBIAN

text::pdf::ttfont

Text::PDF::TTFont(3pm)					User Contributed Perl Documentation				    Text::PDF::TTFont(3pm)

NAME

       Text::PDF::TTFont - Inherits from Text::PDF::Dict and represents a TrueType font within a PDF file.

DESCRIPTION

       A font consists of two primary parts in a PDF file: the header and the font descriptor. Whilst two fonts may share font descriptors, they
       will have their own header dictionaries including encoding and widhth information.

INSTANCE VARIABLES

       There are no instance variables beyond the variables which directly correspond to entries in the appropriate PDF dictionaries.

METHODS

       Text::PDF::TTFont->new($parent, $fontfname, $pdfname, %opts)

       Creates a new font resource for the given fontfile. This includes the font descriptor and the font stream. The $pdfname is the name by
       which this font resource will be known throught a particular PDF file.

       All font resources are full PDF objects.

       $t->width($text)

       Measures the width of the given text according to the widths in the font

       $t->trim($text, $len)

       Trims the given text to the given length (in per mille em) returning the trimmed text

       $t->out_text($text)

       Indicates to the font that the text is to be output and returns the text to be output

       $f->copy

       Copies the font object excluding the name, widths and encoding, etc.

TITLE

       Text::PDF::TTIOString - internal IO type handle for string output for font embedding. This code is ripped out of IO::Scalar, to save the
       direct dependence for so little. See IO::Scalar for details

perl v5.8.8							    2006-09-09						    Text::PDF::TTFont(3pm)

UNIX and Linux RSS News