08-22-2008
Extremely Fast Text Feature Extraction for Classification and Indexing
HPL-2008-91R1
Extremely Fast Text Feature Extraction for Classification and Indexing - Forman, George; Kirshenbaum, Evan
Keyword(s): text mining, text indexing, bag-of-words, feature engineering, feature extraction, document categorization, text tokenization
Abstract: Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial trainin ...
Full Report
More...
6 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi All,
I have a file of the following format.
<?xml version='1.0' encoding='utf-8'?>
<tomcat-users>
<role rolename="tomcat"/>
<role rolename="role1"/>
<role rolename="manager"/>
<role rolename="admin"/>
<user username="tomcat" password="tomcat" roles="tomcat"/>
<user... (5 Replies)
Discussion started by: nua7
5 Replies
2. UNIX for Dummies Questions & Answers
The following script code works great for extracting 'postmaster' from a line of text stored in a variable named string:
string="PenaltyError:=554 5.7.1 Error, send your mail to postmaster@LOCALDOMAIN"
stuff=$( echo $string | cut -d@ -f1 | awk '{ print $NF }' )
echo $stuff
However, I need to be... (9 Replies)
Discussion started by: cleanden
9 Replies
3. Programming
Hi All,
I don't want any codes for this problem. Just suggestions:
I have a huge collection of text files (around 300,000) which look like this:
1.fil
orange
apple
dskjdsk
computer
skjks
The entire text collection (referenced above) has about 1 billion words.
I have created... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies
4. UNIX for Dummies Questions & Answers
Hi everyone,
I have a large text file containing DNA sequences in fasta format as follows:
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450... (4 Replies)
Discussion started by: Fahmida
4 Replies
5. Shell Programming and Scripting
Hi everyone!
I'm writting a function in .bashrc to extract some text from a file. The file looks like this:
" random text
Begin CG step 1
random text
Begin CG step 2
...
Begin CG step 100
random text"
For a given number, let's say 70, I want all the text between "Begin CG... (4 Replies)
Discussion started by: radudownload
4 Replies
6. Shell Programming and Scripting
Dear All,
I am trying to extract text from a file containing cron entries.
cat /var/tmp/cron_backups/debmed_tmp
< * * * * * /bell
> * * * * * /belly
what I am trying to do is create two text files containing all entries that begin with < and another text files containing entries with > .... (4 Replies)
Discussion started by: Junaid Subhani
4 Replies
LEARN ABOUT SUSE
xscreensaver-text
xscreensaver-text(1) XScreenSaver manual xscreensaver-text(1)
NAME
xscreensaver-text - prints some text to stdout, for use by screen savers.
SYNOPSIS
xscreensaver-text [--verbose] [--columns N] [--text STRING] [--file PATH] [--program CMD] [--url URL]
DESCRIPTION
The xscreensaver-text script prints out some text for use by various screensavers, according to the options set in the ~/.xscreensaver
file. This may dump the contents of a file, run a program, or load a URL.
OPTIONS
xscreensaver-text accepts the following options:
--columns N or --cols N
Where to wrap lines; default 72 columns.
--verbose or -v
Print diagnostics to stderr. Multiple -v switches increase the amount of output.
Command line options may be used to override the settings in the ~/.xscreensaver file:
--string STRING
Print the given string. It may contain % escape sequences as per strftime(2).
--file PATH
Print the contents of the given file. If --cols is specified, re-wrap the lines; otherwise, print them as-is.
--program CMD
Run the given program and print its output. If --cols is specified, re-wrap the output.
--url HTTP-URL
Download and print the contents of the HTTP document. If it contains HTML, RSS, or Atom, it will be converted to plain-text.
Note: this re-downloads the document every time it is run! It might be considered abusive for you to point this at a web server
that you do not control!
ENVIRONMENT
HTTP_PROXY or http_proxy
to get the default HTTP proxy host and port.
BUGS
The RSS and Atom output is always ISO-8859-1, regardless of locale.
URLs should be cached, use "If-Modified-Since", and obey "Expires".
SEE ALSO
xscreensaver-demo(1), xscreensaver(1), fortune(1), phosphor(1), apple2(1), starwars(1), fontglide(1), dadadodo(1), webcollage(1),
http://www.livejournal.com/stats/latest-rss.bml,
http://twitter.com/statuses/public_timeline.atom,
driftnet(1), EtherPEG, EtherPeek
COPYRIGHT
Copyright (C) 2005 by Jamie Zawinski. Permission to use, copy, modify, distribute, and sell this software and its documentation for any
purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice
and this permission notice appear in supporting documentation. No representations are made about the suitability of this software for any
purpose. It is provided "as is" without express or implied warranty.
AUTHOR
Jamie Zawinski <jwz@jwz.org>, 20-Mar-2005.
X Version 11 5.15 (28-Sep-2011) xscreensaver-text(1)