unihist(1) General Commands Manual unihist(1)NAME
unihist - Generate a histogram of the characters in a Unicode file
SYNOPSIS
unihist ([option flags])
DESCRIPTION
unihist generates a histogram of the characters in its input, which must be encoded in UTF-8 Unicode. By default, for each character it
prints the frequency of the character as a percentage of the total, the absolute number of tokens in the input, the UTF-32 code in hexa-
decimal, and, if the character is displayable, the glyph itself as UTF-8 Unicode. Command line flags allow unwanted information to be sup-
pressed. In particular, note that by suppressing the percentages and counts it is possible to generate a list of the unique characters in
the input.
Output is produced ordered by character code. To sort it in descending order of frequency, pipe the output into the command:
sort -k1 -n -r
By default, unihist handles all of Unicode. To reduce memory usage and increase speed, it may be compiled so as to handle only the Basic
Multilingual Plane (plane 0) by defining BMPONLY.
COMMAND LINE FLAGS -c Suppress printing of counts and percentages.
-g Suppress printing of glyphs.
-h Print usage information.
-u Suppress printing of the Unicode code as text.
-v Print version information.
SEE ALSO
uniname (1)
REFERENCES
Unicode Standard, version 5.0
AUTHOR
Bill Poser
billposer@alum.mit.edu
LICENSE
GNU General Public License
May, 2008 unihist(1)
Check Out this Related Man Page
unifuzz(1) General Commands Manual unifuzz(1)NAME
unifuzz - Emit strings designed to test Unicode handling
SYNOPSIS
unifuzz ([option flags])
DESCRIPTION
unifuzz emits strings designed to test the ability of programs intended to accept Unicode input to handle unexpected input. These include:
characters from all Unicode ranges, Private Use characters, surrogates, undefined characters, non-characters, control characters, exotic
space characters, sequences violating normalization rules, unexpected sequences (e.g. a base character from one range followed by a combin-
ing character from another range), and long sequences of combining characters. It can also generate very long lines, strings containing
embedded nulls, and ill-formed UTF-8.
COMMAND LINE FLAGS -b Restrict the output to the Basic Multilingual Plane (Plane 0).
-g Do not emit specific characters.
-h Print usage information.
-l Emit very long lines.
-n Emit string with embedded nulls.
-q Be quiet. Omit commentary.
-r <number>
Set the number of random characters to emit.
-S Scan ranges - emit a character from each range.
-s <seed>
Set the seed for the random number generator.
-u Emit ill-formed UTF-8.
-v Print version information.
The sequence of random characters is determined by a pseudorandom number generator, so the same sequence can be obtained by setting the
seed to the same value. If not set on the command line, a seed is chosen based on the time of execution. The seed used is included in the
output in a line of the form "Seed = NNNNNN" immediately preceding the random character sequence. Note that in order to obtain the same
sequence it is necessary to keep the same setting for restriction of output to the BMP.
REFERENCES
Unicode Standard, version 5.0
AUTHOR
Bill Poser
billposer@alum.mit.edu
LICENSE
GNU General Public License
April, 2008 unifuzz(1)
How can I remove all the non printing characters from a file? I've tried using sed but I can't figure out how ! (change all but...) works with the substitute command.
I know how to do it in perl but I would prefer not to use this solution.
Thanks! (8 Replies)
I am new to unix.
Could anyone tell me what the following command does.
cat *.caller.dat | grep "," | sort -t\" -k 2,2 -u -T $SORTING_DIR > external_source.sort
Your help would be much appreciated.
Thanks in advance. (7 Replies)
Hi
I was just wondering if there was a way in which i could find out the character set used in a file in HP-UX. ie Whether it is Unicode, UTF-8,ascii etc.
Regards (3 Replies)
I am trying to print out a section of a file begining at the start and printng until a character is found.
My code and input file are below. This code is printing out every line except for the line with the character which is not what I want the out put should be a file with numbers 1-4.
... (3 Replies)
I have a file with multiple entries and I have calculated the percentages. Now I want to know how many of my entries are there between 1-10% 11-20% and so on..
chr1_14401_14450 0.211954217888936
chr1_14451_14500 1.90758796100042
chr1_14501_14550 4.02713013988978
chr1_14551_14600 ... (3 Replies)
I am new to R and would like to calculate the percentage frequency distribution of h1 and h2. How can I combine h1 and h2 in one plot? I tried the following code.
h1=c(5.18,4.61,3.30,7.58,3.00,3.80,1.95,2.67,2.77,2.73,2.33,3.36,3.50,1.91,4.25,3.87,2.86,2.26,2.00,3.86,3.33,3.59,4.00)... (0 Replies)