Importing R cosine similarity to UNIX? Post: 302767747

7 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

importing database from unix to winnt

i am a unix-super-beginner (swaddled and weaned on windows) and am trying to import a database from a unix directory into winnt. can someone help me or am i a hopeless case?

2. UNIX for Dummies Questions & Answers

Importing a unix file dump into a PC capable database

My development team has been trying to figure out how to import a unix data dump into SQL Server or convert it into an intermediate file format for several days. The data dump in question looks like this: $RecordID: 1<eof> $Version: 1<eof> Category: 1<eof> Poster: John Doe<eof>...

3. UNIX for Advanced & Expert Users

building flat files in unix and importing them from windows

what is a flat file in unix? i have to import a unix flat files from windows based programme. my question is not to export from unix but only to import from windows only. how to build that flat files? how to create export to windows how to import from windows

4. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance. I have a 2D data file (input file is a plain tab-delimited text file). The first...

5. Shell Programming and Scripting

Help with merge data based on similarity

Input_file data1 USA 100 ASE data3 UK 20 GWQR data4 Brazil 40 QWE data2 Scotland 60 THWE data5 USA 40 QWERR Reference_file USA 12312 34532 1324 Brazil 23321 231 3421 Scotland 342 34235 UK 231 141 England...

6. Shell Programming and Scripting

Help with sort list of file based on similarity

Input file (long list of input file): s_1_1_AABCD.txt s_1_1_ABADA.txt s_1_1_DSCBA.txt s_1_1_DSCCA.txt s_1_1_EATTG.txt s_1_1_FADSD.txt s_1_1_TGACC.txt s_1_1_TTAGG.txt s_1_2_AABCD.txt s_1_2_ABADA.txt s_1_2_DSCBA.txt s_1_2_DSCCA.txt s_1_2_EATTG.txt s_1_2_FADSD.txt ...

7. UNIX for Advanced & Expert Users

Vector base Cosine Similarity for two Matrices -- R in UNIX

Dear All, I am facing a problem and I would be Thankful if you can help Hope this is the right place to ask this question I have two matrices of (row=10, col=3) and I want to get the cosine similarity between two lines (vectors) of each file --> the result should be (10,1) of cosine measures I...

LEARN ABOUT DEBIAN

simhash

SIMHASH(1)						      General Commands Manual							SIMHASH(1)

NAME

       simhash - file similarity hash tool

SYNOPSIS

       simhash [ -s nshingles ] [ -f nfeatures ] [ file ]
       simhash [ -s nshingles ] [ -f nfeatures ] -w file ...
       simhash [ -s nshingles ] [ -f nfeatures ] -m file ...
       simhash -c hashfile hashfile

DESCRIPTION

       This  program  is  used to compute and compare similarity hashes of files.  A similarity hash is a chunk of data that has the property that
       some distance metric between files is proportional to some distance metric between the hashes.  Typically the similarity hash will be  much
       smaller than the file itself.

       The algorithm used by simhash is Manassas' "shingleprinting" algorithm (see BIBLIOGRAPHY below): take a hash of every m-byte subsequence of
       the file, and retain the n of these hashes that are numerically smallest.  The size of the intersection of the hash sets of two files gives
       a statistically good estimate of the similarity of the files as a whole.

       In  its	default mode, simhash will compute the similarity hash of its file argument (or stdin) and write this hash to its standard output.
       When invoked with the -w argument (see below), simhash will compute similarity hashes of all of its file arguments in "batch  mode".   When
       invoked	with the -m argument (see below), simhash will compare all the given files using similarity hashes in "match mode".  Finally, when
       invoked with the -c argument (see below), simhash will report the degree of similarity between two hashes.

OPTIONS

       -f feature-count
	      When computing a similarity hash, retain at most feature-count significant hashes from the target file.  The  default  is  128  fea-
	      tures.   Larger  feature	counts	will give higher resolution in differences between files, will increase the size of the similarity
	      hash proportionally to the feature count, and will increase similarity hash computation time slightly.

       -s shingle-size
	      When computing a similarity hash, use hashes of samples consisting of shingle-size consecutive bytes drawn  from	the  target  file.
	      The  default  is	8  bytes, the minimum is 4 bytes.  Larger shingle sizes will emphasize the differences between files more and will
	      slow the similarity hash computation proportionally to the shingle size.

       -c hashfile1 hashfile2
	      Display the distance (normalized to the range 0..1) between the similarity hash stored in hashfile1 and the similarity  hash  stored
	      in hashfile2.

       -w file ...
	      Write the similarity hash of each of the file arguments to file.sim.

       -m file ...
	      Compute the similarity hash of each of the file arguments, and output a similarity matrix for those files.

AUTHOR

       Bart Massey <bart@cs.pdx.edu>

BUGS

       This currently uses CRC32 for the hashing.  A Rabin Fingerprint should be offered as a slightly slower but more reliable alternative.

       The  shingleprinting algorithm works for text files and fairly well for other sequential filetypes, but does not work well for image files.
       The latter both are 2D and often undergo odd transformations.

BIBLIOGRAPHY

       Mark  Manasse,  Microsoft  Research  Silicon  Valley.   Finding	similar  things  quickly  in  large  collections.   http://research.micro-
       soft.com/research/sv/PageTurner/similarity.htm

       Andrei  Z.  Broder.   On  the  resemblance  and containment of documents.  In Compression and Complexity of Sequences (SEQUENCES'97), pages
       21-29. IEEE Computer Society, 1998.  ftp://ftp.digital.com/pub/DEC/SRC/publications/broder/positano-final-wpnums.pdf

       Andrei Z. Broder.  Some applications of Rabin's fingerprinting method.  Published in R. Capocelli, A. De Santis, U. Vaccaro eds., Sequences
       II: Methods in Communications, Security, and Computer Science, Springer-Verlag, 1993.  http://athos.rutgers.edu/~muthu/broder.ps

								  3 January 2007							SIMHASH(1)