compare the similar files Post: 302432360

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Compare directories then move similar ones

I would like to know how to compare a listing of directories that begin with the same four numbers ie. /1234cat /1234tree /1234fish and move all these directories into one directory Thanks in advance

2. Shell Programming and Scripting

Comparing similar columns in two different files

Hi, I have two text files.The first and the 2nd file have data in the same format For e.g. The first file has table_name1 column1 sum(column1) max(column1) min(column1) table_name1 column2 sum(column2) max(column2) min(column2) table_name1 coulmn3 sum(column3) max(column3) min(column3) ...

3. Shell Programming and Scripting

Require compare command to compare 4 files

I have four files, I need to compare these files together. As such i know "sdiff and comm" commands but these commands compare 2 files together. If I use sdiff command then i have to compare each file with other which will increase the codes. Please suggest if you know some commands whcih can...

4. Shell Programming and Scripting

concatenating similar files in a directory

Hi, I am new in unix. I have below requirement: I have two files at the same directory location File1.txt and File2.txt (just an example, real scenario we might have File2 and File3 OR File6 and File7....) File1.txt has : header1 record1 trailer1 File2.txt has: header2 record2...

5. Shell Programming and Scripting

Looking to find files that are similar.

Hello all, I have a server that is running AIX, running a tool that converts various printstreams (AFP/Metadata) to PDF. This is done using a rexx script and an off the shelf utility. Each report (there's around 125) uses a certain script file, it's basically a text file. I am trying...

6. UNIX for Dummies Questions & Answers

Finding similar strings between two files

Hi, I have a file1 like this: ABAT ABCA1 ABCC1 ABCC5 ABCC8 ABCE1 ABHD2 ABL1 CAMTA1 ACBD3 ACCN1 And I have a second file like this: chr19 46118590 46119564 MACS_peak_1499 3100.00 chr19 46122009 46148405 CYP2B7P1 -2445 chr1 7430312 7430990...

7. Shell Programming and Scripting

Editing files with sed or something similar

{ "AFafa": "FAFA","AFafa": "FAFA" "baseball":"soccer","wrestling":"dancing" "rhinos":"crocodiles","roles":"foodchain" } I need to insert a new line before the closing brackets "}" so that the final output looks like this: { "AFafa": "FAFA","AFafa": "FAFA"...

8. Solaris

Getting similar lines in two files

Hi, I need to compare the /etc/passwd files from 2 servers, and extract the users that are similar in these two files. I sorted the 2 files based on the user IDs (UID) (3rd column). I first sorted the files using the username (1st column), however when I use comm to compare the files there is no...

9. UNIX for Beginners Questions & Answers

How to compare two files in UNIX using similar to vlookup?

Hi, I want to compare same column in two files, if values match then display the column or display "NA". Ex : File 1 : 123 abc xyz pqr File 2: 122 aab fdf pqr fff qqq rrr

10. UNIX for Beginners Questions & Answers

Bash selection of files with similar name

Hi all, This is my first day on Linux shell!!! So, I am trying to write a script that that will pick up pairs of files with the same name (not the same content) but that are different in one character (one is *R1 the other is *R2)... Something like: look ate the files, whenever they are the...

LEARN ABOUT DEBIAN

similarity-tester

SIM(1)							      General Commands Manual							    SIM(1)

NAME

       sim - find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda, or text files

SYNOPSIS

       sim_c [ -[defFiMnpPRsST] -r N -t N -w N -o F ] file ... [ / [ file ... ] ]
       sim_c ...
       sim_java ...
       sim_pasc ...
       sim_m2 ...
       sim_lisp ...
       sim_mira ...
       sim_text ...

DESCRIPTION

       Sim_c  reads  the  C files file ...  and looks for segments of text that are similar; two segments of program text are similar if they only
       differ in layout, comment, identifiers and the contents of numbers, strings and characters.  If any runs of sufficient  length  are  found,
       they are reported on standard output; the number of significant tokens in the run is given between square brackets.

       Sim_java  does the same for Java, sim_pasc for Pascal, sim_m2 for Modula-2, sim_mira for Miranda, and sim_lisp for Lisp.  Sim_text works on
       arbitrary text; it is occasionally useful on shell scripts.

       The program can be used for finding copied pieces of code in purportedly unrelated programs (with -s or -S), or	for  finding  accidentally
       duplicated code in larger projects (with -f).

       If a / is present between the input files, the latter are divided into a group of "new" files (before the /) and a group of "old" files; if
       there is no /, all files are "new".  Old files are never compared to each other.

       Since the similarity tester reads the files several times, it cannot read from standard input.

       There are the following options:

       -d     The output is in a diff(1)-like format instead of the default 2-column format.

       -e     Each file is compared to each file in isolation; this will find all similarities between all texts involved,  regardless	of  dupli-
	      cates.

       -f     Runs are restricted to segments with balancing parentheses, to isolate potential routine bodies (not in text).

       -F     The names of routines in calls are required to match exactly (not in text).

       -i     The  names  of the files to be compared are read from standard input, including a possible /; the file names need to be separated by
	      layout.  This allows a very large number of file names to be specified; it differs from the @ facility provided by some compilers in
	      that it handles file names only, and does not recognize option arguments.

       -M     Memory usage information is displayed on standard error output.

       -n     Similarities found are only summarized, not displayed.

       -o F   The output is written to the file named F.

       -p     The output is given in similarity percentages; see below; implies -e and -s.

       -P     As -p but more extensive; implies -e and -s.

       -r N   The minimum run length is set to N units; the default is 24 tokens, except in sim_text, where it is 8 words.

       -R     Directories in the input list are entered recursively, and all files they contain are involved in the comparison.

       -s     The contents of a file are not compared to itself (-s for "not self").

       -S     The contents of the new files are compared to the old files only - not between themselves.

       -t N   In combination with the -p option, sets the threshold (in percents) below which similarities will not be reported; the default is 1,
	      except in sim_text, where it is 20.

       -T     A more terse and uniform form of output is produced, which may be more suitable for postprocessing.

       -w N   The page width used is set to N columns; the default is 80.

       --     (A secret option, which prints the input as the similarity checker sees it, and then stops.)

       The -p option results in lines of the form
	       F consists for x % of G material
       meaning that x % of F's text can also be found in G.  Note that this relation is not symmetric; it is in fact quite possible for  one  file
       to consist for 100 % of text from another file, while the other file consists for only 1 % of text of the first file, if their lengths dif-
       fer enough.  Each file is reported only once in the position of the F in the above line.  This simplifies the identification of	a  set	of
       files  A[1]  ...  A[n],	where  the  concatenation  of  these files is also present.  This restriction can be lifted by using the -P option
       instead.  A threshold can be set using the -t option; this option is ignored under -P.  Note that the granularity of the recognized text is
       still governed by the -r option or its default.

       Sim_text accepts  s p a c e d   t e x t	as normal text.

       The  program can handle UNICODE file names under Windows.  This is relevant only under the -R option, since there is no way to give UNICODE
       file names from the command line.

       Care has been taken to keep all internal processes linear in the length of the input, with the exception of the matching process  which	is
       almost  linear,	using  a  hash table; various other tables are used for speed-up.  If, however, there is not enough memory for the tables,
       they are discarded in order of unimportance, under which conditions the algorithms revert to their quadratic nature.

EXAMPLES

       The call
	       sim_c *.c
       highlights duplicate code in the directory.  (It is useful to remove generated files first.)  A call
	       sim_c -f -F *.c
       can pinpoint them further.

       A call
	       sim_text -e -p -s new/* / old/*
       compares each file in new/* to each subsequent file in new/* and old/*, and if any pair has more that 20% in common, that fact is reported.
       Usually a similarity of 30% or more is significant; lower than 20% is probably coincidence; and in between is doubtful.

       A call
	       sim_text -e -n -s -r100 new/* / old/*
       compares the same files, and reports large common segments.  Both approaches are good for plagiarism detection.

LIMITATIONS

       Repetitive input is the bane of similarity checking.  If we have a file containing 4 copies of similar text,
	   A1 A2 A3 A4
       where  the  numbers  serve  only to distinguish the similar copies, there are 7 similarities: A1=A2, A1=A3, A1=A4, A2=A3, A2=A4, A3=A4, and
       A1A2=A3A4, even discarding the overlapping A1A2A3=A2A3A4.  Of these, only 3 are meaningful: A1=A2, A2=A3, and A3=A4.  And for a table  with
       20  lines  similar  to each other, not unusual in a program, there are 715 similarities, of which at most 19 are meaningful.  Reporting all
       715 of them is clearly unacceptable.

       To remedy this, finding the similarities is performed as follows: For each position in the text, the largest segment is found, of  which  a
       non-overlapping	copy  occurs  in  the text following it.  That segment and its copy are reported and scanning resumes at the position just
       after the segment.  For the above example this results in the similarities A1A2=A3A4 and A3=A4, which is quite satisfactory, and for N sim-
       ilar segments roughly log N messages are given.

       A drawback of this heuristic is that the output is sensitive to the order of the input files.  If we have two files
	   file1 = A1, file2 = A2A3
       then the order "file1 file2" gives "A1=A2, A2=A3" and "file2 file1" gives "A2=A3, A3=A1"; but both reports convey the same information.

BUGS

       Since it uses lex(1) on some systems, it may crash on any weird construction that overflows lex's internal buffers.

AUTHOR

       Dick Grune, Vrije Universiteit, Amsterdam; dick@dickgrune.com.

								    2012/05/02								    SIM(1)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Compare directories then move similar ones

Discussion started by: tgibson2

2. Shell Programming and Scripting

Comparing similar columns in two different files

Discussion started by: ragavhere

3. Shell Programming and Scripting

Require compare command to compare 4 files

Discussion started by: nehashine

4. Shell Programming and Scripting

concatenating similar files in a directory

Discussion started by: Deepak62828r