Sponsored Content
Full Discussion: compare the similar files
Top Forums Shell Programming and Scripting compare the similar files Post 302432360 by drl on Thursday 24th of June 2010 10:12:44 PM
Old 06-24-2010
Hi.

I would normalize the files first: remove empty lines, squeeze multiple spaces to one, remove the unprintable characters, etc., and then compare the files.

You may also be able to get some guidelines from The software and text similarity tester SIM

Good luck ... cheers, drl
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Compare directories then move similar ones

I would like to know how to compare a listing of directories that begin with the same four numbers ie. /1234cat /1234tree /1234fish and move all these directories into one directory Thanks in advance (2 Replies)
Discussion started by: tgibson2
2 Replies

2. Shell Programming and Scripting

Comparing similar columns in two different files

Hi, I have two text files.The first and the 2nd file have data in the same format For e.g. The first file has table_name1 column1 sum(column1) max(column1) min(column1) table_name1 column2 sum(column2) max(column2) min(column2) table_name1 coulmn3 sum(column3) max(column3) min(column3) ... (13 Replies)
Discussion started by: ragavhere
13 Replies

3. Shell Programming and Scripting

Require compare command to compare 4 files

I have four files, I need to compare these files together. As such i know "sdiff and comm" commands but these commands compare 2 files together. If I use sdiff command then i have to compare each file with other which will increase the codes. Please suggest if you know some commands whcih can... (6 Replies)
Discussion started by: nehashine
6 Replies

4. Shell Programming and Scripting

concatenating similar files in a directory

Hi, I am new in unix. I have below requirement: I have two files at the same directory location File1.txt and File2.txt (just an example, real scenario we might have File2 and File3 OR File6 and File7....) File1.txt has : header1 record1 trailer1 File2.txt has: header2 record2... (4 Replies)
Discussion started by: Deepak62828r
4 Replies

5. Shell Programming and Scripting

Looking to find files that are similar.

Hello all, I have a server that is running AIX, running a tool that converts various printstreams (AFP/Metadata) to PDF. This is done using a rexx script and an off the shelf utility. Each report (there's around 125) uses a certain script file, it's basically a text file. I am trying... (5 Replies)
Discussion started by: jeffs42885
5 Replies

6. UNIX for Dummies Questions & Answers

Finding similar strings between two files

Hi, I have a file1 like this: ABAT ABCA1 ABCC1 ABCC5 ABCC8 ABCE1 ABHD2 ABL1 CAMTA1 ACBD3 ACCN1 And I have a second file like this: chr19 46118590 46119564 MACS_peak_1499 3100.00 chr19 46122009 46148405 CYP2B7P1 -2445 chr1 7430312 7430990... (7 Replies)
Discussion started by: a_bahreini
7 Replies

7. Shell Programming and Scripting

Editing files with sed or something similar

{ "AFafa": "FAFA","AFafa": "FAFA" "baseball":"soccer","wrestling":"dancing" "rhinos":"crocodiles","roles":"foodchain" } I need to insert a new line before the closing brackets "}" so that the final output looks like this: { "AFafa": "FAFA","AFafa": "FAFA"... (6 Replies)
Discussion started by: SkySmart
6 Replies

8. Solaris

Getting similar lines in two files

Hi, I need to compare the /etc/passwd files from 2 servers, and extract the users that are similar in these two files. I sorted the 2 files based on the user IDs (UID) (3rd column). I first sorted the files using the username (1st column), however when I use comm to compare the files there is no... (1 Reply)
Discussion started by: anaigini45
1 Replies

9. UNIX for Beginners Questions & Answers

How to compare two files in UNIX using similar to vlookup?

Hi, I want to compare same column in two files, if values match then display the column or display "NA". Ex : File 1 : 123 abc xyz pqr File 2: 122 aab fdf pqr fff qqq rrr (1 Reply)
Discussion started by: hkoshekay
1 Replies

10. UNIX for Beginners Questions & Answers

Bash selection of files with similar name

Hi all, This is my first day on Linux shell!!! So, I am trying to write a script that that will pick up pairs of files with the same name (not the same content) but that are different in one character (one is *R1 the other is *R2)... Something like: look ate the files, whenever they are the... (3 Replies)
Discussion started by: ALou
3 Replies
SIM(1)							      General Commands Manual							    SIM(1)

NAME
sim - find similarities in C, Java, Pascal, Modula-2, Lisp, Miranda, or text files SYNOPSIS
sim_c [ -[defFiMnpPRsST] -r N -t N -w N -o F ] file ... [ / [ file ... ] ] sim_c ... sim_java ... sim_pasc ... sim_m2 ... sim_lisp ... sim_mira ... sim_text ... DESCRIPTION
Sim_c reads the C files file ... and looks for segments of text that are similar; two segments of program text are similar if they only differ in layout, comment, identifiers and the contents of numbers, strings and characters. If any runs of sufficient length are found, they are reported on standard output; the number of significant tokens in the run is given between square brackets. Sim_java does the same for Java, sim_pasc for Pascal, sim_m2 for Modula-2, sim_mira for Miranda, and sim_lisp for Lisp. Sim_text works on arbitrary text; it is occasionally useful on shell scripts. The program can be used for finding copied pieces of code in purportedly unrelated programs (with -s or -S), or for finding accidentally duplicated code in larger projects (with -f). If a / is present between the input files, the latter are divided into a group of "new" files (before the /) and a group of "old" files; if there is no /, all files are "new". Old files are never compared to each other. Since the similarity tester reads the files several times, it cannot read from standard input. There are the following options: -d The output is in a diff(1)-like format instead of the default 2-column format. -e Each file is compared to each file in isolation; this will find all similarities between all texts involved, regardless of dupli- cates. -f Runs are restricted to segments with balancing parentheses, to isolate potential routine bodies (not in text). -F The names of routines in calls are required to match exactly (not in text). -i The names of the files to be compared are read from standard input, including a possible /; the file names need to be separated by layout. This allows a very large number of file names to be specified; it differs from the @ facility provided by some compilers in that it handles file names only, and does not recognize option arguments. -M Memory usage information is displayed on standard error output. -n Similarities found are only summarized, not displayed. -o F The output is written to the file named F. -p The output is given in similarity percentages; see below; implies -e and -s. -P As -p but more extensive; implies -e and -s. -r N The minimum run length is set to N units; the default is 24 tokens, except in sim_text, where it is 8 words. -R Directories in the input list are entered recursively, and all files they contain are involved in the comparison. -s The contents of a file are not compared to itself (-s for "not self"). -S The contents of the new files are compared to the old files only - not between themselves. -t N In combination with the -p option, sets the threshold (in percents) below which similarities will not be reported; the default is 1, except in sim_text, where it is 20. -T A more terse and uniform form of output is produced, which may be more suitable for postprocessing. -w N The page width used is set to N columns; the default is 80. -- (A secret option, which prints the input as the similarity checker sees it, and then stops.) The -p option results in lines of the form F consists for x % of G material meaning that x % of F's text can also be found in G. Note that this relation is not symmetric; it is in fact quite possible for one file to consist for 100 % of text from another file, while the other file consists for only 1 % of text of the first file, if their lengths dif- fer enough. Each file is reported only once in the position of the F in the above line. This simplifies the identification of a set of files A[1] ... A[n], where the concatenation of these files is also present. This restriction can be lifted by using the -P option instead. A threshold can be set using the -t option; this option is ignored under -P. Note that the granularity of the recognized text is still governed by the -r option or its default. Sim_text accepts s p a c e d t e x t as normal text. The program can handle UNICODE file names under Windows. This is relevant only under the -R option, since there is no way to give UNICODE file names from the command line. Care has been taken to keep all internal processes linear in the length of the input, with the exception of the matching process which is almost linear, using a hash table; various other tables are used for speed-up. If, however, there is not enough memory for the tables, they are discarded in order of unimportance, under which conditions the algorithms revert to their quadratic nature. EXAMPLES
The call sim_c *.c highlights duplicate code in the directory. (It is useful to remove generated files first.) A call sim_c -f -F *.c can pinpoint them further. A call sim_text -e -p -s new/* / old/* compares each file in new/* to each subsequent file in new/* and old/*, and if any pair has more that 20% in common, that fact is reported. Usually a similarity of 30% or more is significant; lower than 20% is probably coincidence; and in between is doubtful. A call sim_text -e -n -s -r100 new/* / old/* compares the same files, and reports large common segments. Both approaches are good for plagiarism detection. LIMITATIONS
Repetitive input is the bane of similarity checking. If we have a file containing 4 copies of similar text, A1 A2 A3 A4 where the numbers serve only to distinguish the similar copies, there are 7 similarities: A1=A2, A1=A3, A1=A4, A2=A3, A2=A4, A3=A4, and A1A2=A3A4, even discarding the overlapping A1A2A3=A2A3A4. Of these, only 3 are meaningful: A1=A2, A2=A3, and A3=A4. And for a table with 20 lines similar to each other, not unusual in a program, there are 715 similarities, of which at most 19 are meaningful. Reporting all 715 of them is clearly unacceptable. To remedy this, finding the similarities is performed as follows: For each position in the text, the largest segment is found, of which a non-overlapping copy occurs in the text following it. That segment and its copy are reported and scanning resumes at the position just after the segment. For the above example this results in the similarities A1A2=A3A4 and A3=A4, which is quite satisfactory, and for N sim- ilar segments roughly log N messages are given. A drawback of this heuristic is that the output is sensitive to the order of the input files. If we have two files file1 = A1, file2 = A2A3 then the order "file1 file2" gives "A1=A2, A2=A3" and "file2 file1" gives "A2=A3, A3=A1"; but both reports convey the same information. BUGS
Since it uses lex(1) on some systems, it may crash on any weird construction that overflows lex's internal buffers. AUTHOR
Dick Grune, Vrije Universiteit, Amsterdam; dick@dickgrune.com. 2012/05/02 SIM(1)
All times are GMT -4. The time now is 05:26 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy