Removing files with same text but different file names

08-05-2011

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

Removing files with same text but different file names

Hi All,

I have some 50,000 HTML files in a directory. The problem is; some HTML files are duplicate versions that is wget crawled them two times and gave them file names by appending 1, 2, 3 etc after each crawl. For example, if the file index.html has been crawled several times, it has been named as index.html.1, index.html.2 etc. But all index.html files contain the same text.

I browsed through some posts here and found this:
https://www.unix.com/shell-programmin...directory.html

I then tried the above script by creating 3 similar "test" text files containing one word (this was just to test the code given there). It works for the 3 text files where it gave me the information of two and discarded one. I can then process the output text (duplicate.files) file to get file names and delete the duplicate files or files with same text.

But when I apply the above code on my HTML directory, it does not show any files with same text. But in reality there are duplicate files or files with same text as I have manually checked it.

I am not sure where the problem is? I am using Linux with BASH.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

08-05-2011

Banned

947, 38

Join Date: Apr 2009

Last Activity: 30 July 2012, 5:38 AM EDT

Location: /usr/bin/vim

Posts: 947

Thanks Given: 13

Thanked 38 Times in 36 Posts

Finddup - Find duplicate files by content, name

Code:

./finddup
Displays files of the current directory which are all same by its content

This User Gave Thanks to thegeek For This Post:

thegeek

View Public Profile for thegeek

Find all posts by thegeek

Shell Programming and Scripting

Removing files with same text but different file names

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Exclude certain file names while selectingData files coming in different names in a file name called

Discussion started by: dsravanam

2. UNIX for Dummies Questions & Answers

[solved]removing characters from a mass of file names

Discussion started by: rabidphilbrick

3. Shell Programming and Scripting

Removing unknow chars from file names.

Discussion started by: Joktaa

4. Shell Programming and Scripting

How to find empty files in a directory and write their file names in a text?

Discussion started by: vel4ever

5. UNIX for Dummies Questions & Answers

Removing path name from list of file names

Discussion started by: kristinu

6. Shell Programming and Scripting

How to remove common file names from text files

Discussion started by: siegfried

7. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Discussion started by: boxx

8. UNIX for Dummies Questions & Answers

Removing blank spaces from text files in UNIX

Discussion started by: vtoniolo

9. Shell Programming and Scripting

matching names in 2 text files

Discussion started by: rider29

10. Shell Programming and Scripting

processing file names using text files

Discussion started by: ligander