Removing files with same text but different file names


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing files with same text but different file names
# 1  
Old 08-05-2011
Removing files with same text but different file names

Hi All,

I have some 50,000 HTML files in a directory. The problem is; some HTML files are duplicate versions that is wget crawled them two times and gave them file names by appending 1, 2, 3 etc after each crawl. For example, if the file index.html has been crawled several times, it has been named as index.html.1, index.html.2 etc. But all index.html files contain the same text.

I browsed through some posts here and found this:
https://www.unix.com/shell-programmin...directory.html

I then tried the above script by creating 3 similar "test" text files containing one word (this was just to test the code given there). It works for the 3 text files where it gave me the information of two and discarded one. I can then process the output text (duplicate.files) file to get file names and delete the duplicate files or files with same text.

But when I apply the above code on my HTML directory, it does not show any files with same text. But in reality there are duplicate files or files with same text as I have manually checked it.

I am not sure where the problem is? I am using Linux with BASH.
# 2  
Old 08-05-2011
Finddup - Find duplicate files by content, name

Code:
./finddup
Displays files of the current directory which are all same by its content

This User Gave Thanks to thegeek For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Exclude certain file names while selectingData files coming in different names in a file name called

Data files coming in different names in a file name called process.txt. 1. shipments_yyyymmdd.gz 2 Order_yyyymmdd.gz 3. Invoice_yyyymmdd.gz 4. globalorder_yyyymmdd.gz The process needs to discard all the below files and only process two of the 4 file names available ... (1 Reply)
Discussion started by: dsravanam
1 Replies

2. UNIX for Dummies Questions & Answers

[solved]removing characters from a mass of file names

I found a closed thread that helped quite a bit. I tried adding the URL, but I can't because I don't have enough points... ? Modifying the syntax to remove ! ~ find . -type f -name '*~\!]*' | while IFS= read -r; do mv -- "$REPLY" "${REPLY//~\!]}"; done These messages are... (2 Replies)
Discussion started by: rabidphilbrick
2 Replies

3. Shell Programming and Scripting

Removing unknow chars from file names.

I'm trying to move a large folder to an external drive but some files have these weird chars that the external drive won't accept. Does anyone know any command of any bash script that will look through a given folder and remove any weird chars? (4 Replies)
Discussion started by: Joktaa
4 Replies

4. Shell Programming and Scripting

How to find empty files in a directory and write their file names in a text?

I need to find empty files in a directory and write them into a text file. Directory will contain old files as well, i need to get the empty files for the last one hour only. (1 Reply)
Discussion started by: vel4ever
1 Replies

5. UNIX for Dummies Questions & Answers

Removing path name from list of file names

I have this piece of code printf '%s\n' $pth*.msf | tr ' ' '\n' | sort -t '-' -k7 -k6r \ | awk -F- '{c=($6$7!=p&&FNR!=1)?ORS:"";p=$6$7}{printf("%c%s\n",c,$0)}' When I run it I get /home/chrisd/tatsh/branches/terr0.50/darwin/n02-z30-dsr65-terr0.50-dc0.002-8x6drw-csq.msf... (8 Replies)
Discussion started by: kristinu
8 Replies

6. Shell Programming and Scripting

How to remove common file names from text files

I'm running on freebsd -- with a default shell of csh. I have two files named A and B. Each line of each file contains a file name. How can I write a script that removes all the file names in file B from A. I tried to use perl to create a huge regular expression with "|" separating the file... (2 Replies)
Discussion started by: siegfried
2 Replies

7. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Hello all, I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files) I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some... (4 Replies)
Discussion started by: boxx
4 Replies

8. UNIX for Dummies Questions & Answers

Removing blank spaces from text files in UNIX

Hello, I am an super newbie, so forgive my sheer ignorance. I have a series of text files formatted as follows (just showing the header and first few lines): mean_geo mean_raw lat lon 0.000 0 -70.616 163.021 0.000 0 -70.620 163.073 0.000 ... (8 Replies)
Discussion started by: vtoniolo
8 Replies

9. Shell Programming and Scripting

matching names in 2 text files

I have 2 text files like ________________________________ Company Name:yada yada ADDRESS:some where, CITY,STATE CONTACT PEOPLE:first_name1.last_name1,first_name2.last_name2,first_name3.last_name3 LEAD:first_name.last_name ________________________________ & Data file2 ... (1 Reply)
Discussion started by: rider29
1 Replies

10. Shell Programming and Scripting

processing file names using text files

Hi, I have to perform an iterative function on a set of 10 files. After the first round the output files are named differently than the input files. examples input file name = xxxx1.yyy output file name = xxxx1_0001.yyy I need to rename all of the output files to the original input... (5 Replies)
Discussion started by: ligander
5 Replies
Login or Register to Ask a Question