Sponsored Content
Top Forums Shell Programming and Scripting Removing files with same text but different file names Post 302544818 by shoaibjameel123 on Friday 5th of August 2011 06:07:00 AM
Old 08-05-2011
Removing files with same text but different file names

Hi All,

I have some 50,000 HTML files in a directory. The problem is; some HTML files are duplicate versions that is wget crawled them two times and gave them file names by appending 1, 2, 3 etc after each crawl. For example, if the file index.html has been crawled several times, it has been named as index.html.1, index.html.2 etc. But all index.html files contain the same text.

I browsed through some posts here and found this:
https://www.unix.com/shell-programmin...directory.html

I then tried the above script by creating 3 similar "test" text files containing one word (this was just to test the code given there). It works for the 3 text files where it gave me the information of two and discarded one. I can then process the output text (duplicate.files) file to get file names and delete the duplicate files or files with same text.

But when I apply the above code on my HTML directory, it does not show any files with same text. But in reality there are duplicate files or files with same text as I have manually checked it.

I am not sure where the problem is? I am using Linux with BASH.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

processing file names using text files

Hi, I have to perform an iterative function on a set of 10 files. After the first round the output files are named differently than the input files. examples input file name = xxxx1.yyy output file name = xxxx1_0001.yyy I need to rename all of the output files to the original input... (5 Replies)
Discussion started by: ligander
5 Replies

2. Shell Programming and Scripting

matching names in 2 text files

I have 2 text files like ________________________________ Company Name:yada yada ADDRESS:some where, CITY,STATE CONTACT PEOPLE:first_name1.last_name1,first_name2.last_name2,first_name3.last_name3 LEAD:first_name.last_name ________________________________ & Data file2 ... (1 Reply)
Discussion started by: rider29
1 Replies

3. UNIX for Dummies Questions & Answers

Removing blank spaces from text files in UNIX

Hello, I am an super newbie, so forgive my sheer ignorance. I have a series of text files formatted as follows (just showing the header and first few lines): mean_geo mean_raw lat lon 0.000 0 -70.616 163.021 0.000 0 -70.620 163.073 0.000 ... (8 Replies)
Discussion started by: vtoniolo
8 Replies

4. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Hello all, I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files) I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some... (4 Replies)
Discussion started by: boxx
4 Replies

5. Shell Programming and Scripting

How to remove common file names from text files

I'm running on freebsd -- with a default shell of csh. I have two files named A and B. Each line of each file contains a file name. How can I write a script that removes all the file names in file B from A. I tried to use perl to create a huge regular expression with "|" separating the file... (2 Replies)
Discussion started by: siegfried
2 Replies

6. UNIX for Dummies Questions & Answers

Removing path name from list of file names

I have this piece of code printf '%s\n' $pth*.msf | tr ' ' '\n' | sort -t '-' -k7 -k6r \ | awk -F- '{c=($6$7!=p&&FNR!=1)?ORS:"";p=$6$7}{printf("%c%s\n",c,$0)}' When I run it I get /home/chrisd/tatsh/branches/terr0.50/darwin/n02-z30-dsr65-terr0.50-dc0.002-8x6drw-csq.msf... (8 Replies)
Discussion started by: kristinu
8 Replies

7. Shell Programming and Scripting

How to find empty files in a directory and write their file names in a text?

I need to find empty files in a directory and write them into a text file. Directory will contain old files as well, i need to get the empty files for the last one hour only. (1 Reply)
Discussion started by: vel4ever
1 Replies

8. Shell Programming and Scripting

Removing unknow chars from file names.

I'm trying to move a large folder to an external drive but some files have these weird chars that the external drive won't accept. Does anyone know any command of any bash script that will look through a given folder and remove any weird chars? (4 Replies)
Discussion started by: Joktaa
4 Replies

9. UNIX for Dummies Questions & Answers

[solved]removing characters from a mass of file names

I found a closed thread that helped quite a bit. I tried adding the URL, but I can't because I don't have enough points... ? Modifying the syntax to remove ! ~ find . -type f -name '*~\!]*' | while IFS= read -r; do mv -- "$REPLY" "${REPLY//~\!]}"; done These messages are... (2 Replies)
Discussion started by: rabidphilbrick
2 Replies

10. Shell Programming and Scripting

Exclude certain file names while selectingData files coming in different names in a file name called

Data files coming in different names in a file name called process.txt. 1. shipments_yyyymmdd.gz 2 Order_yyyymmdd.gz 3. Invoice_yyyymmdd.gz 4. globalorder_yyyymmdd.gz The process needs to discard all the below files and only process two of the 4 file names available ... (1 Reply)
Discussion started by: dsravanam
1 Replies
htmerge(1)						      General Commands Manual							htmerge(1)

NAME
htmerge - create document index and word database for the ht://Dig search engine SYNOPSIS
htmerge [options] DESCRIPTION
Htmerge is used to create a document index and word database from the files that were created by htdig. These databases are then used by htsearch to perform the actual searched. OPTIONS
-a Use alternate work files. Tells htdig to append .work to database files, causing a second copy of the database to be built. This allows the original files to be used by htsearch during the indexing run. -c configfile Use the specified configfile instead of the default. -d Prevent the document index from being created. -s Print statistics about the document and word databases after htmerge has finished. -v Run in verbose mode. This will provide some hints as to the progress of the merge. This can be useful when running htmerge interac- tively since some parts (especially the word database creation) can take a very long time. -w Prevent the word database from being created. ENVIRONMENT
TMPDIR In addition to the command line options, the environment variable TMPDIR will be used to designate the directory where intermediate files are stored during the sorting process. FILES
/etc/htdig/htdig.conf The default configuration file. SEE ALSO
Please refer to the HTML pages (in the htdig-doc package) /usr/share/doc/htdig-doc/html/index.html and the manual pages htdig(1) and htsearch(1) for a detailed description of ht://Dig and its commands. AUTHOR
This manual page was written by Christian Schwarz, modified by Stijn de Bekker, based on the HTML documentation of ht://Dig. 21 July 1997 htmerge(1)
All times are GMT -4. The time now is 03:33 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy