08-05-2011
Removing files with same text but different file names
Hi All,
I have some 50,000 HTML files in a directory. The problem is; some HTML files are duplicate versions that is wget crawled them two times and gave them file names by appending 1, 2, 3 etc after each crawl. For example, if the file index.html has been crawled several times, it has been named as index.html.1, index.html.2 etc. But all index.html files contain the same text.
I browsed through some posts here and found this:
https://www.unix.com/shell-programmin...directory.html
I then tried the above script by creating 3 similar "test" text files containing one word (this was just to test the code given there). It works for the 3 text files where it gave me the information of two and discarded one. I can then process the output text (duplicate.files) file to get file names and delete the duplicate files or files with same text.
But when I apply the above code on my HTML directory, it does not show any files with same text. But in reality there are duplicate files or files with same text as I have manually checked it.
I am not sure where the problem is? I am using Linux with BASH.
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi,
I have to perform an iterative function on a set of 10 files. After the first round the output files are named differently than the input files.
examples
input file name = xxxx1.yyy
output file name = xxxx1_0001.yyy
I need to rename all of the output files to the original input... (5 Replies)
Discussion started by: ligander
5 Replies
2. Shell Programming and Scripting
I have 2 text files like
________________________________
Company Name:yada yada
ADDRESS:some where, CITY,STATE
CONTACT PEOPLE:first_name1.last_name1,first_name2.last_name2,first_name3.last_name3
LEAD:first_name.last_name
________________________________
&
Data file2 ... (1 Reply)
Discussion started by: rider29
1 Replies
3. UNIX for Dummies Questions & Answers
Hello,
I am an super newbie, so forgive my sheer ignorance. I have a series of text files formatted as follows (just showing the header and first few lines):
mean_geo mean_raw lat lon
0.000 0 -70.616 163.021
0.000 0 -70.620 163.073
0.000 ... (8 Replies)
Discussion started by: vtoniolo
8 Replies
4. Shell Programming and Scripting
Hello all,
I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files)
I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some... (4 Replies)
Discussion started by: boxx
4 Replies
5. Shell Programming and Scripting
I'm running on freebsd -- with a default shell of csh.
I have two files named A and B. Each line of each file contains a file name. How can I write a script that removes all the file names in file B from A.
I tried to use perl to create a huge regular expression with "|" separating the file... (2 Replies)
Discussion started by: siegfried
2 Replies
6. UNIX for Dummies Questions & Answers
I have this piece of code
printf '%s\n' $pth*.msf | tr ' ' '\n' | sort -t '-' -k7 -k6r \
| awk -F- '{c=($6$7!=p&&FNR!=1)?ORS:"";p=$6$7}{printf("%c%s\n",c,$0)}'
When I run it I get
/home/chrisd/tatsh/branches/terr0.50/darwin/n02-z30-dsr65-terr0.50-dc0.002-8x6drw-csq.msf... (8 Replies)
Discussion started by: kristinu
8 Replies
7. Shell Programming and Scripting
I need to find empty files in a directory and write them into a text file. Directory will contain old files as well, i need to get the empty files for the last one hour only. (1 Reply)
Discussion started by: vel4ever
1 Replies
8. Shell Programming and Scripting
I'm trying to move a large folder to an external drive but some files have these weird chars that the external drive won't accept.
Does anyone know any command of any bash script that will look through a given folder and remove any weird chars? (4 Replies)
Discussion started by: Joktaa
4 Replies
9. UNIX for Dummies Questions & Answers
I found a closed thread that helped quite a bit. I tried adding the URL, but I can't because I don't have enough points... ?
Modifying the syntax to remove ! ~
find . -type f -name '*~\!]*' |
while IFS= read -r; do
mv -- "$REPLY" "${REPLY//~\!]}";
done
These messages are... (2 Replies)
Discussion started by: rabidphilbrick
2 Replies
10. Shell Programming and Scripting
Data files coming in different names in a file name called process.txt.
1. shipments_yyyymmdd.gz
2 Order_yyyymmdd.gz
3. Invoice_yyyymmdd.gz
4. globalorder_yyyymmdd.gz
The process needs to discard all the below files and only process two of the 4 file names available
... (1 Reply)
Discussion started by: dsravanam
1 Replies
LEARN ABOUT DEBIAN
htmerge
htmerge(1) General Commands Manual htmerge(1)
NAME
htmerge - create document index and word database for the ht://Dig search engine
SYNOPSIS
htmerge [options]
DESCRIPTION
Htmerge is used to create a document index and word database from the files that were created by htdig. These databases are then used by
htsearch to perform the actual searched.
OPTIONS
-a Use alternate work files. Tells htdig to append .work to database files, causing a second copy of the database to be built. This
allows the original files to be used by htsearch during the indexing run.
-c configfile
Use the specified configfile instead of the default.
-d Prevent the document index from being created.
-s Print statistics about the document and word databases after htmerge has finished.
-v Run in verbose mode. This will provide some hints as to the progress of the merge. This can be useful when running htmerge interac-
tively since some parts (especially the word database creation) can take a very long time.
-w Prevent the word database from being created.
ENVIRONMENT
TMPDIR In addition to the command line options, the environment variable TMPDIR will be used to designate the directory where intermediate
files are stored during the sorting process.
FILES
/etc/htdig/htdig.conf
The default configuration file.
SEE ALSO
Please refer to the HTML pages (in the htdig-doc package) /usr/share/doc/htdig-doc/html/index.html and the manual pages htdig(1) and
htsearch(1) for a detailed description of ht://Dig and its commands.
AUTHOR
This manual page was written by Christian Schwarz, modified by Stijn de Bekker, based on the HTML documentation of ht://Dig.
21 July 1997 htmerge(1)