08-05-2011
Removing files with same text but different file names
Hi All,
I have some 50,000 HTML files in a directory. The problem is; some HTML files are duplicate versions that is wget crawled them two times and gave them file names by appending 1, 2, 3 etc after each crawl. For example, if the file index.html has been crawled several times, it has been named as index.html.1, index.html.2 etc. But all index.html files contain the same text.
I browsed through some posts here and found this:
https://www.unix.com/shell-programmin...directory.html
I then tried the above script by creating 3 similar "test" text files containing one word (this was just to test the code given there). It works for the 3 text files where it gave me the information of two and discarded one. I can then process the output text (duplicate.files) file to get file names and delete the duplicate files or files with same text.
But when I apply the above code on my HTML directory, it does not show any files with same text. But in reality there are duplicate files or files with same text as I have manually checked it.
I am not sure where the problem is? I am using Linux with BASH.
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi,
I have to perform an iterative function on a set of 10 files. After the first round the output files are named differently than the input files.
examples
input file name = xxxx1.yyy
output file name = xxxx1_0001.yyy
I need to rename all of the output files to the original input... (5 Replies)
Discussion started by: ligander
5 Replies
2. Shell Programming and Scripting
I have 2 text files like
________________________________
Company Name:yada yada
ADDRESS:some where, CITY,STATE
CONTACT PEOPLE:first_name1.last_name1,first_name2.last_name2,first_name3.last_name3
LEAD:first_name.last_name
________________________________
&
Data file2 ... (1 Reply)
Discussion started by: rider29
1 Replies
3. UNIX for Dummies Questions & Answers
Hello,
I am an super newbie, so forgive my sheer ignorance. I have a series of text files formatted as follows (just showing the header and first few lines):
mean_geo mean_raw lat lon
0.000 0 -70.616 163.021
0.000 0 -70.620 163.073
0.000 ... (8 Replies)
Discussion started by: vtoniolo
8 Replies
4. Shell Programming and Scripting
Hello all,
I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files)
I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some... (4 Replies)
Discussion started by: boxx
4 Replies
5. Shell Programming and Scripting
I'm running on freebsd -- with a default shell of csh.
I have two files named A and B. Each line of each file contains a file name. How can I write a script that removes all the file names in file B from A.
I tried to use perl to create a huge regular expression with "|" separating the file... (2 Replies)
Discussion started by: siegfried
2 Replies
6. UNIX for Dummies Questions & Answers
I have this piece of code
printf '%s\n' $pth*.msf | tr ' ' '\n' | sort -t '-' -k7 -k6r \
| awk -F- '{c=($6$7!=p&&FNR!=1)?ORS:"";p=$6$7}{printf("%c%s\n",c,$0)}'
When I run it I get
/home/chrisd/tatsh/branches/terr0.50/darwin/n02-z30-dsr65-terr0.50-dc0.002-8x6drw-csq.msf... (8 Replies)
Discussion started by: kristinu
8 Replies
7. Shell Programming and Scripting
I need to find empty files in a directory and write them into a text file. Directory will contain old files as well, i need to get the empty files for the last one hour only. (1 Reply)
Discussion started by: vel4ever
1 Replies
8. Shell Programming and Scripting
I'm trying to move a large folder to an external drive but some files have these weird chars that the external drive won't accept.
Does anyone know any command of any bash script that will look through a given folder and remove any weird chars? (4 Replies)
Discussion started by: Joktaa
4 Replies
9. UNIX for Dummies Questions & Answers
I found a closed thread that helped quite a bit. I tried adding the URL, but I can't because I don't have enough points... ?
Modifying the syntax to remove ! ~
find . -type f -name '*~\!]*' |
while IFS= read -r; do
mv -- "$REPLY" "${REPLY//~\!]}";
done
These messages are... (2 Replies)
Discussion started by: rabidphilbrick
2 Replies
10. Shell Programming and Scripting
Data files coming in different names in a file name called process.txt.
1. shipments_yyyymmdd.gz
2 Order_yyyymmdd.gz
3. Invoice_yyyymmdd.gz
4. globalorder_yyyymmdd.gz
The process needs to discard all the below files and only process two of the 4 file names available
... (1 Reply)
Discussion started by: dsravanam
1 Replies
LEARN ABOUT DEBIAN
test::html::w3c
Test::HTML::W3C(3pm) User Contributed Perl Documentation Test::HTML::W3C(3pm)
NAME
Test::HTML::W3C - Perform W3C HTML validation testing
SYNOPSIS
use Test::HTML::W3C tests => $test_count;
# or
use Test::HTML::W3C 'show_detail';
# or when using both
use Test::HTML::W3C tests => $test_count, 'show_detail';
is_valid_markup($my_html_scalar);
is_valid_file("/path/to/my/file.html");
is_valid("http://example.com");
# Get the underlying WebService:;Validator::W3C::HTML object
my $validator = validator();
DESCRIPTION
The purpose of this module is to provide a wrapper around the W3C that works with the Test::More testing framework.
ABUSE
Please keep in mind that the W3C validation pages and services are a shared resource. If you plan to do many many tests, please consider
using your own installation of the validation programs, and then use your local install by modifying the local validtor:
my $v = validator();
$v->validator_uri($my_own_validator);
See the documentation for WebService:;Validator::W3C::HTML and the W3C's site at http://validator.w3.org/ for details
validator();
Description: Returns the underlying WebService::Validator::HTML::W3C object
Parameters: None.
Returns: $validator
plan();
Description: Access to the underlying "plan" method provided by Test::Builder.
Parameters: As per Test::Builder
is_valid_markup($markup[, $name]);
Description: is_valid_markup tests whether the text in the provided scalar value correctly validates according to the W3C
specifications. This is useful if you have markup stored in a scalar that you wish to test that you might get from using LWP or
WWW::Mechanize for example...
Parameters: $markup, a scalar containing the data to test, $name, an optional descriptive test name.
Returns: None.
is_valid_file($path[, $name]);
Description: is_valid_file works the same way as is_valid_markup, except that you can specify the text to validate with the path to a
filename. This is useful if you have pregenerated all your HTML files locally, and now wish to test them.
Parameters: $path, a scalar, $name, an optional descriptive test name.
Returns: None.
is_valid($url[, $name]);
Description: is_valid, again, works very similarly to the is_valid_file and is_valid_file, except you specify a document that is
already online with its URL. This can be useful if you wish to periodically test a website or webpage that dynamically changes over
time for example, like a blog or a wiki, without first saving the html to a file using your browswer, or a utility such as wget.
Parameters: $url, a scalar, $name, an optional descriptive test name.
Returns: None.
diag_html($url);
Description: If you want to display the actual errors reported by the service for a particular test, you can use the diag_html
function. Please note that you must have imported 'show_detail' for this to work properly.
use Test::HTML::W3C 'show_detail';
is_valid_markup("<html></html">, "My simple test") or diag_html();
Parameters: $url, a scalar.
Returns: None.
SEE ALSO
Test::Builder::Module for creating your own testing modules.
Test::More for another popular testing framework, also based on Test::Builder
Test::Harness for detils about how test results are interpreted.
AUTHORS
Victor <victor73@gmail.com> with inspiration from the authors of the Test::More and WebService::Validator::W3C:HTML modules.
BUGS
See http://rt.cpan.org to report and view bugs.
COPYRIGHT
Copyright 2006 by Victor <victor73@gmail.com>.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html
perl v5.12.4 2011-08-22 Test::HTML::W3C(3pm)