Sponsored Content
Top Forums Shell Programming and Scripting Removing files with same text but different file names Post 302544818 by shoaibjameel123 on Friday 5th of August 2011 06:07:00 AM
Old 08-05-2011
Removing files with same text but different file names

Hi All,

I have some 50,000 HTML files in a directory. The problem is; some HTML files are duplicate versions that is wget crawled them two times and gave them file names by appending 1, 2, 3 etc after each crawl. For example, if the file index.html has been crawled several times, it has been named as index.html.1, index.html.2 etc. But all index.html files contain the same text.

I browsed through some posts here and found this:
https://www.unix.com/shell-programmin...directory.html

I then tried the above script by creating 3 similar "test" text files containing one word (this was just to test the code given there). It works for the 3 text files where it gave me the information of two and discarded one. I can then process the output text (duplicate.files) file to get file names and delete the duplicate files or files with same text.

But when I apply the above code on my HTML directory, it does not show any files with same text. But in reality there are duplicate files or files with same text as I have manually checked it.

I am not sure where the problem is? I am using Linux with BASH.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

processing file names using text files

Hi, I have to perform an iterative function on a set of 10 files. After the first round the output files are named differently than the input files. examples input file name = xxxx1.yyy output file name = xxxx1_0001.yyy I need to rename all of the output files to the original input... (5 Replies)
Discussion started by: ligander
5 Replies

2. Shell Programming and Scripting

matching names in 2 text files

I have 2 text files like ________________________________ Company Name:yada yada ADDRESS:some where, CITY,STATE CONTACT PEOPLE:first_name1.last_name1,first_name2.last_name2,first_name3.last_name3 LEAD:first_name.last_name ________________________________ & Data file2 ... (1 Reply)
Discussion started by: rider29
1 Replies

3. UNIX for Dummies Questions & Answers

Removing blank spaces from text files in UNIX

Hello, I am an super newbie, so forgive my sheer ignorance. I have a series of text files formatted as follows (just showing the header and first few lines): mean_geo mean_raw lat lon 0.000 0 -70.616 163.021 0.000 0 -70.620 163.073 0.000 ... (8 Replies)
Discussion started by: vtoniolo
8 Replies

4. Shell Programming and Scripting

Removing matching text from multiple files with a shell script

Hello all, I am in need of assistance in creating a script that will remove a specified block of text from multiple .htaccess files. (roughly 1000 files) I am attempting to help with a project to clean up a linux server that has a series of unwanted url rewrites in place, as well as some... (4 Replies)
Discussion started by: boxx
4 Replies

5. Shell Programming and Scripting

How to remove common file names from text files

I'm running on freebsd -- with a default shell of csh. I have two files named A and B. Each line of each file contains a file name. How can I write a script that removes all the file names in file B from A. I tried to use perl to create a huge regular expression with "|" separating the file... (2 Replies)
Discussion started by: siegfried
2 Replies

6. UNIX for Dummies Questions & Answers

Removing path name from list of file names

I have this piece of code printf '%s\n' $pth*.msf | tr ' ' '\n' | sort -t '-' -k7 -k6r \ | awk -F- '{c=($6$7!=p&&FNR!=1)?ORS:"";p=$6$7}{printf("%c%s\n",c,$0)}' When I run it I get /home/chrisd/tatsh/branches/terr0.50/darwin/n02-z30-dsr65-terr0.50-dc0.002-8x6drw-csq.msf... (8 Replies)
Discussion started by: kristinu
8 Replies

7. Shell Programming and Scripting

How to find empty files in a directory and write their file names in a text?

I need to find empty files in a directory and write them into a text file. Directory will contain old files as well, i need to get the empty files for the last one hour only. (1 Reply)
Discussion started by: vel4ever
1 Replies

8. Shell Programming and Scripting

Removing unknow chars from file names.

I'm trying to move a large folder to an external drive but some files have these weird chars that the external drive won't accept. Does anyone know any command of any bash script that will look through a given folder and remove any weird chars? (4 Replies)
Discussion started by: Joktaa
4 Replies

9. UNIX for Dummies Questions & Answers

[solved]removing characters from a mass of file names

I found a closed thread that helped quite a bit. I tried adding the URL, but I can't because I don't have enough points... ? Modifying the syntax to remove ! ~ find . -type f -name '*~\!]*' | while IFS= read -r; do mv -- "$REPLY" "${REPLY//~\!]}"; done These messages are... (2 Replies)
Discussion started by: rabidphilbrick
2 Replies

10. Shell Programming and Scripting

Exclude certain file names while selectingData files coming in different names in a file name called

Data files coming in different names in a file name called process.txt. 1. shipments_yyyymmdd.gz 2 Order_yyyymmdd.gz 3. Invoice_yyyymmdd.gz 4. globalorder_yyyymmdd.gz The process needs to discard all the below files and only process two of the 4 file names available ... (1 Reply)
Discussion started by: dsravanam
1 Replies
Test::HTML::W3C(3pm)					User Contributed Perl Documentation				      Test::HTML::W3C(3pm)

NAME
Test::HTML::W3C - Perform W3C HTML validation testing SYNOPSIS
use Test::HTML::W3C tests => $test_count; # or use Test::HTML::W3C 'show_detail'; # or when using both use Test::HTML::W3C tests => $test_count, 'show_detail'; is_valid_markup($my_html_scalar); is_valid_file("/path/to/my/file.html"); is_valid("http://example.com"); # Get the underlying WebService:;Validator::W3C::HTML object my $validator = validator(); DESCRIPTION
The purpose of this module is to provide a wrapper around the W3C that works with the Test::More testing framework. ABUSE
Please keep in mind that the W3C validation pages and services are a shared resource. If you plan to do many many tests, please consider using your own installation of the validation programs, and then use your local install by modifying the local validtor: my $v = validator(); $v->validator_uri($my_own_validator); See the documentation for WebService:;Validator::W3C::HTML and the W3C's site at http://validator.w3.org/ for details validator(); Description: Returns the underlying WebService::Validator::HTML::W3C object Parameters: None. Returns: $validator plan(); Description: Access to the underlying "plan" method provided by Test::Builder. Parameters: As per Test::Builder is_valid_markup($markup[, $name]); Description: is_valid_markup tests whether the text in the provided scalar value correctly validates according to the W3C specifications. This is useful if you have markup stored in a scalar that you wish to test that you might get from using LWP or WWW::Mechanize for example... Parameters: $markup, a scalar containing the data to test, $name, an optional descriptive test name. Returns: None. is_valid_file($path[, $name]); Description: is_valid_file works the same way as is_valid_markup, except that you can specify the text to validate with the path to a filename. This is useful if you have pregenerated all your HTML files locally, and now wish to test them. Parameters: $path, a scalar, $name, an optional descriptive test name. Returns: None. is_valid($url[, $name]); Description: is_valid, again, works very similarly to the is_valid_file and is_valid_file, except you specify a document that is already online with its URL. This can be useful if you wish to periodically test a website or webpage that dynamically changes over time for example, like a blog or a wiki, without first saving the html to a file using your browswer, or a utility such as wget. Parameters: $url, a scalar, $name, an optional descriptive test name. Returns: None. diag_html($url); Description: If you want to display the actual errors reported by the service for a particular test, you can use the diag_html function. Please note that you must have imported 'show_detail' for this to work properly. use Test::HTML::W3C 'show_detail'; is_valid_markup("<html></html">, "My simple test") or diag_html(); Parameters: $url, a scalar. Returns: None. SEE ALSO
Test::Builder::Module for creating your own testing modules. Test::More for another popular testing framework, also based on Test::Builder Test::Harness for detils about how test results are interpreted. AUTHORS
Victor <victor73@gmail.com> with inspiration from the authors of the Test::More and WebService::Validator::W3C:HTML modules. BUGS
See http://rt.cpan.org to report and view bugs. COPYRIGHT
Copyright 2006 by Victor <victor73@gmail.com>. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://www.perl.com/perl/misc/Artistic.html perl v5.12.4 2011-08-22 Test::HTML::W3C(3pm)
All times are GMT -4. The time now is 01:56 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy