finding and removing patterns in a large list of urls


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting finding and removing patterns in a large list of urls
# 1  
Old 02-19-2009
Question finding and removing patterns in a large list of urls

I have a list of urls for example:

Google
Google Base
Yahoo!
Yahoo!
Yahoo! Video - It's On
Google

The problem is that Google and Google are duplicates as are Yahoo! and Yahoo!.

I'm needing to find these conical www duplicates and append the text "DUP#" in from of both Google and Google for delimited import into excel to be able to sort and review by eye.

Smilie have no idea how to begin... sed, awk, perl, cut, etc????

Many thanks for any input.
# 2  
Old 02-20-2009
Code:
#!/usr/bin/perl
use strict;
open FH,"<a.txt";
my (@arr,%hash);
while(<FH>){
	chomp;
	push @arr,$_;
	$hash{$_}++;
}
close FH;
map { $_="#DUP ".$_ if $hash{$_} > 1 } @arr;
print join "\n" , @arr;

# 3  
Old 02-20-2009
Hi totus,

Hope This also can do .....

inputfile:
www.Google.com
www.Google Base.com
www.Yahoo!.com
www.Yahoo!.com
www.Yahoo! Video - It's On.com
www.Google.com

command:
sort inputfile|uniq -D |awk '{print $0"_DUP#"}'> out.csv

output:
www.Google.com_DUP#
www.Google.com_DUP#
www.Yahoo!.com_DUP#
www.Yahoo!.com_DUP#

Thanks
Sha
# 4  
Old 02-20-2009
Quote:
Originally Posted by Shahul
Hi totus,

Hope This also can do .....

inputfile:
www.Google.com
www.Google Base.com
www.Yahoo!.com
www.Yahoo!.com
www.Yahoo! Video - It's On.com
www.Google.com

command:
sort inputfile|uniq -D |awk '{print $0"_DUP#"}'> out.csv

output:
www.Google.com_DUP#
www.Google.com_DUP#
www.Yahoo!.com_DUP#
www.Yahoo!.com_DUP#

Thanks
Sha
Hello both of you! Thanks for the tips! However, I made a mistake in representing my data - as the vbulletin mucked it up. Here it is in <code> snip:

Code:
http://www.google.com
http://google.com
http://www.yahoo.com
http://video.yahoo.com
http://www.yahoo.com
http://knol.google.com

The issue is the www.domain.com and domain.com are dups. I need to identify these in a large lists by appending some delimiter to the matches e.g.

Code:
DUP#http://www.google.com
DUP#http://google.com

Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Finding matching patterns in two files

Hi, I have requirement to find the matching patterns of two files in Unix. One file is the log file and the other is the error list file. If any pattern in the log file matches the list of errors in the error list file, then I would need to find the counts of the match. For example, ... (5 Replies)
Discussion started by: Bobby_2000
5 Replies

2. Web Development

Removing VBSEO for vbulletin – Reverting back to vbulletin URLs

Please note, this information was copied from vbseo.com, now showing a database error. This is posted for reference since vbSEO seems to be going out of business: If you ever need to uninstall vBSEO , you can use the following instructions. Make sure you carefully follow each step. Login... (37 Replies)
Discussion started by: Neo
37 Replies

3. Shell Programming and Scripting

Finding files which contains anyone from the given patterns

Hi All, I need help as i am not able to create shell script for a scenario. i have 3000 numbers and want to search all the files which contain anyone of the above pattern. the files are in folder structure. Thanks and Regards Rishi Dhawan (14 Replies)
Discussion started by: Rishi26
14 Replies

4. Shell Programming and Scripting

Trying to extract domain and tld from list of urls.

I have done a fair amount of searching the threads, but I have not been able to cobble together a solution to my challenge. What I am trying to do is to line edit a file that will leave behind only the domain and tld of a long list of urls. The list looks something like this: www.google.com... (3 Replies)
Discussion started by: chamb1
3 Replies

5. Shell Programming and Scripting

Finding several patterns and outputting 4 lines after

I an trying to parse a file looking for pattern1, or pattern2, or pattern3 and when found print that line and the next 4 lines after it. I am using korn shell script on AIX and grep -A isn't available. (1 Reply)
Discussion started by: daveisme
1 Replies

6. Shell Programming and Scripting

Finding patterns in a file

Hi, I have a file with 3 columns and I want to find when the average number of rows on column 3 is a certain value. The output will be put into another file indicating the range. Here is what I mean (file is tab separated): hhm1 2 0 hhm1 4 0.5 hhm1 6 0.3 hhm1 8 -1.4... (2 Replies)
Discussion started by: kylle345
2 Replies

7. Shell Programming and Scripting

Rsync to an external list of URLs

I'm going to have a text file formatted something like this: some_name http://www.someurl.com/ another_name http://www.anotherurl.com/ third_name http://www.thirdurl.com/ I need to write a script that can rsync from a file path I'll set, to each URL in the list. Any ideas? (8 Replies)
Discussion started by: ibsen
8 Replies

8. Shell Programming and Scripting

removing certain paragraphs for matching patterns

Hi, I have a log file which might have certain paragraphs. Switch not possible Error code 1234 Process number 678 Log not available Error code 567 Process number 874 ..... ...... ...... Now I create an exception file like this. cat text.exp Error code 1234 Process number 874 (7 Replies)
Discussion started by: kaushys
7 Replies

9. Shell Programming and Scripting

Split a large file with patterns and size

Hi, I have a large file with a repeating pattern in it. Now i want the file split into the block of patterns with a specified no. of lines in each file. i.e. The file is like 1... 2... 2... 3... 1... 2... 3... 1... 2... 2... 2... 2... 2... 3... where 1 is the start of the block... (5 Replies)
Discussion started by: sudhamacs
5 Replies

10. Shell Programming and Scripting

Finding patterns through out all subdir

Hi all experts, here is a problem which i would appreciate ur expertise. I need to do this: Eg. Find a number: 1234567 which i dunno which file and which folder I do know which main folder it is in but it is hidden deep within a lot of subdir. Is it possible to find the file? + output... (4 Replies)
Discussion started by: unnerdy
4 Replies
Login or Register to Ask a Question