Parsing a file which contains urls from different sites


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Parsing a file which contains urls from different sites
# 1  
Old 09-30-2009
Parsing a file which contains urls from different sites

Hi

I have a file which have millions of urls from different sites. Count of lines are 4000000.
Code:
http://www.chipchick.com/2009/09/usb_hand_grenade.html
http://www.engadget.com/page/5
http://www.mp3raid.com/search/download-mp3/20173/michael_jackson_fall_again_instrumental.html
http://www.myacrobatpdf.com/8713/canon-speedlite-430ex-manual.html
http://www.mobileheart.com/cell-phone-screensavers/1167-Sony-Ericsson-W200-Screensavers.aspx
http://www.india-forums.com/forum_posts.asp?TID=1256207&TPN=2
http://gallery.mobile9.com/f/923680
http://www.phoronix.com/scan.php?page=article&item=xorg_vdpau_vaapi&num=1
http://www.experts-exchange.com/Software/Photos_Graphics
http://www.jigzone.com/mpc/expired.php
http://ultimatetop200.com/
http://www.mp3raid.com/search/for/the_maine/4.html
http://gallery.mobile9.com/f/907594?view=download
http://gallery.mobile9.com/f/907594
http://www.imdb.com/title/tt0813715/board/thread/147969365
http://www.imdb.com/name/nm0002028

I want some command or code which can give me count of urls from individual sites e.g imdb, experts-exchange. gallery.mobile9

Last edited by radoulov; 09-30-2009 at 07:33 AM.. Reason: please use code tags
# 2  
Old 09-30-2009
With GNU AWK you can do something like this:

Code:
gawk -F'http://(www\\.)?|/' '!_[$2]++{print $2}' infile

Otherwise use Perl:

Code:
perl -nle'
  print $1 unless $_{(m|http://(?:www.)?([^/]*)|)[0]}++
' infile



---------- Post updated at 12:49 PM ---------- Previous update was at 12:48 PM ----------

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags [code] and [/code] by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums
# 3  
Old 10-05-2009
Hey thanks! alot....
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regex for URLs and files

Hi, I am looking for a regex that will validate a URL and files accessed in a browser. For example:http://www.google.co.uk http://www.google.com https://www.google.co.uk https://www.google.com ftp:// file:///somefile/on/a/server/accessed/from/browser/file.txt So far I have: ... (4 Replies)
Discussion started by: muay_tb
4 Replies

2. Shell Programming and Scripting

Replacing urls from file

Hi ALL, I have a file A which contains A=www.google.com B=www.abcd.com C=www.nick.com D=567 file B Contains A=www.google1234.com B=www.bacd.com C=www.mick.com D=789 I wanted a script which can replace file A contents with B Contents (5 Replies)
Discussion started by: nikhil jain
5 Replies

3. Post Here to Contact Site Administrators and Moderators

Not allowed to post URLs

Hi, I tried to post some perl code for discussion (wrapped in swaddling . However, a regex has an escaped backslash so the forum parser sees it as an URL? Had the same experience with the sample data that I tried to provide for the same discussion. It contains emails addresses,... (1 Reply)
Discussion started by: msutfin
1 Replies

4. Web Development

How to redirect URLs in Apache?

I am a total newbie to Apache. I need to do this only for this weekend during an upgrade from old system to new system We have different URLs http://domain.name/xxx (xxx varies to any length and words - it can be /home, /login, /home/daily, /daily/report, etc). How do i redirect all those to... (0 Replies)
Discussion started by: GosarJunk
0 Replies

5. Shell Programming and Scripting

Hashing URLs

So, I am writing a script that will read output from Bulk Extractor (which gathers data based on regular expressions). My script then reads the column that has the URL found, hashes it with MD5, then outputs the URL and hash to a file. Where I am stuck on is that I want to read the bulk... (7 Replies)
Discussion started by: twjolson
7 Replies

6. Shell Programming and Scripting

Remove external urls from .html file

Hi everyone. I have an html file with lines like so: link href="localFolder/..."> link href="htp://..."> img src="localFolder/..."> img src="htp://..."> I want to remove the links with http in the href and imgs with http in its src. I'm having trouble removing them because there... (4 Replies)
Discussion started by: CowCow339
4 Replies

7. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Hey guys, I have this file generated by me... i want to create some HTML output from it. The problem is that i am really confused about how do I go about reading the file. The file is in the following format: TID1 Name1 ATime=xx AResult=yyy AExpected=yyy BTime=xx BResult=yyy... (8 Replies)
Discussion started by: umar.shaikh
8 Replies

8. UNIX for Dummies Questions & Answers

find and replace urls

I need to archive a large website onto a DVD. Many of the links and image srcs are absolute URLs. As I don't want to alter them all manually, I'm looking for a perl or unix command that would remove: http://www.mydomain.com/mysubfolder/ and replace with: ./ Can anyone help me with this... (3 Replies)
Discussion started by: benkyma
3 Replies
Login or Register to Ask a Question