Need to find occurrences of email domains in all files in a directory


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Need to find occurrences of email domains in all files in a directory
# 1  
Old 10-14-2009
Need to find occurrences of email domains in all files in a directory

Hello Everyone!

I trust you are off to a great week! Trying to output the name and count of each uniquely occurring domain in the current directory for a portion of a script I'm building.

Here's what I'm stuck on:

- Need to find UNIQUE occurences of domains (*@domain.com) in ALL files in a directory.
- Need to output:
uniquedomain1.com = 1234 occurrences
uniquedomain2.com = 12345 occurrences

... etc

- Every file includes ONE domain per line, with the format of the surrounding text being inconsistent and random. What WILL remain consistent is that each line will have an email address with the following syntax somewhere in each: emailaddress@domain.com


Would someone be able to help me figure out how do this?

Thanks so much

---------- Post updated at 05:30 PM ---------- Previous update was at 04:45 PM ----------

I can call the below to output a list of UNIQUELY occuring domains:
perl -wne'while(/@[\w\.]+/g){print "$&\n"}' filename | sort -u

Now, how do I, for all files in a directory, display the count of each unique domain per file and then a final TOTAL count, per domain, for all files.

Thanks!
# 2  
Old 10-15-2009
file 'test':
Code:
@www.test.com
@www.test.org
@www.test.com
@www.test.org
@www.test.com
@www.test.com
@www.test.com
@www.test.com
@

command
Code:
 perl -we 'my $domains = {};open FH, "<$ARGV[0]"; while (<FH>) {if (/\@([\w\.]+)/){$domains->{$1}+=1;}}foreach my $domain (sort keys %$domains){print "$domain"."=";print $domains->{$domain}."\n";};close FH;'  test

more easily read:
Code:
my $domains = {};
open FH, "<$ARGV[0]"; 
while (<FH>) {
  if (/\@([\w\.]+)/) {
    $domains->{$1}+=1;
  }
}
foreach my $domain (sort keys %$domains) {
  print "$domain"."=";
  print $domains->{$domain}."\n";
};
close FH;

result
Code:
www.test.com=6
www.test.org=2

# 3  
Old 10-15-2009
Error with that code

Here's what I'm seeing:

Code:
Use of uninitialized value $ARGV[0] in concatenation (.) or string at -e line 1, <> line 19273.
readline() on closed filehandle FH at -e line 1.

... once per every line.

Anyway, I made some headway on my own, so please take a look at my code below.

Code:
 1 #!/bin/sh
  2 for file in *
  3 do
  4   if [ -f "$file" ]
  5   then
  6     # FOR EACH FILE, OUTPUT THE FILENAME + LINE COUNT
  7     find $file -print0 | xargs -0 wc -l
  8     fileLineCount="`wc -w $file`"
  9     echo $fileLineCount
 10 
 11     #Output unique domains
 12     perl -wne'while(/@[\w\.]+/g){print "$&\n"}' $file | sort -u > uniques.txt   # TO FILE
 13     #perl -wne'while(/@[\w\.]+/g){print "$&\n"}' $file | sort -u                # TO SCREEN
 14 
 15     # Create structures based on individual files
 16     c=0; while read line; do arrayDomain[c]=`echo "$line"`; let c=$c+1; done < uniques.txt
 17      
 18 
 19     arrayDomain_size=${#arrayDomain[*]}
 20     
 21 
 22    #ASSIGN 'DOMAIN COUNT' TO THE RELATED ARRAY and OUTPUT COUNT, PER DOMAIN
 23    #i=0; while[$arrayDomain_size > $i]; do arrayUniqueNum[i]= $(grep -o ${arrayDomain[i]} $file | wc -w); let i=$i+1; do    ne
 24         max=c
 25         position=0
 26         while (( position < max))
 27         do  
 28                 arrayUniqueNum[position]=$(grep -o ${arrayDomain[position]} $file | wc -w)
 29                  
 30                 if [ ${arrayUniqueNum[position]} -ge 1000 ]
 31                 then
 32                         echo "${arrayDomain[position]}  :  ${arrayUniqueNum[position]}"
 33                         #echo "\n$((${arrayUniqueNum[position]}/$fileLineCount)*100) %"
 34                 fi
 35                 (( position = position + 1 ))
 36         
 37         done
 38 
 39 
 40 
 41 
 42    fi

Everything works pretty much, except here are the items I'm COMPLETELY stuck on:

1)Only output the analysis lines IF the count is greater than 1000.
2) For some reason, some output looks like this:
@r : 1052
@s : 2704
@t : 1406
.... when it should actually be showing the entire domain. The domains that get output to uniques.txt looks fine. Not too sure why it's not reading in the lines properly/outputting from arrayDomain[] .

3) Output the percentages as well. You'll see my code that's commented out (
Code:
#echo "\n$((${arrayUniqueNum[position]}/$fileLineCount)*100) %"

).
I'm not really sure how to properly format this to make it output what I need (percentage that a given domain makes up in a file):

Domain.com : xxxx unique occurrances : 23%



Help would be GREATLY appreciated. Thanks for your assistance in advance, you all are a true asset to furthering knowledge and education in the Unix community! I'm sure we can come to a solution together. I'm here to learn from the best~!

Please let me know if this needs clarification at all.
# 4  
Old 10-15-2009
you may be getting that error in my sample if you're still using the '-n' flag.

If I get a chance, I'll look at the shell script too.
# 5  
Old 10-15-2009
That line of code worked (think I entered something incorrectly before). I've included a similar functionality in the script (per my previous post). If you'd be so kind to see what can be done to make the other items happen, that would be FANTASTIC.

I'm really stuck and I'd appreciate the opportunity to learn how to make these other functions happen (there are just a few).

Thanks so much in advance.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Check Directory for files & Email

I'm working on a bash script to move files from one location, to two. The first part of my challenge is intended to check a particular directory for contents (e.g. files or other items in it), if files exists, then send the list of names to a txt file and email me the text file. If files do not... (4 Replies)
Discussion started by: Nvizn
4 Replies

2. Shell Programming and Scripting

Loop multiple directory, find a file and send email

Hello ALL, need a BASH script who find file and send email with attachment. I have 50 folders without sub directories in each generated files of different sizes but with a similar name Rp01.txt Rp02.txt Rp03.txt ...etc. Each directors bound by mail group, I need a script that goes as... (1 Reply)
Discussion started by: penchev
1 Replies

3. UNIX for Advanced & Expert Users

Find 2 occurrences of a word and print file names

I was thinking something like this but it always gets rid of the file location. grep -roh base. | wc -l find . -type f -exec grep -o base {} \; | wc -l Would this be a job for awk? Would I need to store the file locations in an array? (3 Replies)
Discussion started by: cokedude
3 Replies

4. UNIX for Advanced & Expert Users

Find all files in the current directory excluding hidden files and directories

Find all files in the current directory only excluding hidden directories and files. For the below command, though it's not deleting hidden files.. it is traversing through the hidden directories and listing normal which should be avoided. `find . \( ! -name ".*" -prune \) -mtime +${n_days}... (7 Replies)
Discussion started by: ksailesh1
7 Replies

5. Shell Programming and Scripting

Regex find first 5-7 occurrences of a set of digits within a string

Using these strings as an example: <a onclick="doShowCHys=1;ShowWindowN(0,'/daman/man.php?asv4=145148&amp;playTogether=True',960,540,943437);return false;" title=""> <a onclick="doShowCHys=1;ShowWindowN(0,'/daman/man.php?asv4=1451486&amp;playTogether=True',960,540,94343);return false;" title=""> <a... (12 Replies)
Discussion started by: metallica1973
12 Replies

6. Homework & Coursework Questions

Du without directory and Grep for occurrences of a word

Assistance on work Use and complete the template provided. The entire template must be completed. If you don't, your post may be deleted! 1. The problem statement, all variables and given/known data: Files stored in ... (1 Reply)
Discussion started by: alindner
1 Replies

7. Shell Programming and Scripting

Script to find and email selected files

I am trying to come up with a script that will search for selected files and then email them to me. For example, say I have a directory that has the following files: AA_doug.txt AA_andy.txt BB_john.txt APPLE_mike.txt GLOBE_ed.txt GLOBE_tony.txt TOTAL_carl.txt what is the best way to... (2 Replies)
Discussion started by: coach5779
2 Replies

8. UNIX for Dummies Questions & Answers

Find files and display only directory list containing those files

I have a directory (and many sub dirs beneath) on AIX system, containing thousands of file. I'm looking to get a list of all directory containing "*.pdf" file. I know basic syntax of find command, but it gives me list of all pdf files, which numbers in thousands. All I need to know is, which... (4 Replies)
Discussion started by: r7p
4 Replies

9. UNIX for Dummies Questions & Answers

find and remove rows from file where multi occurrences of character found

I have a '~' delimited file of 6 - 7 million rows. Each row should contain 13 columns delimited by 12 ~'s. Where there are 13 tildes, the row needs to be removed. Each row contains alphanumeric data and occasionally a ~ ends up in a descriptive field and therefore acts as a delimiter, resulting in... (1 Reply)
Discussion started by: kpd
1 Replies
Login or Register to Ask a Question