Perl- Finding average "frequency" of occurrence of duplicate lines


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl- Finding average "frequency" of occurrence of duplicate lines
# 1  
Old 08-09-2011
Perl- Finding average "frequency" of occurrence of duplicate lines

Hello,

I am working with a perl script that tries to find the average "frequency" in which lines are duplicated. So far I've only managed to find the way to count how many times the lines are repeated, the code is as follows:

Code:
perl -ae'
my $filename= $ENV{'i'};
open (FILE, "$filename") or die  $!;

my %seen= ();


while(my $line = <FILE>){
  my @fields = split(/\s+/, $line);
  my @fields2= @fields[3..16];
  my $niin= join("\t", @fields2);
  $seen{$niin}++;
  }

foreach my $keys (sort {$seen{$b} <=> $seen{$a}} keys %seen){
    print "$keys = $seen{$keys}\n";
}

close (FILE);


'

Which produces this type of output:

Code:
225    1    225    2    225    3    225    4    225    5    225    6    225    7 = 31789
225    10    225    11    225    12    225    13    225    14    225    15    225    0 = 31772
225    8    225    9    225    10    225    11    225    12    225    13    225    14 = 31714
225    3    225    4    225    5    225    6    225    7    225    8    225    9 = 31686

Now, what I want to do is find a way to find out the number of (in average) "every how many lines a certain line is repeated". So I was wondering if it's possible to have some sort of record and then in the end just calculate the average?

I actually have another way to calculate this frequency. In the original file being read, the first field is a unix timestamp (which i "cut out" for the counting of the duplicate lines). So I thought it would be possible as well to try to keep a record of the "time between repetitions" and then make an average in the end. Of course this would imply keeping a record for each duplicate line, which seems like a rather intricate operation. An example of the lines is :

Code:
1301892853.870    1316    efc0696e        225    1    225    2    225    3    225    4    225    5    225    6    225    7

The first field being the unix timestamp. The first, second and third field are ignored for the comparison of duplicate lines.

Any help is deeply appreciated.

---------- Post updated 08-09-11 at 07:49 AM ---------- Previous update was 08-08-11 at 08:40 AM ----------

Is this really not accomplishable the way I asked for in perl? Is there any other way to do it? Any ideas please? Smilie

Thanks again...

Last edited by pludi; 08-08-2011 at 03:24 AM..
# 2  
Old 08-09-2011
Do you want something like this?
Code:
% cat INPUTFILE
a
b
a
c
b
a
a
d
d
% perl -lne '
  $seen{$_}++;
  END {
    for $key (sort keys %seen) {
      printf "%s %.2f%%\n", $key, $seen{$key}/$. * 100;
    }
}' INPUTFILE
a 44.44%
b 22.22%
c 11.11%
d 22.22%

# 3  
Old 08-09-2011
I'm sorry I think I wasn't clear enough.

I'd like the average of "every how many lines a certain line is repeated". So say that the line

Code:
a b c d e



is repeated first every 2 lines, then the next time it appears after 10 lines, then 2 again, then 4, etc etc.

Is it possible to keep a record of this and make an average? For each duplicate line, of course.

Anyway, if I'm still not being clear enough, please do ask.

Thanks!

I had proposed to use the first field in my file to keep record of time (since it's a unix timestamp). Try to find the "inter-occurrence" time instead of the "every how many lines" record, but I don't know if this would be more complicated.
# 4  
Old 08-09-2011
Quote:
Is it possible to keep a record of this and make an average?
I believe it is possible. But I'm not sure I understand the task (sorry, English is not my native language). Please give examples of your input and the desired output. Maybe it would be enough if you give the desired output for my INPUTFILE:
All lines: 9
Lines between a: 1, 2, 0 (or maybe you need to remember line numbers - 1, 3, 6, 7?) so what output?
b: 2 - ?
c: ? (only one occurrence) - ?
d: 0 - ?
# 5  
Old 08-09-2011
Quote:
Originally Posted by yazu
I believe it is possible. But I'm not sure I understand the task (sorry, English is not my native language). Please give examples of your input and the desired output. Maybe it would be enough if you give the desired output for my INPUTFILE:
All lines: 9
Lines between a: 1, 2, 0 (or maybe you need to remember line numbers - 1, 3, 6, 7?) so what output?
b: 2 - ?
c: ? (only one occurrence) - ?
d: 0 - ?


Thanks for your reply.
Yeah what I want is something like what you said. So, for your example input file, the output would be:

Code:
a- 4 2 
b- 2 3
c- 1 0
d- 2 1



the first field being the contents of the line being repeated, the second field the number of times found in the file, the third field being the average of "every how many lines it is repeated". So for example for 'a', first it appears after 2 lines, then 3 lines then 1 line. So the average of this makes 2 lines. Then for 'b' and 'd' since they are only duplicated once, there won't be a need to make an average. And, since 'c' is never repeated, then the average is just '0' (or could be blank, it doesn't matter).

On the other hand, how about keeping track of the timestamp and subtracting it to make the "time between repetitions" and then making an average? That was my original idea but I don't know how to keep track of this time, per each repeated line. The output in this case would be something like:

Code:
a- 4 0.05
b- 2 0.89
c- 1 0
d- 2 0.06



the last field being the seconds.

Thanks!

# 6  
Old 08-09-2011
Ok. Is this algorithm is right (there is 1 second difference between lines)?
Code:
cat INPUTFILE 
1301892853.870 a
1301892854.870 b
1301892855.870 a
1301892856.870 c
1301892857.870 b
1301892858.870 a
1301892859.870 a
1301892860.870 d
1301892861.870 d
 
perl -ane '
  push @{$seen{$F[1]}}, $F[0];
  END {
    for $key (sort keys %seen) {
      @ts = @{$seen{$key}};
      $n = @ts;      
      $prev = $ts[0];
      $nt = 0;
      print "$key $n ";
      for $time (@ts) {
        $nt += $time - $prev;
      }
      print $nt/$n, "\n";
    }
  }
' INPUTFILE
a 4 3.25
b 2 1.5
c 1 0
d 2 0.5


Last edited by yazu; 08-09-2011 at 03:59 AM.. Reason: small improvements
# 7  
Old 08-09-2011
Quote:
Originally Posted by yazu
Ok. Is this algorithm is right (there is 1 second difference between lines)?
Code:
cat INPUTFILE 
1301892853.870 a
1301892854.870 b
1301892855.870 a
1301892856.870 c
1301892857.870 b
1301892858.870 a
1301892859.870 a
1301892860.870 d
1301892861.870 d
 
perl -ane '
  push @{$seen{$F[1]}}, $F[0];
  END {
    for $key (sort keys %seen) {
      @ts = @{$seen{$key}};
      $n = @ts;      
      $prev = $ts[0];
      $nt = 0;
      print "$key $n ";
      for $time (@ts) {
        $nt += $time - $prev;
      }
      print $nt/$n, "\n";
    }
  }
' INPUTFILE
a 4 3.25
b 2 1.5
c 1 0
d 2 0.5

I just tried the algorithm and it works for the example input file but for my actual file, there are a couple of problems.

The input lines in my original file are of the form:
Code:
1301892853.870    1316    efc0696e        225    1    225    2    225    3    225    4    225    5    225    6    225    7

So for the comparison of duplicates I want to ignore the fields 0, 1 and 2.
How can I adjust your code to this?

I tried changing this part that only considers the first field:

Code:
 push @{$seen{$F[1]}}, $F[0];

then I changed it to

Code:
push @{$seen{$F[3..16]}}, $F[0];



but it doesn't seem to work and well, I don't think I quite get what the code does, could you please explain? Smilie thanks!
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. AIX

Apache 2.4 directory cannot display "Last modified" "Size" "Description"

Hi 2 all, i have had AIX 7.2 :/# /usr/IBMAHS/bin/apachectl -v Server version: Apache/2.4.12 (Unix) Server built: May 25 2015 04:58:27 :/#:/# /usr/IBMAHS/bin/apachectl -M Loaded Modules: core_module (static) so_module (static) http_module (static) mpm_worker_module (static) ... (3 Replies)
Discussion started by: penchev
3 Replies

2. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Hello. System : opensuse leap 42.3 I have a bash script that build a text file. I would like the last command doing : print_cmd -o page-left=43 -o page-right=22 -o page-top=28 -o page-bottom=43 -o font=LatinModernMono12:regular:9 some_file.txt where : print_cmd ::= some printing... (1 Reply)
Discussion started by: jcdole
1 Replies

3. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

How to use "mailx" command to do e-mail reading the input file containing email address, where column 1 has name and column 2 containing “To” e-mail address and column 3 contains “cc” e-mail address to include with same email. Sample input file, email.txt Below is an sample code where... (2 Replies)
Discussion started by: asjaiswal
2 Replies

4. Shell Programming and Scripting

Find lines with "A" then change "E" to "X" same line

I have a bunch of random character lines like ABCEDFG. I want to find all lines with "A" and then change any "E" to "X" in the same line. ALL lines with "A" will have an "X" somewhere in it. I have tried sed awk and vi editor. I get close, not quite there. I know someone has already solved this... (10 Replies)
Discussion started by: nightwatchrenba
10 Replies

5. Shell Programming and Scripting

Cant get awk 1liner to remove duplicate lines from Delimited file, get "event not found" error..help

Hi, I am on a Solaris8 machine If someone can help me with adjusting this awk 1 liner (turning it into a real awkscript) to get by this "event not found error" ...or Present Perl solution code that works for Perl5.8 in the csh shell ...that would be great. ****************** ... (3 Replies)
Discussion started by: andy b
3 Replies

6. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Hi all. I have a .txt file that I need to sort it My file is like: 1- 88 chain0 MASTER (FF-TE) FFFF 1962510 /TCK T FD2TQHVTT1 /jtagc/jtag_instreg/updateinstr_reg_1 dff1 (TI,SO) 2- ... (10 Replies)
Discussion started by: Behrouzx77
10 Replies

7. Solaris

"Load Average" vs "virtual processor"

Hi, I have one question regarding the understanding of “load average” in a platform with virtual processors. Suppose in this situation: Total number of physical processors: 1 Number of virtual processors: 32 Total number of cores: 4 Number of cores per physical... (1 Reply)
Discussion started by: MDING
1 Replies

8. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Hi Friends, Can any of you explain me about the below line of code? mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'` Im not able to understand, what exactly it is doing :confused: Any help would be useful for me. Lokesha (4 Replies)
Discussion started by: Lokesha
4 Replies
Login or Register to Ask a Question