Perl- Finding average "frequency" of occurrence of duplicate lines

08-09-2011

Registered User

28, 0

Join Date: Apr 2011

Last Activity: 17 August 2011, 4:10 AM EDT

Location: Helsinki, Finland

Posts: 28

Thanks Given: 14

Thanked 0 Times in 0 Posts

Perl- Finding average "frequency" of occurrence of duplicate lines

Hello,

I am working with a perl script that tries to find the average "frequency" in which lines are duplicated. So far I've only managed to find the way to count how many times the lines are repeated, the code is as follows:

Code:

perl -ae'
my $filename= $ENV{'i'};
open (FILE, "$filename") or die  $!;

my %seen= ();


while(my $line = <FILE>){
  my @fields = split(/\s+/, $line);
  my @fields2= @fields[3..16];
  my $niin= join("\t", @fields2);
  $seen{$niin}++;
  }

foreach my $keys (sort {$seen{$b} <=> $seen{$a}} keys %seen){
    print "$keys = $seen{$keys}\n";
}

close (FILE);


'

Which produces this type of output:

Code:

225    1    225    2    225    3    225    4    225    5    225    6    225    7 = 31789
225    10    225    11    225    12    225    13    225    14    225    15    225    0 = 31772
225    8    225    9    225    10    225    11    225    12    225    13    225    14 = 31714
225    3    225    4    225    5    225    6    225    7    225    8    225    9 = 31686

Now, what I want to do is find a way to find out the number of (in average) "every how many lines a certain line is repeated". So I was wondering if it's possible to have some sort of record and then in the end just calculate the average?

I actually have another way to calculate this frequency. In the original file being read, the first field is a unix timestamp (which i "cut out" for the counting of the duplicate lines). So I thought it would be possible as well to try to keep a record of the "time between repetitions" and then make an average in the end. Of course this would imply keeping a record for each duplicate line, which seems like a rather intricate operation. An example of the lines is :

Code:

1301892853.870    1316    efc0696e        225    1    225    2    225    3    225    4    225    5    225    6    225    7

The first field being the unix timestamp. The first, second and third field are ignored for the comparison of duplicate lines.

Any help is deeply appreciated.

---------- Post updated 08-09-11 at 07:49 AM ---------- Previous update was 08-08-11 at 08:40 AM ----------

Is this really not accomplishable the way I asked for in perl? Is there any other way to do it? Any ideas please? Smilie

Thanks again...

Last edited by pludi; 08-08-2011 at 03:24 AM..

acsg

View Public Profile for acsg

Find all posts by acsg

08-09-2011

Registered User

1,000, 237

Join Date: Jun 2011

Last Activity: 2 August 2017, 9:27 AM EDT

Location: From far

Posts: 1,000

Thanks Given: 21

Thanked 237 Times in 231 Posts

Do you want something like this?

Code:

% cat INPUTFILE
a
b
a
c
b
a
a
d
d
% perl -lne '
  $seen{$_}++;
  END {
    for $key (sort keys %seen) {
      printf "%s %.2f%%\n", $key, $seen{$key}/$. * 100;
    }
}' INPUTFILE
a 44.44%
b 22.22%
c 11.11%
d 22.22%

yazu

View Public Profile for yazu

Find all posts by yazu

08-09-2011

Registered User

28, 0

Join Date: Apr 2011

Last Activity: 17 August 2011, 4:10 AM EDT

Location: Helsinki, Finland

Posts: 28

Thanks Given: 14

Thanked 0 Times in 0 Posts

I'm sorry I think I wasn't clear enough.

I'd like the average of "every how many lines a certain line is repeated". So say that the line

Code:

a b c d e

is repeated first every 2 lines, then the next time it appears after 10 lines, then 2 again, then 4, etc etc.

Is it possible to keep a record of this and make an average? For each duplicate line, of course.

Anyway, if I'm still not being clear enough, please do ask.

Thanks!

I had proposed to use the first field in my file to keep record of time (since it's a unix timestamp). Try to find the "inter-occurrence" time instead of the "every how many lines" record, but I don't know if this would be more complicated.

acsg

View Public Profile for acsg

Find all posts by acsg

08-09-2011

Registered User

1,000, 237

Join Date: Jun 2011

Last Activity: 2 August 2017, 9:27 AM EDT

Location: From far

Posts: 1,000

Thanks Given: 21

Thanked 237 Times in 231 Posts

Quote:

Is it possible to keep a record of this and make an average?

I believe it is possible. But I'm not sure I understand the task (sorry, English is not my native language). Please give examples of your input and the desired output. Maybe it would be enough if you give the desired output for my INPUTFILE:
All lines: 9
Lines between a: 1, 2, 0 (or maybe you need to remember line numbers - 1, 3, 6, 7?) so what output?
b: 2 - ?
c: ? (only one occurrence) - ?
d: 0 - ?

yazu

View Public Profile for yazu

Find all posts by yazu

08-09-2011

Registered User

28, 0

Join Date: Apr 2011

Last Activity: 17 August 2011, 4:10 AM EDT

Location: Helsinki, Finland

Posts: 28

Thanks Given: 14

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by yazu

I believe it is possible. But I'm not sure I understand the task (sorry, English is not my native language). Please give examples of your input and the desired output. Maybe it would be enough if you give the desired output for my INPUTFILE:
All lines: 9
Lines between a: 1, 2, 0 (or maybe you need to remember line numbers - 1, 3, 6, 7?) so what output?
b: 2 - ?
c: ? (only one occurrence) - ?
d: 0 - ?

Thanks for your reply.
Yeah what I want is something like what you said. So, for your example input file, the output would be:

Code:

a- 4 2 
b- 2 3
c- 1 0
d- 2 1

the first field being the contents of the line being repeated, the second field the number of times found in the file, the third field being the average of "every how many lines it is repeated". So for example for 'a', first it appears after 2 lines, then 3 lines then 1 line. So the average of this makes 2 lines. Then for 'b' and 'd' since they are only duplicated once, there won't be a need to make an average. And, since 'c' is never repeated, then the average is just '0' (or could be blank, it doesn't matter).

On the other hand, how about keeping track of the timestamp and subtracting it to make the "time between repetitions" and then making an average? That was my original idea but I don't know how to keep track of this time, per each repeated line. The output in this case would be something like:

Code:

a- 4 0.05
b- 2 0.89
c- 1 0
d- 2 0.06

the last field being the seconds.

Thanks!

acsg

View Public Profile for acsg

Find all posts by acsg

08-09-2011

Registered User

1,000, 237

Join Date: Jun 2011

Last Activity: 2 August 2017, 9:27 AM EDT

Location: From far

Posts: 1,000

Thanks Given: 21

Thanked 237 Times in 231 Posts

Ok. Is this algorithm is right (there is 1 second difference between lines)?

Code:

cat INPUTFILE 
1301892853.870 a
1301892854.870 b
1301892855.870 a
1301892856.870 c
1301892857.870 b
1301892858.870 a
1301892859.870 a
1301892860.870 d
1301892861.870 d
 
perl -ane '
  push @{$seen{$F[1]}}, $F[0];
  END {
    for $key (sort keys %seen) {
      @ts = @{$seen{$key}};
      $n = @ts;      
      $prev = $ts[0];
      $nt = 0;
      print "$key $n ";
      for $time (@ts) {
        $nt += $time - $prev;
      }
      print $nt/$n, "\n";
    }
  }
' INPUTFILE
a 4 3.25
b 2 1.5
c 1 0
d 2 0.5

Last edited by yazu; 08-09-2011 at 03:59 AM.. Reason: small improvements

yazu

View Public Profile for yazu

Find all posts by yazu

08-09-2011

Registered User

28, 0

Join Date: Apr 2011

Last Activity: 17 August 2011, 4:10 AM EDT

Location: Helsinki, Finland

Posts: 28

Thanks Given: 14

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by yazu

Ok. Is this algorithm is right (there is 1 second difference between lines)?

Code:

cat INPUTFILE 
1301892853.870 a
1301892854.870 b
1301892855.870 a
1301892856.870 c
1301892857.870 b
1301892858.870 a
1301892859.870 a
1301892860.870 d
1301892861.870 d
 
perl -ane '
  push @{$seen{$F[1]}}, $F[0];
  END {
    for $key (sort keys %seen) {
      @ts = @{$seen{$key}};
      $n = @ts;      
      $prev = $ts[0];
      $nt = 0;
      print "$key $n ";
      for $time (@ts) {
        $nt += $time - $prev;
      }
      print $nt/$n, "\n";
    }
  }
' INPUTFILE
a 4 3.25
b 2 1.5
c 1 0
d 2 0.5

I just tried the algorithm and it works for the example input file but for my actual file, there are a couple of problems.

The input lines in my original file are of the form:

Code:

1301892853.870    1316    efc0696e        225    1    225    2    225    3    225    4    225    5    225    6    225    7

So for the comparison of duplicates I want to ignore the fields 0, 1 and 2.
How can I adjust your code to this?

I tried changing this part that only considers the first field:

Code:

 push @{$seen{$F[1]}}, $F[0];

then I changed it to

Code:

push @{$seen{$F[3..16]}}, $F[0];

but it doesn't seem to work and well, I don't think I quite get what the code does, could you please explain? Smilie

thanks!

acsg

View Public Profile for acsg

Find all posts by acsg

Shell Programming and Scripting

Perl- Finding average "frequency" of occurrence of duplicate lines

9 More Discussions You Might Find Interesting

1. AIX

Apache 2.4 directory cannot display "Last modified" "Size" "Description"

Discussion started by: penchev

2. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Discussion started by: jcdole

3. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

Discussion started by: asjaiswal

4. Shell Programming and Scripting

Find lines with "A" then change "E" to "X" same line

Discussion started by: nightwatchrenba

5. Shell Programming and Scripting

Cant get awk 1liner to remove duplicate lines from Delimited file, get "event not found" error..help

Discussion started by: andy b

6. Shell Programming and Scripting

finding the strings beween 2 characters "/" & "/" in .txt file

Discussion started by: Behrouzx77

7. Solaris

"Load Average" vs "virtual processor"

Discussion started by: MDING

8. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Discussion started by: shis100

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Discussion started by: Lokesha