Print number of lines for files in directory, also print number of unique lines

07-11-2019

Registered User

24, 2

Join Date: Mar 2010

Last Activity: 16 October 2019, 10:25 PM EDT

Posts: 24

Thanks Given: 14

Thanked 2 Times in 2 Posts

Quote:

Originally Posted by vgersh99

how about this:

Code:

#!/bin/ksh

wc -l * | sed '$d' | sort | while read lines file junk
do
   echo $lines $(sort < $file | uniq -u |wc -l) $file
done

FYI - Just tried this, it printed correct counts but unique counts were off. I will check the others and update.

Quote:

Originally Posted by nezabudka

Code:

awk '{u[$0]; l++} ENDFILE {print length(u), l, FILENAME; delete u; l=0}' * | sort -k1,1n

Thanks nezabudka!! This seems to work with gawk -- thanks also vgersh99 for pointing out gawk -- tried your different gawk but counts still off ... as in your original solution -- maybe uniq is not being done in correct order?

Quote:

Originally Posted by Don Cragun

Please always tell us what shell and operating system you're using when you start a new thread. Don't assume that everyone who wants to help you has read all of your previous threads.

Code:

#!/bin/bash
tmpf="/tmp/$$.result"

trap 'rm -f "$tmpf"' EXIT

awk '
function dump() {
	print linecount, distinct, lastfile
	linecount = distinct = 0
	split("", lines)
}

FILENAME != lastfile {
	if(lastfile)
		dump()
	lastfile = FILENAME
}

{	linecount++
	if(lines[$0]++ == 0)
		distinct++
}

END {	dump()
}' * > "$tmpf"

echo 'Sorted by increaasing number of lines in files:'
sort -n "$tmpf"

echo 'Sorted by increaasing number of distinct lines in files:'
sort -k2,2n "$tmpf"

Note that this should work with any version of awk (but on Solaris systems, you'll need to use nawk or /usr/xpg4/bin/awk).

Thanks Don Cragun -- this also works!

Quote:

Originally Posted by MadeInGermany

The following variant correctly handles filenames with special characters:

Code:

for f in *; do printf "%s/%s lines are unique in file %s\n" $(sort "$f" | uniq -u | wc -l) $(wc -l < "$f") "$f"; done

Post #3 has another perception of "unique":

Code:

for f in *; do printf "%s/%s unique lines in file %s\n" $(sort  -u "$f" | wc -l) $(wc -l < "$f") "$f"; done

Didn't see the "sort" requirement. Left as an exercise.

Thanks MadeInGermany, first gives same unique count as vgersh99, second works for me. Maybe my perception of unique is incorrect

I'm getting my unique count by:

Code:

sort filename | uniq | wc -l

The contents of my files are URLs if that makes a difference.

Last edited by spacegoose; 07-11-2019 at 07:13 PM..

This User Gave Thanks to spacegoose For This Post:

spacegoose

View Public Profile for spacegoose

Find all posts by spacegoose

07-11-2019

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Quote:

Originally Posted by spacegoose

FYI - Just tried this, it printed correct counts but unique counts were off. I will check the others and update.

worked just fine with my test harness files quoted previously!

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

07-11-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

For the fun of it:

Code:

for FN in *; do { sort $FN | tee >(uniq -u | wc -l >&3) | wc -l; echo $FN; } 3>&1; done | paste -s -d"\t\t\n" | sort -n

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-11-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You might note that the suggestion in post #5 in this thread invokes awk (using only standard awk features) once and sort twice producing both of the requested sorted outputs. Unlike some of the scripts in this thread, it doesn't need multiple invocations of sort or tr per file processed. And, theawk script processes one file at a time keeping only unique lines from that file (rather than keeping unique lines in memory from all files being processed). When the files being processed contain tens of thousands of input lines and tens of thousands of lines from most of those files are unique, that can chew up a lot of system resources.

And, although most of us corrected the use of sort without the n flag when sorting numeric values, none of us said why we did that. (If you use sort without the n flag, the sort performed is an alphanumeric sort; not a numeric sort. So, for example the string 9 is alphanumerically greater than the string 100000 because the leading digit 9 in the first string is greater than the leading digit 1 in the second string. When the n flag is given to sort, it performs a numeric sort instead of an alphanumeric sort for the key fields to which the flag is attached.)

Last edited by Don Cragun; 07-11-2019 at 06:55 PM.. Reason: Fix typo: s/sdoesn't/doesn't/

These 3 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-11-2019

Registered User

24, 2

Join Date: Mar 2010

Last Activity: 16 October 2019, 10:25 PM EDT

Posts: 24

Thanks Given: 14

Thanked 2 Times in 2 Posts

Quote:

Originally Posted by vgersh99

how about this:

Code:

#!/bin/ksh

wc -l * | sed '$d' | sort | while read lines file junk
do
   echo $lines $(sort < $file | uniq -u |wc -l) $file
done

Quote:

Originally Posted by vgersh99

worked just fine with my test harness files quoted previously!

With nezabudka's gawk I get:

Code:

gawk '{u[$0]; l++} ENDFILE {print length(u), l, FILENAME; delete u; l=0}' * | sort -k1,1n
4 6 file1
5 7 file2

With yours I get:

Code:

gawk '{l[$0]++} ENDFILE {for (i in l) {if (l[i]==1) u++;t+=l[i]} print t, u, FILENAME; delete l; u=t=0}' *
6 2 file1
7 3 file2

I'm on a Mac running GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2).

Last edited by Scrutinizer; 07-11-2019 at 07:11 PM.. Reason: icode tags -> code tags

spacegoose

View Public Profile for spacegoose

Find all posts by spacegoose

07-11-2019

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Adaptation to post#6 that uses less memory, only unique lines for one file at a time (thanks Don):

Code:

awk '
  FNR==1 {
    filenr++
    Name[filenr]=FILENAME
    split("", Seen)
  }

  !Seen[$0]++ {
    Uniq[filenr]++
  } 

  {
    Total[filenr]++
  } 

  END {
    for(i in Name)
      print Total[i], Uniq[i], Name[i]
  }
' file* | sort -nk1,1 -nk2,2 -k3,3

Last edited by Scrutinizer; 07-11-2019 at 07:41 PM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

07-11-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If you change:

Code:

gawk '{l[$0]++} ENDFILE {for (i in l) {if (l[i]==1) u++;t+=l[i]} print t, u, FILENAME; delete l; u=t=0}' *

to:

Code:

gawk '{l[$0]++} ENDFILE {for (i in l) {u++;t+=l[i]} print t, u, FILENAME; delete l; u=t=0}' *

I think you'll get the results you want. (But, I don't have gawk installed on my system to verify that it works.)

Note that each subscript value represents a unique input line. So, there is no test needed to count the number of unique lines in a file. The test that is currently in that code is only counting unique lines if they only appear in the file once.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Advise on how to print range of lines above and below a number?

Discussion started by: newbie_01

2. Shell Programming and Scripting

How to print N number of lines before and after the grep?

Discussion started by: Huvan

3. UNIX for Dummies Questions & Answers

Writing a script to print the number of lines in multiple files

Discussion started by: evelibertine

4. Shell Programming and Scripting

How to print lines that only have number lower than...

Discussion started by: narachaid

5. Shell Programming and Scripting

Compare multiple files and print unique lines

Discussion started by: jacobs.smith

6. Shell Programming and Scripting

print lines between line number

Discussion started by: senthil_is

7. Shell Programming and Scripting

print every 20 lines the lowest number

Discussion started by: TheTransporter

8. SCO

Why? I can not change the number of lines to print

Discussion started by: Edgar Guevara

9. Shell Programming and Scripting

How do I print out lines with the same number in front using awk?

Discussion started by: SIFA

10. Shell Programming and Scripting

How to print number of lines with awk ?

Discussion started by: maheshsri