Today (Saturday) We will make some minor tuning adjustments to MySQL.

You may experience 2 up to 10 seconds "glitch time" when we restart MySQL. We expect to make these adjustments around 1AM Eastern Daylight Saving Time (EDT) US.


Print number of lines for files in directory, also print number of unique lines


Login or Register to Reply

 
Thread Tools Search this Thread
# 8  
Quote:
Originally Posted by vgersh99
how about this:
Code:
#!/bin/ksh

wc -l * | sed '$d' | sort | while read lines file junk
do
   echo $lines $(sort < $file | uniq -u |wc -l) $file
done

FYI - Just tried this, it printed correct counts but unique counts were off. I will check the others and update.

Quote:
Originally Posted by nezabudka
Code:
awk '{u[$0]; l++} ENDFILE {print length(u), l, FILENAME; delete u; l=0}' * | sort -k1,1n

Thanks nezabudka!! This seems to work with gawk -- thanks also vgersh99 for pointing out gawk -- tried your different gawk but counts still off ... as in your original solution -- maybe uniq is not being done in correct order?



Quote:
Originally Posted by Don Cragun
Please always tell us what shell and operating system you're using when you start a new thread. Don't assume that everyone who wants to help you has read all of your previous threads.
Code:
#!/bin/bash
tmpf="/tmp/$$.result"

trap 'rm -f "$tmpf"' EXIT

awk '
function dump() {
	print linecount, distinct, lastfile
	linecount = distinct = 0
	split("", lines)
}

FILENAME != lastfile {
	if(lastfile)
		dump()
	lastfile = FILENAME
}

{	linecount++
	if(lines[$0]++ == 0)
		distinct++
}

END {	dump()
}' * > "$tmpf"

echo 'Sorted by increaasing number of lines in files:'
sort -n "$tmpf"

echo 'Sorted by increaasing number of distinct lines in files:'
sort -k2,2n "$tmpf"

Note that this should work with any version of awk (but on Solaris systems, you'll need to use nawk or /usr/xpg4/bin/awk).

Thanks Don Cragun -- this also works!


Quote:
Originally Posted by MadeInGermany
The following variant correctly handles filenames with special characters:
Code:
for f in *; do printf "%s/%s lines are unique in file %s\n" $(sort "$f" | uniq -u | wc -l) $(wc -l < "$f") "$f"; done

Post #3 has another perception of "unique":
Code:
for f in *; do printf "%s/%s unique lines in file %s\n" $(sort  -u "$f" | wc -l) $(wc -l < "$f") "$f"; done

Didn't see the "sort" requirement. Left as an exercise.
Thanks MadeInGermany, first gives same unique count as vgersh99, second works for me. Maybe my perception of unique is incorrect Smilie

I'm getting my unique count by:

Code:
sort filename | uniq | wc -l

The contents of my files are URLs if that makes a difference.

Last edited by spacegoose; 6 Days Ago at 06:13 PM..
This User Gave Thanks to spacegoose For This Post:
# 9  
Quote:
Originally Posted by spacegoose
FYI - Just tried this, it printed correct counts but unique counts were off. I will check the others and update.
worked just fine with my test harness files quoted previously!
# 10  
For the fun of it:

Code:
for FN in *; do { sort $FN | tee >(uniq -u | wc -l >&3) | wc -l; echo $FN; } 3>&1; done | paste -s -d"\t\t\n" | sort -n

# 11  
You might note that the suggestion in post #5 in this thread invokes awk (using only standard awk features) once and sort twice producing both of the requested sorted outputs. Unlike some of the scripts in this thread, it doesn't need multiple invocations of sort or tr per file processed. And, theawk script processes one file at a time keeping only unique lines from that file (rather than keeping unique lines in memory from all files being processed). When the files being processed contain tens of thousands of input lines and tens of thousands of lines from most of those files are unique, that can chew up a lot of system resources.

And, although most of us corrected the use of sort without the n flag when sorting numeric values, none of us said why we did that. (If you use sort without the n flag, the sort performed is an alphanumeric sort; not a numeric sort. So, for example the string 9 is alphanumerically greater than the string 100000 because the leading digit 9 in the first string is greater than the leading digit 1 in the second string. When the n flag is given to sort, it performs a numeric sort instead of an alphanumeric sort for the key fields to which the flag is attached.)

Last edited by Don Cragun; 6 Days Ago at 05:55 PM.. Reason: Fix typo: s/sdoesn't/doesn't/
These 3 Users Gave Thanks to Don Cragun For This Post:
# 12  
Quote:
Originally Posted by vgersh99
how about this:
Code:
#!/bin/ksh

wc -l * | sed '$d' | sort | while read lines file junk
do
   echo $lines $(sort < $file | uniq -u |wc -l) $file
done

Quote:
Originally Posted by vgersh99
worked just fine with my test harness files quoted previously!
With nezabudka's gawk I get:

Code:
gawk '{u[$0]; l++} ENDFILE {print length(u), l, FILENAME; delete u; l=0}' * | sort -k1,1n
4 6 file1
5 7 file2

With yours I get:

Code:
gawk '{l[$0]++} ENDFILE {for (i in l) {if (l[i]==1) u++;t+=l[i]} print t, u, FILENAME; delete l; u=t=0}' *
6 2 file1
7 3 file2

I'm on a Mac running GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2).

Last edited by Scrutinizer; 6 Days Ago at 06:11 PM.. Reason: icode tags -> code tags
# 13  
Adaptation to post#6 that uses less memory, only unique lines for one file at a time (thanks Don):

Code:
awk '
  FNR==1 {
    filenr++
    Name[filenr]=FILENAME
    split("", Seen)
  }

  !Seen[$0]++ {
    Uniq[filenr]++
  } 

  {
    Total[filenr]++
  } 

  END {
    for(i in Name)
      print Total[i], Uniq[i], Name[i]
  }
' file* | sort -nk1,1 -nk2,2 -k3,3


Last edited by Scrutinizer; 6 Days Ago at 06:41 PM..
# 14  
If you change:
Code:
gawk '{l[$0]++} ENDFILE {for (i in l) {if (l[i]==1) u++;t+=l[i]} print t, u, FILENAME; delete l; u=t=0}' *

to:
Code:
gawk '{l[$0]++} ENDFILE {for (i in l) {u++;t+=l[i]} print t, u, FILENAME; delete l; u=t=0}' *

I think you'll get the results you want. (But, I don't have gawk installed on my system to verify that it works.)

Note that each subscript value represents a unique input line. So, there is no test needed to count the number of unique lines in a file. The test that is currently in that code is only counting unique lines if they only appear in the file once.
These 2 Users Gave Thanks to Don Cragun For This Post:
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
How to print N number of lines before and after the grep?
Huvan
Hi , My record file , need to print up to above (DATA array)(there may be n no lines ) , grep "myvalue" row now .....suggest me some options --- DATA Array--- record type xxxxx sequence type yyyyy 2 3---> data1 /dev/ --- DEVICE --- MAXIMUM_People= data_blocks= MY_value=2 xyz abc ...... Shell Programming and Scripting
0
Shell Programming and Scripting
Writing a script to print the number of lines in multiple files
evelibertine
Hi I have 1000 files labelled data1.txt through data1000.txt. I want to write a script that prints out the number of lines in each txt file and outputs it in the following format: Column 1: number of data file (1 through 1000) Column 2: number of lines in the text file Thanks!... UNIX for Dummies Questions & Answers
2
UNIX for Dummies Questions & Answers
print every 20 lines the lowest number
TheTransporter
Hello all, How can I find the lowest number every 10 lines? For example i have a list name1 -0.1 name2 2 name3 3 name4 -3 name5 1 name6 2 name7 34 name8 34... Shell Programming and Scripting
6
Shell Programming and Scripting
Why? I can not change the number of lines to print
Edgar Guevara
hi My problem now is that if shipping options as -o length = 88 it says the following: # lp -o length=88 -dhp4015 /etc/hosts UX:lp: ERROR: The following options can't be handled: -o length= TO FIX: The printer(s) that otherwise qualify for printing your request can't handle one or more of...... SCO
2
SCO
How to print number of lines with awk ?
maheshsri
Can some body tell me how to print number of line from a particular file, with sed. ? Input file format AAAA BBBB CCCC SDFFF DDDD DDDD Command to print line 2 and 3 ? BBBB CCCC And also please tell me how to assign column sum to variable. I user the following command it...... Shell Programming and Scripting
1
Shell Programming and Scripting

Featured Tech Videos