Find missed numbers

03-01-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by RudiC

What about

Code:

$ seq 1 1000000 | grep -vwf file

?

That's a nice solution from the point of view of the human who has to read it; it's simple and succinct. However, it's a massively inefficient algorithm. If you have a half-million random numbers out of a million, the number of comparisons will be on the order of 250 billion (500000^2). Yikes!

For kicks, I tried it on an old laptop (PII 350 MHz, 192 MB ram, 256 MB swap) and it blew up. Well, not quite, but, after 43 seconds, grep (GNU grep 2.5.1) was killed by the kernel after having consumed all ram and swap.

For comparison, a diff on both files took 3 seconds and change.

Regards,
Alister

---------- Post updated at 02:48 PM ---------- Previous update was at 02:38 PM ----------

The following should print the missing numbers within the sequence and beyond up to n:

Code:

awk 'function fill(x) {while (NR+i < x) print NR+i++} NR+i < $0 {fill($0)} END {++i; fill(n+1)}' n=1000000 file

Regards,
Alister

Last edited by alister; 03-01-2013 at 05:20 PM..

These 2 Users Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

03-01-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Yes, but diff tends to throw human-friendly artifacts and search around a bit, so for data and control, the old sort merge of time immemorial keeps the overhead minimized, and the pipes make it pipeline parallel, two sorts in final merge feeding comm feeding a sort in initial blocking is 4 way parallel.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

03-01-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by DGPickett

Yes, but diff tends to throw human-friendly artifacts and search around a bit, so for data and control, the old sort merge of time immemorial keeps the overhead minimized, and the pipes make it pipeline parallel, two sorts in final merge feeding comm feeding a sort in initial blocking is 4 way parallel.

I mentioned diff only to demonstrate that the job could be handled more efficiently than the grep proposal. However, default diff output would be trivial to massage, and more efficient than comm and multiple sorts:

Code:

seq -f %.0f 1000000 | diff file - | sed -n '/^> /s///p'

Regards,
Alister

---------- Post updated at 03:14 PM ---------- Previous update was at 03:05 PM ----------

On the ancient (single core, single threaded) laptop whose specs I mentioned in my first post in this thread:
67 seconds: comm-sort
11 seconds: diff-sed
03 seconds: awk (from post #8)

A multicore machine will no doubt close the gap, but nonetheless, all that sorting for a list of already sorted numbers is unnecessary work.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-01-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

Well, sort/comm is robust. I consistently look there for 'set' problems, even if I have to number input lines to reassemble them. I agree that some of diff's output formats might be pretty easy to massage, but what if the data gets less trivial?

Did you test with 500K numbers ?

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

03-01-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by DGPickett

Well, sort/comm is robust. I consistently look there for 'set' problems, even if I have to number input lines to reassemble them. I agree that some of diff's output formats might be pretty easy to massage, but what if the data gets less trivial?

Did you test with 500K numbers ?

The times reported previously reflect a file with ~1M numbers. To be precise, from the sequence 1 .. 1000000, I removed 2 and 999,999.

I also tested with a file that had approximately 500,000 of 1,000,000 numbers. The missing numbers were removed "randomly" using:

Code:

seq -f %.0f 1000000 | tee complete | awk 'int(rand()*10)%2' > partial

Run time increased for the awk and diff-sed approaches. I assume it's due to increased i/o (from 2 lines of output to ~500,000) outweighing savings in reading the input file (which shrank from 1M to ~0.5M lines). Run time for comm-sort decreased. I assume that the sort overhead saved with the much smaller file outweighed the output i/o.

47 seconds: comm -23 <(sort complete) <(sort partial) | sort -n
17 seconds: diff partial complete | sed -n '/^> /s///p'
05 seconds: awk (from post #8)

I agree with you that comm/sort is robust. The behavior is more deterministic. However, I prefer an AWK one-liner for this task. It too is robust. And, at a minor cost in increased complexity, it's much more efficient.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-02-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by alister

. . . However, it's a massively inefficient algorithm. If you have a half-million random numbers out of a million, the number of comparisons will be on the order of 250 billion (500000^2). Yikes!
. . .

I posted that proposal because it was so easy to read and understand, and assuming the numbers to compare against were sparse. I had some collywobbles/gripes when imagining grep had to run through all numbers in the file for every single input line...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-05-2013

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

My brain cooked another on the back burner -- fix the comm comparison by padding:

Code:

comm -23 <(
  seq -w 999999
 ) <(
  sed '
    s/^/00000/
    s/0*\([0-9]\{6\}\)/\1/
   ' file
 )

Last edited by DGPickett; 03-05-2013 at 04:04 PM.. Reason: seq printf is weak, -w is simpler

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

Shell Programming and Scripting

Find missed numbers

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Script That Find Numbers Less Than 10

Discussion started by: forextrafun

2. Programming

RegEx find numbers above 25000

Discussion started by: blend_in

3. Linux

Apache folder missed

Discussion started by: Mani_apr08

4. Shell Programming and Scripting

AWK regex to find only numbers

Discussion started by: sridanu

5. Shell Programming and Scripting

find the last word with all numbers?

Discussion started by: hitmansilentass

6. HP-UX

vgchgid missed one disk

Discussion started by: apra143

7. Shell Programming and Scripting

to find numbers in a string

Discussion started by: fongthai

8. Shell Programming and Scripting

to find numbers using awk

Discussion started by: cdfd123

9. Shell Programming and Scripting

asked question about script before missed ansewr..

Discussion started by: moxxx68