Find missed numbers


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find missed numbers
# 8  
Old 03-01-2013
Quote:
Originally Posted by RudiC
What about
Code:
$ seq 1 1000000 | grep -vwf file

?
That's a nice solution from the point of view of the human who has to read it; it's simple and succinct. However, it's a massively inefficient algorithm. If you have a half-million random numbers out of a million, the number of comparisons will be on the order of 250 billion (500000^2). Yikes!

For kicks, I tried it on an old laptop (PII 350 MHz, 192 MB ram, 256 MB swap) and it blew up. Well, not quite, but, after 43 seconds, grep (GNU grep 2.5.1) was killed by the kernel after having consumed all ram and swap.

For comparison, a diff on both files took 3 seconds and change.

Regards,
Alister

---------- Post updated at 02:48 PM ---------- Previous update was at 02:38 PM ----------

The following should print the missing numbers within the sequence and beyond up to n:
Code:
awk 'function fill(x) {while (NR+i < x) print NR+i++} NR+i < $0 {fill($0)} END {++i; fill(n+1)}' n=1000000 file

Regards,
Alister

Last edited by alister; 03-01-2013 at 05:20 PM..
These 2 Users Gave Thanks to alister For This Post:
# 9  
Old 03-01-2013
Yes, but diff tends to throw human-friendly artifacts and search around a bit, so for data and control, the old sort merge of time immemorial keeps the overhead minimized, and the pipes make it pipeline parallel, two sorts in final merge feeding comm feeding a sort in initial blocking is 4 way parallel.
# 10  
Old 03-01-2013
Quote:
Originally Posted by DGPickett
Yes, but diff tends to throw human-friendly artifacts and search around a bit, so for data and control, the old sort merge of time immemorial keeps the overhead minimized, and the pipes make it pipeline parallel, two sorts in final merge feeding comm feeding a sort in initial blocking is 4 way parallel.
I mentioned diff only to demonstrate that the job could be handled more efficiently than the grep proposal. However, default diff output would be trivial to massage, and more efficient than comm and multiple sorts:
Code:
seq -f %.0f 1000000 | diff file - | sed -n '/^> /s///p'

Regards,
Alister

---------- Post updated at 03:14 PM ---------- Previous update was at 03:05 PM ----------

On the ancient (single core, single threaded) laptop whose specs I mentioned in my first post in this thread:
67 seconds: comm-sort
11 seconds: diff-sed
03 seconds: awk (from post #8)

A multicore machine will no doubt close the gap, but nonetheless, all that sorting for a list of already sorted numbers is unnecessary work.

Regards,
Alister
# 11  
Old 03-01-2013
Well, sort/comm is robust. I consistently look there for 'set' problems, even if I have to number input lines to reassemble them. I agree that some of diff's output formats might be pretty easy to massage, but what if the data gets less trivial?

Did you test with 500K numbers ?
# 12  
Old 03-01-2013
Quote:
Originally Posted by DGPickett
Well, sort/comm is robust. I consistently look there for 'set' problems, even if I have to number input lines to reassemble them. I agree that some of diff's output formats might be pretty easy to massage, but what if the data gets less trivial?

Did you test with 500K numbers ?
The times reported previously reflect a file with ~1M numbers. To be precise, from the sequence 1 .. 1000000, I removed 2 and 999,999.

I also tested with a file that had approximately 500,000 of 1,000,000 numbers. The missing numbers were removed "randomly" using:
Code:
seq -f %.0f 1000000 | tee complete | awk 'int(rand()*10)%2' > partial

Run time increased for the awk and diff-sed approaches. I assume it's due to increased i/o (from 2 lines of output to ~500,000) outweighing savings in reading the input file (which shrank from 1M to ~0.5M lines). Run time for comm-sort decreased. I assume that the sort overhead saved with the much smaller file outweighed the output i/o.

47 seconds: comm -23 <(sort complete) <(sort partial) | sort -n
17 seconds: diff partial complete | sed -n '/^> /s///p'
05 seconds: awk (from post #8)

I agree with you that comm/sort is robust. The behavior is more deterministic. However, I prefer an AWK one-liner for this task. It too is robust. And, at a minor cost in increased complexity, it's much more efficient.

Regards,
Alister
# 13  
Old 03-02-2013
Quote:
Originally Posted by alister
. . . However, it's a massively inefficient algorithm. If you have a half-million random numbers out of a million, the number of comparisons will be on the order of 250 billion (500000^2). Yikes!
. . .
I posted that proposal because it was so easy to read and understand, and assuming the numbers to compare against were sparse. I had some collywobbles/gripes when imagining grep had to run through all numbers in the file for every single input line...
# 14  
Old 03-05-2013
My brain cooked another on the back burner -- fix the comm comparison by padding:
Code:
comm -23 <(
  seq -w 999999
 ) <(
  sed '
    s/^/00000/
    s/0*\([0-9]\{6\}\)/\1/
   ' file
 )


Last edited by DGPickett; 03-05-2013 at 04:04 PM.. Reason: seq printf is weak, -w is simpler
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Script That Find Numbers Less Than 10

We have an existing script called "slots.sh" that prints a few numbers. I want to have another shell script that looks if there is any number from this output that is less than 10. If it does find a number that is less than 10, it then emails to me the output of "slot. sh", if it is equal to 10 or... (7 Replies)
Discussion started by: forextrafun
7 Replies

2. Programming

RegEx find numbers above 25000

Hello, I am using the Sublime Plugin LogHighlight. I can use RegEx there to highlight some lines in sublime. Now I need to find every line, that has a number of above 25000. the lines look like this: smart_sdl.result: 8947 smart_sdm.result: 8947 smart_sdn.result: 25000 Currently I am... (3 Replies)
Discussion started by: blend_in
3 Replies

3. Linux

Apache folder missed

Hi, I am confused on apache service. Please advice me There is no apache directory in my linux box. Even I cant find out httpd.conf file also. But If browse my Hostname in Internet Explor, I can able to apache service there. Please How it is working .. This file is available in... (1 Reply)
Discussion started by: Mani_apr08
1 Replies

4. Shell Programming and Scripting

AWK regex to find only numbers

Hi guys I need to find both negative and positive numbers from the following text file. And i also dont need 0. 0 8 -7 -2268 007 -07 -00 -0a0 0a0 -07a0 7a00 0a0 Can someone please give a regex to filter out the values in red. I tried a few things in awk but it didnt work... (9 Replies)
Discussion started by: sridanu
9 Replies

5. Shell Programming and Scripting

find the last word with all numbers?

Hi, I was trying to extract the last word with all numbers using awk. like the below example. I am looking for a pattern using awk. desired result: (13 Replies)
Discussion started by: hitmansilentass
13 Replies

6. HP-UX

vgchgid missed one disk

Hi, Suppose I have 3 disks c0t0d0, c0t0d1, and c0t0d2 all with the *same* VGID. I then run: # vgchgid /dev/rdsk/c0t0d0 /dev/rdsk/c0t0d1 (notice I *accidentally* skipped the third disk!) How would I fix this so that all 3 disks have the same VGID again? I'm looking for step-by-step... (7 Replies)
Discussion started by: apra143
7 Replies

7. Shell Programming and Scripting

to find numbers in a string

I writing my script and got stuck in this function. Can someone help me? I need to extract out the numbers inside a string. Ex: INPUT -> OUTPUT abcdef123 -> 123 abc123def -> 123 123abcdef -> 123 a123bc45d -> 123 45 abcdefghi -> -1 Thank you! (12 Replies)
Discussion started by: fongthai
12 Replies

8. Shell Programming and Scripting

to find numbers using awk

suppose u have a file 23 33 44 66 55 88 Another file 49 34 49 90 So to find where these numbers lie between in the first file output shud be while using second file and search for the first file 44 66 - 23 33 44 66 - (1 Reply)
Discussion started by: cdfd123
1 Replies

9. Shell Programming and Scripting

asked question about script before missed ansewr..

linux fedora core2 :) i am trying to write a script to clear, date, pwd and tty a linux termnal or konsole.. when I test the tty against $0 i am, getting a premission denied on the terminal that I am trying to printf to.. I tried using an awk command, test condition, an if then fi clause, but... (6 Replies)
Discussion started by: moxxx68
6 Replies
Login or Register to Ask a Question