Remove lines that are subsets of other lines in File


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove lines that are subsets of other lines in File
# 1  
Old 04-22-2015
Remove lines that are subsets of other lines in File

Hello everyone,


Although it seems easy, I've been stuck with this problem for a moment now and I can't figure out a way to get it done.

My problem is the following:

I have a file where each line is a sequence of IP addresses, example :

Code:
10.0.0.1 10.0.0.2 
10.0.0.5 10.0.0.1 10.0.0.2
...

What I'd like to do, is to remove lines that are completely matched in other lines. In the previous example, "Line 1" would be deleted as it is contained in "Line 2".

So far, I've worked with python and set() objects to get the job done but I've got more than 100K lines and sets lookups are becoming time consuming as the program goes :/

Thanks for you help

Moderator's Comments:
Mod Comment Use code tags, thanks.

Last edited by zaxxon; 04-22-2015 at 06:31 AM.. Reason: code tags and missing a dot
# 2  
Old 04-22-2015
Testing / set lookups WILL become increasingly lengthy with more and more data sets. If you have a working solution, it might be very challenging to squeeze a few percent performance out of it.
# 3  
Old 04-22-2015
Hello RudiC,


Well, I managed to trim down the dataset with "sort -u input > output", but this will only remove pure duplicates. But still, running my script on this filtered dataset will take ages :/

I presume 'sed' could help me but I can't figure out what regex I should feed him..
# 4  
Old 04-22-2015
Assuming that the 1st line from:
Code:
10.0.0.2 10.0.0.1
10.0.0.5 10.0.0.1 10.0.0.2

should also be removed as a subset of the 2nd line and that the 1st line from:
Code:
10.0.0.2
10.0.0.21

should not be removed as a subset of the 2nd line; I can't see how sed would be a good tool for attacking this problem. The periods in IP addresses also need to be handled specially in sed since they have a special meaning in regular expressions. At first glance, awk would seem to me to be much more suited for this problem than sed. I haven't done enough with python to compare awk and python for a problem like this.

And, while sort -u will get rid of duplicate lines, it won't reduce the problem you need to solve with input like:
Code:
10.0.0.5 10.0.0.1 10.0.0.2
10.0.0.5 10.0.0.2 10.0.0.1
10.0.0.1 10.0.0.5 10.0.0.2
10.0.0.1 10.0.0.2 10.0.0.5
10.0.0.2 10.0.0.1 10.0.0.5
10.0.0.2 10.0.0.5 10.0.0.1
10.0.0.1 10.0.0.2
10.0.0.1 10.0.0.5
10.0.0.2 10.0.0.1
10.0.0.2 10.0.0.5
10.0.0.5 10.0.0.1
10.0.0.5 10.0.0.2
10.0.0.1
10.0.0.2
10.0.0.5

which I assume should be reduced to a single output line that matches one of the 1st six lines above.

But, if you have a lot of duplicate lines in your input, using sort -u as a preprocessing step may reduce your overall run time. Without a much deeper understanding of how many duplicates there are and how you process lines while looking for duplicates, we would just be making wild guesses as to whether using sort -u would increase or decrease run times.

Preprocessing your input so that all of the IP addresses in each line are in sorted order might also be an advantage. Whether or not it would help or not again depends on what your input looks like and how you are looking for duplicates. It might also speed up processing if your input was sorted with the longest sequences coming first.

Is this a one-time problem; or is this something you're going to be doing on a regular schedule? Getting a correct result is crucial. Getting a correct result as quickly as possible might not be important if you only need to do this once.

If you show us your code, we have a MUCH better chance of helping you correct or optimize what you're doing.

Giving us more details about your input could also make a huge difference in the way some of us would approach a problem like this. What are the minimum and maximum number of IP addresses in your sequences? Do all of the IP addresses in a single sequence have the same 1st component (i.e. 10.0.0.1, 10.0.0.2, and 10.0.0.5)? Everything you know about your data could provide hints that could optimize code used to solve this problem.

You have more than 100K input sequences. How much "more than"? What OS are you using? What hardware are you using? How much memory do you have?
# 5  
Old 04-22-2015
Given this input file:

Code:
10.0.0.1 10.0.0.2 10.0.0.3
10.0.0.3 10.0.0.4
10.0.0.1 10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.6
10.0.0.1 10.0.0.2
10.0.0.4 10.0.0.5 10.0.0.6
10.0.0.5 10.0.0.6

Is solution #1 below acceptable, or must it be further optimized to solution 2?

Code:
Solution 1
10.0.0.1 10.0.0.2 10.0.0.3
10.0.0.3 10.0.0.4
10.0.0.1 10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.6

Solution 2
10.0.0.1 10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.6

# 6  
Old 04-22-2015
This solution keeps the lines with most IPs first although not optimal it should give fairly more concise answers than just processing file top to bottom.

I tried it on a data files with 170K lines and it took approx 3 seconds to process.

Code:
awk '
function have(ln,ip,num) {
  ret=1
  num=split(ln,ips)
  for(ip=1;ip<=num;ip++)
     if(!(ips[ip] in havelist)) {
         havelist[ips[ip]]
         unique++
         ret=0
      }
  return ret
}
{ L[NR]=$0;D[NF]=D[NF] " " NR; max=NF>max?NF:max }
END {
   for(c=max;c;c--)
      if(D[c]) {
         lns=split(D[c],v)
         for(i=1;i<=lns;i++)
           if(!have(L[v[i]])) print L[v[i]]
      }
   printf "Data file contains %'\''d unique IPs\n", unique > "/dev/stderr"
}' infile


For anyone who wants to test possible solutions I used the script to make my test file:

Code:
for ((i=0;i<170000;i++))
do
    printf "10.0.%d.%d" $((RANDOM%256)) $((RANDOM%256))
    ips=$((RANDOM%8))
    while ((--ips > 0))
    do
       printf " 10.0.%d.%d" $((RANDOM%256)) $((RANDOM%256))
    done
    printf "\n"
done


Last edited by Chubler_XL; 04-22-2015 at 05:53 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to remove lines that do not start with digit and combine line or lines

I have been searching and trying to come up with an awk that will perform the following on a converted text file (original is a pdf). 1. Since the first two lines are (begin with) text they are removed 2. if $1 is a number then all text is merged (combined) into one line until the next... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

Two files, remove lines from second based on lines in first

I have two files, a keepout.txt and a database.csv. They're unsorted, but could be sorted. keepout: user1 buser3 anuser19 notheruser27 database: user1,2343,"information about",field,blah,34 user2,4231,"mo info",etc,stuff,43 notheruser27,4344,"hiya",thing,more thing,423... (4 Replies)
Discussion started by: esoffron
4 Replies

3. Shell Programming and Scripting

Remove lines in file

I have a file that contains the following: Party_Id1;Party_id2;Party_id3; 1;2;3; 0 0 4;5;6; 0 7;8;9; How can I adjust the file so it looks like this: Party_Id1;Party_id2;Party_id3; 1;2;3; 4;5;6; 7;8;9; I Think the '0' is something like a carriage return, I don't know. But how... (2 Replies)
Discussion started by: katled
2 Replies

4. UNIX for Dummies Questions & Answers

Want to remove all lines but not latest 50 lines from a file

Hi, I have a huge file which has Lacs of lines. File system got full. I want your guys help to suggest me a solution so that I can remove all lines from that file but not last 50,000 lines. I want solution which can remove lines from existing file so that I can have some space left with. (28 Replies)
Discussion started by: prashant2507198
28 Replies

5. Shell Programming and Scripting

Remove lines from file

Hey Gang- I have a list of servers. I want to exclude servers that begin with and end with certain characters. Is there an easy command to do this? Example wvm1234dev wvm1234pro uvm1122dev uvm1122bku uvm1344dev I want to exclude any lines that start with "wvm" OR "uvm" AND end... (7 Replies)
Discussion started by: idiotboy
7 Replies

6. Shell Programming and Scripting

remove blank lines and merge lines in shell

Hi, I'm not a expert in shell programming, so i've come here to take help from u gurus. I'm trying to tailor a csv file that i got to make it work for the LOAD FROM command. I've a datatable csv of the below format - --in file format xx,xx,xx ,xx , , , , ,,xx, xxxx,, ,, xxx,... (11 Replies)
Discussion started by: dvah
11 Replies

7. Shell Programming and Scripting

remove : lines from file

A small question I have a test.txt file I have contents as: a:google b:yahoo : c:facebook : d:hotmail How do I remove the line with : my output should be a:google b:yahoo c:facebook d:hotmail (5 Replies)
Discussion started by: aronmelon
5 Replies

8. Shell Programming and Scripting

remove lines from file

Hi gurus, i'm trying to remove a number of lines from a large file using the following command: sed '1,5000d' oldfile > newfile Somehow the lines in the old file are not deleted... Am I doing this wrongly? Any suggestions? :confused: Thanks! :) wee (10 Replies)
Discussion started by: lweegp
10 Replies

9. UNIX for Dummies Questions & Answers

vi to remove lines in file

All, I have a text file with several entries like below: personname personname.domain.com I know there is a way to use vi to remove only the personname.domain.com line. Can someone help? I believe that it involves /s/g/ something...I just can't remember the exact syntax. Thanks (2 Replies)
Discussion started by: kjbaumann
2 Replies

10. Shell Programming and Scripting

remove lines from file

file: 1 xxxxxxx 2 xxx xxx 5 xxx xxx ... 180 xxxxxx 200 xxx how to remove any lines with the first number range 1-180 (9 Replies)
Discussion started by: bluemoon1
9 Replies
Login or Register to Ask a Question