I've got a postgres db with nearly 200'000 network address types.
I'd like to detect if some subnets are overlapping themselves, for ex, detect 123.0.0.0/16, 123.2.0.0/24 and 123.3.4.128/30 and report them.
I'm already using a lot of python scripts and netaddr library.
Considering the number of entries, what would be the best approach/algorithm to detect overlaps?
I'm pretty sure there's a better way than comparing each entry to the whole database.
I think the following should be a fairly efficient approach:
import netaddr
import bisect
def subnets_overlap(subnets):
# ranges will be a sorted list of alternating start and end addresses
ranges = []
for subnet in subnets:
# find indices to insert start and end addresses
first = bisect.bisect_left(ranges, subnet.first)
last = bisect.bisect_right(ranges, subnet.last)
# check the overlap conditions and return if one is met
if first != last or first % 2 == 1:
return True
ranges[first:first] = [subnet.first, subnet.last]
return False
Examples:
>>> subnets_overlap([netaddr.IPNetwork('1.0.0.0/24'), netaddr.IPNetwork('1.0.0.252/30')])
True
>>> subnets_overlap([netaddr.IPNetwork('1.0.0.0/24'), netaddr.IPNetwork('1.0.1.0/24')])
False
import sys
import ipaddr
from pprint import pprint
from netaddr import IPNetwork, IPAddress
matching_subent=[]
def cidrsOverlap(cidr0):
subnets_list = [IPNetwork('123.0.0.0/16'),
IPNetwork('123.2.0.0/24'),
IPNetwork('123.132.0.0/20'),
IPNetwork('123.142.0.0/20')]
flag = False
for subnet in subnets_list:
if (subnet.first <= cidr0.last and subnet.last >= cidr0.last):
matching_subent.append(subnet)
print "Matching subnets for given %s are %s" %(cidr0, matching_subent)
pprint(subnets_list)
cidrsOverlap(IPNetwork(sys.argv[1]))
Related
from ipaddress class, I know the address_exclude method. below is an example from the documentation:
>>> n1 = ip_network('192.0.2.0/28')
>>> n2 = ip_network('192.0.2.1/32')
>>> list(n1.address_exclude(n2))
[IPv4Network('192.0.2.8/29'), IPv4Network('192.0.2.4/30'),
IPv4Network('192.0.2.2/31'), IPv4Network('192.0.2.0/32')]
but what about if I want to remove two or more subnets from a network? for example, how can I delete from the 192.168.10.0/26 his subnets 192.168.10.24/29 and 192.168.10.48/28? the result should be 192.168.10.0/28, 192.168.10.16/29 and 192.168.10.32/28.
I'm trying to find a way to write the algoritm that I use in my mind using the address_exclude method but I can't. is there a simple way to implement what I just explained?
When you exclude one network from another, the result can be multiple networks (original one got split) - so, for the rest of the networks to exclude, you need to first find which part they would fit into before excluding them as well.
Here's one possible solution:
from ipaddress import ip_network, collapse_addresses
complete = ip_network('192.168.10.0/26')
# I chose the larger subnet for exclusion first, can be automated with network comparison
subnets = list(complete.address_exclude(ip_network('192.168.10.48/28')))
# other network to exclude
other_exclude = ip_network('192.168.10.24/29')
result = []
# Find which subnet the other exclusion will happen in
for sub in subnets:
# If found, exclude & add the result
if other_exclude.subnet_of(sub):
result.extend(list(sub.address_exclude(other_exclude)))
else:
# Other subnets can be added directly
result.append(sub)
# Collapse in case of overlaps
print(list(collapse_addresses(result)))
Output:
[IPv4Network('192.168.10.0/28'), IPv4Network('192.168.10.16/29'), IPv4Network('192.168.10.32/28')]
Expanding on my brain wave posted on #rdas's response, posting my solution.
It seems better to split the initial network into the smallest chunks you are asking, and do this for all ranges to be removed. Then exclude them from the list and return result.
from ipaddress import ip_network, collapse_addresses
def remove_ranges(mynetwork,l_of_ranges):
# find smallest chunk
l_chunk = sorted(list(set([x.split('/')[1] for x in l_of_ranges])))
l_mynetwork = list(ip_network(mynetwork).subnets(new_prefix=int(l_chunk[-1])))
l_chunked_ranges = [ ]
for nw in l_of_ranges:
l_chunked_ranges.extend(list(ip_network(nw).subnets(new_prefix=int(l_chunk[-1]))))
#l_removed_networks = [ ]
#for mynw in l_mynetwork:
# if not mynw in l_chunked_ranges:
# l_removed_networks.append(mynw)
#result = list(collapse_addresses(l_removed_networks))
result = list(collapse_addresses(set(l_mynetwork) - set(l_chunked_ranges)))
return [str(r) for r in result]
if __name__ == '__main__':
mynetwork = "10.110.0.0/16"
l_of_ranges = ["10.110.0.0/18","10.110.72.0/21","10.110.80.0/21","10.110.96.0/21"]
print(f"My network: {mynetwork}, Existing: {l_of_ranges} ")
a = remove_ranges(mynetwork,l_of_ranges)
print(f"Remaining: {a}")
With the result:
My network: 10.110.0.0/16, Existing: ['10.110.0.0/18', '10.110.72.0/21', '10.110.80.0/21', '10.110.96.0/21']
Remaining: ['10.110.64.0/21', '10.110.88.0/21', '10.110.104.0/21', '10.110.112.0/20', '10.110.128.0/17']
Which seems to be valid.
I am cleaning a data set with fraudulent email addresses that I am removing.
I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.
So I have for example rules like this:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
My solution isn't efficient, nor pretty. But check it out and see if it works for you #jeangelj. It definitely works for the examples you provided. Good luck!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('#')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020#gmail.com', 'attn1#gmail.com']
In [31]: spam
Out[31]:
['abc7020.10#gmail.com',
'abc7020.11#gmail.com',
'abc7020.12#gmail.com',
'abc7020.13#gmail.com',
'abc7020.14#gmail.com',
'abc7020.15#gmail.com',
'abc7020.1#gmail.com',
'attn12345678#gmail.com',
'attn1234567#gmail.com',
'attn123456#gmail.com',
'attn12345#gmail.com',
'attn1234#gmail.com',
'attn123#gmail.com',
'attn12#gmail.com']
# THE TRUTH YALL!
You can use a regular expression to do this; example below:
import re
a = "attn12345#gmail.comf"
b = "abc7020.14#gmail.com"
c = "abc7020#gmail.com"
d = "attn12345678#gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?#")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
If you run the code you will see:
$ python spam.py
spam1
spam2
spam3
spam4
The benefit to this method is that its standardized (regular expressions) and that you can adjust the strength of the match easily by adjusting the values within {}; which means you can have a global configuration file where you set/adjust the values. You can also adjust the regular expression easily without having to rewrite code.
First take a look at regexp question here
Second, try to filter email address like that:
# Let's email is = 'attn1234#gmail.com'
email = 'attn1234#gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
You could pick a diff threshold using edit distance (aka Levenshtein distance). In python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1#gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
If you wanted to be smarter about it, you could run through the resulting list and, instead of turning it into a set, keep track of how many other email addresses it was near - then use that as a 'weight' to determine fake-ness.
This gets you not only the given cases (where the fraudulent addresses all share a common start and differ only in numerical suffix, but additionally number or letter padding eg at the beginning or in the middle of an email address.
ids = [s.split('#')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
Here's one way to approach it, that should be pretty efficient.
We do it by grouping the email address in lengths, so that we only need to check if each email address matches the level down, by a slice and set membership check.
The code:
First, read in the data:
import pandas as pd
import numpy as np
string = '''
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
foo123#bar.com
foo1#bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
We strip off the #foo.bar part, and then filer to only those that end with a number, and add on a 'length' column:
#split on #, expand means into two columns
emails = x.x.str.split('#', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
Now, all we have to do, is take each length, and length -1, and see if the length. with it's last character dropped, appears in a set of the n-1 lengths (and, we have to check if the opposite is true, in case it is the shortest repeat):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
The resulting mask can be converted to boolean, and used to subset the original (deduplicated) dataframe, and the indices should match the original indices to subset like that:
x.loc[~mask.astype(bool),:]
x
0 abc7020#gmail.com
16 foo123#bar.com
17 foo1#bar.com
You can see that we have not removed your first value, as the '.' means it did not match - you can remove the punctuation first.
I have an idea on how to solve this:
fuzzywuzzy
Create a set of unique emails, for-loop over them and compare them with fuzzywuzzy.
Example:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)#.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)#.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
I took some liberties with how the data is represented, but I feel the variable names are mnemonic enough for you to understand what I am getting at. It is a very rough piece of code, in that I have not thought through how to stop repetitive flagging.
Anyways, the elif part compares the two email addresses without #gmail.com (or any other email e.g. #yahoo.com), if the ratio is above 80 (play around with this number) use your flagging operation.
For example:
fuzz.partial_ratio("abc7020.1", "abc7020")
100
In my Python application I have an array of IP address strings which looks something like this:
[
"50.28.85.81-140", // Matches any IP address that matches the first 3 octets, and has its final octet somewhere between 81 and 140
"26.83.152.12-194" // Same idea: 26.83.152.12 would match, 26.83.152.120 would match, 26.83.152.195 would not match
]
I installed netaddr and although the documentation seems great, I can't wrap my head around it. This must be really simple - how do I check if a given IP address matches one of these ranges? Don't need to use netaddr in particular - any simple Python solution will do.
The idea is to split the IP and check every component separately.
mask = "26.83.152.12-192"
IP = "26.83.152.19"
def match(mask, IP):
splitted_IP = IP.split('.')
for index, current_range in enumerate(mask.split('.')):
if '-' in current_range:
mini, maxi = map(int,current_range.split('-'))
else:
mini = maxi = int(current_range)
if not (mini <= int(splitted_IP[index]) <= maxi):
return False
return True
Not sure this is the most optimal, but this is base python, no need for extra packages.
parse the ip_range, creating a list with 1 element if simple value, and a range if range. So it creates a list of 4 int/range objects.
then zip it with a split version of your address and test each value in range of the other
Note: Using range ensures super-fast in test (in Python 3) (Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3?)
ip_range = "50.28.85.81-140"
toks = [[int(d)] if d.isdigit() else range(int(d.split("-")[0]),int(d.split("-")[1]+1)) for d in ip_range.split(".")]
print(toks) # debug
for test_ip in ("50.28.85.86","50.284.85.200","1.2.3.4"):
print (all(int(a) in b for a,b in zip(test_ip.split("."),toks)))
result (as expected):
[[50], [28], [85], range(81, 140)]
True
False
False
The brute force approach:
from ipaddr import IPv4Network
n = IPv4Network('10.10.128.0/17')
all = list(n.iterhosts()) # will give me all hosts in network
first,last = all[0],all[-1] # first and last IP
I was wondering how I would get the first and last IP address from a CIDR without having to iterate over a potentially very large list to get the first and last element?
I want this so I can then generate a random ip address in this range using something like this:
socket.inet_ntoa(struct.pack('>I', random.randint(int(first),int(last))))
From Python 3.3, you can use the ipaddress module
You could use it like this:
import ipaddress
n = ipaddress.IPv4Network('10.10.128.0/17')
first, last = n[0], n[-1]
__getitem__ is implemented, so it won't generate any large lists.
https://github.com/python/cpython/blob/3.6/Lib/ipaddress.py#L634
Maybe try netaddr instead, in particular the indexing section.
https://pythonhosted.org/netaddr/tutorial_01.html#indexing
from netaddr import *
import pprint
ip = IPNetwork('10.10.128.0/17')
print "ip.cidr = %s" % ip.cidr
print "ip.first.ip = %s" % ip[0]
print "ip.last.ip = %s" % ip[-1]
The python 3 ipaddress module is the more elegant solution, imho. And, by the way, it works fine, but the ipaddress module doesn't return exactly the first and last free ip addresses at indexes [0,-1], but respectively the network address and the broadcast address.
The first and last free and assignable addresses are rather
import ipaddress
n = ipaddress.IPv4Network('10.10.128.0/17')
first, last = n[1], n[-2]
which will return 10.10.128.1 as first and 10.10.255.254 instead of 10.10.128.0 and 10.10.255.255
I want to use an IP address string, ie: 192.168.1.23 but only keep the first three bytes of the IP address and then append 0-255. I want to transform that IP address into a range of IP address' I can pass to NMAP to conduct a sweep scan.
The easiest solution of course is to simply trim off the last two characters of the string, but of course this won't work if the IP is 192.168.1.1 or 192.168.1.123
Here is the solution I came up with:
lhost = "192.168.1.23"
# Split the lhost on each '.' then re-assemble the first three parts
lip = self.lhost.split('.')
trange = ""
for i, val in enumerate(lip):
if (i < len(lip) - 1):
trange += val + "."
# Append "0-255" at the end, we now have target range trange = "XX.XX.XX.0-255"
trange += "0-255"
It works fine but feels ugly and not efficient to me. What is a better way to do this?
You could use the rfind function of string object.
>>> lhost = "192.168.1.23"
>>> lhost[:lhost.rfind(".")] + ".0-255"
'192.168.1.0-255'
The rfind function is similar with find() but searching from the end.
rfind(...)
S.rfind(sub [,start [,end]]) -> int
Return the highest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
A more complicate solution could use regular express as:
>>> import re
>>> re.sub("\d{1,3}$","0-255",lhost)
'192.168.1.0-255'
Hope it be helpful!
You could split and get the first three values, join by a '.', and then add ".0-255"
>>> lhost = "192.168.1.23"
>>> '.'.join(lhost.split('.')[0:-1]) + ".0-255"
'192.168.1.0-255'
>>>
Not all IPs belong to class C. I think that the code must be flexible to accommodate various IP ranges and their masks,
I had previously written a tiny python module to calculate network ID< broadcast ID for a given IP address with any network mask.
code can be found here : https://github.com/brownbytes/tamepython/blob/master/subnet_calculator.py
I think networkSubnet() and hostRange() are functions which can be of some help to you.
I like this:
#!/usr/bin/python3
ip_address = '128.200.34.1'
list_ = ip_address.split('.')
assert len(list_) == 4
list_[3] = '0-255'
print('.'.join(list_))