I want to create a simple table (using python) in which I can store/search IP packet header fields i.e.,
source IP, Destination IP, Source Port, Destination port, count
I want to achieve the following when I get new packet header fields:
Lookup in the table to see if a packet with these fields is already added, if true then update the count.
If the packet is not already present in the table create a new entry and so on.
Through my search so far I have two options:
Create a list of dictionaries, with each dictionary having the five fields mentioned above.
(Python list of dictionaries search)
Use SQLite.
I want to ask what is an optimal approach (or best option) for creating an packet/flow lookup table. The expected size of table is 100-500 entries.
You could use defaultdict(list) from collections to store your data. I assume you would want to search based on the source IP so you would keep the source IP as key.
from collections import defaultdict
testDictionary = defaultdict(list)
testDictionary["192.168.0.1"] = ["10.10.10.1", 22, 8080, 0]
if testDictionary[sourceIP]:
testDictionary[sourceIP][-1] += 1
Since you are saying that you only have a table with 100-500 entries, you could search for destination IPs also using
for sourceIP, otherHeader in testDictionary.items():
if otherHeader[0] == destinationIP:
testDictionary[sourceIP][-1] += 1
I do not know whether both the source IP and the destination IP would be unique in all the cases. For that, you can decided what to choose. The advantage of defaultdict(list) is that you can append things also without overwriting the previous values.
for sourceIP, otherHeader in testDictionary.items():
if otherHeader[0] != destinationIP:
testDictionary[sourceIP].append(["10.10.10.2", 22, 8080, 1])
else:
testDictionary[sourceIP][-1] += 1
I am not sure this is exactly what you are looking for but I have tried to understand your data type according to description.
Hope that helps.
Related
Please help!
Well, first of all, I will explain what this should do. I'm trying to store users and servers from discord in a list (users that use the bot and servers in which the bot is in) with the id.
for example:
class User(object):
name = ""
uid = 0
Now, all discord id are very long and I want to store lots of users and servers in my list (one list for each one) but suppose that I get 10.000 users in my list, and I want to get the last one (without knowing it's the last one), this would take a lot of time. Instead, I thought that I could make a directory system for storing users in the list and finding it quickly. This is how it works:
I can get the id easily so imagine my id is 12345.
Now I convert it into a string using python str(id) function and I store it in a variable, strId.
For each digit of the list, I use it as an index for the users list, like this:
The User() is where the user is stored
users_list = [[[], [[], [], [[], [], [], [User()]]]]]
actual_dir = 0
for digit in strId:
actual_dir = digit
user = actual_dir[0]
And that's how I reach the user (or something like that)
Now, here is where my problem is. I know I can get the user easily by getting the user by id, but when I want to save the changes, I should do something like users_list[1][2][3][4][5] = changed_user_variable, but how far I know I cannot do something like list[1] += [2]
Is there any way to reach the user and save the changes?
Thanks in advance
You can use a python dictionary with the user id as the key and the user object as the value. I ran a test on my own computer and found that finding 100 000 random users in a dictionary with 10 million users only took 0.3s. This method is much simpler and I would guess it's just as fast, if not faster.
You can create a dictionary and add users with:
users = {}
users[userID] = some_user
(many other ways of doing this)
by using a dictionary you can easily change a user's field by:
users[userID].some_field = "Some value"
or overwrite the same way you add users in the first place.
I have a csv with 70 columns. The 60th column contains a value which decides wether the record is valid or invalid. If the 60th column has 0, 1, 6 or 7 it's valid. If it contains any other value then its invalid.
I realised that this functionality wasn't possible relying completely on changing property's of processors in Apache NiFi. Therfore I decided to use the executeScript processor and added this python code as the text body.
import csv
valid =0
invalid =0
total =0
file2 = open("invalid.csv","w")
file1 = open("valid.csv","w")
with open('/Users/himsaragallage/Desktop/redder/Regexo_2019101812750.dat.csv') as f:
r = csv.reader(f)
for row in f:
# print row[1]
total +=1
if row[59] == "0" or row[59] == "1" or row[59] == "6" or row[59] == "7":
valid +=1
file1.write(row)
else:
invalid += 1
file2.write(row)
file1.close()
file2.close()
print("Total : " + str(total))
print("Valid : " + str(valid))
print("Invalid : " + str(invalid))
I have no idea how to use a session and code within the executeScript processor as shown in this question. So I just wrote a simple python code and directed the valid and invalid data to different files. This approach I have used has many limitations.
I want to be able to dynamically process csv's with different filenames.
The csv which the invalid data is sent to, must also have the same filename as the input csv.
There would be around 20 csv's in my redder folder. All of them must be processed in one go.
Hope you could suggest a method for me to do the following. Feel free to provide me with a solution by editing the python code I have used or even completely using a different set of processors and totally excluding the use of ExecuteScript Processer
Here is complete step-by-step instructions on how to use QueryRecord processor
Basically, you need to setup highlighted properties
You want to route records based on values from one column. There are various ways to make this happen in NiFi. I can think of the following:
Use QueryRecord processor to partition records by column values
Use RouteOnContent processor to route using a regular expression
Use ExecuteScript processor to create a custom routing logic
Use PartitionRecord processor to route based on RecordPaths
I show you how to solve your problem using PartitionRecord processor. Since you did not provide any example data I created an example use case. I want to distinguish cities in Europe from cities elsewhere. Following data is given:
id,city,country
1,Berlin,Germany
2,Paris,France
3,New York,USA
4,Frankfurt,Germany
Flow:
GenerateFlowFile:
PartitionRecord:
CSVReader should be setup to infer schema and CSVRecordSetWriter to inherit schema. PartitionRecord will group records by country and pass them on together with an attribute country that has the country value. You will see following groups of records:
id,city,country
1,Berlin,Germany
4,Frankfurt,Germany
id,city,country
2,Paris,France
id,city,country
3,New York,USA
Each group is a flowfile and will have the country attribute, which you will use to route the groups.
RouteOnAttribute:
All countries from Europe will be routed to the is_europe relationship. Now you can apply the same strategy to your use case.
I have managed to compile two lists of IP addresses. used and unused ips as such
unused_ips = ['172.16.100.0/32', '172.16.100.1/32', '172.16.100.2/32', '172.16.100.3/32', '172.16.100.4/32', '172.16.100.5/32', '172.16.100.6/32', '172.16.100.7/32', '172.16.100.8/32', '172.16.100.9/32'...]
used_ips = ['172.16.100.1/32','172.16.100.33/32']
what I want to be able to do now is compare these lists and return the next free IP. in the above example the next ip would be 172.16.100.2/32, until it handed out all of those from 1 to 32 then it would hand out 34.
im not sure where to begin with this, I can convert these to IPv4Network objects if there is something built in for this but I couldn't find anything in documentation
Thanks
I'd keep a set of ipaddress objects and manipulate them to allocate and de-allocate the addresses, like so:
import ipaddress
def massage_ip_lists():
global unused_ips, used_ips
unused_ips = set(ipaddress.ip_address(ip.replace('/32', ''))
for ip in unused_ips)
used_ips = set(ipaddress.ip_address(ip.replace('/32', ''))
for ip in used_ips)
def allocate_next_ip():
new_ip = min(unused_ips - used_ips)
used_ips.add(new_ip)
return new_ip
unused_ips = [
'172.16.100.0/32',
'172.16.100.1/32',
'172.16.100.2/32',
'172.16.100.3/32',
'172.16.100.4/32',
'172.16.100.5/32',
'172.16.100.6/32',
'172.16.100.7/32',
'172.16.100.8/32',
'172.16.100.9/32']
used_ips = ['172.16.100.1/32', '172.16.100.33/32']
massage_ip_lists()
print(allocate_next_ip())
print(allocate_next_ip())
Note:
/32 is a nomenclature for IP networks, not IP hosts.
ipaddress objects are comparable, so functions like min() work on them.
172.16.100.0 is a perfectly valid IP address, depending upon the netmask. If you don't want to allocate it, either keep it out of unused_ips, or make the program aware of the netmask in use.
You want ips that are in unused but not used:
available_ips = [ip for ip in unused_ips if ip not in used_ips]
You want to sort them to get the one that's closest to zero. Naive sorting will not work as you have strings; 172.16.xxx.xxx is sorted higher than 172.100.xxx.xxx for example. You can convert the IPs into lists of numbers to sort them correctly.
import re
available_ips = sorted(available_ips, key=lambda ip: (int(n) for n in re.split(r'[./]', ip)))
If you're just trying to iterate through a list of the available ips, you could do something like this:
# Filter unavailable ips from the list of all ips
available_ips = set(unused_ips) - set(used_ips)
# Iterate through list of available ips
for ip in available_ips:
print(ip) # Or whatever you want to do with the next available ip
I am writing a simple file with stats on IP address.
I use this code :
line = str('%s %12g %12g %12g' %(IP, STAT1, STAT2, THSD))
with open(ficresul, 'a+') as fico:
if not any(value == x.rstrip('\r\n') for x in fico):
fico.write(value + '\n' )
fico.close()
and output is something like this:
192.168.0.10 15.8121 15.4317 18
192.168.0.20 18.625 12.5085 18
192.168.0.24 20.8323 23.252 18
192.168.0.17 17.6208 15.9218 18
It work perfectly for a new IP address. But I would like to update the stats if the ip address is already in the file and not write it in a new line.
How can it be done?
The simplest thing is to read the entire file, update it in memory, then write the whole thing back out (if it has changes). Trying to update lines in-place will only work if you don't change the length of any lines, and is probably more error-prone.
When storing the contents in memory, use an OrderedDict to store them by the key you want to do lookups by. OrderedDict will help avoid spurious changes to the ordering of the lines, which might be nice to have. Otherwise you can use a regular dict.
I have an sysadmin type CLI app that reads in info from a config file. I cannot change the format of the config file below.
TYPE_A = "value1,value2,value2"
TYPE_A = "value3,value4,value5"
TYPE_B = "valuex,valuey,valuez"
Based on the TYPE, I'll need to do some initial processing with each one. After I'm done with that step for all, I need to do some additional processing and depending on the options chosen either print the state and intended action(s) or execute those action(s).
I'd like to do the initial parsing of the config into a dict of lists of dicts and update every instance of TYPE_A, TYPE_B, TYPE_C, etc with all the pertinent info about it. Then either print the full state or execute the actions (or fail if the state of something was incorrect)
My thought is it would look something like:
dict
TYPE_A_list
dict_A[0] key:value,key:value,key:value
dict_A[1] key:value,key:value,key:value
TYPE_B_list
dict_A[0] key:value,key:value,key:value
dict_A[1] key:value,key:value,key:value
I think I'd want to read the config into that and then add keys and values or update values as the app progresses and reprocesses each TYPE.
Finally my questions.
I'm not sure how iterate over each list of dicts or to add list elements and add or update key:value pairs.
Is what I describe above the best way to go about this?
I'm fairly new to Python, so I'm open to any advice. FWIW, this will be python 2.6.
A little clarification on the config file lines
CAR_TYPE = "Ford,Mustang,Blue,2005"
CAR_TYPE = "Honda,Accord,Green,2009"
BIKE_TYPE = "Honda,VTX,Black,2006"
BIKE_TYPE = "Harley,Sportster,Red,2010"
TIRE_TYPE = "170R15,whitewall"
Each type will have the same order and number of values.
No need to "remember" there are two different TYPE_A assignments - you can combine them.
TYPE_A = "value1,value2,value2"
TYPE_A = "value3,value4,value5"
would be parsed as only one of them, or both, depends on the implementation of your sysadmin CLI app.
Then the data model should be:
dict
TYPE_A: list(value1, value2, value3)
TYPE_B: list(valuex, valuey, valuez)
That way, you can iterate through dict.items() pretty easily:
for _type, values in dict.items():
for value in values:
print "%s: %s" % (_type, value)
# or whatever you wish to do