Hashable container for million items and modifications in each iteration - python

I'm currently dealing with dictionary filled by a dozen items which grow until a dozen millions of dictionary's items after few iterations. Fundamentally my item is defined by several ID, value, and characteristics. I create my dict with data in JSON I gather from a SQL server.
The operation I execute are for example:
get SQL results in JSON
search item where 'id1' and/or 'id2' which are identical
merge all items with same 'id1' by summing float('value')
See an example of what looks like my dict:
[
{'id1':'01234-01234-01234',
'value':'10',
'category':'K'}
...
{'id1':'01234-01234-01234',
'value':'5',
'category':'K'}
...
]
What I would like to get looks like:
[
...
{'id1':'01234-01234-01234',
'value':'15',
'category':'K'}
...
]
I could use dict of dicts instead:
{
'01234-01234-01234': {'value':'10',
'categorie':'K'}
...
'01234-01234-01234': {'value':'5',
'categorie':'K'}
...
}
and get:
{'01234-01234-01234': {'value':'15',
'categorie':'K'}
...
}
I've just got dedicated 4Go in Ram and millions of dicts in one dictionary on 64bit architecture I would like to optimise my code and my operations in time and RAM. Are there tricks or better containers than dictionary of dictionaries to realise these kind of operations? Is it better to create a new object which erase the first one for each iteration or change the hashable object itself?
I'm using Python 3.4.
EDIT: simplified the question in one question about the value.
The question is similar to How to sum dict elements or Fastest way to merge n-dictionaries and add values on 2.6, but in my case I've string in my dict.
EDIT2: for the moment, the best performances are get thanks to this method:
def merge_similar_dict(input_list=list):
i=0
#sorted the dictionnary of exchanges with the id.
try:
merge_list = sorted(deepcopy(input_list), key=lambda k: k['id'])
while (i+1)<=(len(merge_list)-1):
while (merge_list[i]['id']==merge_list[i+1]['id']):
merge_list[i]['amount'] = str(float(merge_list[i]['amount']) + float(merge_list[i+1]['amount']))
merge_list.remove(merge_list[i+1])
if i+1 >= len(merge_list):
break
else:
pass
i += 1
except Exception as error:
print('The merge of similar dict has failed')
print(error)
raise
return merge_list
return merge_list
When I get dozen thousand dicts in list, it begins to become very long (several minutes).

Related

accelerate comparing dictionary keys and values to strings in list in python

Sorry if this is trivial I'm still learning but I have a list of dictionaries that looks as follow:
[{'1102': ['00576', '00577', '00578', '00579', '00580', '00581']},
{'1102': ['00582', '00583', '00584', '00585', '00586', '00587']},
{'1102': ['00588', '00589', '00590', '00591', '00592', '00593']},
{'1102': ['00594', '00595', '00596', '00597', '00598', '00599']},
{'1102': ['00600', '00601', '00602', '00603', '00604', '00605']}
...]
it contains ~89000 dictionaries. And I have a list containing 4473208 paths. example:
['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv',
'/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv',
...]
and what I want to do is group each path that contains the grouped values in the dict in the folder containing the key together.
I tried using for loops like this:
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for file in ct_paths:
for key, val in elem.items():
if (file[16:20] == key) and (any(x in file[21:26] for x in val)):
temp1.append(file)
grpd_cts.append(temp1)
but this takes around 30hours. is there a way to make it more efficient? any itertools function or something?
Thanks a lot!
ct_paths is iterated repeatedly in your inner loop, and you're only interested in a little bit of it for testing purposes; pull that out and use it to index the rest of your data, as a dictionary.
What does make your problem complicated is that you're wanting to end up with the original list of filenames, so you need to construct a two-level dictionary where the values are lists of all originals grouped under those two keys.
ct_path_index = {}
for f in ct_paths:
ct_path_index.setdefault(f[16:20], {}).setdefault(f[21:26], []).append(f)
grpd_cts = []
for elem in tqdm(dict_list):
temp1 = []
for key, val in elem.items():
d2 = ct_path_index.get(key)
if d2:
for v in val:
v2 = d2.get(v)
if v2:
temp1 += v2
grpd_cts.append(temp1)
ct_path_index looks like this, using your data:
{'1102': {'00575': ['/****/**/******_1102/00575***...**0CT.csv',
'/****/**/******_1102/00575***...**1CT.csv',
'/****/**/******_1102/00575***...**2CT.csv',
'/****/**/******_1102/00575***...**3CT.csv',
'/****/**/******_1102/00575***...**4CT.csv'],
'00578': ['/****/**/******_1102/00578***...**1CT.csv',
'/****/**/******_1102/00578***...**2CT.csv',
'/****/**/******_1102/00578***...**3CT.csv']}}
The use of setdefault (which can be a little hard to understand the first time you see it) is important when building up collections of collections, and is very common in these kinds of cases: it makes sure that the sub-collections are created on demand and then re-used for a given key.
Now, you've only got two nested loops; the inner checks are done using dictionary lookups, which are close to O(1).
Other optimizations would include turning the lists in dict_list into sets, which would be worthwhile if you made more than one pass through dict_list.

map() with partial arguments: save up space

I have a very large list of dictionaries, which keys are a triple of (string, float, string) and whose values are again lists.
cols_to_aggr is basically a list(defaultdict(list))
I wish I could pass to my function _compute_aggregation not only the list index i but also exclusively the data contained by that index, namely cols_to_aggr[i] instead of the whole data structure cols_to_aggr and having to get the smaller chunk inside my parallelized functions.
This because the problem is that this passing of the whole data structures cause my Pool to eat up all my memory with no efficiency at all.
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation, cols_to_aggr=cols_to_aggr,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr)
def _compute_aggregation(index, cols_to_aggr, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = cols_to_aggr[index]
To give a patch to my memory issue I tried to set a maxtasksperchild but without success, I have no clue how to optimally set it.
Using dict.values(), you can iterate over the values of a dictionary.
So you could change your code to:
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr.values())
def _compute_aggregation(value, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = value
If you still need the keys in your _compute_aggregation function, use dict.items() instead.

Most pythonic way of iterating list items into a nested dict

I have a problem and I want to determine whether my approach is sound. Here is the idea:
I would be creating a primary dict called zip_codes, of which respective zipcodes (from a list) were the names of each of the nested dicts. Each would have keys for "members", "offices", "members per office"
It would look like this:
zips {
90219: {
"members": 120,
"offices": 18,
"membersperoffice": 28
},
90220: {
"members": 423,
"offices": 37,
"membersperoffice": 16
}
}
and so on and so forth.
I think I need to build the nested dicts, and then process several lists against conditionals, passing resulting values into the corresponding dicts on the fly (i.e. based on how many times a zip code exists in the list).
Is using nested dictionaries the most pythonic way of doing this? Is it cumbersome? Is there a better way?
Can someone drop me a hint about how to push key values into nested dicts from a loop? I've not been able to find a good resource describing what I'm trying to do (if this is, indeed, the best path).
Thanks.
:edit: a more specific example:
determine how many instances of a zipcode are in list called membersperzip
find corresponding nested dict with same name as zipcode, inside dict called zips
pass value to corresponding key, called "members" (or whatever key)
:edit 2:
MadPhysicist requested I give code examples (I don't even know where to start with this one and I can't find examples. All I've been able to do thus far is:
area_dict = {}
area_dict = dict.fromkeys(all_areas, 0) #make all of the zipscodes keys, add a zero in the first non-key index
dictkeys = list (area_dict.keys())
That gets me a dict with a bunch of zip codes as keys. I've discovered no way to iterate through a list and create nested dicts (yet). Hence the actual question.
Please don't dogpile me and do the usual stack overflow thing. This is not me asking anyone to do my homework. This is merely me asking someone to drop me a HINT.
:edit 3:
Ok. This is convoluted (my fault). Allow me to clarify further:
So, I have an example of what the nested dicts should look like. They'll start out empty, but I need to iterate through one of the zip code lists to create all the nested dicts... inside of zips.
This is a sample of the list that I want to use to create the nested dicts inside of the zips dict:
zips = [90272, 90049, 90401, 90402, 90403, 90404, 90291, 90292, 90290, 90094, 90066, 90025, 90064, 90073]
And this is what I want it to look like
zips {
90272: {
"members": ,
"offices": ,
"membersperoffice":
},
90049: {
"members": ,
"offices": ,
"membersperoffice":
}
}
....
etc, etc. ( creating a corresponding nested dict for each zipcode in the list)
After I achieve this, I have to iterate through several more zip code lists... and those would spit out the number of times a zip code appears in a given list, and then find the dict corresponding to the zip code in question, and append that value to the relevant key.
One I figure out the first part, I can figure this second part out on my own.
Thanks again. Sorry for any confusion.
You can do something like this:
all_areas = [90219, 90220]
zips = {zipcode: code_members(zipcode) for zipcode in all_areas}
def code_members(zipcode):
if zipcode == 90219:
return dict(members=120, offices=18, membersperoffice=28)
return dict(members=423, offices=37, membersperoffice=16)
I think I need to build the nested dicts, and then process several
lists against conditionals, passing resulting values into the
corresponding dicts on the fly (i.e. based on how many times a zip
code exists in the list).
Using the above approach, if a zipcode appears multiple times in the all_areas list, the resulting zip dictionary will only contain one instance of the zipcode.
Is using nested dictionaries the most pythonic way of doing this? Is
it cumbersome? Is there a better way?
May I suggest making a simple object that represents the value of each zipcode. Something simple like:
Using dataclass:
#dataclass.dataclass
class ZipProperties(object):
members: int
offices: int
membersperoffice: int
Using named tuple:
ZipProperties = collections.namedtuple('ZipProperties', ['members', 'offices', 'membersperoffice'])
You can then change the code_members function to this:
def code_members(zipcode):
if zipcode == 90219:
return ZipProperties(120, 18, 28)
return ZipProperties(423, 37, 16)
Addressing your concrete example:
determine how many instances of a zipcode are in list called membersperzip
find corresponding nested dict with same name as zipcode, inside dict called zips
pass value to corresponding key, called "members" (or whatever key)
membersperzip: typings.List[Tuple[int, int]] = [(90219, 54)]
for zip, members in membersperzip:
for zipcode, props in zips.items():
if zipcode == zip:
props.members = members
I would suggest you to append it when you have the actual value instead of initializing dictionary with empty values for each key. You have list of keys and I do not see why you want to put all of them to the dictionary without having value in the first place.
zips = [90272, 90049, 90401, 90402, 90403, 90404, 90291, 90292, 90290, 90094, 90066, 90025, 90064, 90073]
zips_dict = {}
for a_zip in zips:
if a_zip not in zips_dict:
# Initialize proper value here for members etc.
zips_dict[a_zip] = proper_value
If you insist to initialize dict with empty value for each keys, you could use this, which will also iterate through the list anyway but in python comprehension.
zips = [90272, 90049, 90401, 90402, 90403, 90404, 90291, 90292, 90290, 90094, 90066, 90025, 90064, 90073]
zips_dict = {
x:{
"members":None,
"offices":None,
"membersperoffice":None,
} for x in zips
}
Hope this helps

Python: Sorting ip ranges which are dictionary keys

I have a dictionary which has IP address ranges as Keys (used to de-duplicate in a previous step) and certain objects as values. Here's an example
Part of the dictionary sresult:
10.102.152.64-10.102.152.95 object1:object3
10.102.158.0-10.102.158.255 object2:object5:object4
10.102.158.0-10.102.158.31 object3:object4
10.102.159.0-10.102.255.255 object6
There are tens of thousands of lines, I want to sort (correctly) by IP address in keys
I tried splitting the key based on the range separator - to get a single IP address that can be sorted as follows:
ips={}
for key in sresult:
if '-' in key:
l = key.split('-')[0]
ips[l] = key
else:
ips[1] = key
And then using code found on another post, sorting by IP address and then looking up the values in the original dictionary:
sips = sorted(ipaddress.ip_address(line.strip()) for line in ips)
for x in sips:
print("SRC: "+ips[str(x)], "OBJECT: "+" :".join(list(set(sresult[ips[str(x)]]))), sep=",")
The problem I have encountered is that when I split the original range and add the sorted first IPs as new keys in another dictionary, I de-duplicate again losing lines of data - lines 2 & 3 in the example
line 1 10.102.152.64 -10.102.152.95
line 2 10.102.158.0 -10.102.158.255
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
becomes
line 1 10.102.152.64 -10.102.152.95
line 3 10.102.158.0 -10.102.158.31
line 4 10.102.159.0 -10.102.255.25
So upon rebuilding the original dictionary using the IP address sorted keys, I have lost data
Can anyone help please?
EDIT This post now consists of three parts:
1) A bit of information about dictionaries that you will need in order to understand the rest.
2) An analysis of your code, and how you could fix it without using any other Python features.
3) What I would consider the best solution to the problem, in detail.
1) Dictionaries
Python dictionaries are not ordered. If I have a dictionary like this:
dictionary = {"one": 1, "two": 2}
And I loop through dictionary.items(), I could get "one": 1 first, or I could get "two": 2 first. I don't know.
Every Python dictionary implicitly has two lists associated with it: a list of it's keys and a list of its values. You can get them list this:
print(list(dictionary.keys()))
print(list(dictionary.values()))
These lists do have an ordering. So they can be sorted. Of course, doing so won't change the original dictionary, however.
Your Code
What you realised is that in your case you only want to sort according to the first IP address in your dictionaries keys. Therefore, the strategy that you adopted is roughly as follows:
1) Build a new dictionary, where the keys are only this first part.
2) Get that list of keys from the dictionary.
3) Sort that list of keys.
4) Query the original dictionary for the values.
This approach will, as you noticed, fail at step 1. Because as soon as you made the new dictionary with truncated keys, you will have lost the ability to differentiate between some keys that were only different at the end. Every dictionary key must be unique.
A better strategy would be:
1) Build a function which can represent you "full" ip addresses with as an ip_address object.
2) Sort the list of dictionary keys (original dictionary, don't make a new one).
3) Query the dictionary in order.
Let's look at how we could change your code to implement step 1.
def represent(full_ip):
if '-' in full_ip:
# Stylistic note, never use o or l as variable names.
# They look just like 0 and 1.
first_part = full_ip.split('-')[0]
return ipaddress.ip_address(first_part.strip())
Now that we have a way to represent the full IP addresses, we can sort them according to this shortened version, without having to actually change the keys at all. All we have to do is tell Python's sorted method how we want the key to be represented, using the key parameter (NB, this key parameter has nothing to do with key in a dictionary. They just both happened to be called key.):
# Another stylistic note, always use .keys() when looping over dictionary keys. Explicit is better than implicit.
sips = sorted(sresults.keys(), key=represent)
And if this ipaddress library works, there should be no problems up to here. The remainder of your code you can use as is.
Part 3 The best solution
Whenever you are dealing with sorting something, it's always easiest to think about a much simpler problem: given two items, how would I compare them? Python gives us a way to do this. What we have to do is implement two data model methods called
__le__
and
__eq__
Let's try doing that:
class IPAddress:
def __init__(self, ip_address):
self.ip_address = ip_address # This will be the full IP address
def __le__(self, other):
""" Is this object less than or equal to the other one?"""
# First, let's find the first parts of the ip addresses
this_first_ip = self.ip_address.split("-")[0]
other_first_ip = other.ip_address.split("-")[0]
# Now let's put them into the external library
this_object = ipaddress.ip_address(this_first_ip)
other_object = ipaddress.ip_adress(other_first_ip)
return this_object <= other_object
def __eq__(self, other):
"""Are the two objects equal?"""
return self.ip_address == other.ip_adress
Cool, we have a class. Now, the data model methods will automatically be invoked any time I use "<" or "<=" or "==". Let's check that it is working:
test_ip_1 = IPAddress("10.102.152.64-10.102.152.95")
test_ip_2 = IPAddress("10.102.158.0-10.102.158.255")
print(test_ip_1 <= test_ip_2)
Now, the beauty of these data model methods is that Pythons "sort" and "sorted" will use them as well:
dictionary_keys = sresult.keys()
dictionary_key_objects = [IPAddress(key) for key in dictionary_keys]
sorted_dictionary_key_objects = sorted(dictionary_key_objects)
# According to you latest comment, the line below is what you are missing
sorted_dictionary_keys = [object.ip_address for object in sorted_dictionary_key_objects]
And now you can do:
for key in sorted_dictionary_keys:
print(key)
print(sresults[key])
The Python data model is almost the defining feature of Python. I'd recommend reading about it.

python: badly behaving dict inside a function- erroneous TypeError

I have dicts that I need to clean, e.g.
dict = {
'sensor1': [list of numbers from sensor 1 pertaining to measurements on different days],
'sensor2': [list of numbers from from sensor 2 pertaining to measurements from different days],
etc. }
Some days have bad values, and I would like to generate a new dict with the all the sensor values from that bad day to be erased by using an upper limit on the values of one of the keys:
def clean_high(dict_name,key_string,limit):
'''clean all the keys to eliminate the bad values from the arrays'''
new_dict = dict_name
for key in new_dict: new_dict[key] = new_dict[key][new_dict[key_string]<limit]
return new_dict
If I run all the lines separately in IPython, it works. The bad days are eliminated, and the good ones are kept. These are both type numpy.ndarray: new_dict[key] and new_dict[key][new_dict[key_string]<limit]
But, when I run clean_high(), I get the error:
TypeError: only integer arrays with one element can be converted to an index
What?
Inside of clean_high(), the type for new_dict[key] is a string, not an array.
Why would the type change? Is there a better way to modify my dictionary?
Do not modify a dictionary while iterating over it. According to the python documentation: "Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries". Instead, create a new dictionary and modify it while iterating over the old one.
def clean_high(dict_name,key_string,limit):
'''clean all the keys to eliminate the bad values from the arrays'''
new_dict = {}
for key in dict_name:
new_dict[key] = dict_name[key][dict_name[key_string]<limit]
return new_dict

Categories

Resources