map() with partial arguments: save up space

map() with partial arguments: save up space - python

I have a very large list of dictionaries, which keys are a triple of (string, float, string) and whose values are again lists.
cols_to_aggr is basically a list(defaultdict(list))
I wish I could pass to my function _compute_aggregation not only the list index i but also exclusively the data contained by that index, namely cols_to_aggr[i] instead of the whole data structure cols_to_aggr and having to get the smaller chunk inside my parallelized functions.
This because the problem is that this passing of the whole data structures cause my Pool to eat up all my memory with no efficiency at all.
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation, cols_to_aggr=cols_to_aggr,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr)
def _compute_aggregation(index, cols_to_aggr, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = cols_to_aggr[index]
To give a patch to my memory issue I tried to set a maxtasksperchild but without success, I have no clue how to optimally set it.

Using dict.values(), you can iterate over the values of a dictionary.
So you could change your code to:
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr.values())
def _compute_aggregation(value, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = value
If you still need the keys in your _compute_aggregation function, use dict.items() instead.

Related

Iterable empty when using ProcessPoolExecutor.map(), works as expected with regular map()

In my application I run the following code:
key = lambda r: (r["rule"], r["diameter"])
sorted_rules = sorted(raw_rules, key = key)
grouped_rules = itertools.groupby(sorted_rules, key = key)
This works as expected, I then have a function that does further processing on each of the resulting groups, like so:
def do_stuff(group):
key, values = group
# do stuff here
Now, because this processing is done on a very large list (~500k elements), I'd like to make use of some multiprocessing and I tried using concurrent.futures.ProcessPoolExecutor, like so:
# above is the code with grouped_rules variable
with ProcessPoolExecutor() as executor:
res = executor.map(do_stuff, grouped_rules)
However, when I run this the values inside do_stuff is an empty iterable. This does not happen if I replace executor.map with regular single-threaded map. Any idea why this might be happening?
Edit: fixed formatting.

Using tqdm to track progress in slow REST-requests: progress for filling lists

I have a function where I use REST-requests. The function requests data from a remote Confluence server in chunks. Each chunk requests 25 results from a REST API.
for result in response.json()['results']:
user = dict()
user['group'] = result['name']
groups.append(user)
size = response.json()['size']
start += chunk
return groups
The result variable contains these requested 25 results from a REST-request. Results are returned in a dictionary. So I get values from the returned dictionary and then store them in a new dictionary.
Then I make the next query and get following 25 results as a new user dictionary.
Finally, I add all dictionaries to the groups list. As a result, the function returns a list consiting of multiple dictionaries.
More on this:
I use this function to collect data in a dataframe and output to a file. Because my function retrives a list of dictionaries, I expand the result to a single list as follows.
groups = list()
print('Retrieving groups:')
groups.extend(get_groups())
df = pd.DataFrame(
groups,
columns = ['group']
)
The question is: how do I track the progress of how the group list gets filled with dictionaries? Currently this print('Retrieving groups:') steps takes a lot of time, and I want some progress bar to track the go.
I thought of
tqdm.pandas(desc="desc")
for user in tqdm(groups):
pass
Within the get_groups() function, but this doesn't seem the right way to do so. I'd apreciate your help. TIA.

Adding pool results to a dict

I have a function which accepts a two inputs provided by itertools combinations, and outputs a solution. The two inputs should be stored as a tuple forming the key in the dict, while the result is the value.
I can pool this and get all of the results as a list, which I can then insert into a dictionary one-by-one, but this seems inefficient. Is there a way to get the results as each job finishes, and directly add it to the dict?
Essentially, I have the code below:
all_solutions = {}
for start, goal in itertools.combinations(graph, 2):
all_solutions[(start, goal)] = search(graph, start, goal)
I am trying to parallelize it as follows:
all_solutions = {}
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
results = pool.starmap(search, zip(itertools.repeat(graph_pool),
itertools.combinations(graph, 2)))
for i, start_goal in enumerate(itertools.combinations(graph, 2)):
start, goal = start_goal[0], start_goal[1]
all_solutions[(start, goal)] = results[i]
Which actually works, but iterates twice, once in the pool, and once to write to a dict (not to mention the clunky tuple unpacking).

This is possible, you just need to switch to using a lazy mapping function (not map or starmap, which have to finish computing all the results before you can begin using any of them):
from functools import partial
from itertools import tee
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
# Since you're processing in order and in parallel, tee might help a little
# by only generating the dict keys/search arguments once. That said,
# combinations of n choose 2 are fairly cheap; the overhead of tee's caching
# might overwhelm the cost of just generating the combinations twice
startgoals1, startgoals2 = tee(itertools.combinations(graph, 2))
# Use partial binding of search with graph_pool to be able to use imap
# without a wrapper function; using imap lets us consume results as they become
# available, so the tee-ed generators don't store too many temporaries
results = pool.imap(partial(search, graph_pool), startgoals2))
# Efficiently create the dict from the start/goal pairs and the results of the search
# This line is eager, so it won't complete until all the results are generated, but
# it will be consuming the results as they become available in parallel with
# calculating the results
all_solutions = dict(zip(startgoals1, results))

pool.apply_async with multiple parameters

The below code should call two databases at the same time. I tried to do it with
ThreadPool but run into some difficulties. pool.apply_async doesn't seem to allow multiple parameters, so I put them into a tuple and then try to unpack them. Is this the right approach or is there a better solution?
The list of tuples is defined in params=... and the tuples have 3 entries. I would expect the function to be called twice, each time with 3 parameters.
def get_sql(self, *params): # run with risk
self.logger.info(len(params))
sql=params[0]
schema=params[1]
db=params[2]
self.logger.info("Running SQL with schema: {0}".format(schema))
df = pd.read_sql(sql, db)
return df
def compare_prod_uat(self):
self.connect_dbrs_prod_db()
self.connect_dbrs_uat_db()
self.logger.info("connected to UAT and PROD database")
sql = """ SELECT * FROM TABLE """
params = [(sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod), (sql, "DF_RISK_CUAT_OWNER", self.db_dbrs_uat)]
pool = ThreadPool(processes=2)
self.logger.info("Calling Pool")
result_prod = pool.apply_async(self.get_sql, (sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod))
result_uat = pool.apply_async(self.get_sql, (sql, "DF_RISK_CUAT_OWNER", self.db_dbrs_uat))
# df_prod = self.get_sql(sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod)
# df_cuat = self.get_sql(sql, "DF_RISK_CUAT_OWNER", self.db_dbrs_uat)
self.logger.info("Get return from uat")
df1 = result_uat.get() # get return value from the database call
self.logger.info("Get return from prod")
df2 = result_prod.get() # get second return value from the database call
return df1, df2

There may be many things wrong, but if you add
print params
as the first line of your get_sql, you'll see that you send in a tuple (sql, [(sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod), (sql, .....)])
So yes, length of params is always two, the first parameter being "sql" whatever that is in your implementation, and the second being an array of tuples of length three. I don't understand why you are sending (sql,params) instead of just (params,) as "sql" seems to be there in the array elements. If it needs to be there, your array is in params[1].
However, I don't understand how your worker function would traverse this array. It seems to be built to execute only one sql statement as it doesn't have a for loop. Maybe you intended to do the for loop in your compare_prod_uat function and spawn as many workers as you have elements in your array? I don't know but it currently doesn't make much sense.
The parameter issue can be fixed by this, though.

Hashable container for million items and modifications in each iteration

I'm currently dealing with dictionary filled by a dozen items which grow until a dozen millions of dictionary's items after few iterations. Fundamentally my item is defined by several ID, value, and characteristics. I create my dict with data in JSON I gather from a SQL server.
The operation I execute are for example:
get SQL results in JSON
search item where 'id1' and/or 'id2' which are identical
merge all items with same 'id1' by summing float('value')
See an example of what looks like my dict:
[
{'id1':'01234-01234-01234',
'value':'10',
'category':'K'}
...
{'id1':'01234-01234-01234',
'value':'5',
'category':'K'}
...
]
What I would like to get looks like:
[
...
{'id1':'01234-01234-01234',
'value':'15',
'category':'K'}
...
]
I could use dict of dicts instead:
{
'01234-01234-01234': {'value':'10',
'categorie':'K'}
...
'01234-01234-01234': {'value':'5',
'categorie':'K'}
...
}
and get:
{'01234-01234-01234': {'value':'15',
'categorie':'K'}
...
}
I've just got dedicated 4Go in Ram and millions of dicts in one dictionary on 64bit architecture I would like to optimise my code and my operations in time and RAM. Are there tricks or better containers than dictionary of dictionaries to realise these kind of operations? Is it better to create a new object which erase the first one for each iteration or change the hashable object itself?
I'm using Python 3.4.
EDIT: simplified the question in one question about the value.
The question is similar to How to sum dict elements or Fastest way to merge n-dictionaries and add values on 2.6, but in my case I've string in my dict.
EDIT2: for the moment, the best performances are get thanks to this method:
def merge_similar_dict(input_list=list):
i=0
#sorted the dictionnary of exchanges with the id.
try:
merge_list = sorted(deepcopy(input_list), key=lambda k: k['id'])
while (i+1)<=(len(merge_list)-1):
while (merge_list[i]['id']==merge_list[i+1]['id']):
merge_list[i]['amount'] = str(float(merge_list[i]['amount']) + float(merge_list[i+1]['amount']))
merge_list.remove(merge_list[i+1])
if i+1 >= len(merge_list):
break
else:
pass
i += 1
except Exception as error:
print('The merge of similar dict has failed')
print(error)
raise
return merge_list
return merge_list
When I get dozen thousand dicts in list, it begins to become very long (several minutes).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

map() with partial arguments: save up space - python

Related

Iterable empty when using ProcessPoolExecutor.map(), works as expected with regular map()

Using tqdm to track progress in slow REST-requests: progress for filling lists

Adding pool results to a dict

pool.apply_async with multiple parameters

Hashable container for million items and modifications in each iteration

Categories

Resources