pool.apply_async with multiple parameters - python

The below code should call two databases at the same time. I tried to do it with
ThreadPool but run into some difficulties. pool.apply_async doesn't seem to allow multiple parameters, so I put them into a tuple and then try to unpack them. Is this the right approach or is there a better solution?
The list of tuples is defined in params=... and the tuples have 3 entries. I would expect the function to be called twice, each time with 3 parameters.
def get_sql(self, *params): # run with risk
self.logger.info(len(params))
sql=params[0]
schema=params[1]
db=params[2]
self.logger.info("Running SQL with schema: {0}".format(schema))
df = pd.read_sql(sql, db)
return df
def compare_prod_uat(self):
self.connect_dbrs_prod_db()
self.connect_dbrs_uat_db()
self.logger.info("connected to UAT and PROD database")
sql = """ SELECT * FROM TABLE """
params = [(sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod), (sql, "DF_RISK_CUAT_OWNER", self.db_dbrs_uat)]
pool = ThreadPool(processes=2)
self.logger.info("Calling Pool")
result_prod = pool.apply_async(self.get_sql, (sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod))
result_uat = pool.apply_async(self.get_sql, (sql, "DF_RISK_CUAT_OWNER", self.db_dbrs_uat))
# df_prod = self.get_sql(sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod)
# df_cuat = self.get_sql(sql, "DF_RISK_CUAT_OWNER", self.db_dbrs_uat)
self.logger.info("Get return from uat")
df1 = result_uat.get() # get return value from the database call
self.logger.info("Get return from prod")
df2 = result_prod.get() # get second return value from the database call
return df1, df2

There may be many things wrong, but if you add
print params
as the first line of your get_sql, you'll see that you send in a tuple (sql, [(sql, "DF_RISK_PRD_OWNER", self.db_dbrs_prod), (sql, .....)])
So yes, length of params is always two, the first parameter being "sql" whatever that is in your implementation, and the second being an array of tuples of length three. I don't understand why you are sending (sql,params) instead of just (params,) as "sql" seems to be there in the array elements. If it needs to be there, your array is in params[1].
However, I don't understand how your worker function would traverse this array. It seems to be built to execute only one sql statement as it doesn't have a for loop. Maybe you intended to do the for loop in your compare_prod_uat function and spawn as many workers as you have elements in your array? I don't know but it currently doesn't make much sense.
The parameter issue can be fixed by this, though.

Related

map() with partial arguments: save up space

I have a very large list of dictionaries, which keys are a triple of (string, float, string) and whose values are again lists.
cols_to_aggr is basically a list(defaultdict(list))
I wish I could pass to my function _compute_aggregation not only the list index i but also exclusively the data contained by that index, namely cols_to_aggr[i] instead of the whole data structure cols_to_aggr and having to get the smaller chunk inside my parallelized functions.
This because the problem is that this passing of the whole data structures cause my Pool to eat up all my memory with no efficiency at all.
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation, cols_to_aggr=cols_to_aggr,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr)
def _compute_aggregation(index, cols_to_aggr, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = cols_to_aggr[index]
To give a patch to my memory issue I tried to set a maxtasksperchild but without success, I have no clue how to optimally set it.
Using dict.values(), you can iterate over the values of a dictionary.
So you could change your code to:
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr.values())
def _compute_aggregation(value, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = value
If you still need the keys in your _compute_aggregation function, use dict.items() instead.

Creating Generator object from record list within a function

I'm trying to create generator object for the list of records with the data from mysql database, so I'm passing the mysql cursor object to the function as parameter.
My issue here is if the "if block" containing yield records is commented then cust_records function works perfectly fine but if I uncomment the line then the function is not working.
Not sure if this is not the way to yield the list object in Python 3
My code so far:
def cust_records(result_set) :
block_id = None
records = []
i = 0
for row in result_set :
records.append((row['id'], row, smaller_ids))
if records :
yield records
The point of generators is lazy evaluation, so storing all records in a list and yielding the list makes no sense at all. If you want to retain lazy evalution (which is IMHO preferable, specially if you have to work on arbitrary datasets that might get huge), you want to yield each record, ie:
def cust_records(result_set) :
for row in result_set :
yield (row['id'], row, smaller_ids)
# then
def example():
cursor.execute(<your_sql_query_here>)
for record in cust_records(cursor):
print(record)
else (if you really want to consume as much memory as possible) just male cust_record a plain function:
def cust_records(result_set) :
records = []
for row in result_set :
records.append((row['id'], row, smaller_ids))
return records

sqlalchemy print results instead of objects

I am trying to print the results of my query to the console, with the below code, but it keeps returning me the object location instead.
test = connection.execute('SELECT EXISTS(SELECT 1 FROM "my_table" WHERE Code = 08001)')
print(test)
Result that I get is - <sqlalchemy.engine.result.ResultProxy object at 0x000002108508B278>
How do I print out the value instead of the object?
Reason I am asking is because I want to use the value to do comparison tests.
EG:
if each_item not in test:
# do stuffs
Like this, test contains all the rows returned by your query.
If you want something you can iterate over, you can use fetchall for example. Like this:
test = connection.execute('SELECT EXISTS(SELECT 1 FROM "my_table" WHERE Code = 08001)').fetchall()
Then you can iterate over a list of rows. Each row will contain all the fields. In your example you can access fields by their position. You only have one at position one. So that's how you can access 1:
for row in test:
print(row[0])
test is an object containing the rows values. So if the column's name is value, you can call it using test.value.
If you're looking for more "convenient" way of doing so (like iterating through each column of test), you'd have to explicitly define these functions (either as methods of test or as other functions designed to iterate through those types of rows).

Python/SQL - How the cursor functions

In my code I have a function that needs to return either a string or None depending on what is present in the database. However at the moment the result is a list with the string answer inside, or None. Is there any change that could be made that would result in just a string or None being returned, rather than having to index the list?
Here is the code:
def retrieve_player_name(username):
param = [(username)]
command = ("""
SELECT username FROM players
WHERE username = ?
""")
result = cur.execute(command, param).fetchone()
if result is not None:
return result[0]
Thanks in advance.
A database cursors fetches entire rows, not values.
Even a row with a single value inside is still a row.
If you don't want to write row[0] multiple times, create a helper function execute_and_return_a_single_value_from_query().

pysqlite, query for duplicate entries with swapped columns

Currently I have a pysqlite db that I am using to store a list of road conditions. The source this list is generated from however is buggy and sometimes generates duplicates. Some of these duplicates will have the start and end points swapped but everything else the same.
The method i currently have looks like this:
def getDupes(self):
'''This method is used to return a list of dupilicate entries
'''
self.__curs.execute('SELECT * FROM roadCond GROUP BY road, start, end, cond, reason, updated, county, timestmp HAVING count(*)>1')
result = self.__curs.fetchall()
def getSwaps():
'''This method is used to grab the duplicates with swapped columns
'''
self.__curs.execute('SELECT * FROM roadCond WHERE ')
extra = self.__curs.fetchall()
return extrac
result.extend(getSwaps())
return result
The the initial query works but I am suspicious of it (I think there is a better way, I just don't know) but I am not all to sure how to make the inner method work.
Thank you ahead of time. :-D
Instead of the first query, you could use
SELECT DISTINCT * FROM roadCond
which will retrieve all the records from the table, removing any duplicates.
As for the inner method, this query will return all the records which have "duplicates" with start and end swapped. Note that, for each record with "duplicates", this query will return both the "original" and the "copy".
SELECT DISTINCT * FROM roadCond WHERE EXISTS (
SELECT * FROM roadCond rc2 WHERE
roadCond.road = rc2.road AND
roadCond.end = rc2.start AND roadCond.start = rc2.end AND
roadCond.cond = rc2.cond AND
... AND
roadCond.timestamp = rc2.timestamp)
Edit: To detect and remove "duplicates" with start and end swapped, you could make sure that your data always contains these values laid out in the same order:
UPDATE roadCond SET start = end, end = start WHERE end < start;
But this approach only works if it doesn't matter which is which.

Categories

Resources