Easiest way to impliment multithread in this function [Python] - python

So I have data known as id_list that is coming into the function in this format [(u'SGP-3630', 1202), (u'MTSCR-534', 1244)]. The format being two values paired together, there could be 1 pair or a hundred pairs.
This is the function:
def ListParser(id_list):
list_length = len(id_list)
count = 0
table = ""
while count < list_length:
jira = id_list[count][0]
stash = id_list[count][1]
count = count + 1
table = table + RetrieveFromAPI(stash, jira)
table = TableFormatter(table)
table = TableColouriser(table)
return table
What this function does is goes through the list and extracts the pairs and puts them through a function called RetrieveFromAPI() which fetches information from a URL.
Anyone have an idea on how to impliment multithreading here? I've had a shot at splitting both lists up into their own lists and getting the pool to iterate through each list but it hasn't quite worked.
def ListParser(id_list):
pool = ThreadPool(4)
list_length = len(id_list)
count = 0
table = ""
jira_list = list()
stash_list = list()
while count < list_length:
jira_list = jira_list.extend(id_list[count][0])
print jira_list
stash_list = stash_list.extend(id_list[count][1])
print stash_list
count = count + 1
table = table + pool.map(RetrieveFromAPI, stash_list, jira_list)
table = TableFormatter(table)
table = TableColouriser(table)
return table
The error I'm getting for this attempt is TypeError: 'int' object is not iterable
EDIT 2: Okay so I've managed to get the first list with tuples split up into two different lists, but I'm unsure how to get multithreading working with it.
jira,stash= map(list,zip(*id_list))

You're working too hard! From help(multiprocessing.pool.ThreadPool)
map(self, func, iterable, chunksize=None)
Apply `func` to each element in `iterable`, collecting the results
in a list that is returned.
The second argument is an iterable of the arguments you want to pass to the worker threads. You have a list of lists and you want the first two items from the inner list for each call. id_list is already iterable, so we're close. A small function (in this case implemented as a lambda) bridges the gap.
I worked up a full mock solution just to make sure it works, so here it goes. As an aside, you can benefit from a fairly large pool size since they spend much of their time waiting on I/O.
from multiprocessing.pool import ThreadPool
def RetrieveFromAPI(stash, jira):
# boring mock of api
return '{}-{}.'.format(stash, jira)
def TableFormatter(table):
# mock
return table
def TableColouriser(table):
# mock
return table
def ListParser(id_list):
if id_list:
pool = ThreadPool(min(12, len(id_list)))
table = ''.join(pool.map(lambda item: RetrieveFromAPI(item[1], item[0]),
id_list, chunksize=1))
pool.close()
pool.join()
else:
table = ''
table = TableFormatter(table)
table = TableColouriser(table)
return table
id_list = [[0,1,'foo'], [2,3,'bar'], [4,5, 'baz']]
print(ListParser(id_list))

Related

FastAPI Endpoint Pagination with Generator

I converted path operation to a generator as below. To get a generator when call is made to endpoint and to loop over pages of data instead of getting all data at once, but what I observe actually I still get data at once. Now instead of getting a list of dictionaries, I just get a two dimensional list in which there are multiple lists whose length is equal to my page size and their each element is also dictionaries. Why generator behaved in this way and what should I use to implement pagination? Does storing data in a global variable and returning a chunk of it for each call to endpoint make sense?
#overlaps_router.get("/query")
def query_overlaps(query: str,
testdrive_id_list: List[str] = Query(None),
predecessor: bool = False,
db: Session = Depends(get_db)):
try:
data_to_paginate = []
if not data_to_paginate:
logger.info("No data to paginate. Data will be calculated!")
results = crud.query_overlaps(query, db, testdrive_id_list)
logger.info(f"First row of results: {results[0]}")
if predecessor:
result_with_predecessor = calculate_predecessor(results)
data_to_paginate = result_with_predecessor
else:
data_to_paginate = process_result(results)
logger.info(f"Length of data to paginate:{len(data_to_paginate)}")
limit = environment.env_variables["PAGINATION_OFFSET"]
page = []
for datum in data_to_paginate:
if len(page) < limit:
page.append(datum)
else:
yield page
page = []
yield page
except Exception as e:
logger.exception(f"Exception: {str(e)}")
Endpoint is called some other place:
# This one returns three lists in another list because
# data size is 2125 records and PAGINATION_OFFSET env
# variable is 1000, so it still returns all data at once
# in three different lists.
result_text = requests.get(request_string, params=params).text
print(json.loads(result_text))

how to iterate array of dictionary without loop using django?

This my scenario. I have 30 records in the array of dictionary in django. So, I tried to iterate it's working fine. but it takes around one minute. How to reduce iteration time. I tried map function but it's not working. How to fix this and I will share my example code.
Example Code
def find_places():
data = [{'a':1},{'a':2},{'a':3},{'a':4},{'a':5},{'a':6},{'a':7},{'a':8}]
places =[]
for p in range(1,len(data)):
a = p.a
try:
s1 = sample.object.filter(a=a)
except:
s1 = sample(a=a)
s1.save()
plac={id:s1.id,
a:s1.a}
places.append(plac)
return places
find_places()
I need an efficient way to iterate the array of objects in python without a loop.
You can filter outside the loop and run get_or_create instead of reverting to an object creation if the filter doesn't match.
data_a = [d.a for d in data]
samples = sample.objects.filter(a__in=data_a)
places = []
for a in data_a:
s1, created = samples.get_or_create(
a=a
)
place = {id: s1.id, a:s1.a}
places.append(place)
You can try this:
You can create a list hen save it at once, try this:
def find_places():
data = [{'a':1},{'a':2},{'a':3},{'a':4},{'a':5},{'a':6},{'a':7},{'a':8}]
places =[]
lst = []
for p in data:
a = p['a']
lst.append(a) # store it at once
Then try to store it into database. You can search: How to store a list into Model in Django.
I only made changes to loop of the code, if database side also fails you can let me know.

How to reset cursor after iterating SQLAlchemy query resultset [duplicate]

I queried two databases to get two relations. I terate over those relations once to form maps, and then again to perform some calculations. However, when I attempt iterate over the same relations a second time, I find that no iteration is actually occurring. Here is the code:
dev_connect = dev_engine.connect()
prod_connect = prod_engine.connect() # from a different database
Relation1 = dev_engine.execute(sqlquery1)
Relation2 = prod_engine.execute(sqlquery)
before_map = {}
after_map = {}
for row in Relation1:
before_map[row['instrument_id']] = row
for row2 in Relation2:
after_map[row2['instrument_id']] = row2
update_count = insert_count = delete_count = 0
change_list = []
count =0
for prod_row in Relation2:
count += 1
result = list(prod_row)
...
change_list.append(result)
count2 = 0
for before_row in Relation1:
count2 += 1
result = before_row
...
print count, count2 # prints 0
before_map and after_map are not empty, so Relation1 and Relation2 definitely have tuples in them. Yet count and count2 are 0, so the prod_row and before_row 'for loops' aren't actually occurring. Why can't I iterate over Relation1 and Relation2 a second time?
When you call execute on a SQL Alchemy engine, you get back a ResultProxy, which is a facade to a DBAPI cursor to the rows your query returns.
Once you iterate over all the results of the ResultProxy, it automatically closes the underlying cursor so you can't use the results again by just iterating over it, as documented on the SQLAlchemy page:
The returned result is an instance of ResultProxy, which references a DBAPI cursor and provides a largely compatible interface with that of the DBAPI cursor. The DBAPI cursor will be closed by the ResultProxy when all of its result rows (if any) are exhausted.
You can solve your problem a couple ways:
Store the results in a list. Just do a list-comprehension against the rows returned:
Relation1 = dev_engine.execute(sqlquery1)
relation1_items = [r for r in Relation1]
# ...
# now you can iterate over relation1_items as much as you want
Do everything you need to in one pass through each row set returned. I don't know if this option is feasible for you since I don't know if the full extent of your calculations require cross-referencing between your before_map and after_map objects.

replace for loop to parallel process in pyspark

I am using for loop in my script to call a function for each element of size_DF(data frame) but it is taking lot of time. I tried by removing the for loop by map but i am not getting any output.
size_DF is list of around 300 element which i am fetching from a table.
Using For:
import call_functions
newObject = call_functions.call_functions_class()
size_RDD = sc.parallelize(size_DF)
if len(size_DF) == 0:
print "No record present in the truncated list"
else:
for row in size_DF:
length = row[0]
print "length: ", length
insertDF = newObject.full_item(sc, dataBase, length, end_date)
Using Map
if len(size_DF) == 0:
print "No record present in the list"
else:
size_RDD.mapPartition(lambda l: newObject.full_item(sc, dataBase, len(l[0]), end_date))
newObject.full_item(sc, dataBase, len(l[0]), end_date)
In full_item() -- I am doing some select ope and joining 2 tables and inserting the data into a table.
Please help me and let me know what i am doing wrong.
pyspark.rdd.RDD.mapPartition method is lazily evaluated.
Usually to force an evaluation, you can a method that returns a value on the lazy RDD instance that is returned.
There are higher-level functions that take care of forcing an evaluation of the RDD values. e.g. pyspark.rdd.RDD.foreach
Since you don't really care about the results of the operation you can use pyspark.rdd.RDD.foreach instead of pyspark.rdd.RDD.mapPartition.
def first_of(it):
for first in it:
return first
return []
def insert_first(it):
first = first_of(it)
item_count = len(first)
newObject.full_item(sc, dataBase, item_count, end_date)
if len(size_DF) == 0:
print('No record present in the truncated list')
else:
size_DF.forEach(insert_first)

Creating Generator object from record list within a function

I'm trying to create generator object for the list of records with the data from mysql database, so I'm passing the mysql cursor object to the function as parameter.
My issue here is if the "if block" containing yield records is commented then cust_records function works perfectly fine but if I uncomment the line then the function is not working.
Not sure if this is not the way to yield the list object in Python 3
My code so far:
def cust_records(result_set) :
block_id = None
records = []
i = 0
for row in result_set :
records.append((row['id'], row, smaller_ids))
if records :
yield records
The point of generators is lazy evaluation, so storing all records in a list and yielding the list makes no sense at all. If you want to retain lazy evalution (which is IMHO preferable, specially if you have to work on arbitrary datasets that might get huge), you want to yield each record, ie:
def cust_records(result_set) :
for row in result_set :
yield (row['id'], row, smaller_ids)
# then
def example():
cursor.execute(<your_sql_query_here>)
for record in cust_records(cursor):
print(record)
else (if you really want to consume as much memory as possible) just male cust_record a plain function:
def cust_records(result_set) :
records = []
for row in result_set :
records.append((row['id'], row, smaller_ids))
return records

Categories

Resources