Python multiprocessing on same method that deals with data - python

I have a Django application where I am trying to manage the data for one of my models, since the number of its table rows in the database has gotten quite unruly.
I have a model staticmethod that gets all the 'Vehicle' objects in my database, goes through them, check if their image url is still active, and if not deletes it.
There are 1.2 million records (and some records have more than one image to check) so its going to take a pretty long time to go through all the records.
I know that you can use threading to run multiple processes, but I also know that for a method that deals with data, each thread has to be aware of the other thread. Is there a way that I can use multithreading to cut down the time it would take to go through a queryset and make those checks, e.g. If the first thread looks at queryset item 1, and 2, thread 2 will start looking at queryset item 3, and than thread 1 will skip to queryset item 4 if its done, and thread 2 hasn't finished with queryset item 3 ?
Method
#staticmethod
def image_existence_check():
import threading
import requests
from dealer.models import Dealer
vehicles = Vehicle.objects.all()
for index, veh in enumerate(vehicles):
images = veh.images.all()
if images.count() == 1:
image = images[0]
response = requests.get(image.image_url)
if response.status_code == 200:
veh.has_image = True
else:
veh.has_image = False
elif images.count() > 1:
has_image = True
for img in images:
response = requests.get(img.image_url)
if response != 200:
has_image = False
veh.has_image = has_image
else:
veh.has_image = False
veh.save()

Related

How to use other def values?

I want to use other def values.
For example, I added a 'pt' in the 'clean_beds_process' definition and add 'Patients' in the 'run' definition.
I want to patient information when the 'clean_beds_process' function is called.
However, this makes this error 'AttributeError: type object 'Patients' has no attribute 'id''
I don't know why this happen.
Maybe I have something wrong understanding of mechanism of simpy.
Please let me know how can I use a patient information when 'clean_beds_process' function is called.
Thank you.
import simpy
import random
class Patients:
def __init__(self, p_id):
self.id = p_id
self.bed_name = ""
self.admission_decision = ""
def admin_decision(self):
admin_decision_prob = random.uniform(0, 1)
if admin_decision_prob <= 0.7:
self.admission_decision = "DIS"
else:
self.dmission_decision = "IU"
return self.admission_decision
class Model:
def __init__(self, run_number):
self.env = simpy.Environment()
self.pt_ed_q = simpy.Store(self.env )
self.pt_counter = 0
self.tg = simpy.Resource(self.env, capacity = 4)
self.physician = simpy.Resource(self.env, capacity = 4)
self.bed_clean = simpy.Store(self.env)
self.bed_dirty = simpy.Store(self.env)
self.IU_bed = simpy.Resource(self.env, capacity = 50)
def generate_beds(self):
for i in range(77):
yield self.env.timeout(0)
yield self.bed_clean.put(f'bed{i}')
def generate_pt_arrivals(self):
while True:
self.pt_counter += 1
pt = Patients(self.pt_counter)
yield self.env.timeout(5)
self.env.process(self.process(pt))
def clean_beds_process(self, cleaner_id, pt):
while True:
print(pt.id)
bed = yield self.bed_dirty.get()
yield self.env.timeout(50)
yield self.bed_clean.put(bed)
def process(self, pt):
with self.tg.request() as req:
yield req
yield self.env.timeout(10)
bed = yield self.bed_clean.get()
pt.bed_name = bed
pt.admin_decision()
if pt.admission_decision == "DIS":
with self.IU_bed.request() as req:
dirty_bed_name = pt.bed_name
yield self.bed_dirty.put(dirty_bed_name)
yield self.env.timeout(600)
else:
dirty_bed_name = pt.bed_name
yield self.bed_dirty.put(dirty_bed_name)
def run(self):
self.env.process(self.generate_pt_arrivals())
self.env.process(self.generate_beds())
for i in range(2):
self.env.process(self.clean_beds_process(i+1, Patients))
self.env.run(until = 650)
run_model = Model(0)
run_model.run()
So if a patient can use either a clean bed or a dirty bed then the patient needs to make two request (one for each type of bed) and use env.any_of to wait for the first request to fire. You also need to deal with the case that both events fire at the same time. Don't forget to cancel the request you do not use. If the request that fires is for a clean bed, things stay mostly the same. But if the request is for a dirty bed, then you need to add a step to clean the bed. For this I would make the cleaners Resources instead of processes. So the patient would request a cleaner, and do a timeout for the cleaning time, release the cleaner. To collect patient data I would create a log with the patient id, key event, time, and crunch these post sim to get the stats I need. To process the log I often create a dataframe that filters the log for the first, a second dataframe that filters for the second envent, join the two dataframes on patient id. Now both events for a patient is on one row so I can get the delta. once I have have the delta I can do a sum and count. For example, if my two events are when a patient arrives, and when a patient gets a bed, get the sum of deltas and divide by the count and I have the average time to bed.
If you remember, one of the first answers I gave you awhile ago had a example to get the first available bed from two different queues
I do not have a lot of time right know, but I hope this dissertation helps a bit

multiprocessing a function with parameters that are iterated through

I'm trying to improve the speed of my program and I decided to use multiprocessing!
the problem is I can't seem to find any way to use the pool function (i think this is what i need) to use my function
here is the code that i am dealing with:
def dataLoading(output):
name = ""
link = ""
upCheck = ""
isSuccess = ""
for i in os.listdir():
with open(i) as currentFile:
data = json.loads(currentFile.read())
try:
name = data["name"]
link = data["link"]
upCheck = data["upCheck"]
isSuccess = data["isSuccess"]
except:
print("error in loading data from config: improper naming or formating used")
output[name] = [link, upCheck, isSuccess]
#working
def userCheck(link, user, isSuccess):
link = link.replace("<USERNAME>", user)
isSuccess = isSuccess.replace("<USERNAME>", user)
html = requests.get(link, headers=headers)
page_source = html.text
count = page_source.count(isSuccess)
if count > 0:
return True
else:
return False
I have a parent function to run these two together but I don't think i need to show the whole thing, just the part that gets the data iteratively:
for i in configData:
data = configData[i]
link = data[0]
print(link)
upCheck = data[1] #just for future use
isSuccess = data[2]
if userCheck(link, username, isSuccess) == True:
good.append(i)
you can see how I enter all of the data in there, how would I be able to use multiprocessing to do this when I am iterating through the dictionary to collect multiple parameters?
I like to use mp.Pool().map. I think it is easiest and most straight forward and handles most multiprocessing cases. So how does map work? For starts, we have to keep in mind that mp creates workers, each worker receives a copy of the namespace (ya the whole thing), then each worker works on what they are assigned and returns. Hence, doing something like "updating a global variable" while they work, doesn't work; since they are each going to receive a copy of the global variable and none of the workers are going to be communicating. (If you want communicating workers you need to use mp.Queue's and such, it gets complicated). Anyway, here is using map:
from multiprocessing import Pool
t = 'abcd'
def func(s):
return t[int(s)]
results = Pool().map(func,range(4))
Each worker received a copy of t, func, and the portion of range(4) they were assigned. They are then automatically tracked and everything is cleaned up in the end by Pool.
Something like your dataLoading won't work very well, we need to modify it. I also cleaned the code a little.
def loadfromfile(file):
data = json.loads(open(file).read())
items = [data.get(k,"") for k in ['name','link','upCheck','isSuccess']]
return items[0],items[1:]
output = dict(Pool().map(loadfromfile,os.listdir()))

How to update a Wagtail Page

I am using a wagtail_hook to update a Page object and am running into trouble. More specifically, when an end-user hits the "Save draft" button from the browser I want the following code to fire. The purpose of this code is to change the knowledge_base_id dynamically, based on the results of the conditional statements listed below.
def sync_knowledge_base_page_with_zendesk2(request, page):
if isinstance(page, ContentPage):
page_instance = ContentPage.objects.get(pk=page.pk)
pageJsonHolder = page_instance.get_latest_revision().content_json
content = json.loads(pageJsonHolder)
print("content at the start = ")
print(content['knowledge_base_id'])
kb_id_revision = content['knowledge_base_id']
kb_active_revision = content['kb_active']
if kb_active_revision == "T":
if kb_id_revision == 0:
print("create the article")
content['knowledge_base_id'] = 456
#page_instance.knowledge_base_id = 456 # change this API call
else:
print("update the article")
elif kb_id_revision != 0:
print("delete the article")
content['knowledge_base_id'] = 0
#page_instance.knowledge_base_id = 0
print("content at the end = ")
print(content['knowledge_base_id'])
#page_instance.save_revision().publish
So when the hook code fires, it updates the draft with all the info EXCEPT the knowledge_base_id.
However when I change the knowledge_base_id like this (seen commented out above)
page_instance.knowledge_base_id = 0
And save it like this (also seen commented out above)
page_instance.save_revision().publish()
It saves the updated knowledge_base_id BUT skips over the other revisions. In short, what the heck am I doing wrong. Thanks in advance for the assist. Take care and have a good day.
So I figured out the problem. Instead of trying to use the Page method save_revisions(), I opted to use revisions.create(). Inside revisions.create(), you pass it a JSON object with your updated values. In addition to that you pass an instance of the user, and values for submitted_for_moderation and approved_go_live_at. Listed below is my updated code sample, with comments. Please let me know if you have any questions for me. I hope this post helps others avoid frustrations with updating revisions. Thanks for reading. Take care and have a good day.
from wagtail.wagtailcore import hooks
from .models import ContentPage
import json
# Allows the latest page revision JSON to be updated based on conditionals
def sync_kb_page_with_zendesk(request, page):
# Sanity check to ensure page is an instance of ContentPage
if isinstance(page, ContentPage):
# this sets the user variable
user_var = request.user
# sets the Content Page
page_instance = ContentPage.objects.get(pk=page.pk)
# this retrieves JSON str w/ latest revisions
pageJsonHolder = page_instance.get_latest_revision().content_json
# this takes the json string and converts it into a json object
content = json.loads(pageJsonHolder)
# this sets the kb id variable for use in the code
kb_id_revision = content['knowledge_base_id']
# this sets the kb active variable for use in the code
kb_active_revision = content['kb_active']
# this is the conditional logic
if kb_active_revision == "T":
if kb_id_revision == 0:
print("create the article")
# updates the kb id value in the JSON object
content['knowledge_base_id'] = 456
else:
print("update the article")
elif kb_id_revision != 0:
print("delete the article")
# updates the kb id value in the JSON object
content['knowledge_base_id'] = 0
# this takes the JSON object and coverts it back to a JSON str
revisionPageJsonHolder = json.dumps(content)
# this takes your JSON str and creates the latest revision of Page
revision = page_instance.revisions.create(
content_json=revisionPageJsonHolder,
user=user_var,
submitted_for_moderation=False,
approved_go_live_at=None,
)
# registers the function to fire after page edit
hooks.register('after_edit_page', sync_kb_page_with_zendesk)
# registers the function to fire after page creation
hooks.register('after_create_page', sync_kb_page_with_zendesk)

Django database query round trips

I have the query below, as you can see in my loop I add each message. I want to reduce the total round trips I have to make to the DB. Is this a way I can process the message create in batches of say 20 at a time? Will this help with speed? any suggestions welcome.
class ProcessRequests(Task):
"""
Celery Task to start request to process that are not scheduled.
"""
name = "Request to Process"
max_retries = 1
default_retry_delay = 3
def run(self, batch):
# Only run this task on non-scheduled tasks
if batch.status != "Scheduled":
q = Contact.objects.filter(contact_owner=batch.user, subscribed=True)
if batch.group == None:
q = q.filter(id=batch.contact_id)
else:
q = q.filter(group=batch.group)
for e in q:
msg = Message.objects.create(
recipient_number=e.mobile,
content=batch.content,
sender=e.contact_owner,
billee=batch.user,
sender_name=batch.sender_name
)
gateway = Gateway.objects.get(pk=2)
msg.send(gateway)
You can use bulk_create.
Also note you're getting the same gateway object each time through the loop, it would be better to get it once outside of the loop and use the same one each time.

Making a python program wait until Twisted deferred returns a value

I have a program that fetches info from other pages and parses them using BeautifulSoup and Twisted's getPage. Later on in the program I print info that the deferred process creates. Currently my program tries to print it before the differed returns the info. How can I make it wait?
def twisAmaz(contents): #This parses the page (amazon api xml file)
stonesoup = BeautifulStoneSoup(contents)
if stonesoup.find("mediumimage") == None:
imageurl.append("/images/notfound.png")
else:
imageurl.append(stonesoup.find("mediumimage").url.contents[0])
usedPdata = stonesoup.find("lowestusedprice")
newPdata = stonesoup.find("lowestnewprice")
titledata = stonesoup.find("title")
reviewdata = stonesoup.find("editorialreview")
if stonesoup.find("asin") != None:
asin.append(stonesoup.find("asin").contents[0])
else:
asin.append("None")
reactor.stop()
deferred = dict()
for tmpISBN in isbn: #Go through ISBN numbers and get Amazon API information for each
deferred[(tmpISBN)] = getPage(fetchInfo(tmpISBN))
deferred[(tmpISBN)].addCallback(twisAmaz)
reactor.run()
.....print info on each ISBN
What it seems like is you're trying to make/run multiple reactors. Everything gets attached to the same reactor. Here's how to use a DeferredList to wait for all of your callbacks to finish.
Also note that twisAmaz returns a value. That value is passed through the callbacks DeferredList and comes out as value. Since a DeferredList keeps the order of the things that are put into it, you can cross-reference the index of the results with the index of your ISBNs.
from twisted.internet import defer
def twisAmazon(contents):
stonesoup = BeautifulStoneSoup(contents)
ret = {}
if stonesoup.find("mediumimage") is None:
ret['imageurl'] = "/images/notfound.png"
else:
ret['imageurl'] = stonesoup.find("mediumimage").url.contents[0]
ret['usedPdata'] = stonesoup.find("lowestusedprice")
ret['newPdata'] = stonesoup.find("lowestnewprice")
ret['titledata'] = stonesoup.find("title")
ret['reviewdata'] = stonesoup.find("editorialreview")
if stonesoup.find("asin") is not None:
ret['asin'] = stonesoup.find("asin").contents[0]
else:
ret['asin'] = 'None'
return ret
callbacks = []
for tmpISBN in isbn: #Go through ISBN numbers and get Amazon API information for each
callbacks.append(getPage(fetchInfo(tmpISBN)).addCallback(twisAmazon))
def printResult(result):
for e, (success, value) in enumerate(result):
print ('[%r]:' % isbn[e]),
if success:
print 'Success:', value
else:
print 'Failure:', value.getErrorMessage()
callbacks = defer.DeferredList(callbacks)
callbacks.addCallback(printResult)
reactor.run()
Another cool way to do this is with #defer.inlineCallbacks. It lets you write asynchronous code like a regular sequential function: http://twistedmatrix.com/documents/8.1.0/api/twisted.internet.defer.html#inlineCallbacks
First, you shouldn't put a reactor.stop() in your deferred method, as it kills everything.
Now, in Twisted, "Waiting" is not allowed. To print results of you callback, just add another callback after the first one.

Categories

Resources