How to Remove All Session Objects after H2O AutoML? - python

I am trying to create an ML application in which a front end takes user information and data, cleans it, and passes it to h2o AutoML for modeling, then recovers and visualizes the results. Since the back end will be a stand-alone / always-on service that gets called many times, I want to ensure that all objects created in each session are removed, so that h2o doesn't get cluttered and run out of resources. The problem is that many objects are being created, and I am unsure how to identify/track them, so that I can remove them before disconnecting each session.
Note that I would like the ability to run more than one analysis concurrently, which means I cannot just call remove_all(), since this may remove objects still needed by another session. Instead, it seems I need a list of session objects, which I can pass to the remove() method. Does anyone know how to generate this list?
Here's a simple example:
import h2o
import pandas as pd
df = pd.read_csv("C:\iris.csv")
my_frame = h2o.H2OFrame(df, "my_frame")
aml = H2OAutoML(max_runtime_secs=100)
aml.train(y='class', training_frame=my_frame)
Looking in the Flow UI shows that this simple example generated 5 new frames, and 74 models. Is there a session ID tag or something similar that I can use to identify these separately from any objects created in another session, so I can remove them?

The recommended way to clean only your work is to use h2o.remove(aml).
This will delete the automl instance on the backend and cascade to all the submodels and attached objects like metrics.
It won't delete the frames that you provided though (e.g. training_frame).

You can use h2o.ls() to list the H2O objects. Then you can use h2o.remove('YOUR_key') to remove ones you don't want to keep.
For example:
#Create frame of objects
h_objects = h2o.ls()
#Filter for keys of one AutoML session
filtered_objects = h_objects[h_objects['key'].str.contains('AutoML_YYYYMMDD_xxxxxx')]
for key in filtered_objects['key']:
h2o.remove(key)
Alternatively, you can remove all AutoML objects using the filter below instead.
filtered_objects = h_objects[h_objects['key'].str.lower().str.contains('automl')]

Related

More efficient way to update multiple model objects each with unique values

I am looking for a more efficient way to update a bunch of model objects. Every night I have background jobs creating 'NCAABGame' objects from an API once the scores are final.
In the morning I have to update all the fields in the model with the stats that the API did not provide.
As of right now I get the stats formatted from an excel file and I copy and paste each update and run it like this:
NCAABGame.objects.filter(
name__name='San Francisco', updated=False).update(
field_goals=38,
field_goal_attempts=55,
three_points=11,
three_point_attempts=24,
...
)
The other day there were 183 games, most days between 20-30 so it can be very timely doing it this way. I've looked into bulk_update and a few other things but I can't really find a solution. I'm sure there is something simple that I'm just not seeing.
I appreciate any ideas or solutions you can offer.
If you need to update each object that gets created via the API manually anyway, I would not even bother going through Django. Just load your games from the API directly in Excel, then make your edits in Excel, and save as CSV file. Then I would add the CSV directly into the database table, unless there is a specific reason that objects must be created via Django? I mean, you can of course do that with something like the below, which could be modified to also work for your current method via updates, but then you need to first retrieve the correct pk of the object that you want to update.
import csv
with open("my_data.csv", 'r') as my_data_file:
reader = csv.reader(my_data_file)
for row in reader:
# get_or_create returns a tuple. 'created' is a boolean that indicates
# if a new object was created or not, with game holding the object that
# was either retrieved or created
game, created = NCAABGame.objects.get_or_create(
name=row[0],
field_goals=row[1],
field_goal_attempts=row[2],
....,
)

Is it a good idea to store copies of documents from a mongodb collection in a dictionary list, and use this data instead of querying the database?

I am currently developing a Python Discord bot that uses a Mongo database to store user data.
As this data is continually changed, the database would be subjected to a massive number of queries to both extract and update the data; so I'm trying to find ways to minimize client-server communication and reduce bot response times.
In this sense, is it a good idea to create a copy of a Mongo collection as a dictionary list as soon as the script is run, and manipulate the data offline instead of continually querying the database?
In particular, every time a data would be searched with the collection.find() method, it is instead extracted from the list. On the other hand, every time a data needs to be updated with collection.update(), both the list and the database are updated.
I'll give an example to better explain what I'm trying to do. Let's say that my collection contains documents with the following structure:
{"user_id": id_of_the_user, "experience": current_amount_of_experience}
and the experience value must be continually increased.
Here's how I'm implementing it at the moment:
online_collection = db["collection_name"] # mongodb cursor
offline_collection = list(online_collection.find()) # a copy of the collection
def updateExperience(user_id):
online_collection.update_one({"user_id":user_id}, {"$inc":{"experience":1}})
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
mydocument["experience"] += 1
def findExperience(user_id):
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
return mydocument["experience"]
As you can see, the database is involved only for the update function.
Is this a valid approach?
For very large collections (millions of documents) does the next () function have the same execution times or would there still be some slowdowns?
Also, while not explicitly asked in the question, I'd me more than happy to get any advice on how to improve the performance of a Discord bot, as long as it doesn't include using a VPS or sharding, since I'm already using these options.
I don't really see why not - as long as you're aware of the following :
You will need the system resources to load an entire database into memory
It is your responsibility to sync the actual db and your local store
You do need to be the only person/system updating the database
Eventually this pattern will fail i.e. db gets too large, or more than one process needs to update, so it isn't future-proof.
In essence you're talking about a caching solution - so no need to reinvent the wheel - many such products/solutions you could use.
It's probably not the traditional way of doing things, but if it works then why not

How to load kedro DataSet object dynamically

I am currently using the yaml api to create all of my datasets with kedro==15.5. I would like to be able to peer into this information from time to time dynamically. It appears that I can get to this information with the io.datasets which is a _FrozenDatasets object. I cannot loop over it or access it programatically though.
Specific Use Case
Specifically I would like to add a test that loops over the datasets to check that there are not multiple catalog entries using the same filepath. Is this possible without using eval? Currently I think would need to do something like this
filepaths = {}
for entry_name in io.list()
eval(f'filepaths[{entry_name}] = io.datasets.{entry_name}'.filepath)
Unfortunately, I don't think AbstractDataSet (from which they are all defined) has a property for filepath or the config that built it. You can read the ProjectContext config but that won't cover datasets that were dynamically built.

Iterating over controls in wxPython in order to save session data

I have a GUI written in wxPython (with boa constructor).
I would like to save a user's session to a file, to be loaded the next time the application starts.
I would like to avoid saving each value 'by hand' by iterating over the controls and saving their values to a dictionary.
Is there a way to get a hold of all the wxIDs used in the application, and their corresponding widgets?
You don't need the IDs at all, just start from the top level window and recursively enumerate all the children using wxWindow::GetChildren() method. Then, for each child, you will need to dynamically determine its type (this is simpler if you only use controls of a few types) and save its value. You may also find it useful to specify the names (not labels) for your controls when creating them to have a more convenient unique identifier for each of them than a numeric ID.
IMHO you are going at this wrong. The state of a user's session is best not stored in the values of the controls. The state should be stored in a 'model'. The 'view' should query the model when it needs to display the state of the model, and when it wants to save the state to a file. http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller.
This makes lots of things easier, even trivial, including your problem.
I would look at the PersistenceManager mechanism in wx.lib.agw. Here are the original docs for it: http://xoomer.virgilio.it/infinity77/AGW_Docs/persist.persistencemanager.PersistenceManager.html
And here are the newer docs:
https://docs.wxpython.org/wx.lib.agw.persist.persistencemanager.PersistenceManager.html#wx.lib.agw.persist.persistencemanager.PersistenceManager
Alternatively, you can probably use the frame or panel's GetChildren() method to grab all the widgets and pull the values from them, but I think the PersistenceManager would make more sense.

SQLAlchemy Event interface

I'm using SQLAlchemy 0.7. I would like some 'post-processing' to occur after a session.flush(), namely, I need to access the instances involved in the flush() and iterate through them. The flush() call will update the database, but the instances involved also store some data in an LDAP database, I would like SQLAlchemy to trigger an update to that LDAP database by calling an instance method.
I figured I'd be using the after_flush(session, flush_context) event, detailed here, but how do I get a list of update()'d instances?
On a side note, how can I determine which columns have changed (or are 'dirty') on an instance. I've been able to find out if an instance as a whole is dirty, but not individual properties.
According to the link you provided:
Note that the session’s state is still in pre-flush, i.e. ‘new’, ‘dirty’, and ‘deleted’ lists still show pre-flush state as well as the history settings on instance attributes.
This means that you should be able to get an access of all the dirty objects in the session.dirty list. You'll note that the first parameter of the event callback is the current session object.
As for the second part, you can use the sqlalchemy.orm.attributes.get_history function to figure out which columns have been changed. It returns a History object for a given attribute which contains a has_changes() method.
If you're trying to listen for changes on specific class attributes, consider using Attribute Events instead.

Categories

Resources