How to create a variable whose value persists across file reload? - python

Common Lisp has defvar which
creates a global variable but only sets it if it is new: if it already
exists, it is not reset. This is useful when reloading a file from a long running interactive process, because it keeps the data.
I want the same in Python.
I have file foo.py which contains something like this:
cache = {}
def expensive(x):
try:
return cache[x]
except KeyError:
# do a lot of work
cache[x] = res
return res
When I do imp.reload(foo), the value of cache is lost which I want
to avoid.
How do I keep cache across reload?
PS. I guess I can follow How do I check if a variable exists? :
if 'cache' not in globals():
cache = {}
but it does not look "Pythonic" for some reason...
If it is TRT, please tell me so!
Answering comments:
I am not interested in cross-invocation persistence; I am already handling that.
I am painfully aware that reloading changes class meta-objects and I am already handling that.
The values in cache are huge, I cannot go to disk every time I need them.

Here are a couple of options. One is to use a temporary file as persistent storage for your cache, and try to load every time you load the module:
# foo.py
import tempfile
import pathlib
import pickle
_CACHE_TEMP_FILE_NAME = '__foo_cache__.pickle'
_CACHE = {}
def expensive(x):
try:
return _CACHE[x]
except KeyError:
# do a lot of work
_CACHE[x] = res
_save_cache()
return res
def _save_cache():
tmp = pathlib.Path(tempfile.gettempdir(), _CACHE_TEMP_FILE_NAME)
with tmp.open('wb') as f:
pickle.dump(_CACHE, f)
def _load_cache():
global _CACHE
tmp = pathlib.Path(tempfile.gettempdir(), _CACHE_TEMP_FILE_NAME)
if not tmp.is_file():
return
try:
with tmp.open('rb') as f:
_CACHE = pickle.load(f)
except pickle.UnpicklingError:
pass
_load_cache()
The only issue with this is that you need to trust the environment not to write anything malicious in place of the temporary file (the pickle module is not secure against erroneous or maliciously constructed data).
Another option is to use another module for the cache, one that does not get reloaded:
# foo_cache.py
Cache = {}
And then:
# foo.py
import foo_cache
def expensive(x):
try:
return foo_cache.Cache[x]
except KeyError:
# do a lot of work
foo_cache.Cache[x] = res
return res

Since the whole point of a reload is to ensure that the executed module's code is run a second time, there is essentially no way to avoid some kind of "reload detection."
The code you use appears to be the best answer from those given in the question you reference.

Related

How to save all python objects from working directory to a file [duplicate]

This question already has answers here:
Saving an Object (Data persistence)
(5 answers)
Closed 2 years ago.
I need to find the way to save all objects (or at least dataframes) to one place, outside of working directory. I assume Python keeps all objects in memory not on disk, so I'm looking for the way of exporting all objects from current session. It can be pickle, does not matter until you can read it to a different Python session
If I understood your need, you want to backup your session.
If that is the case, here is a solution using pickle. Kr.
First solution is:
import pickle
def is_picklable(obj):
try:
pickle.dumps(obj)
except Exception:
return False
return True
bk = {}
for k in dir():
obj = globals()[k]
if is_picklable(obj):
try:
bk.update({k: obj})
except TypeError:
pass
# to save session
with open('./your_bk.pkl', 'wb') as f:
pickle.dump(bk, f)
# to load your session
with open('./your_bk.pkl', 'rb') as f:
bk_restore = pickle.load(f)
***Second solution is with dill. You might have error if in your workspace, there are some unpicklable objects ***:
import dill
dill.dump_session('./your_bk_dill.pkl')
#to restore session:
dill.load_session('./your_bk_dill.pkl')
Third option go with shelve package:
import shelve
bk = shelve.open('./your_bk_shelve.pkl','n')
for k in dir():
try:
bk[k] = globals()[k]
except Exception:
pass
bk.close()
# to restore
bk_restore = shelve.open('./your_bk_shelve.pkl')
for k in bk_restore:
globals()[k]=bk_restore[k]
tmp[k] = bk_restore[k]
bk_restore.close()
Check and let's know about your trial.
Credits: The second and third solution are nearly a shameless copy/paste from those two links belows. I adapted the handling of errors as the original answer will lead to error for pickling of module.
dill solution
shelve solution
If you want to save a dataframe in another directory, try :
df.to_csv('foldername/filename.csv')

TypeError while using Pool from multiprocessing (python 3.7)

I'm trying to sum up the sizes of all files in a directory including recursive subdirectories. The relevant function (self._count) works totaly fine if I just call it once. But for large amounts of files I want to use multiprocessing to make the program faster. Here are the relevant parts of the code.
self._sum_dict sums the values of the same keys of the given dicts up.
self._get_file_type returns the category (key for stats) the file shall be placed.
self._categories holds a list of all possible categorys.
number_of_threats specifies the number of workers thal shall be used.
path holds the path to the directory meantioned in the first sentence.
import os
from multiprocessing import Pool
def _count(self, path):
stats = dict.fromkeys(self._categories, 0)
try:
dir_list = os.listdir(path)
except:
# I do some warning here, but removed it for SSCCE
return stats
for element in dir_list:
new_path = os.path.join(path, element)
if os.path.isdir(new_path):
add_stats = self._count(new_path)
stats = self._sum_dicts([stats, add_stats])
else:
file_type = self._get_file_type(element)
try:
size = os.path.getsize(new_path)
except Exception as e:
# I do some warning here, but removed it for SSCCE
continue
stats[file_type] += size
return stats
files = []
dirs = []
for e in dir_list:
new_name = os.path.join(path, e)
if os.path.isdir(new_name):
dirs.append(new_name)
else:
files.append(new_name)
with Pool(processes=number_of_threats) as pool:
res = pool.map(self._count, dirs)
self._stats = self._sum_dicts(res)
I know, that this code won't consider files in path, but that is something that I can add easily add. When execuding the code I get the following exception.
Exception has occurred: TypeError
cannot serialize '_io.TextIOWrapper' object
...
line ... in ...
res = pool.map(self._count, dirs)
I found out, that this exception can occure when sharing resources betwenen processes, which - as far as I can see - I only do with stats = dict.fromkeys(self._categories, 0). But replacing this line with hardcoded values won't fix the problem. Even placing a breakpoint at this line won't help me, because it isn't reached.
Does anybody have an idea what the reason for this problem is and how I can fix this?
The problem is you try to transmit "self". If self has a Stream object it can't be serialized.
Try and move the multiprocessed code outside the class.
Python multiprocessing launches a new interpreter and if you try to access shared code that can't be pickled (or serialized) it fails. Usually it doesn't crash where you think it crashed... but when trying to recieve the object.
I converted a code using threads to multiprocessing and i had a lot of wierd errors even i didn't sent or used those objects, but i used their parent ( self )

Multiple scripts access the same module with the same data in python?

Recently I have been trying to make a makeshift "disk space" reader. I made a library that stores values in a list "the disk" and when I subprocess a new script to write to the "disk" to see if the values change on the display nothing happens. I realized that any time you import a module the module sort of clones itself to only that script.
I want to be able to have scripts import the same module and so that if 1 script changes a value another script can see that value.
Here is my code for the "disk" system
import time
ram = []
space = 256000
lastspace = 0
for i in range(0,space + 1):
ram.append('')
def read(location):
try:
if ram[int(location)] == '':
return "ERR_NO_VALUE"
else:
return ram[int(location)]
except:
return "ERR_OUT_OF_RANGE"
def write(location, value):
try:
ram[int(location)] = value
except:
return "ERR_OUT_OF_RANGE"
def getcontents():
contents = []
for i in range(0, 256001):
contents.append([str(i)+ '- ', ram[i]])
return contents
def getrawcontents():
contents = []
for i in range(0, 256001):
contents.append(ram[i])
return contents
def erasechunk(beg, end):
try:
for i in range(int(beg), int(end) + 1):
ram[i] = ''
except:
return "ERR_OUT_OF_RANGE"
def erase(location):
ram[int(location)] = ''
def reset():
ram = []
times = space/51200
tc = 0
for i in range(0,round(times)):
for x in range(0,51201):
ram.append('')
tc += 1
print("Byte " + str(tc) + " of " + " Bytes")
for a in range(0,100):
print('\a', end='')
return [len(ram), ' bytes']
def wipe():
for i in range(0,256001):
ram[i] = ''
return "WIPED"
def getspace():
x = 0
for i in range(0,len(ram)):
if ram[i] != "":
x += 1
return [x,256000]
The shortest answer to your question, which I'm understanding as "if I import the same function into two (or more) Python namespaces, can they interact with each other?", is no. What actually happens when you import a module is that Python uses the source script to 'build' those functions in the namespace you're importing them to; there is no sense of permanence in "where the module came from" since that original module isn't actually running in a Python process anywhere! When you import those functions into multiple scripts, it's just going to create those pseudo-global variables (in your case ram) with the function you're importing.
Python import docs: https://docs.python.org/3/reference/import.html
The whole page on Python's data model, including what __globals__ means for functions and modules: https://docs.python.org/3/reference/datamodel.html
Explanation:
To go into a bit more depth, when you import any of the functions from this script (let's assume it's called 'disk.py'), you'll get an object in that function's __globals__ dict called ram, which will indeed work as you expect for these functions in your current namespace:
from disk import read,write
write(13,'thing')
print(read(13)) #prints 'thing'
We might assume, since these functions are accurately accessing our ram object, that the ram object is being modified somehow in the namespace of the original script, which could then be accessed by a different script (a different Python process). Looking at the namespace of our current script using dir() might support that notion, since we only see read and write, and not ram. But the secret is that ram is hidden in those functions' __globals__ dict (mentioned above), which is how the functions are interacting with ram:
from disk import read,write
print(type(write.__globals__['ram'])) #<class 'list'>
print(write.__globals__['ram'] is read.__globals__['ram']) #True
write(13,'thing')
print(read(13)) #'thing'
print(read.__globals__['ram'][13]) #'thing'
As you can see, ram actually is a variable defined in the namespace of our current Python process, hidden in the functions' __globals__ dict, which is actually the exact same dictionary for any function imported from the same module; read.__globals__ is write.__globals__ evaluates to True (even if you don't import them at the same time!).
So, to wrap it all up, ram is contained in the __globals__ dict for the disk module, which is created separately in the namespace of each process you import into:
Python interpreter #1:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139775502955080 139775502955080
Python interpreter #2:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139797009773128 139797009773128
Solution hint:
There are many approaches on how to do this practically that are beyond the scope of this answer, but I will suggest that pickle is the standard way to send objects between Python interpreters using files, and has a really standard interface. You can just write, read, etc your ram object using a pickle file. To write:
import pickle
with open('./my_ram_file.pkl','wb') as ram_f:
pickle.dump(ram,ram_f)
To read:
import pickle
with open('./my_ram_file.pkl','rb') as ram_f:
ram = pickle.load(ram_f)

Beautifulsoup memory leak

I am experiencing an ugly case of memory leak. I am creating an object with a beutifulsoup, then processing it via its own methods. I am doing this with ~2000 XML files. After processing about half, the program stops working due to MemoryError, and the performance is constantly degrading. I tried to solve it via a soup.decompose method on __del__ and forced gc.collect after processing each file.
class FloorSoup:
def __init__(self, f_id):
only_needed = SoupStrainer(["beacons", 'hint'])
try:
self.f_soup = BeautifulSoup(open("Data/xmls/floors/floors_" + f_id + ".xml", encoding='utf8'), "lxml", parse_only = only_needed)
except (FileNotFoundError):
print("File: Data/xmls/floors/floors_" + f_id + ".xml not found")
def __del__(self):
self.f_soup.decompose()
def find_floor_npcs(self):
found_npcs = set()
for npc in self.f_soup.find_all(text="npc"):
found_npcs.add(npc.parent.parent.values.string)
return found_npcs
def find_floor_hints(self):
hint_ids = set()
print("Finding hints in file")
for hint in self.f_soup.find_all('hint'):
hint_ids.add(hint.localization.string)
return hint_ids
Relevant part of the code I am using to create the object and call the methods:
for q in questSoup.find_all('quest'):
gc.collect()
ql = find_q_line(q)
floors = set()
for f in set(q.find_all('location_id')):
if f.string not in skip_loc:
floor_soup = FloorSoup(f.string)
join_dict(string_by_ql, ql, floor_soup.find_floor_npcs())
join_dict(string_by_ql, ql, floor_soup.find_floor_hints())
del floor_soup
else:
print("Skipping location " + f.string)
By putting the find_floor_hints method out of use, I was able to remove the memory leak almost entirely (or to the point where its effects are negligible). Thus I suspect that the problem might lie in that particular method.
Any help would be greatly appreciated!
Referencing this answer, I was able to remove the leak on the find_floor_hints method by using
hint_ids.add(str(hint.localization.contents))
It seems like the former returned a Navigable String, that seems to leave some (read: an awful lot of) references even after the FloorSoup object is deleted. I am not exactly sure if it is a bug or a feature, but it works.

Unable to load pickle file

I was previously able to load a pickle file. I saved a new file under a different name. I am unable to load either the old or the new file. Which is a bummer as it contains data which I have worked hard to scrub.
Here is the code that I use to save:
def pickleStore():
pickle.dump(store, open("...shelf3.p", "wb"))
Here is the code that I use to re-load:
def pickleLoad():
store = pickle.load(open(".../shelf3.p","rb" ) )
The created file exists, and I run pickleLoad() no errors come up, neither does any variables show in the panel variable explorer. If I try to load a non-existent file, I get a error message.
I am running spyder, python 3.5.
Any suggestions?
If you want to write to a module-level variable from a function, you need to use the global keyword:
store = None
def pickleLoad():
global store
store = pickle.load(open(".../shelf3.p","rb" ) )
...or return the value and perform the assignment from module-level code:
store = None
def pickleLoad():
return pickle.load(open(".../shelf3.p","rb" ) )
store = pickleLoad()
As a general and more versatile approach I would suggest something like this:
def load(file_name):
with open(simulation, 'rb') as pickle_file:
return pickle.load(pickle_file)
def save(file_name, data):
with open(file_name, 'wb') as f:
pickle.dump(data, f)
I have added this snippet to several projects in order to reduce rewriting same code several times.

Categories

Resources