Beautifulsoup memory leak - python

I am experiencing an ugly case of memory leak. I am creating an object with a beutifulsoup, then processing it via its own methods. I am doing this with ~2000 XML files. After processing about half, the program stops working due to MemoryError, and the performance is constantly degrading. I tried to solve it via a soup.decompose method on __del__ and forced gc.collect after processing each file.
class FloorSoup:
def __init__(self, f_id):
only_needed = SoupStrainer(["beacons", 'hint'])
try:
self.f_soup = BeautifulSoup(open("Data/xmls/floors/floors_" + f_id + ".xml", encoding='utf8'), "lxml", parse_only = only_needed)
except (FileNotFoundError):
print("File: Data/xmls/floors/floors_" + f_id + ".xml not found")
def __del__(self):
self.f_soup.decompose()
def find_floor_npcs(self):
found_npcs = set()
for npc in self.f_soup.find_all(text="npc"):
found_npcs.add(npc.parent.parent.values.string)
return found_npcs
def find_floor_hints(self):
hint_ids = set()
print("Finding hints in file")
for hint in self.f_soup.find_all('hint'):
hint_ids.add(hint.localization.string)
return hint_ids
Relevant part of the code I am using to create the object and call the methods:
for q in questSoup.find_all('quest'):
gc.collect()
ql = find_q_line(q)
floors = set()
for f in set(q.find_all('location_id')):
if f.string not in skip_loc:
floor_soup = FloorSoup(f.string)
join_dict(string_by_ql, ql, floor_soup.find_floor_npcs())
join_dict(string_by_ql, ql, floor_soup.find_floor_hints())
del floor_soup
else:
print("Skipping location " + f.string)
By putting the find_floor_hints method out of use, I was able to remove the memory leak almost entirely (or to the point where its effects are negligible). Thus I suspect that the problem might lie in that particular method.
Any help would be greatly appreciated!

Referencing this answer, I was able to remove the leak on the find_floor_hints method by using
hint_ids.add(str(hint.localization.contents))
It seems like the former returned a Navigable String, that seems to leave some (read: an awful lot of) references even after the FloorSoup object is deleted. I am not exactly sure if it is a bug or a feature, but it works.

Related

Python Pandas how to speed up the __init__ class function correctly with numba?

I have a class that does different mathematical calculations on a cycle
I want to speed up its processing with Numba
And now I'm trying to apply Namba to init functions
The class itself and its init function looks like this:
class Generic(object):
##numba.vectorize
def __init__(self, N, N1, S1, S2):
A = [['Time'],['Time','Price'], ["Time", 'Qty'], ['Time', 'Open_interest'], ['Time','Operation','Quantity']]
self.table = pd.DataFrame(pd.read_csv('Datasets\\RobotMath\\table_OI.csv'))
self.Reader()
for a in A:
if 'Time' in a:
self.df = pd.DataFrame(pd.read_csv(Ex2_Csv, usecols=a, parse_dates=[0]))
self.df['Time'] = self.df['Time'].dt.floor("S", 0)
self.df['Time'] = pd.to_datetime(self.df['Time']).dt.time
if a == ['Time']:
self.Tik()
elif a == ['Time','Price']:
self.Poc()
self.Pmm()
self.SredPrice()
self.Delta_Ema(N, N1)
self.Comulative(S1, S2)
self.M()
elif a == ["Time", 'Qty']:
self.Volume()
elif a == ['Time', 'Open_interest']:
self.Open_intrest()
elif a == ['Time','Operation','Quantity']:
self.D()
#self.Dataset()
else:
print('Something went wrong', f"Set Error: {0} ".format(a))
All functions of the class are ordinary column calculations using Pandas.
Here are two of them for example:
def Tik(self):
df2 = self.df.groupby('Time').value_counts(ascending=False)
df2.to_csv('Datasets\\RobotMath\\Tik.csv', )
def Poc(self):
g = self.df.groupby('Time', sort=False)
out = (g.last()-g.first()).reset_index()
out.to_csv('Datasets\\RobotMath\\Poc.csv', index=False)
I tried to use Numba in different ways, but I got an error everywhere.
Is it possible to speed up exactly the init function? Or do I need to look for another way and I can 't do without rewriting the class ?
So there's nothing special or magical about the init function, at run time, it's just another function like any other.
In terms of performance, your class is doing quite a lot here - it might be worth breaking down the timings of each component first to establish where your performance hang-ups lie.
For example, just reading in the files might be responsible for a fair amount of that time, and for Ex2_Csv you repeatedly do that within a loop, which is likely to be sub-optimal depending on the volume of data you're dealing with, but before targeting a resolution (like Numba for example) it'd be prudent to identify which aspect of the code is performing within expectations.
You can gather that information in a number of ways, but the simplest might be to add in some tagged print statements that emit the elapsed time since the last print statement.
e.g.
start = datetime.datetime.now()
print("starting", start)
## some block of code that does XXX
XXX_finish = datetime.datetime.now()
print("XXX_finished at", XXX_finish, "taking", XXX_finish-start)
## and repeat to generate a runtime report showing timings for each block of code
Then you can break the runtime of your program down into feature-aligned chunks, then when tweaking you'll be able to directly see the effect. Profiling the runtime of your code like this can really help when making performance tweaks, and it helps sharpen focus on what tweaks are benefiting/harming specific areas of your code.
For example, in the section that's performing the group-bys, with some timestamp outputs before and after you can compare running it with the numba turned on, and then again with it turned off.
From there, with a little careful debugging, sensible logging of times and a bit of old-fashioned sleuthing (I tend to jot these kinds of things down with paper and pencil) you (and if you share your findings, we) ought to be in a better position to answer your question more precisely.

Python xml.parsers.expat results differ based on buffer_text = True or False and/or buffer_size

I'm using Python's xml.parsers.expat to parse some xml data, but sometimes the tags doesn't seem to be parsed correctly. I wonder if anyone knows if why setting buffer_text and / or buffer_size might actually change whether the data body gets parsed correctly?
A sanitized example, used to parse Azure blob list:
from xml.parsers.expat import ParserCreate, ExpatError, errors
class MyParser:
def __init__(self, xmlstring):
self.current_blob_name = None
self.current_blob_hash = None
self.get_hash = False
self.get_filename = False
self.xmlstring = xmlstring
self.p = ParserCreate()
#### LINE A ####
self.p.buffer_text = True
#### LINE B ####
# self.p.buffer_size = 24
self.p.StartElementHandler = self._start_element
self.p.EndElementHandler = self._end_element
self.p.CharacterDataHandler = self._char_data
def _reset_current_blob(self):
self.current_blob_name = None
self.current_blob_hash = None
# 3 handler functions
def _start_element(self, name, attrs):
if name == 'Blob':
self._reset_current_blob()
elif name == 'Name':
self.get_filename = True
elif name == 'Content-MD5':
self.get_hash = True
def _end_element(self, name):
if name == 'Blob':
...
elif name == 'Name':
self.get_filename = False
elif name == 'Content-MD5':
self.get_hash = False
def _char_data(self, data):
if self.get_hash is True:
self.current_blob_hash = data
if self.get_filename is True:
try:
...
except Exception as e:
print('Error parsing blob name from Azure.')
print(e)
def run(self):
self.p.Parse(self.xmlstring)
Most of the time things work just fine. However, sometimes after I upload a new item to Azure and that the xml returned changes, The error "Error parsing blob name from Azure" will get triggered, since some operations in the try block gets unexpected data. Upon further inspection, I observe that one potential error happens when data is supposed to be abcdefg_123/hijk/test.jpg, only abcdefg got passed into _char_data, while _123/hijk/test.jpg gets passed into another item. There is nothing wrong with the xml returned from Azure though.
I've experimented with a few things:
Added LINE A (buffer_text = True). It fixed the issue this time but I do not know if it is just luck (e.g. whether it would break again if I upload a new item to Azure);
If both LINE A and LINE B are added - things break again (I've also experimented with different buffer sizes; all seem to break the parser);
The way it breaks is deterministic: I can rerun the code all I want and it is always the exact element that gets parsed incorrectly;
If I remove that element from the original xml data, depending on how much other data I remove along the way, it could lead to another element being parsed incorrectly, or it could lead to a "fix" / no error. It does look like it has something to do with the total length of the data returned from Azure. My data is quite long (3537860 characters at this moment), and if I remove a fixed length from anywhere in the xml data, once the length reach a certain value, no more errors occur.
Unfortunately I am not able to share the raw xml I am parsing since it would take too much effort to sanitize, but hopefully this contains enough information! Let me know if I can add anything else.
Thanks!
Thanks to Walter:
https://bugs.python.org/msg385104
Just a guess, but the buffer size might be so small that the text that you expect gets passed via two calls to _char_data(). You should refactor your code the simply collect all the text in _char_data() and act on it in the _end_element() handler.
So this probably isn't a bug in xml.parsers.expat.
This fix is working for me so far. What I did not understand is that I thought the default buffer was 24, but if I set it to say 48 (a larger buffer), I would again see the issue (before this fix). In any case, collecting all data in the _char_data() section does make sense to me.

TypeError while using Pool from multiprocessing (python 3.7)

I'm trying to sum up the sizes of all files in a directory including recursive subdirectories. The relevant function (self._count) works totaly fine if I just call it once. But for large amounts of files I want to use multiprocessing to make the program faster. Here are the relevant parts of the code.
self._sum_dict sums the values of the same keys of the given dicts up.
self._get_file_type returns the category (key for stats) the file shall be placed.
self._categories holds a list of all possible categorys.
number_of_threats specifies the number of workers thal shall be used.
path holds the path to the directory meantioned in the first sentence.
import os
from multiprocessing import Pool
def _count(self, path):
stats = dict.fromkeys(self._categories, 0)
try:
dir_list = os.listdir(path)
except:
# I do some warning here, but removed it for SSCCE
return stats
for element in dir_list:
new_path = os.path.join(path, element)
if os.path.isdir(new_path):
add_stats = self._count(new_path)
stats = self._sum_dicts([stats, add_stats])
else:
file_type = self._get_file_type(element)
try:
size = os.path.getsize(new_path)
except Exception as e:
# I do some warning here, but removed it for SSCCE
continue
stats[file_type] += size
return stats
files = []
dirs = []
for e in dir_list:
new_name = os.path.join(path, e)
if os.path.isdir(new_name):
dirs.append(new_name)
else:
files.append(new_name)
with Pool(processes=number_of_threats) as pool:
res = pool.map(self._count, dirs)
self._stats = self._sum_dicts(res)
I know, that this code won't consider files in path, but that is something that I can add easily add. When execuding the code I get the following exception.
Exception has occurred: TypeError
cannot serialize '_io.TextIOWrapper' object
...
line ... in ...
res = pool.map(self._count, dirs)
I found out, that this exception can occure when sharing resources betwenen processes, which - as far as I can see - I only do with stats = dict.fromkeys(self._categories, 0). But replacing this line with hardcoded values won't fix the problem. Even placing a breakpoint at this line won't help me, because it isn't reached.
Does anybody have an idea what the reason for this problem is and how I can fix this?
The problem is you try to transmit "self". If self has a Stream object it can't be serialized.
Try and move the multiprocessed code outside the class.
Python multiprocessing launches a new interpreter and if you try to access shared code that can't be pickled (or serialized) it fails. Usually it doesn't crash where you think it crashed... but when trying to recieve the object.
I converted a code using threads to multiprocessing and i had a lot of wierd errors even i didn't sent or used those objects, but i used their parent ( self )

Multiple scripts access the same module with the same data in python?

Recently I have been trying to make a makeshift "disk space" reader. I made a library that stores values in a list "the disk" and when I subprocess a new script to write to the "disk" to see if the values change on the display nothing happens. I realized that any time you import a module the module sort of clones itself to only that script.
I want to be able to have scripts import the same module and so that if 1 script changes a value another script can see that value.
Here is my code for the "disk" system
import time
ram = []
space = 256000
lastspace = 0
for i in range(0,space + 1):
ram.append('')
def read(location):
try:
if ram[int(location)] == '':
return "ERR_NO_VALUE"
else:
return ram[int(location)]
except:
return "ERR_OUT_OF_RANGE"
def write(location, value):
try:
ram[int(location)] = value
except:
return "ERR_OUT_OF_RANGE"
def getcontents():
contents = []
for i in range(0, 256001):
contents.append([str(i)+ '- ', ram[i]])
return contents
def getrawcontents():
contents = []
for i in range(0, 256001):
contents.append(ram[i])
return contents
def erasechunk(beg, end):
try:
for i in range(int(beg), int(end) + 1):
ram[i] = ''
except:
return "ERR_OUT_OF_RANGE"
def erase(location):
ram[int(location)] = ''
def reset():
ram = []
times = space/51200
tc = 0
for i in range(0,round(times)):
for x in range(0,51201):
ram.append('')
tc += 1
print("Byte " + str(tc) + " of " + " Bytes")
for a in range(0,100):
print('\a', end='')
return [len(ram), ' bytes']
def wipe():
for i in range(0,256001):
ram[i] = ''
return "WIPED"
def getspace():
x = 0
for i in range(0,len(ram)):
if ram[i] != "":
x += 1
return [x,256000]
The shortest answer to your question, which I'm understanding as "if I import the same function into two (or more) Python namespaces, can they interact with each other?", is no. What actually happens when you import a module is that Python uses the source script to 'build' those functions in the namespace you're importing them to; there is no sense of permanence in "where the module came from" since that original module isn't actually running in a Python process anywhere! When you import those functions into multiple scripts, it's just going to create those pseudo-global variables (in your case ram) with the function you're importing.
Python import docs: https://docs.python.org/3/reference/import.html
The whole page on Python's data model, including what __globals__ means for functions and modules: https://docs.python.org/3/reference/datamodel.html
Explanation:
To go into a bit more depth, when you import any of the functions from this script (let's assume it's called 'disk.py'), you'll get an object in that function's __globals__ dict called ram, which will indeed work as you expect for these functions in your current namespace:
from disk import read,write
write(13,'thing')
print(read(13)) #prints 'thing'
We might assume, since these functions are accurately accessing our ram object, that the ram object is being modified somehow in the namespace of the original script, which could then be accessed by a different script (a different Python process). Looking at the namespace of our current script using dir() might support that notion, since we only see read and write, and not ram. But the secret is that ram is hidden in those functions' __globals__ dict (mentioned above), which is how the functions are interacting with ram:
from disk import read,write
print(type(write.__globals__['ram'])) #<class 'list'>
print(write.__globals__['ram'] is read.__globals__['ram']) #True
write(13,'thing')
print(read(13)) #'thing'
print(read.__globals__['ram'][13]) #'thing'
As you can see, ram actually is a variable defined in the namespace of our current Python process, hidden in the functions' __globals__ dict, which is actually the exact same dictionary for any function imported from the same module; read.__globals__ is write.__globals__ evaluates to True (even if you don't import them at the same time!).
So, to wrap it all up, ram is contained in the __globals__ dict for the disk module, which is created separately in the namespace of each process you import into:
Python interpreter #1:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139775502955080 139775502955080
Python interpreter #2:
from disk import read,write
print(id(read.__globals__),id(write.__globals__)) #139797009773128 139797009773128
Solution hint:
There are many approaches on how to do this practically that are beyond the scope of this answer, but I will suggest that pickle is the standard way to send objects between Python interpreters using files, and has a really standard interface. You can just write, read, etc your ram object using a pickle file. To write:
import pickle
with open('./my_ram_file.pkl','wb') as ram_f:
pickle.dump(ram,ram_f)
To read:
import pickle
with open('./my_ram_file.pkl','rb') as ram_f:
ram = pickle.load(ram_f)

How to create a variable whose value persists across file reload?

Common Lisp has defvar which
creates a global variable but only sets it if it is new: if it already
exists, it is not reset. This is useful when reloading a file from a long running interactive process, because it keeps the data.
I want the same in Python.
I have file foo.py which contains something like this:
cache = {}
def expensive(x):
try:
return cache[x]
except KeyError:
# do a lot of work
cache[x] = res
return res
When I do imp.reload(foo), the value of cache is lost which I want
to avoid.
How do I keep cache across reload?
PS. I guess I can follow How do I check if a variable exists? :
if 'cache' not in globals():
cache = {}
but it does not look "Pythonic" for some reason...
If it is TRT, please tell me so!
Answering comments:
I am not interested in cross-invocation persistence; I am already handling that.
I am painfully aware that reloading changes class meta-objects and I am already handling that.
The values in cache are huge, I cannot go to disk every time I need them.
Here are a couple of options. One is to use a temporary file as persistent storage for your cache, and try to load every time you load the module:
# foo.py
import tempfile
import pathlib
import pickle
_CACHE_TEMP_FILE_NAME = '__foo_cache__.pickle'
_CACHE = {}
def expensive(x):
try:
return _CACHE[x]
except KeyError:
# do a lot of work
_CACHE[x] = res
_save_cache()
return res
def _save_cache():
tmp = pathlib.Path(tempfile.gettempdir(), _CACHE_TEMP_FILE_NAME)
with tmp.open('wb') as f:
pickle.dump(_CACHE, f)
def _load_cache():
global _CACHE
tmp = pathlib.Path(tempfile.gettempdir(), _CACHE_TEMP_FILE_NAME)
if not tmp.is_file():
return
try:
with tmp.open('rb') as f:
_CACHE = pickle.load(f)
except pickle.UnpicklingError:
pass
_load_cache()
The only issue with this is that you need to trust the environment not to write anything malicious in place of the temporary file (the pickle module is not secure against erroneous or maliciously constructed data).
Another option is to use another module for the cache, one that does not get reloaded:
# foo_cache.py
Cache = {}
And then:
# foo.py
import foo_cache
def expensive(x):
try:
return foo_cache.Cache[x]
except KeyError:
# do a lot of work
foo_cache.Cache[x] = res
return res
Since the whole point of a reload is to ensure that the executed module's code is run a second time, there is essentially no way to avoid some kind of "reload detection."
The code you use appears to be the best answer from those given in the question you reference.

Categories

Resources