Before the question I want to say sorry to you, because i'm not a native english and my english is very very poor.
i'm writing a python program to read a file to a string,
and then analyze it,
and then pass it to some other programs.(They can't work with stream)
simple code like this:
content = open("file").read()
if passItToA(content):
A.portal(content)
del content
The problem is the string I read will not be release. Usually it lives until the end of process.
I know it is a feature of dynamic languages.
But it cause a lot of memory waste when i running 1000 duplicates processes in the same time.
Can I release it on my call?
Python relies heavily on the garbage collection. To mark a value as being garbage (and let the collector do its work on it), just overwrite it:
content = ''
You also can delete the whole variable from the dictionary of variables:
del content
But concerning the string value, both work the same.
Just make sure that no other variable is still holding a pointer to that string. In your case, A.portal() and passItToA() should not create long-living pointers to the same string in order to be able to free it.
Related
I have some data being stored in redis cache which will be read by my application in Rust. The data is being stored by python. Whenever I am storing a string or an array, it stores it in a weird form which I was not able to read into Rust. Vice versa, I want to write from Rust and be able to read it in python.
Using django shell:
In [0]: cache.set("test","abc")
In [1]: cache.get("test")
Out[1]:'abc'
Using redis-cli:
127.0.0.1:6379> GET :1:test
"\x80\x04\x95\a\x00\x00\x00\x00\x00\x00\x00\x8c\x03abc\x94."
Output from Rust:
Err(Invalid UTF-8)
Rust code read data using redis-rs library:
let client = redis::Client::open("redis://127.0.0.1:6379")?;
let mut con = client.get_connection()?;
let q:Result<String, redis::RedisError> = con.get(":1:test");
println!("{:?}",q);
I want to be able to read a string or array into Rust as it was written in Python and vice-versa.
Also, data in one key will only be ever written by either Rust or Python, not both.
This question is not a duplicate of this as that deals specifically for accent encoding, however, I want to solve my problem for arrays as well. Moreover, the value being set in redis by django for a string is not simply the UTF encoding for the string.
Ah, the joys of trying to throw data across environments. The thing you're being bitten by right now is called Pickle and is the default serializer of django-redis. What a serializer does in this case (in python) is the transformation of your data between python and redis so you can store it, regardless of the type, but more importantly so you can retrieve it with the type it came in.
The python side
Obviously, if you had infinite time and effort, you could rewrite pickle in rust and you'd then be able to read this format. I'm pretty sure you have neither, and depending on the data you're storing, you might not even want to do so.
Instead, what I'm going to suggest is to change the serializer from pickle to json. The description of what to change in the config is located at https://django-redis-cache.readthedocs.io/en/latest/advanced_configuration.html#pluggable-serializers , and in particular, I'm pretty sure the class name you want to use is django_redis.serializers.JSONSerializer.
This comes with drawbacks. In particular, there will be some object types you will no longer be able to store on the python side, but if you do really intend to read data on the rust side, this should not concern you.
Sven Marnach mentioned in one of the comments that the serde-pickle crate exists. I have not used it myself, but it does look promising and might save you a ton of interop work if it does function.
The rust side
To read stuff, now that every key is going to be json, you'll be decoding types with either serde or miniserde. This should be pretty straightforward; do bear in mind that you will not get native types out of this; instead, you'll get members of the serde::Value enum (Boolean, Number, Object, etc), which you will then have to filter through.
Edit your question to indicate what you are trying to store, and I'll happily expand on how to do this on here!
In perl there was this idea of the tie operator, where writing to or modifying a variable can run arbitrary code (such as updating some underlying Berkeley database file). I'm quite sure there is this concept of overloading in python too.
I'm interested to know what the most idiomatic way is to basically consider a local JSON file as the canonical source of needed hierarchical information throughout the running of a python script, so that changes in a local dictionary are automatically reflected in the JSON file. I'll leave it to the OS to optimise writes and cache (I don't mind if the file is basically updated dozens of times throughout the running of the script), but ultimately this is just about a kilobyte of metadata that I'd like to keep around. It's not necessary to address concurrent access to this. I'd just like to be able to access a hierarchical structure (like nested dictionary) within the python process and have reads (and writes to) that structure automatically result in reads from (and changes to) a local JSON file.
well, since python itself has no signals-slots, I guess you can instead make your own dictionary class by inherit it from python dictionary. Class exactly like python dict, only in every method of it that can change dict values you will dump your json.
also you can use smth like PyQt4 QAbstractItemModel which has signals. And when it data changed signal will emitted, do your dumping - it will be only in one place, which is nice.
I know these two are sort of stupid ways, probably yea. :) If anyone knows better, go ahead and tell!
This is a developpement from aspect_mkn8rd' answer taking into account Gerrat's comments, but it is too long for a true comment.
You will need 2 special container classes emulating a list and a dictionnary. In both, you add a pointer to a top-level object and override the following methods :
__setitem__(self, key, value)
__delitem__(self, key)
__reversed__(self)
All those methods are called in modification and should have the top-level object to be written to disk.
In addition, __setitem__(self, key, value) should look if value is a list and wrap it into a special list object or if it is a dictionary, wrap it into a special dictionnary object. In both case, the method should set the top-level object into the new container. If neither of them and the object defines __setitem__, it should raise an Exception saying the object is not supported. Of course you should then modify the method to take in account this new class.
Of course, there is a good deal of code to write and test, but it should work - left to the reader as an exercise :-)
If concurrency is not required, maybe consider writing 2 functions to read and write the data to a shelf file? Our is the idea to have the dictionary" aware" of changes to update the file without this kind of thing?
Say i store a password in plain text in a variable called passWd as a string.
How does python release this variable once i discard of it (for instance, with del passWd or passWd= 'new random data')?
Is the string stored as a byte-array meaning it can be overwritten in the memoryplace that it originally existed or is it a fixed set in a memory area which can't be modified and there for when assining a new value a new memory area is created and the old area is discareded but not overwritten by null?
I'm questioning how Python implements the safety of memory areas and would like to know more about it, mainly because i'm curious :)
From what i've gathered so far, using del (or __del__) causes the interpreter to not release memory areas of that variable automaticly which can cause issues, and also i'm not sure that del is so thurrow on deleting the values. But that's just from what i've gathered and not something in black or white :)
The main reason for me asking, is I'm intending to write a hand-over application that gets a string, does some I/O, passes it along to another subsystem (bootloader for raspberry pi for instance) and the interface is written in Python (how odd that must sound in some peoples ears..) and i'm not worried that the data is compromised during the I/O calculations but that a memory dump might be occuring in between the two subsystem handovers. or if the system is frozen (say a hiberation) say 20min after the system is booted and i removed the variable as fast as i could, but somehow it's still in the memory despite me doing a del passWd :)
(Ps. I've asked on Superuser, they refered me here aand i'm sorry for poor grammar!)
Unless you use custom coded input methods to get the password, it will be in many more places then just your immutable string. So don't worry too much.
The OS should take care that any data from your process is cleared before the memory is allocated to another process. This may of course fail if the page is copied to disk (swapped out or hibernated).
Secure password entry is not easy. Maybe you can find a special library or module that handles this.
I finally whent with two solutions.
ld_preload to replace the functionality of the string handling of Python on a lower level.
One other option which is a bit easier was to develop my own C library that has more functionality then what Python offers through the standard string handling.
Mainly the C code has a shread() function that writes over the memory area where the string "was" stored and some other error checks.
However, #Ber gave me a good enough answer to start developing my own solution since (as he pointed out) there is no secure method in Python and python stores strings in way to many places and relies on the OS (which, on it's own isn't a bad thing except when you don't trust the OS you are installing your realtively secure application on).
a couple of my python programs aim to
format into a hash table (hence, I'm a dict() addict ;-) ) some informations in a "source" text file, and
to use that table to modify a "target" file. My concern is that the "source" files I usually process can be very large (several GB) so it makes more than 10sec to parse, and I need to run that program a bunch of times. To conclude, I feel like it's a waste to reload the same large file each time I need to modify a new "target".
My thought is, if it would be possible to write once the dict() made from the "source" file in a way that python would be able to read/process much faster (I think about a format close to the one used in RAM by python), it would be great.
Is there a possibility to achieve that?
Thank you.
Yea, you can marshal the dict, or you can use pickle. For the difference between the two, especially as regards to speed, see this question.
pickle is the usual solution to such things, but if you see any value in being able to edit the saved data, and if the dictionary uses only simple types such as strings and numbers (nested dictionaries or lists are also OK), you can simply write the repr() of the dictionary to a text file, then parse it back into a Python dictionary using eval() (or, better yet, ast.literal_eval()).
I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:
...
junk_block = "".join(open("foo.txt","rb").read().split())
...
I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:
f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?
As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).
Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:
junk_block = "".join(open("foo.txt","rb").read().split())
presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.
Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:
with open("foo.txt","rb") as f:
junk_block = "".join(f.read().split())
is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.
Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).
First of all, there's not really a reason you shouldn't use the first example - it'd quite readable in that it's concise about what it does. No reason to break it up since it's just a linear combination of calls.
Second, import within a function block is useful if there's a particular library function that you only need within that function - since the scope of an imported symbol is only the block within which it is imported, if you only ever use something once, you can just import it where you need it and not have to worry about name conflicts in other functions. This is especially handy with from X import Y statements, since Y won't be qualified by its containing module name and thus might conflict with a similarly named function in a different module being used elsewhere.
from PEP 8 (which is worth reading anyway)
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants
That line has the same result as this:
junk_block = open("foo.txt","rb").read().replace(' ', '')
In your example you are splitting the words of the text into a list of words, and then you are joining them back together with no spaces. The above example instead uses the str.replace() method.
The differences:
Yours builds a file object into memory, builds a string into memory by reading it, builds a list into memory by splitting the string, builds a new string by joining the list.
Mine builds a file object into memory, builds a string into memory by reading it, builds a new string into memory by replacing spaces.
You can see a bit less RAM is used in the new variation but more processor is used. RAM is more valuable in some cases and so memory waste is frowned upon when it can be avoided.
Most of the memory will be garbage collected immediately but multiple users at the same time will hog RAM.
If you want to know if your second code fragment is slower, the quick way to find out would be to just use timeit. I wouldn't expect there to be that much difference though, since they seem pretty equivalent.
You should also ask if a performance difference actually matters in the code in question. Often readability is of more value than performance.
I can't think of any good reasons for importing a module in a function, but sometimes you just don't know you'll need to do something until you see the problem. I'll have to leave it to others to point out a constructive example of that, if it exists.
I think the two codes are readable. I (and that's just a question of personal style) will probably use the first, adding a coment line, something like: "Open the file and convert the data inside into a list"
Also, there are times when I use the second, maybe not so separated, but something like
f_data = open("foo.txt", "rb").read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
But then I'm giving more entity to each operation, which could be important in the flow of the code. I think it's important you are confortable and don't think that the code is difficult to understand in the future.
Definitly, the code will not be (at least, much) slower, as the only "overload" you're making is to asing the results to values.