Python memory limit [duplicate] - python

This question already has answers here:
In-memory size of a Python structure
(7 answers)
Closed 8 years ago.
So clearly there cannot be unlimited memory in Python. I am writing a script that creates lists of dictionaries. But each list has between 250K and 6M objects (there are 6 lists).
Is there a way to actually calculate (possibly based on the RAM of the machine) the maximum memory and the memory required to run the script?
The actual issue I came across:
In running one of the scripts that populates a list with 1.25-1.5 million dictionaries, when it hits 1.227... it simply stops, but returns no error let alone MemoryError. So I am not even sure if this is a memory limit. I have print statements so I can watch what is going on, and it seems to buffer forever as nothing is printing and up until that specific section, the code is running a couple thousand lines per second. Any ideas as to what is making it stop? Is this memory or something else?

If you have that many objects you need to store, you need store them on disk, not in memory. Consider using a database.
If you import the sys module you will have access to the function sys.getsizeof(). You will have to look at each object of the list and for each dictionary compute the value for every key. For more on this see this previous question - In-memory size of a Python structure.

Related

Iterating through files is faster after the first time [duplicate]

This question already has answers here:
How did Python read this binary faster the second time?
(2 answers)
Closed 1 year ago.
This was something I came across while working on a project and I'm kind of confused. I have a .txt file with ~15000 lines. And when I run the program once, it takes around 4-5 seconds to go through all the lines. But I added a while True before opening the file and I did file.close() so that it continuously opens, goes through all the lines, and then closes.
But after the first run, I noticed that it takes around 1 second to complete. I made sure to close the files afterwards so what might be causing it to be so much faster?
It's called "file caching" or "warming the cache". All of the major operating systems allocate a goodly portion of your RAM to a file cache. When you read a file, those buffers are retained for a while instead of being released right away. If you read the same file again, it can often pull the data from RAM instead of going to disk.

How to check disk space when running python script? [duplicate]

This question already has answers here:
Find free disk space in python on OS/X
(7 answers)
Closed 3 years ago.
I have a script that is going to download a lot of data from the internet. But, I have no idea how long will it take or how big the data it will be.
To be more precise I want to analyze some live videos and for that I will download the content using youtube-dl. Since I want to leave it running for a week or two, is there a way so that can avoid running into low memory problem, that the computer checks on a specific interval what is my memory status and if it is below a certain value to stop the execution?
Thanks
You can use shutil.disk_usage(path) from the docs:
shutil.disk_usage(path)
Return disk usage statistics about the given path as a named tuple with the attributes total, used and free, which are the amount of
total, used and free space, in bytes. On Windows, path must be a
directory; on Unix, it can be a file or directory.
Use shutil.disk_usage.
total, used, free = shutil.disk_usage("/")

How can I make a program remember variables every time it is run in python? [duplicate]

This question already has answers here:
Saving an Object (Data persistence)
(5 answers)
easy save/load of data in python
(7 answers)
Closed 5 years ago.
When I run a program for the second time, how can I make it remember the variables that have changed?
So for the first time it is run:
>>> foo = ["bar", "foo"]
>>> foo[2] = raw_input()
And then the second time it is run it remembers what foo[2] is without the user having to put it in again.
You can write data to a file, and read that data back in on subsequent runs. (If the file doesn't exist, or if it is zero length, let the program start from scratch.) This approach is quite practical if you only need to save one or two items of data.
Slightly more generally, persistence is a term commonly used to refer to maintaining data across multiple executions of a program.
See, for example, persistence for Python 2.7 and persistence for Python 3. There, pickling or marshalling data are described. Use functions like dumps() to serialize data for storage, and loads() to de-serialize it. This approach is practical if you save lots of data or if you save complicated data structures.
I don't really understand what you are saying. But if I could guess you have to bind the values you want to certain variables like you do. The second time you run it, it remember the variable's value. Or you can save it to a file and have it there.

Keeping large files loaded in Flask [duplicate]

This question already has answers here:
Store large data or a service connection per Flask session
(1 answer)
Are global variables thread-safe in Flask? How do I share data between requests?
(4 answers)
Closed 4 years ago.
I'm writing a "webapp" for my own personal use, meant to be run with my own computer as a server. It's basically a nice interface for data visualization. This app needs to manipulate large matrices - about 100MB - in Python, and return the computation results to the browser for visualization. Currently I'm just storing the data in a csv file and loading it into pandas every time I want to use it, but this is very slow (about 15 seconds). Is there a way to have this object (a pandas.DataFrame) persist in memory, or does this not make sense?
I tried memcached and I do not think it is appropriate here. I also tried using Redis but reading from the cache is actually the same speed as reading the file if I store each matrix row under its own key, and if I store it all in a string under the same key then reconstructing the array from the string is just as slow as reading it from the csv file. So no gains either way.
Considering the app is supposed to be run on your computer there are 2 options you could try:
Use Werkzeug's FileStorage data structure http://werkzeug.pocoo.org/docs/0.11/datastructures/#werkzeug.datastructures.FileStorage
Get SQLite running in some kind of a RAM filesystem. This will allow data to also be changed faster.

Is it safe to use MD5 checksums to search for duplicate files across multiple hard drives? [duplicate]

This question already has answers here:
Finding duplicate files and removing them
(10 answers)
Closed 8 years ago.
I've been tasked with consolidating about 15 years of records from a laboratory, most of which is either student work or raw data. We're talking 100,000+ human-generated files.
My plan is to write a Python 2.7 script that will map the entire directory structure, create checksums for each, and then flag duplicates for deletion. I'm expecting probably 10-25% duplicates.
My understanding is that MD5 collisions are possible, theoretically, but so unlikely that this is essentially a safe procedure (let's say that if 1 collision happened, my job would be safe).
Is this a safe assumption? In case implementation matters, the only Python libraries I intend to use are:
hashlib for the checksum;
sqlite for databasing the results;
os for directory mapping
The probability of finding an md5 collision between two files by accident is:
0.000000000000000000000000000000000000002938735877055718769921841343055614194546663891
the probability of getting hit by 15km size asteroid is 0.00000002. I'd say yes.
Backing up the files and well testing the script remains a good advice, human mistakes and bugs are more luckily to happen.
The recent researches about MD5 collisions may have baffled you because in 2013 some people gave algorithms to generate MD5 collisions in 1 second on a normal computer however I assure you that this does not nullify the use of MD5 for checking file integrity and duplicacy. It is highly unlikely that you'll get two normal-use files with the same hash unless obviously you're deliberately looking for trouble and put up binary files with the same hash. If you're still getting paranoid then I advice you to use larger key-space hash functions like SHA-512.

Categories

Resources