This question already has answers here:
Store large data or a service connection per Flask session
(1 answer)
Are global variables thread-safe in Flask? How do I share data between requests?
(4 answers)
Closed 4 years ago.
I'm writing a "webapp" for my own personal use, meant to be run with my own computer as a server. It's basically a nice interface for data visualization. This app needs to manipulate large matrices - about 100MB - in Python, and return the computation results to the browser for visualization. Currently I'm just storing the data in a csv file and loading it into pandas every time I want to use it, but this is very slow (about 15 seconds). Is there a way to have this object (a pandas.DataFrame) persist in memory, or does this not make sense?
I tried memcached and I do not think it is appropriate here. I also tried using Redis but reading from the cache is actually the same speed as reading the file if I store each matrix row under its own key, and if I store it all in a string under the same key then reconstructing the array from the string is just as slow as reading it from the csv file. So no gains either way.
Considering the app is supposed to be run on your computer there are 2 options you could try:
Use Werkzeug's FileStorage data structure http://werkzeug.pocoo.org/docs/0.11/datastructures/#werkzeug.datastructures.FileStorage
Get SQLite running in some kind of a RAM filesystem. This will allow data to also be changed faster.
Related
I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!
You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python
This question already has answers here:
Saving an Object (Data persistence)
(5 answers)
easy save/load of data in python
(7 answers)
Closed 5 years ago.
When I run a program for the second time, how can I make it remember the variables that have changed?
So for the first time it is run:
>>> foo = ["bar", "foo"]
>>> foo[2] = raw_input()
And then the second time it is run it remembers what foo[2] is without the user having to put it in again.
You can write data to a file, and read that data back in on subsequent runs. (If the file doesn't exist, or if it is zero length, let the program start from scratch.) This approach is quite practical if you only need to save one or two items of data.
Slightly more generally, persistence is a term commonly used to refer to maintaining data across multiple executions of a program.
See, for example, persistence for Python 2.7 and persistence for Python 3. There, pickling or marshalling data are described. Use functions like dumps() to serialize data for storage, and loads() to de-serialize it. This approach is practical if you save lots of data or if you save complicated data structures.
I don't really understand what you are saying. But if I could guess you have to bind the values you want to certain variables like you do. The second time you run it, it remember the variable's value. Or you can save it to a file and have it there.
I have previously saved a dictionary which maps image_name -> list of feature vectors, with the file being ~32 Gb. I have been using cPickle to load the dictionary in, but since I only have 8 GB of RAM, this process takes forever. Someone suggested using a database to store all the info, and reading from that, but would that be a faster/better solution than reading a file from disk? Why?
Use a database because it allows you to query faster. I've done this before. I would suggest against using cPickle. What specific implementation are you using?
My web app asks users 3 questions and simple writes that to a file, a1,a2,a3. I also have real time visualization of the average of the data (reads real time from file).
Must I use a database to ensure that no/minimal information is lost? Is it possible to produce a queue of read/writes>(Since files are small I am not too worried about the execution time of each call). Does python/flask already take care of this?
I am quite experienced in python itself, but not in this area(with flask).
I see a few solutions:
read /dev/urandom a few times, calculate sha-256 of the number and use it as a file name; collision is extremely improbable
use Redis and command like LPUSH, using it from Python is very easy; then RPOP from right end of the linked list, there's your queue
This question already has answers here:
In-memory size of a Python structure
(7 answers)
Closed 8 years ago.
So clearly there cannot be unlimited memory in Python. I am writing a script that creates lists of dictionaries. But each list has between 250K and 6M objects (there are 6 lists).
Is there a way to actually calculate (possibly based on the RAM of the machine) the maximum memory and the memory required to run the script?
The actual issue I came across:
In running one of the scripts that populates a list with 1.25-1.5 million dictionaries, when it hits 1.227... it simply stops, but returns no error let alone MemoryError. So I am not even sure if this is a memory limit. I have print statements so I can watch what is going on, and it seems to buffer forever as nothing is printing and up until that specific section, the code is running a couple thousand lines per second. Any ideas as to what is making it stop? Is this memory or something else?
If you have that many objects you need to store, you need store them on disk, not in memory. Consider using a database.
If you import the sys module you will have access to the function sys.getsizeof(). You will have to look at each object of the list and for each dictionary compute the value for every key. For more on this see this previous question - In-memory size of a Python structure.