I've noticed that Python handles memory in a way that I didn't expect. I have a huge data set which is stored in a 70 GB file. I usually load this file with np.loadtxt() and do some math on that. I have 32 GB of RAM and I've noticed that, when loaded into memory, around 25 GB of RAM is used. But apparently, this value can change. For example once, while I was processing the data, I got a memory error. After the error the dataset was still in the memory, I verified that I could access it, but only around 5 GB of RAM were used. How is this possible? And how can I force python to use as less memory as possible with my data so that I can run other application simultaneously?
Moreover, sometimes I do calculations which return a new dataset as large as the original, so that at the end I've a number of large dataset in memory, but the total used RAM is not changed. Are these variables written on the hard disk in some way? If so, why sometimes I get memory crashes?
(BTW I use spyder as IDE if it matters)
Related
I'm trying to load a large CSV file into a pandas dataframe. The CSV is rather large: a few GB.
The code is working, but rather slowly. Slower than I would expect it to even. If I take only 1/10th of the CSV, the job is done in about 10 seconds. If I try to load the whole file, it takes more than 15 minutes. I would expect this to just take roughly 10 times as long, not ~100 times.
The amount of RAM used by python is never above exactly 1,930.8 MB (there is 16GB in my system):
enter image description here
It seems to be capped at this, making me think that there is some sort of limit on how much RAM python is allowed to use. However, I never set such a limit and online everyone says "Python has no RAM limit".
Could it be that the RAM python is allowed to use is limit somewhere? And if so, how do I remove that limit?
The problem is not just how much RAM it can use, but how fast is your CPU. Loading very large csv file is very time-consuming if you just use plain pandas. Here are a few options:
You can try other libraries that are made to work with big data. This tutorial shows some libraries. I like dask. Its API is like pandas.
If you have GPU, you can use rapids (which is also mentioned in the link). Man, rapids is really a game changer. Any computation on GPU is just significantly faster. One drawback is that not all features in pandas are yet implemented, but that's if you need them.
The last solution, although not recommended, is you can process your file in batches, e.g., use a for loop, load only the first 100K rows, process them, save, then continue doing so until the file ends. This is still very time-consuming but that's the most naive way.
I hope it helps.
I am working with a dataset in a Python jupyter IPython notebook that is 1.7GB. I read in the .csv that I am working with using pd.read_csv, and my RAM usage shoots up to about 7GB.
When I tried to plot the time series of one of my columns from the dataset, my RAM shot up to nearly 16GB. I was worried about the performance of my laptop, so I decided to interrupt the kernel.
My question is two-fold:
If I let the cell run, would the plot have eventually shown up? Or, is it unable to plot my chart because it reached it's RAM limit?
My data is a time-series of second-by-second data over the course of a month that contains mostly zeroes. Should I remove these zeroes from the data, and would it make it easier to plot?
It will eventually show up. Even if it uses more than 16gb it will use the pagefile to get more virtual memory. It's called paging. It is an important part of memory implementations in modern operating systems, using secondary storage to let programs exceed the size of available physical memory. When a computer runs out of RAM, the operating system (OS) will move pages of memory over to the computer's hard disk to free up RAM for other processes. This ensures that the operating system will never run out of memory and crash.
For more information: http://searchservervirtualization.techtarget.com/definition/memory-paging
It depends on whether you need or not the zeroes. It may be easier to plot or to visualize depending on the amount of data added by them.
I have some data stored in a tree in memory and I regularly store the tree into disk using pickle.
Recently I noticed that the program using a large memory, then I checked saved pickle file, it is around 600M, then I wrote an other small test program loading the tree back into memory, and I found that it would take nearly 10 times memory(5G) than the size on disk, is that normal? And what's the best way to avoid that?
No it's not normal. I suspect your tree is bigger than you think. Write some code to walk it and add up all the space used (and count the nodes).
See memory size of Python data structure
Also what exactly are you asking? Are you surprised that a 600M data structure on disk is 5G in memory. That's not particularly surprising. Pickle compresses the data so you expect it to be smaller on disk. It's smaller by a factor of 10 (roughly) which is pretty good.
If you're surprised by the size of your own data that's another thing.
I have a csv-file that is about 100 mb small. Then I have plenty of memory, about 8 Gb. At runtime I don't have more than, at conservative guess, say 10 pandas.DataFrames that contain the whole csv-file. So pretty sure not more than 2 Gb of memory should be needed. getsizeof(dataframe) also does not return a huge number. Then in a function, I do the following: Finding an interesting value, let it be an outliner of say the motor current. Then plot 10 seconds (about 300 data points) around this point in a graph with bokeh and moreover 4 other graphs for motor voltage, motor speed and so on. This function is plotting about 50 graphs all the same way in a for-loop. Variables are defined locally, so they are overwritten each loop. Now the big question: Why is my memory getting fuller each iteration? Sometimes it reaches about 7 Gb and I get a memory error. I don't see how my data is getting that big internally. The memory thing even occurs with csv-files of 10 mb of size.
Python did not free memory even after the end of a function's lifetime. del variablename didn't free memory either.
This did the trick:
import gc
gc.collect()
I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?
It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.
If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).
Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.
If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).
If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:
# cat f1.txt
1.1
1.2
1.3
# cat f2.txt
2.1
2.2
2.3
# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
... print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n')
('1.3\n', '2.3\n')
>>>
A few options:
1. Just use the memory
If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.
This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.
2. Do it row-by-row
If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.
The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.
3. Preprocess your files and use mmap
You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.
The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.
This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.
Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.
The file conversion into binary is relatively fast and easy, almost a one-liner.