Import and work with large dataset (Python beginners) - python

since i couldn't find the best way to deal with my issue i came here to ask..
I'm a beginner with Python but i have to handle a large dataset.
However, i don't know what's the best way to handle the "Memory Error" problem.
I already have a 64 bits 3.7.3 Python version.
I saw that we can use TensorFlow or specify chunks in the pandas instruction or use the library Dask but i don't know which one is the best to fit with my problem and as a beginner it's not very clear.
I have a huge dataset (over 100M observations) i don't think reducing the dataset would decrease a lot the memory.
What i want to do is to test multiple ML algorithms with a train and test samples. I don't know how to deal with the problem.
Thanks!

This question is high level, so I'll provide some broad approaches for reducing memory usage in Dask:
Use a columnar file format like Parquet so you can leverage column pruning
Use column dtypes that require less memory int8 instead of int64
Strategically persist in memory, where appropriate
Use a cluster that's sized well for your data (running an analysis on 2GB of data requires different amounts of memory than 2TB)
split data into multiple files so it's easier to process in parallel
Your data has 100 million rows, which isn't that big (unless it had thousands of columns). Big data typically has billions or trillions of rows.
Feel free to add questions that are more specific and I can provide more specific advice. You can provide the specs of your machine / cluster, the memory requirements of the DataFrame (via ddf.memory_usage(deep=True)) and the actual code you're trying to run.

Related

tsfresh efficient feature set extraction is stuck at 0%

I am using the tsfresh library to extract features from my time-series dataset.
My problem is that I can only use the setting for the minimal feature set (MinimalFCParameters) because even the efficient one (EfficientFCParameters) is always stuck at 0% and never calculates any features.
The data is pretty large (over 40. Mio. rows, 100k windows) but this is even the smallest data set I am going to use. I am using a compute cluster, so computing resources should not be the issue. I also tried to use the n_jobs parameter for the extract_features-method (n_jobs=32). Finally, as suggested by the tsfresh website, I used dask instead of pandas for the input data frame - but no success too.
My questions are: Is there anything else I can try? Or are there any other libraries I could use?

How to figure out if a modin dataframe is going to fit in RAM?

Im learning how to work with large datasets, so im using modin.pandas.
I'm doing some aggregation, after which a 50GB dataset is hopefully going to become closer to 5GB in size - and now i need to check: if the df is small enough to fit in RAM, i want to cast it to pandas and enjoy a bug-free reliable library.
So, naturally, the question is: how to check it? .memory_usage(deep=True).sum() tells me how much the whole df uses, but i cant possibly know from that one number how much of it is in RAM, and how much is in swap - in other words, how much space do i need for casting the df to pandas. Are there other ways? Am i even right to assume that some partitions live in RAM while others - in swap? How to calculate how much data will flood the RAM when i call ._to_pandas()? Is there a hidden .__memory_usage_in_swap_that_needs_to_fit_in_ram() of some sorts?
Am i even right to assume that some partitions live in RAM while others - in swap?
Modin doesn't specify whether data should be in RAM or swap.
On Ray, it uses ray.put to store partitions. ray.put doesn't give any guarantees about where the data will go. Note that Ray spills data blocks to disk when they are too large for its in-memory object store. You can use ray memory to get a summary of how much of each storage Ray is using.
On Dask, modin uses dask.Client.scatter, which also doesn't give guarantees about where the data will go, to store partition data. I don't know any way to figure out how much of the stored data is really in RAM.

How to store a set of arrays for deep learning not consuming too much memory (Python)?

I`m trying to make a research in which the observations of my dataset are represented by matrices (arrays composed of numbers, similar to how images for deep learning are represented, but mine are not images) of different shapes.
What I`ve already tried is to write those arrays as lists in one column of a pandas dataframe and then save this as a csv\excel. After that I planned just to load such a file and convert those lists to arrays of appropriate shapes and then to convert a set of such arrays to a tensor which I finally will use for training the deep model in keras.
But it seems like this method is extremely inefficient, cause only 1/6 of my dataset has already occupied about 6 Gb of memory (pandas saved as csv) which is huge and I won't be able to load it in RAM (I'm using google colab to run my experiments).
So my question is: is there any other way of storing a set of arrays of different shapes, which won`t occupy so much memory? Maybe I can store tensors directly somehow? Or maybe there are some ways to store pandas in some compressed types of files which are not so heavy?
Yes, Avoid using csv/excel for big datasets, there are tons of data formats out there, for this case I would recommend to use a compressed format like pd.Dataframe.to_hdf, pd.Dataframe.to_parquet or pd.Dataframe.to_pickle.
There are even more formats to choose and compression options within the functions (for example to_hdf takes the argument complevel that you can set to 9 ).
Are you storing purely (or mostly) continuous variables? If so, maybe you could reduce the accuracy (i.e., from float64 to float32) these variables if you don't need need such an accurate value per datapoint.
There's a bunch of ways in reducing the size of your data that's being stored in your memory, and the what's written is one of the many ways to do so. Maybe you could break the process that you've mentioned into smaller chunks (i.e., storage of data, extraction of data), and work on each chunk/stage individually, which hopefully will reduce the overall size of your data!
Otherwise, you could perhaps take advantage of database management systems (SQL or NoSQL depending on which fits best) which might be better, though querying that amount of data might constitute yet another issue.
I'm by no means an expert in this but I'm just explaining more of how I've dealt with excessively large datasets (similar to what you're currently experiencing) in the past, and I'm pretty sure someone here will probably give you a more definitive answer as compared to my 'a little of everything' answer. All the best!

Most efficient way to store pandas Dataframe on disk for frequent access?

I am working on an application which generates a couple of hundred datasets every ten minutes. These datasets consist of a timestamp, and some corresponding values from an ongoing measurement.
(Almost) Naturally, I use pandas dataframes to manage the data in memory.
Now I need to do some work with history data (eg. averaging or summation over days/weeks/months etc. but not limited to that), and I need to update those accumulated values rather frequently (ideally also every ten minutes), so I am wondering which would be the most access-efficient way to store the data on disk?
So far I have been storing the data for every ten minute interval in a separate csv-file and then read the relevant files into a new dataframe as needed. But I feel that there must be a more efficient way, especially when it comes to working with a larger amount of datasets. Although computation cost and memory are not the central issue, as I am running the code on a comparatively powerful machine, but I still don't want to (and most likely, can't afford to) read all the data into memory every time.
It seems to me that the answer should lie within the built-in serialization functions of pandas, but from the docs and my google findings I honestly can't really tell which would fit my needs best.
Any ideas how I could manage my data better?

(Py)Spark: df.sample(0.1) doesn't affect runtime of df.toPandas()

I'm working on a data set of ~ 100k lines in PySpark, and I wan't to convert it to Pandas. The data on web clicks contains string variables and is read from a .snappy.orc file in an Amazon S3 bucket via spark.read.orc(...).
The conversion is running too slow for my application (for reasons very well explained here on stackoverflow), and thus I've tried to downsample my spark DataFrame to one tenth - the dataset is so large, that the statistical analysis I need to do is probably still valid. I however need to repeat the analysis for 5000 similar datasets, why speed the a concern.
What surprised me, is that the running time of: df.sample(false, 0.1).toPandas() is exactly the same as df.toPandas() (approx 180s), and so I don't get the reduction in running time I was hoping for.
I'm suspecting it may be a question of putting in a .cache() or .collect(), but I can't figure out a proper way to fit it in.

Categories

Resources