I am doing exploratory data analysis on my numeric data and i tried to run pandas profiling but i got error while generating report structure.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv('mydatadata.csv')
print(df)
profile = ProfileReport(df)
profile.to_file(output_file="mydata.html")
and the error log looks like this
Summarize dataset:
99%|███████████████████████████████████████████████████████████████████████▌|
1144/1150 [46:07<24:03, 240.60s/it, Calculate cramers
correlation]C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas_profiling\model\correlations.py:101:
UserWarning: There was an attempt to calculate the cramers
correlation, but this failed. To hide this warning, disable the
calculation (using df.profile_report(correlations={"cramers": {"calculate": False}}) If this is problematic for your use case,
please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues (include
the error message: 'No data; observed has size 0.') warnings.warn(
Summarize dataset:
100%|██████████████████████████████████████████████████████████████████████████████████▋|
1145/1150 [46:19<17:32, 210.49s/it, Get scatter matrix]Fail to
allocate bitmap
Reason your code may have failed
If your code failed for the same reason as mine, either:
Tried making multiple profiles at the same time
Tried making a profile with a large dataset in terms of variables
Possible fix in your code
There is a workaround that is documented on the Github page for pandas-profiling under large datasets. In it, there is this example:
from pandas_profiling import ProfileReport
profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")
Possible fix in the pandas-profiling source?
I got the exact same error. I tried looking it up and it seemed to be coming from Matplotlib with a memory leak. Meaning, the plots were not properly being erased when they were formed. I tried adding the following to the utils.py file within the visualization folder of pandas_profiling:
plt.clf()
plt.cla()
plt.close('all')
The code there originally had plt.close() which I have found in the past to not be enough when making multiple plots back to back. However, I still got this error which makes me think it may not be Matplotlib (or I missed it somewhere.
The minimal=True fix above may work sometimes. It still fails for me occasionally when my datasets are too big.
I, too, have ran into that particular error case. Personal experimentation with configuration file settings narrowed things down to the quantity of plots generated, or else requested for generation up until that point in runtime execution. What varied in my experimentation was the serial number of PlotReport() profile generations that required interaction graphs: the less interaction graphs plotted overall, the less frequently this error is encountered.
Isolation of processes via multiprocessing Pool.map() calls for this stuff might help if you're only trying to generate a single profile and you absolutely need all of your interaction graphs, but that's a time and/or RAM-greedy option that would require getting creative with instanced joins for pairing column values for interaction graphs from smaller DataFrames, e.g. more reports overall.
Regardless, since documentation for ProfileReport configuration settings is terrible, here's the ones I've staggered into figuring out which you should probably investigate:
vars.cat.words -- Boolean, triggers word cloud graph generation for string/categorical variables, probably don't need it.
missing_diagrams.bar -- Boolean, turn off if graph is unnecessary
missing_diagrams.matrix -- Boolean, turn off if graph is unnecessary
missing_diagrams.heatmap -- Boolean, turn off if graph is unnecessary
missing_diagrams.dendrogram -- Boolean, turn off if graph is unnecessary
correlations.everythingthatisn'tpearsonifyoudon'tneedit -- Boolean, turn off if graph is unnecessary.
interactions.targets -- Python list of strings. Specifying one or more column names here will limit interaction graphs to just those involving these variables. This is probably what you're looking for if you just can't bear to drop columns prior to report generation.
interactions.continuous -- Boolean, turn off if you just don't want interactions anyway.
plot.image_format -- string, 'svg' or 'png', and I guess there's a data structure size difference which may or may not be related to the apparent memory leak the other guy was getting at, but go figure.
plot.dpi -- integer, specifies dpi of plotted images, and likewise may be related to an apparent memory leak.
plot.pie.max_unique -- integer, set to 0 to disable occasional pie charts being graphed, likewise may be related to apparent memory leak from graph plotting.
Good luck, and don't be afraid to try out other options like DataPrep. I wish I did before going down this rabbit hole.
Related
I have a significant large dataset consisting of several thousands of files spread among different directories. These files all have different formats and come from different sensors giving me different sampling rates. Basically, a mess. I created a python module that is able to enter these folders and make sense of all this data, reformat it, get it into a pandas dataframe that I could use for effective and easy resampling, and in general, make it easier to work with.
The problem is that the resulting dataframe is big and takes a large amount of RAM memory. Loading several of these datasets leaves not enough memory available to actually train a ML model. And it is painfully slow to read the data.
So my solution is a two part approach. First, I read the dataset into a big variable. It is a dict with nested pandas DataFrame, then compute a reduced derived DataFrame with the information I actually need to train my model, and remove from memory the dict variable. Not ideal, but it works. However, further computations sometimes needs re-reading the data and as stated previously, it is slow.
Enter the second part. Before removing the dict from memory, I pickle it into a file. sklearn actually recommends using joblib, so that's what I use. So, once the single files for the dataset are stored in the working directory, the reading stage is about 90% faster than reading the scattered data, most likely because is loading a single large file directly into memory than reading and reformatting thousands of files across different directories.
Here's my problem. The same code when is reading the data from the scattered files, ends up with about 70% less RAM than when reading the pickled data. So, although it is faster, it ends up using much memory. Has anybody experienced something like this?
Given that there are some access issues to the data (it is located in a network drive with some weird restrictions for user access) and the fact that I need to make it as user friendly as possible for other people, I'm using a Jupyter notebook. My IT department provides a web tool with all the packages required to read the network drive from the go and run Jupyter there, whilst running from a VM will require the manual configuration of the network drive to access the data and that part is not user friendly. The Jupyter tool requires only login information, while the VM requires a basic knowledge of linux sysadmin
I'm using Python 3.9.6. I'll keep trying to get a MWE that has a similar situation. So far I have one that has the opposite behaviour (loading the pickled dataset consumes less memory than reading it directly). Might be because the particular structure of the dict with nested DataFrame
MWE (Warning, running this code will create a 4GB file in your hard drive):
import numpy as np
import psutil
from os.path import exists
from os import getpid
from joblib import dump, load
## WARNING. THIS CODE SAVES A LARGE FILE INTO YOUR HARD DRIVE
def read_compute():
if exists('df.joblib'):
df = load('df.joblib')
print('==== df loaded from .joblib')
else:
df = np.random.rand(1000000,500)
dump(df, 'df.joblib')
print('=== df created and dumped')
tab = df[:100, :10]
del df
return tab
table = read_compute()
print(f'{psutil.Process(getpid()).memory_info().rss / 1024 ** 2} MB')
With this, I get when running without the df.joblib file in the pwd
=== df created and dumped
3899.62890625 MB
And then, after that file is created, I restart the kernel and run the same code again, getting
==== df loaded from .joblib
1588.5234375 MB
In my actual case, with the format of my data, I have the opposite effect.
Hello dear Community,
I haven't found something similar during my search and hope I haven't overseen anything. I have the following issue:
I have a big dataset whichs shape is 1352x121797 (1353 samples and 121797 time points). Now I have clustered these and would like to generate one plot for each cluster in which every time series for this cluster is plotted.
However, when using the matplotlib syntax it is like super extremely slow (and I'm not exactly sure where that comes from). Even after 5-10 minutes it hasn't finished.
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
for index, values in subset_cluster.iterrows(): # One Cluster subset, dataframe of shape (11x121797)
ax.plot(values)
fig.savefig('test.png')
Even, when inserting a break after ax.plot(values) it still doesn't finish. I'm using Spyder and thought that it might be due to Spyder always rendering the plot inline in the console.
However, when simply using the pandas method of the Series values.plot() instead of ax.plot(values) the plot appears and is saved in like 1-2 seconds.
As I need the customization options of matplotlib for standardizing all the plots and make them look a little bit prettier I would love to use the matplotlib syntax. Anyone has any ideas?
Thanks in advance
Edit: so while trying around a little bit it seems, that the rendering is the time-consuming part. When ran with the backend matplotlib.use('Agg'), the plot command runs through quicker (if using plt.plot() instead of ax.plot()), but plt.savefig() then takes forever. However, still it should be in a considerable amount of time right? Even for 121xxx data points.
Posting as answer as it may help OP or someone else: I had the same problem and found out that it was because the data I was using as x-axis was an Object, while the y-axis data was float64. After explicitly setting the object to DateTime, plotting With Matplotlib went as fast as Pandas' df.plot(). I guess that Pandas does a better job at understanding the data type when plotting.
OP, you might want to check if the values you are plotting are in the right type, or if, like me, you had some problems when loading the dataframe from file.
I'm learning how to plot things (CSV files) in Python, using import matplotlib.pyplot as plt.
Column1;Column2;Column3;
1;4;6;
2;2;6;
3;3;8;
4;1;1;
5;4;2;
I can plot the one above with plt.plotfile('test0.csv', (0, 1), delimiter=';'), which gives me the figure below.
Do you see the axis labels, column1 and column2? They are in lower case in the figure, but in the data file they beggin with upper case.
Also, I tried plt.plotfile('test0.csv', ('Column1', 'Column2'), delimiter=';'), which does not work.
So it seems matplotlib.pyplot works only with lowercase names :(
Summing this issue with this other, I guess it's time to try something else.
As I am pretty new to plotting in Python, I would like to ask: Where should I go from here, to get a little more than what matplotlib.pyplot provides?
Should I go to pandas?
You are mixing up two things here.
Matplotlib is designed for plotting data. It is not designed for managing data.
Pandas is designed for data analysis. Even if you were using pandas, you would still need to plot the data. How? Well, probably using matplotlib!
Independently of what you're doing, think of it as a three step process:
Data aquisition, data read-in
Data processing
Data representation / plotting
plt.plotfile() is a convenience function, which you can use if you don't need step 2. at all. But it surely has its limitations.
Methods to read in data (not complete of course) are using pure python open, python csvReader or similar, numpy / scipy, pandas etc.
Depeding on what you want to do with your data, you can already chose a suitable input method. numpy for large numerical data sets, pandas for datasets which include qualitative data or heavily rely on cross correlations etc.
So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?
I'm transitioning all of my data analysis from MATLAB to Python and I've finally hit a block where I've been unable to quickly find a turnkey solution. I have time series data from many instruments including an ADV (acoustic doppler velocimeter) that require despiking. Previously I've used this function in MATLAB that works quite well:
http://www.mathworks.com/matlabcentral/fileexchange/15361-despiking
Is anybody aware of a similar function available in Python?
I'd use median filter, and there are plenty of options depending on your data class, for example
import scipy.ndimage as im
x= im.median_filter(x, (self.m,self.m))