I am handling datasets of several GB, which I process in parallel with the multiprocessing library.
It takes a lot of time, but it make sense.
Once I have the resultant dataset, I need to plot it.
In this particular case, through matplotlib, I generate my stacked bar chart with:
plot = df.plot(kind='bar',stacked=True)
fig = plot.get_figure()
fig.savefig('plot.pdf', bbox_inches='tight')
At this point, for large datasets, is simply unmanageable. This method is performed sequentially, so it does not care how many cores do you have.
The generated plot is saved in a pdf, which in turn, is also really heavy, and slow to open.
Is there any alternative workflow to generate lighter plots?
So far, I've tried with dropping alternate rows from the original dataset (this process may be repeated several times, until reaching a more handy dataset). This is done with:
df = df.iloc[::2]
Let's say that it sort of work. However, I'd like to know if there are other approaches for this.
How do you face this type of large dimension visualization?
Related
I have a fairly large pandas data frame((4000, 103) and for smaller dataframes I love using pairplot to visually see patterns in my data. But for my larger dataset the same command runs for hour+ with no output.
Is there an alternative tool to get the same outcome or a way to speed up the command? I tried to use the sample option on pandas to reduce the dataset but it still takes over a hour with no outcome.
dfSample = myData.sample(100) # make dataset smaller
sns.pairplot(dfSample, diag_kind="hist")
You should sample from colums, so replace your first line by
dfSample=myData.sample(10, axis=1).
And live happy.
Say I have two pandas dataframes, one containing data for general population and one containing the same data for a target group.
I assume this is a very common use case of population segmentation. My first idea to explore the data would be to perform some vizualization using e.g. seaborn Facetgrid or barplot & scatterplot or something like that to get a general idea of the trends and differences.
However, I found out that this operation is not as straightforward as I thought as seaborn is made to analyze one dataset and not compare two datasets.
I found this SO answer which provides a solution. But I am wondering how would people go about if if the dataframe was huge and a concat operation would not be possible ?
Datashader does not seem to provide such features as far as I have seen ?
Thanks for any ideas on how to go about such task
I would use the library Dask when data is too big for pandas. Dask is made by the same people who created pandas and it is a little bit more advanced, because it is a big data tool, but it has some of the same features including concat. I found dask easy enough to use and am using it for a couple of projects where I have dozens of columns and tens of millions of rows.
Hello dear Community,
I haven't found something similar during my search and hope I haven't overseen anything. I have the following issue:
I have a big dataset whichs shape is 1352x121797 (1353 samples and 121797 time points). Now I have clustered these and would like to generate one plot for each cluster in which every time series for this cluster is plotted.
However, when using the matplotlib syntax it is like super extremely slow (and I'm not exactly sure where that comes from). Even after 5-10 minutes it hasn't finished.
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
for index, values in subset_cluster.iterrows(): # One Cluster subset, dataframe of shape (11x121797)
ax.plot(values)
fig.savefig('test.png')
Even, when inserting a break after ax.plot(values) it still doesn't finish. I'm using Spyder and thought that it might be due to Spyder always rendering the plot inline in the console.
However, when simply using the pandas method of the Series values.plot() instead of ax.plot(values) the plot appears and is saved in like 1-2 seconds.
As I need the customization options of matplotlib for standardizing all the plots and make them look a little bit prettier I would love to use the matplotlib syntax. Anyone has any ideas?
Thanks in advance
Edit: so while trying around a little bit it seems, that the rendering is the time-consuming part. When ran with the backend matplotlib.use('Agg'), the plot command runs through quicker (if using plt.plot() instead of ax.plot()), but plt.savefig() then takes forever. However, still it should be in a considerable amount of time right? Even for 121xxx data points.
Posting as answer as it may help OP or someone else: I had the same problem and found out that it was because the data I was using as x-axis was an Object, while the y-axis data was float64. After explicitly setting the object to DateTime, plotting With Matplotlib went as fast as Pandas' df.plot(). I guess that Pandas does a better job at understanding the data type when plotting.
OP, you might want to check if the values you are plotting are in the right type, or if, like me, you had some problems when loading the dataframe from file.
I'm learning how to plot things (CSV files) in Python, using import matplotlib.pyplot as plt.
Column1;Column2;Column3;
1;4;6;
2;2;6;
3;3;8;
4;1;1;
5;4;2;
I can plot the one above with plt.plotfile('test0.csv', (0, 1), delimiter=';'), which gives me the figure below.
Do you see the axis labels, column1 and column2? They are in lower case in the figure, but in the data file they beggin with upper case.
Also, I tried plt.plotfile('test0.csv', ('Column1', 'Column2'), delimiter=';'), which does not work.
So it seems matplotlib.pyplot works only with lowercase names :(
Summing this issue with this other, I guess it's time to try something else.
As I am pretty new to plotting in Python, I would like to ask: Where should I go from here, to get a little more than what matplotlib.pyplot provides?
Should I go to pandas?
You are mixing up two things here.
Matplotlib is designed for plotting data. It is not designed for managing data.
Pandas is designed for data analysis. Even if you were using pandas, you would still need to plot the data. How? Well, probably using matplotlib!
Independently of what you're doing, think of it as a three step process:
Data aquisition, data read-in
Data processing
Data representation / plotting
plt.plotfile() is a convenience function, which you can use if you don't need step 2. at all. But it surely has its limitations.
Methods to read in data (not complete of course) are using pure python open, python csvReader or similar, numpy / scipy, pandas etc.
Depeding on what you want to do with your data, you can already chose a suitable input method. numpy for large numerical data sets, pandas for datasets which include qualitative data or heavily rely on cross correlations etc.
I want to create a bunch of histograms from grouped data in pandas dataframe. Here's a link to a similar question. To generate some toy data that is very similar to what I am working with you can use the following code:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
I want to put those histograms (read the binned data) in a new dataframe and save that for later processing. Here's the real kicker, my file is 6 GB, with 400k+ groups, just 2 columns.
I've thought about using a simple for loop to do the work:
data=[]
for group in df['Letter'].unique():
data.append(np.histogram(df[df['Letter']==group]['N'],range=(-2000,2000),bins=50,density=True)[0])
df2=DataFrame(data)
Note that the bins, range, and density keywords are all necessary for my purposes so that the histograms are consistent and normalized across the rows in my new dataframe df2 (parameter values are from my real dataset so its overkill on the toy dataset). And the for loop works great, on the toy dataset generates pandas dataframe of 3 rows and 50 columns as expected. On my real dataset I've estimated that time to completion of the code would be around 9 days. Is there any better/faster way to do what I'm looking for?
P.S. I've thought about multiprocessing, but I think the overhead of creating processes and slicing data would be slower than just running this serially (I may be wrong and wouldn't mind to be corrected on this one).
For the type of problem you describe here, I personally usually do the following, which is basically delegate the whole thing to multithreaded Cython/C++. It's a bit of work, but not impossible, and I'm not sure there's really a viable alternative at the moment.
Here are the building blocks:
First, your df.x.values, df.y.values are just numpy arrays. This link shows how to get C-pointers from such arrays.
Now that you have pointers, you can write a true multithreaded program using Cython's prange and foregoing any Python from this point (you're now in C++ territory). So say you have k threads scanning your 6GB arrays, and thread i handles groups whose keys have a hash that is i modulo k.
For a C program (which is what your code really is now) the GNU Scientific Library has a nice histogram module.
When the prange is done, you need to convert the C++ structures back to numpy arrays, and from there back to a DataFrame. Wrap the whole thing up in Cython, and use it like a normal Python function.