Plot specifying column by name, upper case issue - python

I'm learning how to plot things (CSV files) in Python, using import matplotlib.pyplot as plt.
Column1;Column2;Column3;
1;4;6;
2;2;6;
3;3;8;
4;1;1;
5;4;2;
I can plot the one above with plt.plotfile('test0.csv', (0, 1), delimiter=';'), which gives me the figure below.
Do you see the axis labels, column1 and column2? They are in lower case in the figure, but in the data file they beggin with upper case.
Also, I tried plt.plotfile('test0.csv', ('Column1', 'Column2'), delimiter=';'), which does not work.
So it seems matplotlib.pyplot works only with lowercase names :(
Summing this issue with this other, I guess it's time to try something else.
As I am pretty new to plotting in Python, I would like to ask: Where should I go from here, to get a little more than what matplotlib.pyplot provides?
Should I go to pandas?

You are mixing up two things here.
Matplotlib is designed for plotting data. It is not designed for managing data.
Pandas is designed for data analysis. Even if you were using pandas, you would still need to plot the data. How? Well, probably using matplotlib!
Independently of what you're doing, think of it as a three step process:
Data aquisition, data read-in
Data processing
Data representation / plotting
plt.plotfile() is a convenience function, which you can use if you don't need step 2. at all. But it surely has its limitations.
Methods to read in data (not complete of course) are using pure python open, python csvReader or similar, numpy / scipy, pandas etc.
Depeding on what you want to do with your data, you can already chose a suitable input method. numpy for large numerical data sets, pandas for datasets which include qualitative data or heavily rely on cross correlations etc.

Related

Using Matplotlib with Dask

Let's say we have pandas dataframe pd and a dask dataframe dd. When I want to plot pandas one with matplotlib I can easily do it:
fig, ax = plt.subplots()
ax.bar(pd["series1"], pd["series2"])
fig.savefig(path)
However, when I am trying to do the same with dask dataframe I am getting Type Errors such as:
TypeError: Cannot interpret 'string[python]' as a data type
string[python] is just an example, whatever is your dd["series1"] datatype will be inputed here.
So my question is: What is the proper way to use matplotlib with dask, and is this even a good idea to combine the two libraries?
One motivation to use dask instead of pandas is the size of the data. As such, swapping pandas DataFrame with dask DataFrame might not be feasible. Imagine a scatter plot, this might work well with 10K points, but if the dask dataframe is a billion rows, a plain matplotlib scatter is probably a bad idea (datashader is a more appropriate tool).
Some graphical representations are less sensitive to the size of the data, e.g. normalized bar chart should work well, as long as the number of categories does not scale with the data. In this case the easiest solution is to use dask to compute the statistics of interest before plotting them using pandas.
To summarise: I would consider the nature of the chart, figure out the best tool/representation, and if it's something that can/should be done with matplotlib, then I would run computations on dask DataFrame to get the reduced result as a pandas dataframe and proceed with the matplotlib
SultanOrazbayev's is still spot on, here is an answer elaborating on the datashader option (which hvplot call under the hood).
Don't use Matplotlib, use hvPlot!
If you wish to plot the data while it's still large, I recommend using hvPlot, as it can natively handle dask dataframes. It also automatically provides interactivity.
Example
import numpy as np
import dask
import hvplot.dask
# Create Dask DataFrame with normally distributed data
df = dask.datasets.timeseries()
df['x'] = df['x'].map_partitions(lambda x: np.random.randn(len(x)))
df['y'] = df['y'].map_partitions(lambda x: np.random.randn(len(x)))
# Plot
df.hvplot.scatter(x='x', y='y', rasterize=True)

Preprocessing data for Time-Series prediction

Okay, so I am doing research on how to do Time-Series Prediction. Like always, it's preprocessing the data that's the difficult part. I get I have to convert the "time-stamp" in a data file into a "datetime" or "timestep" I did that.
df = pd.read_csv("airpassengers.csv")
month = pd.to_datatime(df['Month'])
(I may have parse the datatime incorrectly, I seen people use pd.read_csv() instead to parse the data. If I do, please advise on how to do it properly)
I also understand the part where I scale my data. (Could someone explain to me how the scaling works, I know that it turns all my data within the range I give it, but would the output of my prediction also be scaled or something.)
Lastly, once I have scaled and parsed data and timestamps, how would I actually predict with the trained model. I don't know what to enter into (for example) model.predict()
I did some research it seemed like I have to shift my dataset or something, I don't really understand what the documentation is saying. And the example isn't directly related to time-series prediction.
I know this is a lot, you might now be able to answer all the questions. I am fairly new to this. Just help with whatever you can. Thank you!
So, because you're working with airpassengers.csv and asking about predictive modeling I'm going to assume you're working through this github
There's a couple of things I want to make sure you know before I dive into the answer to your questions.
There are lots of different types of predictive models used in
forecasting. You can find all about them here
You're asking a lot of broad questions but I'll break down the main questions
into two steps and describe what's happening using the example that
I believe you're trying to replicate
Let's break it down
Loading and parsing the data
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
air_passengers = pd.read_csv("./data/AirPassengers.csv", header = 0, parse_dates = [0], names = ['Month', 'Passengers'], index_col = 0)
This section of code loads in the data from a .csv (comma-separated values) file. It's saved into the data frame air_passengers. Inside the function to read in the csv we also state that there's a header in the first row, the first column is full of dates, the name of our columns is assigned, we index our data frame to the first column.
Scaling the data
log_air_passengers = np.log(air_passengers.Passengers)
This is done to make the math make sense. Logs are the inverse of exponents (X^2 is the same as Log2X). Using numpy's log function it gives us the natural log (log e). This is also called the natural log. Your predicted values will actually be so close to a percent change that you can use them as such
Now that the data has been scaled, we can prep it for statistical modeling
log_air_passengers_diff = log_air_passengers - log_air_passengers.shift()
log_air_passengers_diff.dropna(inplace=True)
This changes the data frame to be the difference between the previous and next data points instead of just the log values themselves.
The last part of your question contains too many steps to cover here. It is also not as simple as calling a single function. I encourage you to learn more from here

Why is matplotlib plotting so much slower than pd.DataFrame.plot()?

Hello dear Community,
I haven't found something similar during my search and hope I haven't overseen anything. I have the following issue:
I have a big dataset whichs shape is 1352x121797 (1353 samples and 121797 time points). Now I have clustered these and would like to generate one plot for each cluster in which every time series for this cluster is plotted.
However, when using the matplotlib syntax it is like super extremely slow (and I'm not exactly sure where that comes from). Even after 5-10 minutes it hasn't finished.
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots()
for index, values in subset_cluster.iterrows(): # One Cluster subset, dataframe of shape (11x121797)
ax.plot(values)
fig.savefig('test.png')
Even, when inserting a break after ax.plot(values) it still doesn't finish. I'm using Spyder and thought that it might be due to Spyder always rendering the plot inline in the console.
However, when simply using the pandas method of the Series values.plot() instead of ax.plot(values) the plot appears and is saved in like 1-2 seconds.
As I need the customization options of matplotlib for standardizing all the plots and make them look a little bit prettier I would love to use the matplotlib syntax. Anyone has any ideas?
Thanks in advance
Edit: so while trying around a little bit it seems, that the rendering is the time-consuming part. When ran with the backend matplotlib.use('Agg'), the plot command runs through quicker (if using plt.plot() instead of ax.plot()), but plt.savefig() then takes forever. However, still it should be in a considerable amount of time right? Even for 121xxx data points.
Posting as answer as it may help OP or someone else: I had the same problem and found out that it was because the data I was using as x-axis was an Object, while the y-axis data was float64. After explicitly setting the object to DateTime, plotting With Matplotlib went as fast as Pandas' df.plot(). I guess that Pandas does a better job at understanding the data type when plotting.
OP, you might want to check if the values you are plotting are in the right type, or if, like me, you had some problems when loading the dataframe from file.

Efficiently creating lots of Histograms from grouped data held in pandas dataframe

I want to create a bunch of histograms from grouped data in pandas dataframe. Here's a link to a similar question. To generate some toy data that is very similar to what I am working with you can use the following code:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
I want to put those histograms (read the binned data) in a new dataframe and save that for later processing. Here's the real kicker, my file is 6 GB, with 400k+ groups, just 2 columns.
I've thought about using a simple for loop to do the work:
data=[]
for group in df['Letter'].unique():
data.append(np.histogram(df[df['Letter']==group]['N'],range=(-2000,2000),bins=50,density=True)[0])
df2=DataFrame(data)
Note that the bins, range, and density keywords are all necessary for my purposes so that the histograms are consistent and normalized across the rows in my new dataframe df2 (parameter values are from my real dataset so its overkill on the toy dataset). And the for loop works great, on the toy dataset generates pandas dataframe of 3 rows and 50 columns as expected. On my real dataset I've estimated that time to completion of the code would be around 9 days. Is there any better/faster way to do what I'm looking for?
P.S. I've thought about multiprocessing, but I think the overhead of creating processes and slicing data would be slower than just running this serially (I may be wrong and wouldn't mind to be corrected on this one).
For the type of problem you describe here, I personally usually do the following, which is basically delegate the whole thing to multithreaded Cython/C++. It's a bit of work, but not impossible, and I'm not sure there's really a viable alternative at the moment.
Here are the building blocks:
First, your df.x.values, df.y.values are just numpy arrays. This link shows how to get C-pointers from such arrays.
Now that you have pointers, you can write a true multithreaded program using Cython's prange and foregoing any Python from this point (you're now in C++ territory). So say you have k threads scanning your 6GB arrays, and thread i handles groups whose keys have a hash that is i modulo k.
For a C program (which is what your code really is now) the GNU Scientific Library has a nice histogram module.
When the prange is done, you need to convert the C++ structures back to numpy arrays, and from there back to a DataFrame. Wrap the whole thing up in Cython, and use it like a normal Python function.

How to set x lim to 99.5 percentile of my data series for matplotlib histogram?

I'm currently pumping out some histograms with matplotlib. The issue is that because of one or two outliers my whole graph is incredibly small and almost impossible to read due to having two separate histograms being plotted. The solution I am having problems with is dropping the outliers at around a 99/99.5 percentile. I have tried using:
plt.xlim([np.percentile(df,0), np.percentile(df,99.5)])
plt.xlim([df.min(),np.percentile(df,99.5)])
Seems like it should be a simple fix, but I'm missing some key information to make it happen. Any input would be much appreciated, thanks in advance.
To restrict focus to just the middle 99% of the values, you could do something like this:
trimmed_data = df[(df.Column > df.Column.quantile(0.005)) & (df.Column < df.Column.quantile(0.995))]
Then you could do your histogram on trimmed_data. Exactly how to exclude outliers is more of a stats question than a Python question, but basically the idea I was suggesting in a comment is to clean up the data set using whatever methods you can defend, and then do everything (plots, stats, etc.) on only the cleaned dataset, rather than trying to tweak each individual plot to make it look right while still having the outlier data in there.

Categories

Resources