matplotlib graphs with different size variables - python

I'm quite new to python and am working on an assignment. I am using pandas to read a csv into a dataframe. One of the fields in that dataframe is Car-Type. I want to get a total of each of the different Car-Types (sedan, hatch-back, wagon, etc.) in the data frame, then use matplotlib to make graphs of the Car-Types vs. the type-totals. Depending on the type of graph I try to make, I get different errors about the x and y variables.
import pandas as pd
from matplotlib import pyplot as plt
path_to_csv = r'C:\Automobile_data.csv'
# use pandas to read the csv and assign it to a variable, df
df = pd.read_csv(path_to_csv, encoding='iso-8859-1')
# Where the car-type value is null, set the value to Not Available
df.loc[df['Car-Type'].isnull(), 'Car-Type'] = 'Not_Available'
# create new dataframe with just the car-type counts
countType = df['Car-Type'].value_counts().astype('int64')
print(countType)
# use matplotlib to create a graph
# assign values to the x and y variables
x = df['Car-Type']
y = countType
# create a graph with x and y
plt.plot(x, y)
# # display the graph
plt.show()
line: (plt.plot) - ValueError: x and y must have same first dimension, but have shapes (205,) and (6,)
bar: (plt.bar) - ValueError: shape mismatch: objects cannot be broadcast to a single shape
scatter: (plt.scatter) - ValueError: x and y must be the same size
I understand that there's an error regarding the size and or shape of the data I'm assigning to the x and y variable, but I'm not experienced enough, and haven't been able to extrapolate from reading, how to fix those errors.

Related

ValueError: If using all scalar values, you must pass an index, when trying to create a seaborn jointplot

I am trying to make a jointplot out of some sample data and I keep getting the following error and I do not understand why. It is likely that I am missing something obvious as usual. To me I do not understand why it thinks the values are scalar.
The code to create the plot is below.
x_fake = np.random.randn(1, 100)
y_fake = -3*x_fake + 2*np.random.randn(1, 100)
sns.jointplot(x = x_fake, y = y_fake, kind = "reg")
ValueError: If using all scalar values, you must pass an index
The problem is that np.random.randn(1,100) creates a 2D array (of size 1x100) while 1D arrays are needed. You could e..g. "squeeze" them (sns.jointplot(x=x_fake.squeeze(), y=y_fake.squeeze(), kind = "reg")), or just create them as 1D arrays (x_fake = np.random.randn(100)). The cryptic error message comes from pandas. Note that the "standard" way of using Seaborn would be via dataframes (which do have 1D columns).
import seaborn as sns
import numpy as np
x_fake = np.random.randn(1, 100)
y_fake = -3 * x_fake + 2 * np.random.randn(1, 100)
sns.jointplot(x=x_fake.squeeze(), y=y_fake.squeeze(), kind="reg")

Converting pandas dataset to matplotlib plot: is there a way to append a value to the list of y values?

In converting a pandas data set to something usable by matplotlib plotting, the y-values are made into a list. In wanting to append my own value to a list, I get an error that I am unsure of how to approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# pd.set_option('display.max_columns', None)
# Basic Data Set Cleanse
data_BM = pd.read_csv("C:\\Users\\user\\Desktop\\matplot and pand\\bigmart_data.csv")
data_BM = data_BM.dropna(how='any')
data_BM = data_BM.reset_index(drop=True)
# Graphed Mean Sales for Each Sized Outlet -
sales_by_outlet_size = data_BM.groupby('Outlet_Size').Item_Outlet_Sales.mean()[:10]
sales_by_outlet_size.sort_values(inplace=True)
print(sales_by_outlet_size.head())
x = sales_by_outlet_size.index.tolist()
y = sales_by_outlet_size.values.tolist()
# # # # #
# This is where I am confused. If I wanted to do something like -
y.append(3000)
# or
y += [3000]
# It doesn't go well
# # # # #
plt.xlabel('Outlet Size')
plt.ylabel('Mean Sales')
plt.title('Mean Sales by Store Size')
plt.xticks(labels=x, ticks=np.arange(len(x)))
plt.bar(x, y, color=['red', 'orange', 'magenta'])
plt.show()
Is there a rule for this that I am unaware of?
Error for y.append(3000):
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Since you're using y.append(3000) or y += [3000] you've got more Y values than X. Add one more X value and you should be fine.
Yo have error not for y.append(3000) or y += [3000] but for plot generation. After adding one Y value, you have one less X value. Append also one X and check yoursels.

Missing axis values in plotting of NetCDF variable in group with xarray

I am constructing a NetCDF file that will be used with xarray. It will consist of many groups that use dimensions that are defined in the root group. In my current example, xarray's plot function is unable to put the proper values on the axes. Tools like panoply or ncview produce plots that do put the proper values of the dimensions at the axes. The script below creates a file which allows me to reproduce the problem. Do I construct the NetCDF file in an incorrect way, or is this a bug in xarray?
import numpy as np
import netCDF4 as nc
import xarray as xr
import matplotlib.pyplot as plt
# Three series, two variables that contain the axis values and the 2D field.
z = np.arange(0., 1000., 50.)
time = np.arange(0., 86400., 3600.)
a = np.random.rand(time.size, z.size)
# The two dimensions are stored in the root group.
nc_file = nc.Dataset("test.nc", mode="w", datamodel="NETCDF4", clobber=False)
nc_file.createDimension("z" , z.size )
nc_file.createDimension("time", time.size)
nc_z = nc_file.createVariable("z" , "f8", ("z") )
nc_time = nc_file.createVariable("time", "f8", ("time"))
nc_z [:] = z [:]
nc_time[:] = time[:]
# The 2D field is created and stored in a group called test_group.
nc_group = nc_file.createGroup("test_group")
nc_a = nc_group.createVariable("a", "f8", ("time", "z"))
nc_a[:,:] = a[:,:]
nc_file.close()
# Opening the file in x-array shows a plot that misses the axes values.
xr_file = xr.open_dataset("test.nc", "test_group")
xr_a = xr_file['a']
xr_a.plot()
plt.show()
The resulting figure, which has just the count rather than the dimension values on the axes, is:

How to retrieve all data from seaborn distribution plot with mutliple distributions?

The post Get data points from Seaborn distplot describes how you can get data elements using sns.distplot(x).get_lines()[0].get_data(), sns.distplot(x).patches and [h.get_height() for h in sns.distplot(x).patches]
But how can you do this if you've used multiple layers by plotting the data in a loop, such as:
Snippet 1
for var in list(df):
print(var)
distplot = sns.distplot(df[var])
Plot
Is there a way to retrieve the X and Y values for both linecharts and the bars?
Here's the whole setup for an easy copy&paste:
#%%
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pylab
pylab.rcParams['figure.figsize'] = (8, 4)
import seaborn as sns
from collections import OrderedDict
# Function to build synthetic data
def sample(rSeed, periodLength, colNames):
np.random.seed(rSeed)
date = pd.to_datetime("1st of Dec, 1999")
cols = OrderedDict()
for col in colNames:
cols[col] = np.random.normal(loc=0.0, scale=1.0, size=periodLength)
dates = date+pd.to_timedelta(np.arange(periodLength), 'D')
df = pd.DataFrame(cols, index = dates)
return(df)
# Dataframe with synthetic data
df = sample(rSeed = 123, colNames = ['X1', 'X2'], periodLength = 50)
# sns.distplot with multiple layers
for var in list(df):
myPlot = sns.distplot(df[var])
Here's what I've tried:
Y-values for histogram:
If I run:
barX = [h.get_height() for h in myPlot.patches]
Then I get the following list of lenght 11:
[0.046234272703757885,
0.1387028181112736,
0.346757045278184,
0.25428849987066837,
0.2542884998706682,
0.11558568175939472,
0.11875881712519201,
0.3087729245254993,
0.3087729245254993,
0.28502116110046083,
0.1662623439752689]
And this seems reasonable since there seems to be 6 values for the blue bars and 5 values for the red bars. But how do I tell which values belong to which variable?
Y-values for line:
This seems a bit easier than the histogram part since you can use myPlot.get_lines()[0].get_data() AND myPlot.get_lines()[1].get_data() to get:
Out[678]:
(array([-4.54448949, -4.47612134, -4.40775319, -4.33938504, -4.27101689,
...
3.65968859, 3.72805675, 3.7964249 , 3.86479305, 3.9331612 ,
4.00152935, 4.0698975 , 4.13826565]),
array([0.00042479, 0.00042363, 0.000473 , 0.00057404, 0.00073097,
0.00095075, 0.00124272, 0.00161819, 0.00208994, 0.00267162,
...
0.0033384 , 0.00252219, 0.00188591, 0.00139919, 0.00103544,
0.00077219, 0.00059125, 0.00047871]))
myPlot.get_lines()[1].get_data()
Out[679]:
(array([-3.68337423, -3.6256517 , -3.56792917, -3.51020664, -3.4524841 ,
-3.39476157, -3.33703904, -3.27931651, -3.22159398, -3.16387145,
...
3.24332952, 3.30105205, 3.35877458, 3.41649711, 3.47421965,
3.53194218, 3.58966471, 3.64738724]),
array([0.00035842, 0.00038018, 0.00044152, 0.00054508, 0.00069579,
0.00090076, 0.00116922, 0.00151242, 0.0019436 , 0.00247792,
...
0.00215912, 0.00163627, 0.00123281, 0.00092711, 0.00070127,
0.00054097, 0.00043517, 0.00037599]))
But the whole thing still seems a bit cumbersome. So does anyone know of a more direct approach to perhaps retrieve all data to a dictionary or dataframe?
I was just getting the same need of retrieving data from a seaborn distribution plot, what worked for me was to call the method .findobj() on each iteration's graph. Then, one can notice that the matplotlib.lines.Line2D object has a get_data() method, this is similar as what you've mentioned before for myPlot.get_lines()[1].get_data().
Following your example code
data = []
for idx, var in enumerate(list(df)):
myPlot = sns.distplot(df[var])
# Fine Line2D objects
lines2D = [obj for obj in myPlot.findobj() if str(type(obj)) == "<class 'matplotlib.lines.Line2D'>"]
# Retrieving x, y data
x, y = lines2D[idx].get_data()[0], lines2D[idx].get_data()[1]
# Store as dataframe
data.append(pd.DataFrame({'x':x, 'y':y}))
Notice here that the data for the first sns.distplot plot is stored on the first index of lines2D and the data for the second sns.distplot is stored on the second index. I'm not really sure about why this happens this way, but if you were to consider more than two plots, then you will access each sns.distplot data by calling Lines2D on it's respective index.
Finally, to verify one can plot each distplot
plt.plot(data[0].x, data[0].y)

Python : 2d contour plot with fixed x and y for 6 series of fractional data (z)

I'm trying to use a contour plot to show an array of fractional data (between 0 and 1) at 6 heights (5, 10, 15, 20, 25, and 30) with a fixed x-axis (the "WN" series, 1 to 2300). y (height) is different for each series and discontinuous so I need to interpolate between heights.
WN,5,10,15,20,25,30
1,0.9984898,0.99698234,0.99547797,0.99397725,0.99247956,0.99098486
2,0.99814528,0.99629492,0.9944489,0.99260795,0.99077147,0.98893934
3,0.99765164,0.99530965,0.99297464,0.99064702,0.98832631,0.98601222
4,0.99705136,0.99411237,0.99118394,0.98826683,0.98535997,0.9824633
5,0.99606526,0.99214685,0.98824716,0.98436642,0.98050326,0.97665751
6,0.98111153,0.96281821,0.94508928,0.92790776,0.91125059,0.89509743
7,0.99266499,0.98539108,0.97816986,0.97100824,0.96390355,0.95685524
...
Any ideas? Thank you!
Using matplotlib, you need your X (row), Y (column), and Z values. The matplotlib function expects data in a certain format. Below, you'll see the meshgrid helps us get that format.
Here, I use pandas to import your data that I saved to a csv file. You can load your data any way you'd like though. The key is prepping your data for the plotting function.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#import the data from a csv file
data = pd.read_csv('C:/book1.csv')
#here, I let the x values be the column headers (could switch if you'd like)
#[1:] don't want the 'WN' as a value
X = data.columns.values[1:]
#Here I get the index values (a pandas dataframe thing) as the Y values
Y = data['WN']
#don't want this column in your data though
del data['WN']
#need to shape your data in preparation for plotting
X, Y = np.meshgrid(X, Y)
#see http://matplotlib.org/examples/pylab_examples/contour_demo.html
plt.contourf(X,Y,data)

Categories

Resources