Having trouble with seaborn module in python - python

I am trying to draw some basic plots using the seaborn's jointplot() method.
My pandas data frame looks like this:
Out[250]:
YEAR Yields avgSumPcpn avgMaxSumTemp avgMinSumTemp
1970 5000 133.924981 30.437124 19.026974
1971 5560 107.691316 31.161974 19.278186
1972 5196 116.830066 31.454192 19.443712
1973 4233 181.550733 30.373581 19.097679
1975 5093 112.137538 30.428966 18.863224
I am trying to draw 'Yields' against 'YEAR' (So a plot to see how 'Yields' is varying over time). A simple plot.
But when I do this:
sns.jointplot(x='YEAR',y='Yeilds', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
I am getting the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-251-587582a746b8> in <module>()
3 #ax = plt.axes()
4 #sns_sum_reg_min_temp_pcpn = sns.regplot(x='avgSumPcpn',y='avgMaxSumTemp', data = df_sum_temp_pcpn)
----> 5 sns.jointplot(x='Yeilds',y='YEAR', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
6 plt.title('Avg Summer Precipitation vs Yields of Wharton TX', fontsize = 10)
7
//anaconda/lib/python2.7/site-packages/seaborn/distributions.pyc in jointplot(x, y, data, kind, stat_func, color, size, ratio, space, dropna, xlim, ylim, joint_kws, marginal_kws, annot_kws, **kwargs)
793 grid = JointGrid(x, y, data, dropna=dropna,
794 size=size, ratio=ratio, space=space,
--> 795 xlim=xlim, ylim=ylim)
796
797 # Plot the data using the grid
//anaconda/lib/python2.7/site-packages/seaborn/axisgrid.pyc in __init__(self, x, y, data, size, ratio, space, dropna, xlim, ylim)
1637 if dropna:
1638 not_na = pd.notnull(x) & pd.notnull(y)
-> 1639 x = x[not_na]
1640 y = y[not_na]
1641
TypeError: string indices must be integers, not Series
So I printed out the types of each column. Here is how:
for i in summer_pcpn_temp_yeild.columns.values.tolist():
print type(summer_pcpn_temp_yeild[[i]])
print type(summer_pcpn_temp_yeild.index.values)
which gives me:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<type 'numpy.ndarray'>
SO, I am not being able to understand how to fix it.
Any help would be greatly appreciated.
Thanks

Check that the YEAR and Yields have integer ( not string) types of values.

Try changing x='Yeilds' to x='Yields' in your call to jointplot:
sns.jointplot(x='YEAR',y='Yeilds', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
The error message is misleading. Seaborn can't find the column named "Yeilds" in your summer_pcpn_temp_yeild dataframe, because the dataframe column is spelled "Yields".
I had the same problem, and fixed it by correcting the x= argument to sns.jointplot()

Related

How to transform a Pandas Dataframe with irregular coordinates into a xarray Dataset

I'm working with a pandas Dataframe on python, but in order to plot as a map my data I have to transform it into a xarray Dataset, since the library I'm using to plot (salem) works best for this class. The problem I'm having is that the grid of my data isn't regular so I can't seem to be able to create the Dataset.
My Dataframe has the latitude and longitude, as well as the value in each point:
lon lat value
0 -104.936302 -51.339233 7.908411
1 -104.827377 -51.127686 7.969049
2 -104.719154 -50.915470 8.036676
3 -104.611641 -50.702595 8.096765
4 -104.504814 -50.489056 8.163690
... ... ... ...
65995 -32.911377 15.359591 25.475702
65996 -32.957718 15.579139 25.443994
65997 -33.004040 15.798100 25.429346
65998 -33.050335 16.016472 25.408105
65999 -33.096611 16.234255 25.383844
[66000 rows x 3 columns]
In order to create the Dataset using lat and lon as coordinates and fill all of the missing values with NaN, I was trying the following:
ds = xr.Dataset({
'ts': xr.DataArray(
data = value, # enter data here
dims = ['lon','lat'],
coords = {'lon': lon, 'lat':lat},
attrs = {
'_FillValue': np.nan,
'units' : 'K'
}
)},
attrs = {'attr': 'RegCM output'}
)
ds
But I got the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [41], in <cell line: 1>()
1 ds = xr.Dataset({
----> 2 'ts': xr.DataArray(
3 data = value, # enter data here
4 dims = ['lon','lat'],
5 coords = {'lon': lon, 'lat':lat},
6 attrs = {
7 '_FillValue': np.nan,
8 'units' : 'K'
9 }
10 )},
11 attrs = {'example_attr': 'this is a global attribute'}
12 )
14 # ds = xr.Dataset(
15 # data_vars=dict(
16 # variable=(["lon", "lat"], value)
(...)
25 # }
26 # )
27 ds
File ~\anaconda3\lib\site-packages\xarray\core\dataarray.py:406, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
404 data = _check_data_shape(data, coords, dims)
405 data = as_compatible_data(data)
--> 406 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
407 variable = Variable(dims, data, attrs, fastpath=True)
408 indexes = dict(
409 _extract_indexes_from_coords(coords)
410 ) # needed for to_dataset
File ~\anaconda3\lib\site-packages\xarray\core\dataarray.py:123, in _infer_coords_and_dims(shape, coords, dims)
121 dims = tuple(dims)
122 elif len(dims) != len(shape):
--> 123 raise ValueError(
124 "different number of dimensions on data "
125 f"and dims: {len(shape)} vs {len(dims)}"
126 )
127 else:
128 for d in dims:
ValueError: different number of dimensions on data and dims: 1 vs 2
I would really appreciate any insights to solve this.
If you really require a rectangularly gridded dataset you need to resample your data into a regular grid... (rasterio, pyresample etc. provide useful functionalities for that). However if you just want to plot the data, this is not necessary!
Not sure about salem (never used it so far), but I've tried my best to simplify plotting of irrelgularly sampled data in the visualization-library I'm developing EOmaps!
You could get a "contour-plot" like appearance if you use a "delaunay triangulation" to visualize the data:
import pandas as pd
df = pd.read_csv("... path-to df.csv ...", index_col=0)
from eomaps import Maps
m = Maps()
m.add_feature.preset.coastline()
m.set_data(df, x="lon", y="lat", crs=4326, parameter="value")
m.set_shape.delaunay_triangulation()
m.plot_map()

Annotate csv column in scatter plot

I have two dataset in csv format:
df2
type prediction 100000 155000
0 0 2.60994 3.40305
1 1 10.82100 34.68900
0 0 4.29470 3.74023
0 0 7.81339 9.92839
0 0 28.37480 33.58000
df
TIMESTEP id type y z v_acc
100000 8054 1 -0.317192 -0.315662 15.54430
100000 669 0 0.352031 -0.008087 2.60994
100000 520 0 0.437786 0.000325 5.28670
100000 2303 1 0.263105 0.132615 7.81339
105000 8055 1 0.113863 0.036407 5.94311
I am trying to match value of df2[100000] to df1[v_acc]. If value matched, I am making scatter plot from df with columns y and z. After that I want to to annoted scatter point with matched value.
What I want is:
(I want all annotaions in a same plot).
I tried to code in python for such condition but I am not getting all annotation points in a single plot instead I am getting multi plots with a single annotation.
I am also getting this error:
TypeError Traceback (most recent call last)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython/core/formatters.py:339, in BaseFormatter.__call__(self, obj)
337 pass
338 else:
--> 339 return printer(obj)
340 # Finally look for special method names
341 method = get_real_method(obj, self.print_method)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython/core/pylabtools.py:151, in print_figure(fig, fmt, bbox_inches, base64, **kwargs)
148 from matplotlib.backend_bases import FigureCanvasBase
149 FigureCanvasBase(fig)
--> 151 fig.canvas.print_figure(bytes_io, **kw)
152 data = bytes_io.getvalue()
153 if fmt == 'svg':
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/matplotlib/backend_bases.py:2295, in FigureCanvasBase.print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)
2289 renderer = _get_renderer(
2290 self.figure,
2291 functools.partial(
2292 print_method, orientation=orientation)
2293 )
2294 with getattr(renderer, "_draw_disabled", nullcontext)():
-> 2295 self.figure.draw(renderer)
2297 if bbox_inches:
...
189 if len(self) == 1:
190 return converter(self.iloc[0])
--> 191 raise TypeError(f"cannot convert the series to {converter}")
TypeError: cannot convert the series to <class 'float'>
Can I get some help to make a plot as I want?
Thank you.
My code is here:
df2 = pd.read_csv('./result.csv')
print(df2.columns)
#print(df2.head(10))
df = pd.read_csv('./main.csv')
df = df[df['TIMESTEP'] == 100000]
for i in df['v_acc']:
for j in df2['100000']:
# sometimes numbers are long and different after decimals.So mathing 0.2f only
if "{0:0.2f}".format(i) == "{0:0.2f}".format(j):
plt.figure(figsize = (10,8))
sns.scatterplot(data = df, x = "y", y = "z", hue = "type", palette=['red','dodgerblue'], legend='full')
plt.annotate(i, (df['y'][df['v_acc'] == i], df['z'][df['v_acc'] == i]))
plt.grid(False)
plt.show()
break
the reason for the multiple plots is because are you using plt.figure() inside the loop. This will create a single figure for each loop. You need to create that outside and only the individual scatter and annotate within the loop. Here is the updated code that ran for the data you provided. Other than that, think your code is fine...
fig, ax=plt.subplots(figsize = (7,7)) ### Keep this before the loop and call it as subplot
for i in df['v_acc']:
for j in df2[100000]:
# sometimes numbers are long and different after decimals.So mathing 0.2f only
if "{0:0.2f}".format(i) == "{0:0.2f}".format(j):
#plt.figure(figsize = (10,8))
ax=sns.scatterplot(data = df, x = "y", y = "z", hue = "type", palette=['red','dodgerblue'], legend='full')
ax.annotate(i, (df['y'][df['v_acc'] == i], df['z'][df['v_acc'] == i]))
break
plt.grid(False) ### Keep these two after the loop, just one show for one plot
plt.show()
Output plot

Error when trying to use seasonal_decompose from Statsmodels

I have the following dataframe called 'data':
Month
Revenue Index
1920-01-01
1.72
1920-02-01
1.83
1920-03-01
1.94
...
...
2021-10-01
114.20
2021-11-01
115.94
2021-12-01
116.01
This is essentially a monthly revenue index on which I am trying to use seasonal_decompose with the following code:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative')
But unfortunately I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-08e3139bbf77> in <module>()
----> 1 result = seasonal_decompose(data['Consumptieprijsindex'], model='multiplicative')
2 rcParams['figure.figsize'] = 12, 6
3 plt.rc('lines', linewidth=1, color='r')
4
5 fig = result.plot()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq, two_sided, extrapolate_trend)
125 freq = pfreq
126 else:
--> 127 raise ValueError("You must specify a freq or x must be a "
128 "pandas object with a timeseries index with "
129 "a freq not set to None")
ValueError: You must specify a freq or x must be a pandas object with a timeseries index with a freq not set to None
Does anyone know how to solve this issue? Thanks!
The following code in the comments answered my question:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative', period=12)

List becoming nonetype in python

x = (long list of data)
mymap = map(int, x.split())
box = []
mylist = list(mymap)
while len(mylist)>0:
box.append([str(mylist[1])]*mylist[0])
mylist = mylist[2:]
box.sort()
print(type(box))
type(box)
p=sns.displot(data = box)
p.set(xlabel = "Waiting time", ylabel = "Eruptions")
This is my code to create a histogram in sage from an extremely long list of data. The data is all like "3600 79 2800 58" etc with value and then frequency. Everything works well except the histogram generation itself. I've already tried outputting the list and it prints out perfectly fine.
This is the output when I run it:
<class 'list'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-87f016146a7f> in <module>
9 print(type(box))
10 type(box)
---> 11 p=sns.displot(data = box)
12 p.set(xlabel = "Waiting time", ylabel = "Eruptions")
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py in displot(data, x, y, hue, row, col, weights, kind, rug, rug_kws, log_scale, legend, palette, hue_order, hue_norm, color, col_wrap, row_order, col_order, height, aspect, facet_kws, **kwargs)
2225
2226 _assign_default_kwargs(hist_kws, p.plot_univariate_histogram, histplot)
-> 2227 p.plot_univariate_histogram(**hist_kws)
2228
2229 else:
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)
422
423 # First pass through the data to compute the histograms
--> 424 for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
425
426 # Prepare the relevant data
/usr/local/lib/python3.8/dist-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)
994 grouping_keys.append(self.var_levels.get(var, []))
995
--> 996 iter_keys = itertools.product(*grouping_keys)
997 if reverse:
998 iter_keys = reversed(list(iter_keys))
TypeError: 'NoneType' object is not iterable
Clearly box is a list since type(box) returns list, so what am I missing here? What is making its type become none?
insted of using while use for, something like this
x = (long list of data)
mymap = map(int, x.split())
box = []
mylist = list(mymap)
for i in range (len(mylist)):
box.insert(i, str(mylist[1] *mylist[0]) # or append
mylist = mylist[2:]
box.sort()
print(type(box))
type(box)
p=sns.displot(data = box)
p.set(xlabel = "Waiting time", ylabel = "Eruptions")

Binning a series returns a seemingly unrelated TypeError

I am trying to slice a dataframe I created into bins:
picture of dataframe in case it's relevant
# create bins and labels
bins = [575, 600, 625, 650]
labels = [
"$575-$599",
"$600-$624",
"$625-$649",
"$650-$675"
]
schoolSummary["Spending Range"] = pd.cut(schoolSummary["Per Student Budget"], bins, labels = labels)
For some reason, I receive this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-73-b938397739fa> in <module>()
9
10 #schoolSummary["Spending Range"] =
---> 11 pd.cut(schoolSummary["Per Student Budget"], bins, labels = labels)
~\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\tile.py in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates)
232 include_lowest=include_lowest,
233 dtype=dtype,
--> 234 duplicates=duplicates)
235
236 return _postprocess_for_cut(fac, bins, retbins, x_is_series,
~\Anaconda3\envs\py36\lib\site-packages\pandas\core\reshape\tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates)
335
336 side = 'left' if right else 'right'
--> 337 ids = _ensure_int64(bins.searchsorted(x, side=side))
338
339 if include_lowest:
TypeError: '<' not supported between instances of 'int' and 'str'
I'm confused, because I did not use '<' in the code at all. I also used
print(type(schoolSummary["Per Student Budget"]))
and it is a series object, so I don't know what 'int' and 'str' it's referring to. Is it a problem with my bins or labels?
Due to low rep, I can't comment to your question,
You must try the following
bins = [575, 600, 625, 650]
labels = [
"$575-$599",
"$600-$624",
"$625-$649",
"$650-$675"
]
for bin_ in bins:
schoolSummary["Spending Range"] = pd.cut(schoolSummary["Per Student Budget"], bin_, labels = labels)
Because bin takes int type, instead of a list.

Categories

Resources