Given some categorical data like:
import pandas as pd
data = pd.Series(["NY", "NY", "CL", "TX", "CL", "FL", "NY", "FL"])
In the original data, this is a column in a DataFrame. I want to plot it via sns.catplot() like so:
import seaborn as sns
import matplotlib.pyplot as plt
sns.catplot(x=data, kind="count")
But I get this error:
Traceback (most recent call last):
File "C:\Users\%USERNAME%\PycharmProjects\Troubleshooting\temp.py", line 6, in <module>
sns.catplot(x=my_data, kind="count")
File "C:\Users\%USERNAME%\Troubleshooting\lib\site-packages\seaborn\categorical.py", line 3241, in catplot
g = FacetGrid(**facet_kws)
File "C:\Users\%USERNAME%\Troubleshooting\lib\site-packages\seaborn\axisgrid.py", line 403, in __init__
none_na = np.zeros(len(data), bool)
TypeError: object of type 'NoneType' has no len()
The Series / Data Frame has a shape, length etc. so I don't understand where the error message comes from. What is wrong, and how do I fix it?
I know that sns.countplot() will work with this input, but I need to use catplot in order to create the countplot.
It doesn't really make sense to use a catplot with a Series, as this higher level function is relevant when multiple columns with categories are present to automatically generate a FacetGrid.
Anyway, if you really want to use catplot, you'll have to convert to DataFrame and pass the data to data, not x (that is for the column name in data):
sns.catplot(data=data.to_frame('x-label'), x='x-label', kind="count")
Output:
You should use sns.countplot instead:
data = pd.Series(["NY", "NY", "CL", "TX", "CL", "FL", "NY", "FL"])
sns.countplot(x=data)
I figured it out. Thank you guys.
The issue is, that the catplot needs (at least for DataFrames) the explicitly needs parameter "data", given a DataFrame, and then a parameter for "x", but only the column name there. It isn't enough to use the argument "x=df["column_name"]".
import seaborn as sns
import pandas as pd
my_data #any dataframe you have
sns.countplot(data=my_data, x="Column 1", kind="count")
Related
I have yearly average closing values for an asset in a dataframe, and I need to find the structural breaks in the time series. I intended to do this using the stats model 'season_decompose' method but I am having trouble implementing it.
Example data below
from statsmodels.tsa.seasonal import seasonal_decompose
data = {'Year':['1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004'],'Close':[11,22,55,34,447,85,99,86,83,82,81,34,33,36]}
df = pd.DataFrame(data)
df['Year'] = df['Year'].astype(str)
sd = seasonal_decompose(df)
plt.show()
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
When I change the 'Year' column to date time, I get the following issue:
TypeError: float() argument must be a string or a number, not 'Timestamp'
I do not know what the issue is. I have no missing values? Secondary to this, does anybody know a more efficient method to identify structural breaks in time series data?
Thanks
The problem is that you need to set column Year as the index after converting the string values to datetime (from the ValueError message: a pandas object with a DatetimeIndex).
So, e.g.:
from statsmodels.tsa.seasonal import seasonal_decompose
import pandas as pd
data = {'Year':['1991','1992','1993','1994','1995','1996','1997','1998','1999','2000','2001','2002','2003','2004'],'Close':[11,22,55,34,447,85,99,86,83,82,81,34,33,36]}
df = pd.DataFrame(data)
df['Year'] = pd.to_datetime(df['Year'])
df.set_index('Year', drop=True, inplace=True)
sd = seasonal_decompose(df)
sd.plot()
Plot:
I want to use Pandas + Uncertainties. I am getting a strange error, below a MWE:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df.loc[0,'b'] = ufloat(3,1) # This line fails.
I have noticed that if I try to add the ufloats "on the fly", as I usually do with a float or some other stuff, it fails. If I first create a Series then it works:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df['b'] = pandas.Series([ufloat(3,1)]) # Now it works.
print(df)
This makes it more cumbersome when calculating values on the fly within a loop as I have to create a temporary Series and after the loop add it as a column into my data frame.
Is this a problem of Pandas, a problem of Uncertainties, or am I doing something that is not supposed to be done?
The problem arises because when pandas tries to create a new column it checks the dtype of the new value so that it knows what dtype to assign to that column. For some reason, the dtype check on the ufloat value fails. I believe this is a bug that will have to be fixed in uncertainties.
A workaround in the interim is to manually create the new column with dtype set to object, for example in your case above:
from uncertainties import ufloat
import pandas
import numpy
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
# create a new column with the correct dtype
df.loc[:, 'b'] = numpy.zeros(len(df), dtype=object)
df.loc[0,'b'] = ufloat(3,1) # This line now works.
My Goal
Display a bar chart showing the names durations of the first 30 Netflix shows from a .CSV file
Relevant Code after Trail & Error
names = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[2])
durations = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[9])
durations[['duration']] = durations[['duration']].astype(int)
Then I plot it.
plt.bar(names,durations)
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show()
31 rows are read as the first rows are headers. durations is turned into integers as the numbers in the column count as string or something else, and wouldn't work with matplotlib.
Error Message
TypeError: unhashable type: 'numpy.ndarray'
I don't think Numpy applies with what I'm trying to do, so I'm at a dead end here.
This was able to print out a bar chart for the first 31 values
dataset = pd.read_csv("netflix_titles.csv")
names = dataset['title'].head(31)
durations = dataset['duration'].head(31)
plt.bar(names,durations)
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show
The problem is that your are making two different DataFrames from the csv file and trying to plot them against each other. While this is possible, a much simpler approach is to create a single Dataframe from the selected columns and rows of the csv file and then plot it as demonstrated below:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[2,9])
df.columns = ['name', 'duration']
df['duration'] = df['duration'].astype(int)
df.set_index('name', inplace=True)
df.plot(kind = 'bar')
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show()
I created a random dataFrame simulating the dataset tips from seaborn:
import numpy as np
import pandas as pd
time = ['day','night']
sex = ['female','male']
smoker = ['yes','no']
for t in range(0,len(time)):
for s in range(0,len(sex)):
for sm in range(0,len(smoker)):
randomarray = np.random.rand(10)*10
if t == 0 and s == 0 and sm == 0:
df = pd.DataFrame(index=np.arange(0,len(randomarray)),columns=["total_bill","time","sex","smoker"])
L = 0
for i in range(0,len(randomarray)):
df.loc[i] = [randomarray[i], time[t], sex[s], smoker[sm]]
L = L + 1
else:
for i in range(0,len(randomarray)):
df.loc[i+L] = [randomarray[i], time[t], sex[s], smoker[sm]]
L = L + 1
My dataFrame df has, for each column, the same type of class as the dataFrame tips from seaborn's dataset:
tips = sns.load_dataset("tips")
type(tips["total_bill"][0])
type(tips["time"][0])
numpy.float64
str
And so on for the other columns. Same as my dataFrame:
type(df["total_bill"][0])
type(tips["time"][0])
numpy.float64
str
However, when I try to use seaborn's violinplot or factorplot following the documentation:
g = sns.factorplot(x="sex", y="total_bill", hue="smoker", col="time", data=df, kind="violin", split=True, size=4, aspect=.7);
I have no problems if I use the dataFrame tips, but when I use my dataFrame I get:
AttributeError: 'float' object has no attribute 'shape'
I Imagine this is an issue with the way I pass the array into the dataFrame, but I couldn't find what is the problem since every issue I found on the internet with the same AttributeError says it's because it's not the same type of class, and as shown above my dataFrame has the same type of class as the one in seaborn's documentation.
Any suggestions?
I got the same problem and was trying to find a solution but did not see the answer I was looking for. So I guess provide an answer here may help people like me.
The problem here is that the type of df.total_bill is object instead of float.
So the solution is to change it to float befor pass the dataframe to seaborn:
df.total_bill = df.total_bill.astype(float)
This is a rather unusual way of creating a dataframe. The resulting dataframe also has some very strange properties, e.g. it has a length of 50 but the last index is 88. I'm not going into debugging these nested loops. Instead, I would propose to create the dataframe from some numpy array, e.g. like
import numpy as np
import pandas as pd
time = ['day','night']
sex = ['female','male']
smoker = ['yes','no']
data = np.repeat(np.stack(np.meshgrid(time, sex, smoker), -1).reshape(-1,3), 10, axis=0)
df = pd.DataFrame(data, columns=["time","sex","smoker"])
df["total_bill"] = np.random.rand(len(df))*10
Then also plotting works fine:
g = sns.factorplot(x="sex", y="total_bill", hue="smoker", col="time", data=df,
kind="violin", size=4, aspect=.7)
Convert the data type of your variable from object to say float/int.
I had a different issue in my code that produced the same error:
'str' object has no attribute 'get'
For me, I had in my seaborn syntax ...data='df'... where df is an object, however, and should not be in quotes. Once I removed the quotes, my program worked perfectly. I made the mistake, as someone else might, because the x= and y= parameters are in quotes (for the columns in the dataframe)
In extension to my previous question
I can plot the Heat map with Seaborn very well and with suggestion can get annotation. But I see a new problem now.
Input File
Nos,Place,Way,Name,00:00:00,12:00:00
123,London,Air,Apollo,342,972
123,London,Rail,Beta,2352,342
123,Paris,Bus,Beta,545,353
345,Paris,Bus,Rava,652,974
345,Rome,Bus,Rava,2325,56
345,London,Air,Rava,2532,9853
567,Paris,Air,Apollo,545,544
567,Rome,Rail,Apollo,5454,5
876,Japan,Rail,Apollo,644,54
876,Japan,Bus,Beta,45,57
876,Japan,Bus,Beta,40,57
876,Japan,Bus,Beta,40,57
Program:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
df = pd.read_csv('heat_map_data.csv')
df3 = df.copy()
for c in ['Place','Name']:
df3[c] = df3[c].astype('category')
sns.heatmap(df3.pivot_table(index='Place', columns='Name', values='00:00:00' ),annot=True, fmt='.1f' )
plt.show()
If I take fmt='d' then I get error of float value and changed to fmt='f' And I get the count of the desired column.
But When the same axis value repeats it does not add the count from desired column. Any solution for that pls ?
As it is seen in the input file
876,Japan,Bus,Beta,45,57
876,Japan,Bus,Beta,40,57
876,Japan,Bus,Beta,40,57
It has 3 rows in repeat and the value of them should be shown as sum
the cell which represents Japan and Beta should annot value as 125 instead it shows 41.7. How do I achieve that? Also is it possible to give two values as annotation ?
Second doubt is now that in pivot I am giving value='00:00:00' but I need it to dynamically read the last column from the file.
You can use the aggfunc keyword passing in a dict:
aggfunc :
function, default numpy.mean, or list of functions
If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves)
sns.heatmap(df3.pivot_table(index='Place', columns='Name',
values='00:00:00',aggfunc={'00:00:00':np.sum}), annot=True, fmt='.1f')
Which outputs: