AttributeError: 'float' object has no attribute 'shape' when using seaborn - python

I created a random dataFrame simulating the dataset tips from seaborn:
import numpy as np
import pandas as pd
time = ['day','night']
sex = ['female','male']
smoker = ['yes','no']
for t in range(0,len(time)):
for s in range(0,len(sex)):
for sm in range(0,len(smoker)):
randomarray = np.random.rand(10)*10
if t == 0 and s == 0 and sm == 0:
df = pd.DataFrame(index=np.arange(0,len(randomarray)),columns=["total_bill","time","sex","smoker"])
L = 0
for i in range(0,len(randomarray)):
df.loc[i] = [randomarray[i], time[t], sex[s], smoker[sm]]
L = L + 1
else:
for i in range(0,len(randomarray)):
df.loc[i+L] = [randomarray[i], time[t], sex[s], smoker[sm]]
L = L + 1
My dataFrame df has, for each column, the same type of class as the dataFrame tips from seaborn's dataset:
tips = sns.load_dataset("tips")
type(tips["total_bill"][0])
type(tips["time"][0])
numpy.float64
str
And so on for the other columns. Same as my dataFrame:
type(df["total_bill"][0])
type(tips["time"][0])
numpy.float64
str
However, when I try to use seaborn's violinplot or factorplot following the documentation:
g = sns.factorplot(x="sex", y="total_bill", hue="smoker", col="time", data=df, kind="violin", split=True, size=4, aspect=.7);
I have no problems if I use the dataFrame tips, but when I use my dataFrame I get:
AttributeError: 'float' object has no attribute 'shape'
I Imagine this is an issue with the way I pass the array into the dataFrame, but I couldn't find what is the problem since every issue I found on the internet with the same AttributeError says it's because it's not the same type of class, and as shown above my dataFrame has the same type of class as the one in seaborn's documentation.
Any suggestions?

I got the same problem and was trying to find a solution but did not see the answer I was looking for. So I guess provide an answer here may help people like me.
The problem here is that the type of df.total_bill is object instead of float.
So the solution is to change it to float befor pass the dataframe to seaborn:
df.total_bill = df.total_bill.astype(float)

This is a rather unusual way of creating a dataframe. The resulting dataframe also has some very strange properties, e.g. it has a length of 50 but the last index is 88. I'm not going into debugging these nested loops. Instead, I would propose to create the dataframe from some numpy array, e.g. like
import numpy as np
import pandas as pd
time = ['day','night']
sex = ['female','male']
smoker = ['yes','no']
data = np.repeat(np.stack(np.meshgrid(time, sex, smoker), -1).reshape(-1,3), 10, axis=0)
df = pd.DataFrame(data, columns=["time","sex","smoker"])
df["total_bill"] = np.random.rand(len(df))*10
Then also plotting works fine:
g = sns.factorplot(x="sex", y="total_bill", hue="smoker", col="time", data=df,
kind="violin", size=4, aspect=.7)

Convert the data type of your variable from object to say float/int.

I had a different issue in my code that produced the same error:
'str' object has no attribute 'get'
For me, I had in my seaborn syntax ...data='df'... where df is an object, however, and should not be in quotes. Once I removed the quotes, my program worked perfectly. I made the mistake, as someone else might, because the x= and y= parameters are in quotes (for the columns in the dataframe)

Related

Pandas+Uncertainties producing AttributeError: type object 'dtype' has no attribute 'kind'

I want to use Pandas + Uncertainties. I am getting a strange error, below a MWE:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df.loc[0,'b'] = ufloat(3,1) # This line fails.
I have noticed that if I try to add the ufloats "on the fly", as I usually do with a float or some other stuff, it fails. If I first create a Series then it works:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df['b'] = pandas.Series([ufloat(3,1)]) # Now it works.
print(df)
This makes it more cumbersome when calculating values on the fly within a loop as I have to create a temporary Series and after the loop add it as a column into my data frame.
Is this a problem of Pandas, a problem of Uncertainties, or am I doing something that is not supposed to be done?
The problem arises because when pandas tries to create a new column it checks the dtype of the new value so that it knows what dtype to assign to that column. For some reason, the dtype check on the ufloat value fails. I believe this is a bug that will have to be fixed in uncertainties.
A workaround in the interim is to manually create the new column with dtype set to object, for example in your case above:
from uncertainties import ufloat
import pandas
import numpy
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
# create a new column with the correct dtype
df.loc[:, 'b'] = numpy.zeros(len(df), dtype=object)
df.loc[0,'b'] = ufloat(3,1) # This line now works.

Why can data read from a .CSV file with Pandas not be plotted using matplotlib after turning it into integers?

My Goal
Display a bar chart showing the names durations of the first 30 Netflix shows from a .CSV file
Relevant Code after Trail & Error
names = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[2])
durations = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[9])
durations[['duration']] = durations[['duration']].astype(int)
Then I plot it.
plt.bar(names,durations)
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show()
31 rows are read as the first rows are headers. durations is turned into integers as the numbers in the column count as string or something else, and wouldn't work with matplotlib.
Error Message
TypeError: unhashable type: 'numpy.ndarray'
I don't think Numpy applies with what I'm trying to do, so I'm at a dead end here.
This was able to print out a bar chart for the first 31 values
dataset = pd.read_csv("netflix_titles.csv")
names = dataset['title'].head(31)
durations = dataset['duration'].head(31)
plt.bar(names,durations)
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show
The problem is that your are making two different DataFrames from the csv file and trying to plot them against each other. While this is possible, a much simpler approach is to create a single Dataframe from the selected columns and rows of the csv file and then plot it as demonstrated below:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[2,9])
df.columns = ['name', 'duration']
df['duration'] = df['duration'].astype(int)
df.set_index('name', inplace=True)
df.plot(kind = 'bar')
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show()

statsmodels has trouble predicting on formulas using functions like log on rows of heterogeneous type

I have a pandas DataFrame whose rows contain data of multiple types. I want to fit a model based on this data using statsmodels.formula.api and then make some predictions. For my application I want to make predictions a single row at a time. If I do this naively I get AttributeError: 'numpy.float64' object has no attribute 'log' for the reason described in this answer. Here's some sample code:
import string
import random
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# Generate an example DataFrame
N = 100
z = np.random.normal(size=N)
u = np.random.normal(size=N)
w = np.exp(1 + u + 2*z)
x = np.exp(z)
y = np.log(w)
names = ["".join(random.sample(string.lowercase, 4)) for lv in range(N)]
df = pd.DataFrame({"x": x, "y": y, "names": names})
reg_spec = "y ~ np.log(x)"
fit = smf.ols(reg_spec, data=df).fit()
series = df.iloc[0] # In reality it would be `apply` extracting the rows one at a time
print(series.dtype) # gives `object` if `names` is in the DataFrame
print(fit.predict(series)) # AttributeError: 'numpy.float64' object has no attribute 'log'
The problem is that apply feeds me rows as Series, not DataFrames, and because I'm working with multiple types, the Series have type object. Sadly np.log doesn't like Series of objects even if all the objects are in fact floats. Swapping apply for transform doesn't help. I could create an intermediate DataFrame with only numeric columns or change my regression specification to y ~ np.log(x.astype('float64')). In the context of a larger program with a more complicated formula these are both pretty ugly. Is there a cleaner approach I'm missing?
Although you said you don't want to create an intermediate DataFrame with only numeric columns because it's pretty ugly, I think using select_dtypes to create a numbers-only subset of your Series on the fly is quite elegant and doesn't involve large code modifications:
series = df.select_dtypes(include='number').iloc[0]
Another solution that dawned on me as I was doing some other work is to convert the Series that apply gives me into a DataFrame consisting of a single row. This works:
row_df = pd.DataFrame([series])
print(fit.predict(row_df))

How do I replace set_value with at[] in a pandas Series

I'm trying to construct a pandas Series to concatenate onto a dataframe.
import numpy as np
import pandas as pd
rawData = pd.read_csv(input, header=1) # the DataFrame
strikes = pd.Series() # the empty Series
for i, row in rawData.iterrows():
sym = rawData.loc[i,'Symbol']
strike = float(sym[-6:])/1000
strikes = strikes.set_value(i, strike)
print("at26: ",strikes.values)
This program works, but I get the error message:
"line 25: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead."
Every way I have tried to substitute .at, I get a syntax error. Many of the suggestions posted relate to DataFrames, not Series. Append requires another series, and complains when I give it a scalar.
What is the proper way to do it?
Replace strikes.set_value(i, strike) with strikes.at[i] = strike.
Note that assignment back to a series is not necessary with set_value:
s = pd.Series()
s.set_value(0, 10)
s.at[1] = 20
print(s)
0 10
1 20
dtype: int64
For the algorithm you are looking to run, you can simply use assignment:
strikes = rawData['Symbol'].str[-6:].astype(float) / 1000

Heat Map Seaborn fmt='d' error

In extension to my previous question
I can plot the Heat map with Seaborn very well and with suggestion can get annotation. But I see a new problem now.
Input File
Nos,Place,Way,Name,00:00:00,12:00:00
123,London,Air,Apollo,342,972
123,London,Rail,Beta,2352,342
123,Paris,Bus,Beta,545,353
345,Paris,Bus,Rava,652,974
345,Rome,Bus,Rava,2325,56
345,London,Air,Rava,2532,9853
567,Paris,Air,Apollo,545,544
567,Rome,Rail,Apollo,5454,5
876,Japan,Rail,Apollo,644,54
876,Japan,Bus,Beta,45,57
876,Japan,Bus,Beta,40,57
876,Japan,Bus,Beta,40,57
Program:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
df = pd.read_csv('heat_map_data.csv')
df3 = df.copy()
for c in ['Place','Name']:
df3[c] = df3[c].astype('category')
sns.heatmap(df3.pivot_table(index='Place', columns='Name', values='00:00:00' ),annot=True, fmt='.1f' )
plt.show()
If I take fmt='d' then I get error of float value and changed to fmt='f' And I get the count of the desired column.
But When the same axis value repeats it does not add the count from desired column. Any solution for that pls ?
As it is seen in the input file
876,Japan,Bus,Beta,45,57
876,Japan,Bus,Beta,40,57
876,Japan,Bus,Beta,40,57
It has 3 rows in repeat and the value of them should be shown as sum
the cell which represents Japan and Beta should annot value as 125 instead it shows 41.7. How do I achieve that? Also is it possible to give two values as annotation ?
Second doubt is now that in pivot I am giving value='00:00:00' but I need it to dynamically read the last column from the file.
You can use the aggfunc keyword passing in a dict:
aggfunc :
function, default numpy.mean, or list of functions
If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves)
sns.heatmap(df3.pivot_table(index='Place', columns='Name',
values='00:00:00',aggfunc={'00:00:00':np.sum}), annot=True, fmt='.1f')
Which outputs:

Categories

Resources