I want to use Pandas + Uncertainties. I am getting a strange error, below a MWE:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df.loc[0,'b'] = ufloat(3,1) # This line fails.
I have noticed that if I try to add the ufloats "on the fly", as I usually do with a float or some other stuff, it fails. If I first create a Series then it works:
from uncertainties import ufloat
import pandas
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
df['b'] = pandas.Series([ufloat(3,1)]) # Now it works.
print(df)
This makes it more cumbersome when calculating values on the fly within a loop as I have to create a temporary Series and after the loop add it as a column into my data frame.
Is this a problem of Pandas, a problem of Uncertainties, or am I doing something that is not supposed to be done?
The problem arises because when pandas tries to create a new column it checks the dtype of the new value so that it knows what dtype to assign to that column. For some reason, the dtype check on the ufloat value fails. I believe this is a bug that will have to be fixed in uncertainties.
A workaround in the interim is to manually create the new column with dtype set to object, for example in your case above:
from uncertainties import ufloat
import pandas
import numpy
number_with_uncertainty = ufloat(2,1)
df = pandas.DataFrame({'a': [number_with_uncertainty]}) # This line works fine.
# create a new column with the correct dtype
df.loc[:, 'b'] = numpy.zeros(len(df), dtype=object)
df.loc[0,'b'] = ufloat(3,1) # This line now works.
My Goal
Display a bar chart showing the names durations of the first 30 Netflix shows from a .CSV file
Relevant Code after Trail & Error
names = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[2])
durations = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[9])
durations[['duration']] = durations[['duration']].astype(int)
Then I plot it.
plt.bar(names,durations)
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show()
31 rows are read as the first rows are headers. durations is turned into integers as the numbers in the column count as string or something else, and wouldn't work with matplotlib.
Error Message
TypeError: unhashable type: 'numpy.ndarray'
I don't think Numpy applies with what I'm trying to do, so I'm at a dead end here.
This was able to print out a bar chart for the first 31 values
dataset = pd.read_csv("netflix_titles.csv")
names = dataset['title'].head(31)
durations = dataset['duration'].head(31)
plt.bar(names,durations)
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show
The problem is that your are making two different DataFrames from the csv file and trying to plot them against each other. While this is possible, a much simpler approach is to create a single Dataframe from the selected columns and rows of the csv file and then plot it as demonstrated below:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv("netflix_titles.csv", nrows=31, usecols=[2,9])
df.columns = ['name', 'duration']
df['duration'] = df['duration'].astype(int)
df.set_index('name', inplace=True)
df.plot(kind = 'bar')
plt.title("Show Durations")
plt.xlabel("Name of Shows")
plt.ylabel("Durations (In Minutes)")
plt.show()
I have a pandas DataFrame whose rows contain data of multiple types. I want to fit a model based on this data using statsmodels.formula.api and then make some predictions. For my application I want to make predictions a single row at a time. If I do this naively I get AttributeError: 'numpy.float64' object has no attribute 'log' for the reason described in this answer. Here's some sample code:
import string
import random
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# Generate an example DataFrame
N = 100
z = np.random.normal(size=N)
u = np.random.normal(size=N)
w = np.exp(1 + u + 2*z)
x = np.exp(z)
y = np.log(w)
names = ["".join(random.sample(string.lowercase, 4)) for lv in range(N)]
df = pd.DataFrame({"x": x, "y": y, "names": names})
reg_spec = "y ~ np.log(x)"
fit = smf.ols(reg_spec, data=df).fit()
series = df.iloc[0] # In reality it would be `apply` extracting the rows one at a time
print(series.dtype) # gives `object` if `names` is in the DataFrame
print(fit.predict(series)) # AttributeError: 'numpy.float64' object has no attribute 'log'
The problem is that apply feeds me rows as Series, not DataFrames, and because I'm working with multiple types, the Series have type object. Sadly np.log doesn't like Series of objects even if all the objects are in fact floats. Swapping apply for transform doesn't help. I could create an intermediate DataFrame with only numeric columns or change my regression specification to y ~ np.log(x.astype('float64')). In the context of a larger program with a more complicated formula these are both pretty ugly. Is there a cleaner approach I'm missing?
Although you said you don't want to create an intermediate DataFrame with only numeric columns because it's pretty ugly, I think using select_dtypes to create a numbers-only subset of your Series on the fly is quite elegant and doesn't involve large code modifications:
series = df.select_dtypes(include='number').iloc[0]
Another solution that dawned on me as I was doing some other work is to convert the Series that apply gives me into a DataFrame consisting of a single row. This works:
row_df = pd.DataFrame([series])
print(fit.predict(row_df))
I'm trying to construct a pandas Series to concatenate onto a dataframe.
import numpy as np
import pandas as pd
rawData = pd.read_csv(input, header=1) # the DataFrame
strikes = pd.Series() # the empty Series
for i, row in rawData.iterrows():
sym = rawData.loc[i,'Symbol']
strike = float(sym[-6:])/1000
strikes = strikes.set_value(i, strike)
print("at26: ",strikes.values)
This program works, but I get the error message:
"line 25: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead."
Every way I have tried to substitute .at, I get a syntax error. Many of the suggestions posted relate to DataFrames, not Series. Append requires another series, and complains when I give it a scalar.
What is the proper way to do it?
Replace strikes.set_value(i, strike) with strikes.at[i] = strike.
Note that assignment back to a series is not necessary with set_value:
s = pd.Series()
s.set_value(0, 10)
s.at[1] = 20
print(s)
0 10
1 20
dtype: int64
For the algorithm you are looking to run, you can simply use assignment:
strikes = rawData['Symbol'].str[-6:].astype(float) / 1000
In extension to my previous question
I can plot the Heat map with Seaborn very well and with suggestion can get annotation. But I see a new problem now.
Input File
Nos,Place,Way,Name,00:00:00,12:00:00
123,London,Air,Apollo,342,972
123,London,Rail,Beta,2352,342
123,Paris,Bus,Beta,545,353
345,Paris,Bus,Rava,652,974
345,Rome,Bus,Rava,2325,56
345,London,Air,Rava,2532,9853
567,Paris,Air,Apollo,545,544
567,Rome,Rail,Apollo,5454,5
876,Japan,Rail,Apollo,644,54
876,Japan,Bus,Beta,45,57
876,Japan,Bus,Beta,40,57
876,Japan,Bus,Beta,40,57
Program:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
df = pd.read_csv('heat_map_data.csv')
df3 = df.copy()
for c in ['Place','Name']:
df3[c] = df3[c].astype('category')
sns.heatmap(df3.pivot_table(index='Place', columns='Name', values='00:00:00' ),annot=True, fmt='.1f' )
plt.show()
If I take fmt='d' then I get error of float value and changed to fmt='f' And I get the count of the desired column.
But When the same axis value repeats it does not add the count from desired column. Any solution for that pls ?
As it is seen in the input file
876,Japan,Bus,Beta,45,57
876,Japan,Bus,Beta,40,57
876,Japan,Bus,Beta,40,57
It has 3 rows in repeat and the value of them should be shown as sum
the cell which represents Japan and Beta should annot value as 125 instead it shows 41.7. How do I achieve that? Also is it possible to give two values as annotation ?
Second doubt is now that in pivot I am giving value='00:00:00' but I need it to dynamically read the last column from the file.
You can use the aggfunc keyword passing in a dict:
aggfunc :
function, default numpy.mean, or list of functions
If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves)
sns.heatmap(df3.pivot_table(index='Place', columns='Name',
values='00:00:00',aggfunc={'00:00:00':np.sum}), annot=True, fmt='.1f')
Which outputs: