KeyError: when making new column in pandas

KeyError: when making new column in pandas - python

I am working on a dataset and when I try to create a new column after find the difference I get the KeyError: 'filtered'
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
d = {'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'col2': [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]}
df = pd.DataFrame(data=d)
fig, ax = plt.subplots(2, figsize=(8,8))
df['col2'].diff().plot(ax=ax[0])
cutoff = 3
df['filtered'] = df.loc[df['col2'].diff().abs() > cutoff]
df.plot(ax=ax[1])
I used to create new column like this (df['filtered'] = some operation), but it gives KeyError: 'filtered' in this situation. Thank you for the help.

You need to replace the second-to-last line with:
df['filtered'] = df.loc[df['col2'].diff().abs() > cutoff, 'col2']
assuming that you want to get a filtered version of 'col2'. As #RafaelC mentioned, the current .loc[] operation you have returns all the columns (2 in this case) for which the row filter applies hence the error.

Related

pandas boxplot contains content of plot saved before

I'm plotting some columns of a datafame into a boxplot. Sofar, no problem. As seen below I wrote some stuff and it works. BUT: the second plot contains the plot of the first plot, too. So as you can see I tried it with "= None" or "del value", but it does not work. Putting the plot function outside also don't solves the problem.
Whats wrong with my code?
Here is an executable example
import pandas as pd
d1 = {'ff_opt_time': [10, 20, 11, 5, 15 , 13, 19, 25 ], 'ff_count_opt': [30, 40, 45, 29, 35,38,32,41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1 , 1, 4, 5 ], 'ff_count_opt': [3, 4, 4, 9, 5,3, 2,4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
def plot(df, output ):
boxplot = df.boxplot(rot=45,fontsize=5)
fig = boxplot.get_figure()
fig.savefig(output + ".pdf")
df_ot = pd.DataFrame(columns=['opt_time1' , 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
plot(df_ot, "bp_opt_time")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
plot(df_op, "bp_count_opt_perm")
evaluate2(df1, df2)
Here is another executable example. I even used other variable names.
import pandas as pd
d1 = {'ff_opt_time': [10, 20, 11, 5, 15 , 13, 19, 25 ], 'ff_count_opt': [30, 40, 45, 29, 35,38,32,41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1 , 1, 4, 5 ], 'ff_count_opt': [3, 4, 4, 9, 5,3, 2,4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
df_ot = pd.DataFrame(columns=['opt_time1' , 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
boxplot1 = df_ot.boxplot(rot=45,fontsize=5)
fig1 = boxplot1.get_figure()
fig1.savefig( "bp_opt_time.pdf")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
boxplot2 = df_op.boxplot(rot=45,fontsize=5)
fig2 = boxplot2.get_figure()
fig2.savefig( "bp_count_opt_perm.pdf")
evaluate2(df1, df2)

I can see from your code that boxplots: boxplot1 & boxplot2 are in the same graph. What you need to do is instruct that there is going to be two plots.
This can be achieved either by
Create two sub plots using pyplot in matplotlib, this code does the trick fig1, ax1 = plt.subplots() with ax1 specifying boxplot to put in that axes and fig2 specifying boxplot figure
Dissolve evaluate2 function and execute the boxplot separately in different cell in the jupyter notebook
Solution 1 : Two subplots using pyplot
import pandas as pd
import matplotlib.pyplot as plt
d1 = {'ff_opt_time': [10, 20, 11, 5, 15 , 13, 19, 25 ], 'ff_count_opt': [30, 40, 45, 29, 35,38,32,41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1 , 1, 4, 5 ], 'ff_count_opt': [3, 4, 4, 9, 5,3, 2,4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
df_ot = pd.DataFrame(columns=['opt_time1' , 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
fig1, ax1 = plt.subplots()
boxplot1 = df_ot.boxplot(rot=45,fontsize=5)
ax1=boxplot1
fig1 = boxplot1.get_figure()
fig1.savefig( "bp_opt_time.pdf")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
fig2, ax2 = plt.subplots()
boxplot2 = df_op.boxplot(rot=45,fontsize=5)
fig2 = boxplot2.get_figure()
ax2=boxplot2
fig2.savefig( "bp_count_opt_perm.pdf")
plt.show()
evaluate2(df1, df2)
Solution 2: Executing boxplot in different cell
Update based on comments : clearing plots
Two ways you can clear the plot,
plot itself using clf()
matplotlib.pyplot.clf() function to clear the current Figure’s state without closing it
clear axes using cla()
matplotlib.pyplot.cla() function clears the current Axes state without closing the Axes.
Simply call plt.clf() function after calling fig.save
Read this documentation on how to clear a plot in Python using matplotlib

Just grab the code from Archana David and put it in your plot function: the goal is to call "fig, ax = plt.subplots()" to create a new graph.
import pandas as pd
import matplotlib.pyplot as plt
d1 = {'ff_opt_time': [10, 20, 11, 5, 15, 13, 19, 25],
'ff_count_opt': [30, 40, 45, 29, 35, 38, 32, 41]}
df1 = pd.DataFrame(data=d1)
d2 = {'ff_opt_time': [1, 2, 1, 5, 1, 1, 4, 5],
'ff_count_opt': [3, 4, 4, 9, 5, 3, 2, 4]}
df2 = pd.DataFrame(data=d2)
def evaluate2(df1, df2):
def plot(df, output):
fig, ax = plt.subplots()
boxplot = df.boxplot(rot=45, fontsize=5)
ax = boxplot
fig = boxplot.get_figure()
fig.savefig(output + ".pdf")
df_ot = pd.DataFrame(columns=['opt_time1', 'opt_time2'])
df_ot['opt_time1'] = df1['ff_opt_time']
df_ot['opt_time2'] = df2['ff_opt_time']
plot(df_ot, "bp_opt_time")
df_op = pd.DataFrame(columns=['count_opt1' , 'count_opt2'])
df_op['count_opt1'] = df1['ff_count_opt']
df_op['count_opt2'] = df2['ff_count_opt']
plot(df_op, "bp_count_opt_perm")
evaluate2(df1, df2)

boxplot for all data in dataframe: error "'numpy.ndarray' object has no attribute 'boxplot'"

I am trying to display in a subplot all the boxplots corresponding to each columns in my dataframe df.
I have read this question:
Subplot for seaborn boxplot
and tried to implement the given solution:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
sns.boxplot(x= df[name], ax=axes[y])
y = y + 1
Unfortunately I get an error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-111-489a538377fc> in <module>
3 y = 0;
4 for name in names:
----> 5 sns.boxplot(x= df[name], ax=axes[y])
6 y = y + 1
AttributeError: 'numpy.ndarray' object has no attribute 'boxplot'
I understand there is a problem with df[name] but I can't see how to fix it.
Would someone be able to point me in the right direction?
Thank you very much.

The problem comes from passing ax=axes[y] to boxplot. axes is a 2-d numpy array with shape (2, 3), that contains the grid of Matplotlib axes that you requested. So axes[y] is a 1-d numpy array that contains three Matplotlib AxesSubplotobjects. I suspect boxplot is attempting to dispatch to this argument, and it expects it to be an object with a boxplot method. You can fix this by indexing axes with the appropriate row and column that you want to use.
Here's your script, with a small change to do that:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
i, j = divmod(y, 3)
sns.boxplot(x=df[name], ax=axes[i, j])
y = y + 1
plt.tight_layout()
plt.show()
The plot:

How to keep the index when using pd.melt and merge to create a DataFrame for Seaborn and matplotlib

I am trying to draw subplots using two identical DataFrames ( predicted and observed) with exact same structure ... the first column is index
The code below makes new index when they are concatenated using pd.melt and merge
as you can see in the figure the index of orange line is changed from 1-5 to 6-10
I was wondering if some could fix the code below to keep the same index for the orange line:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
actual = pd.DataFrame({'a': [5, 8, 9, 6, 7, 2],
'b': [89, 22, 44, 6, 44, 1]})
predicted = pd.DataFrame({'a': [7, 2, 13, 18, 20, 2],
'b': [9, 20, 4, 16, 40, 11]})
# Creating a tidy-dataframe to input under seaborn
merged = pd.concat([pd.melt(actual), pd.melt(predicted)]).reset_index()
merged['category'] = ''
merged.loc[:len(actual)*2,'category'] = 'actual'
merged.loc[len(actual)*2:,'category'] = 'predicted'
g = sns.FacetGrid(merged, col="category", hue="variable")
g.map(plt.plot, "index", "value", alpha=.7)
g.add_legend();

The orange line ('variable' == 'b') doesn't have an index of 0-5 because of how you used melt. If you look at pd.melt(actual), the index doesn't match what you are expecting, IIUC.
Here is how I would rearrange the dataframe:
merged = pd.concat([actual, predicted], keys=['actual', 'predicted'])
merged.index.names = ['category', 'index']
merged = merged.reset_index()
merged = pd.melt(merged, id_vars=['category', 'index'], value_vars=['a', 'b'])

Set the ignore_index variable to false to preserve the index., e.g.
df = df.melt(var_name=‘species’, value_name=‘height’, ignore_index = False)

I have written a code to calculate the correlation between two Pandas Series. Can you tell me what is wrong with my code?

Below is the code:
import numpy as np
import pandas as pd
def correlation(x, y):
std_x = (x - x.mean())/x.std(ddof = 0)
std_y = (y - y.mean())/y.std(ddof = 0)
return (std_x * std_y).mean
a = pd.Series([2, 4, 5, 7, 9])
b = pd.Series([12, 10, 9, 7, 3])
ca = correlation(a, b)
print(ca)
It does not return the value of the correlation, instead it returns a Series with keys as 0 ,1, 2, 3, 4, 5 and values as -1.747504, -0.340844, -0.043282, -0.259691, -2.531987.
Please help me understand the problem behind this.

You need to call mean() with:
return (std_x * std_y).mean()
not only :
return (std_x * std_y).mean:
which returns the method itself. Full code:
import numpy as np
import pandas as pd
def correlation(x, y):
std_x = (x - x.mean())/x.std(ddof = 0)
std_y = (y - y.mean())/y.std(ddof = 0)
return (std_x * std_y).mean()
a = pd.Series([2, 4, 5, 7, 9])
b = pd.Series([12, 10, 9, 7, 3])
ca = correlation(a, b)
print(ca)
Output:
-0.984661667628

You can also use scipy.stats.stats to calculate a Pearson correlation. At a minimum, you can use this as a quick check your algorithm is correct.
from scipy.stats.stats import pearsonr
import pandas as pd
a = pd.Series([2, 4, 5, 7, 9])
b = pd.Series([12, 10, 9, 7, 3])
pearsonr(a, b)[0] # -0.98466166762781315

It might be worth mentioning that you can also ask pandas directly to calculate the correlation between two series using corr which also allows you to specify the type of correlation:
a = pd.Series([2, 4, 5, 7, 9])
b = pd.Series([12, 10, 9, 7, 3])
a.corr(b)
will then return
-0.98466166762781315
You can apply corr also on a dataframe which calculates all pairwise correlations between your columns (as each column is perfectly correlated with itself, you see 1s on the diagonal):
pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 8]}).corr()
a b
a 1.000000 0.960769
b 0.960769 1.000000

Drop all data in a pandas dataframe

I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df

You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.

My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data

My favorite way is:
df = df[0:0]

Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)

If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])

This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

KeyError: when making new column in pandas - python

Related

pandas boxplot contains content of plot saved before

boxplot for all data in dataframe: error "'numpy.ndarray' object has no attribute 'boxplot'"

How to keep the index when using pd.melt and merge to create a DataFrame for Seaborn and matplotlib

I have written a code to calculate the correlation between two Pandas Series. Can you tell me what is wrong with my code?

Drop all data in a pandas dataframe

Categories

Resources