SettingWithCopyWarning in pandas: how to set the first value in a column? - python

When running my code I get the following message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['detect'][df.index[0]] = df['event'][df.index[0]]
What is the correct way of setting the first value of a column equal to the first value of another column?

Use ix to perform the index label selection:
In [102]:
df = pd.DataFrame({'detect':np.random.randn(5), 'event':np.arange(5)})
df
Out[102]:
detect event
0 -0.815105 0
1 -0.656923 1
2 -1.417722 2
3 0.210070 3
4 0.211728 4
In [103]:
df.ix[0,'detect'] = df.ix[0,'event']
df
Out[103]:
detect event
0 0.000000 0
1 -0.656923 1
2 -1.417722 2
3 0.210070 3
4 0.211728 4
What you are doing is called chained indexing and may or may not work hence the warning

Related

How can I assign a new column to a slice of a pandas DataFrame with a multiindex?

I have a pandas DataFrame with a multi-index like this:
import pandas as pd
import numpy as np
arr = [1]*3 + [2]*3
arr2 = list(range(3)) + list(range(3))
mux = pd.MultiIndex.from_arrays([
arr,
arr2
], names=['one', 'two'])
df = pd.DataFrame({'a': np.arange(len(mux))}, mux)
df
a
one two
1 0 0
1 1 1
1 2 2
2 0 3
2 1 4
2 2 5
I have a function that takes a slice of a DataFrame and needs to assign a new column to the rows that have been sliced:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
#pass in a slice of the df with only records that have the last value for 'two'
work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
However calling the function results in the error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# This is added back by InteractiveShellApp.init_path()
How can I create a new column 'b' in the original DataFrame and assign its values for only the rows that were passed to the function, leaving the rest of the rows nan?
The desired output is:
a b
one two
1 0 0 nan
1 1 1 nan
1 2 2 4
2 0 3 nan
2 1 4 nan
2 2 5 10
NOTE: In the work function I'm actually doing a bunch of complex operations involving calling other functions to generate the values for the new column so I don't think this will work. Multiplying by 2 in my example is just for illustrative purposes.
You actually don't have an error, but just a warning. Try this:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
return df
#pass in a slice of the df with only records that have the last value for 'two'
new_df = work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
Then:
df.reset_index().merge(new_df, how="left").set_index(["one","two"])
Output:
a b
one two
1 0 0 NaN
1 1 NaN
2 2 4.0
2 0 3 NaN
1 4 NaN
2 5 10.0
I don't think you need a separate function at all. Try this...
df['b'] = df['a'].where(df.index.isin(df.index.get_level_values('two')[-1:], level=1))*2
The Series.where() function being called on df['a'] here should return a series where values are NaN for rows that do not result from your query.

Delete zeros from a pandas dataframe

This question was asked in multiple other posts but I could not get any of the methods to work. This is my dataframe:
df = pd.DataFrame([[1,2,3,4.5],[1,2,0,4,5]])
I would like to know how I can either:
1) Delete rows that contain any/all zeros
2) Delete columns that contain any/all zeros
In order to delete rows that contain any zeros, this worked:
df2 = df[~(df == 0).any(axis=1)]
df2 = df[~(df == 0).all(axis=1)]
But I cannot get this to work column wise. I tried to set axis=0 but that gives me this error:
__main__:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Any suggestions?
You're going to need loc for this:
df
0 1 2 3 4
0 1 2 3 4 5
1 1 2 0 4 5
df.loc[:, ~(df == 0).any(0)] # notice the :, this means we are indexing on the columns now, not the rows
0 1 3 4
0 1 2 4 5
1 1 2 4 5
Direct indexing defaults to indexing on the rows. You are trying to index a dataframe with only two rows using [0, 1, 3, 4], so pandas is warning you about that.

Why does not work pandas df.loc + lambda?

I have created pandas frame from csv file.
And I want to select rows use lambda.
But it does not work.
I use this pandas manual.
exception:
what is problem?
thanks.
As #BrenBam has said in the comment this syntax was added in 0.18.1 and it won't work in previous versions.
Selection By Callable:
.loc, .iloc, .ix and also [] indexing can accept a callable as
indexer. The callable must be a function with one argument (the
calling Series, DataFrame or Panel) and that returns valid output for
indexing.
Example (version 0.18.1):
In [10]: df
Out[10]:
a b c
0 1 4 2
1 2 2 4
2 3 4 0
3 0 2 3
4 3 0 4
In [11]: df.loc[lambda df: df.a == 3]
Out[11]:
a b c
2 3 4 0
4 3 0 4
For versions <= 0.18.0 you can't use Selection by callable:
do it this way instead:
df.loc[df['Date'] == '2003-01-01 00:00:00', ['Date']]

Grouping data in Python with pandas yields a blank first row

I have this nice pandas dataframe:
And I want to group it by the column "0" (which represents the year) and calculate the mean of the other columns for each year. I do such thing with this code:
df.groupby(0)[2,3,4].mean()
And that successfully calculates the mean of every column. The problem here being the empty row that appears on top:
That's just a display thing, the grouped column now becomes the index and this is just the way that it is displayed, you will notice here that even when you set pd.set_option('display.notebook_repr_html', False) you still get this line, it has no effect on operations on the goruped df:
In [30]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5), 'c':np.arange(5)})
df
Out[30]:
a b c
0 0.766706 -0.575700 0
1 0.594797 -0.966856 1
2 1.852405 1.003855 2
3 -0.919870 -1.089215 3
4 -0.647769 -0.541440 4
In [31]:
df.groupby('c')['a','b'].mean()
Out[31]:
a b
c
0 0.766706 -0.575700
1 0.594797 -0.966856
2 1.852405 1.003855
3 -0.919870 -1.089215
4 -0.647769 -0.541440
Technically speaking it has assigneed the name attribute:
In [32]:
df.groupby('c')['a','b'].mean().index.name
Out[32]:
'c'
by default there will be no name if it has not been assigned:
In [34]:
print(df.index.name)
None

Change values over specific indexes in DataFrame

Let's say I have a Series of flags in a DataFrame:
a=pd.DataFrame({'flag':[0,1,0,0,1]})
and I want to change the values of the flags which are in a specific indexes:
lind=[0,1,3]
This is a simple solution:
def chnflg(series,ind):
if series.ix[ind]==0:
series.ix[ind]=1
else:
series.ix[ind]=0
map(partial(chnflg,a),lind)
It works fine but there are two issues: the first is that it makes the changes in-place, while I would like a new series in DataFrame. This is not a big deal after all.
The second point is that it does not seems pythonic enough. Is it possible to do better?
An easier way to describe your function is as x -> 1 - x, this will be more efficient that apply/map.
In [11]: 1 - a.iloc[lind]
Out[11]:
flag
0 1
1 0
3 1
Note: I like to use iloc here as it's less ambiguous.
If you wanted to assign these inplace then do the explicit assignment:
In [12]: a.iloc[lind] = 1 - a.iloc[lind]
In [13]: a
Out[13]:
flag
0 1
1 0
2 0
3 1
4 1
You could create a dict that flips the values and call map, this would return a series and you can create a new dataframe and leave the original intact:
In [6]:
temp={0:1,1:0}
pd.DataFrame(a.ix[lind]['flag'].map(temp))
Out[6]:
flag
0 1
1 0
3 1

Categories

Resources