How to avoid warning in Pandas? [duplicate] - python

I'm trying to select a subset of a subset of a dataframe, selecting only some columns, and filtering on the rows.
df.loc[df.a.isin(['Apple', 'Pear', 'Mango']), ['a', 'b', 'f', 'g']]
However, I'm getting the error:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
What 's the correct way to slice and filter now?

TL;DR: There is likely a typo or spelling error in the column header names.
This is a change introduced in v0.21.1, and has been explained in the docs at length -
Previously, selecting with a list of labels, where one or more labels
were missing would always succeed, returning NaN for missing labels.
This will now show a FutureWarning. In the future this will raise a
KeyError (GH15747). This warning will trigger on a DataFrame or a
Series for using .loc[] or [[]] when passing a list-of-labels with at
least 1 missing label.
For example,
df
A B C
0 7.0 NaN 8
1 3.0 3.0 5
2 8.0 1.0 7
3 NaN 0.0 3
4 8.0 2.0 7
Try some kind of slicing as you're doing -
df.loc[df.A.gt(6), ['A', 'C']]
A C
0 7.0 8
2 8.0 7
4 8.0 7
No problem. Now, try replacing C with a non-existent column label -
df.loc[df.A.gt(6), ['A', 'D']]
FutureWarning: Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
A D
0 7.0 NaN
2 8.0 NaN
4 8.0 NaN
So, in your case, the error is because of the column labels you pass to loc. Take another look at them.

This error also occurs with .append call when the list contains new columns. To avoid this
Use:
df=df.append(pd.Series({'A':i,'M':j}), ignore_index=True)
Instead of,
df=df.append([{'A':i,'M':j}], ignore_index=True)
Full error message:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:1472:
FutureWarning: Passing list-likes to .loc or with any missing label
will raise KeyError in the future, you can use .reindex() as an
alternative.
Thanks to https://stackoverflow.com/a/50230080/207661

If you want to retain the index you can pass list comprehension instead of a column list:
loan_data_inputs_train.loc[:,[i for i in List_col_without_reference_cat]]

Sorry, I'm not sure that I correctly understood you, but seems that next way could be acceptable for you:
df[df['a'].isin(['Apple', 'Pear', 'Mango'])][['a', 'b', 'f', 'g']]
Snippet description:
df['a'].isin(['Apple', 'Pear', 'Mango']) # it's "filter" by data in each row in column *a*
df[['a', 'b', 'f', 'g']] # it's "column filter" that provide ability select specific columns set

Related

Append to only one column in a dataframe python

I have an empty panda date frame. I want to append value to one column at a time. I am trying to iterate through the columns using for loop and append a value (5 for example). I wrote the code below but it does not work. any idea?
example:
df: ['a', 'b', 'c']
for column in df:
df.append({column: 5}, ignore_index=True)
I want to implement this by iterating through the columns. the result should be
df: ['a', 'b', 'c']
5 5 5
This sounds like a horrible idea as it would become extremely inefficient as your df grows in size and I'm almost certain there is a much better way to do this if you would give more context. But for sake of answering the question you could use the shape of the df to figure out the row, and the column name as the column and use .at to manually assign the value.
Here we assign 3 values to the df, one column at a time.
import pandas as pd
df = pd.DataFrame({'a':[],'b':[],'c':[]})
values_to_add = [3,4,5]
for v in values_to_add:
row = df.shape[0]
for column in df.columns:
df.at[row,column] = v
Output
a b c
0 3.0 3.0 3.0
1 4.0 4.0 4.0
2 5.0 5.0 5.0

Get column name with last valid value for each index

I have a dataframe like this -
A B C
0 1 NaN 3.0
1 2 3.0 NaN
2 2 NaN NaN
3 NaN NaN 53
I need to find the column name with the last valid value for each index. For example for the above dataframe, I want to get output something like this.
['C','B','A','C]
I did try to get the column names but was only able to grab the values by using iteritems() on the transpose of the dataframe. Also since It loops through the dataframe, I don't find it very optimal. Please find my approach below
l_val = []
for idx, row in df.T.iteritems():
last_val = None
for x in row:
if not pd.isna(x):
last_val = x
l_val.append(last_val)
Returns -
[3.0, 3.0, 2.0]
I have tried searching a lot but most answers referred to last_valid_index method which would return the last valid index in a column which I don't get if I can use for my problem. Can someone please suggest me any fast way to do it.
You can do:
df.idxmax(axis=1).to_list()
Output:
['C', 'B', 'A', 'C']
EDIT:
For the solution which I showed above you will get the index of maximum value. However you can also have a dataframe where values in first columns are greater than values in columns at the end. Then I would suggest using the solution below to get index of last valid value:
df.T.apply(pd.Series.last_valid_index).to_list()
Output:
['C', 'B', 'A', 'C']

pandas transform with NaN values in grouped columns [duplicate]

I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?
pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"
Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2
I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.
All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.
One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)
I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.

Pandas slicing FutureWarning with 0.21.0

I'm trying to select a subset of a subset of a dataframe, selecting only some columns, and filtering on the rows.
df.loc[df.a.isin(['Apple', 'Pear', 'Mango']), ['a', 'b', 'f', 'g']]
However, I'm getting the error:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
What 's the correct way to slice and filter now?
TL;DR: There is likely a typo or spelling error in the column header names.
This is a change introduced in v0.21.1, and has been explained in the docs at length -
Previously, selecting with a list of labels, where one or more labels
were missing would always succeed, returning NaN for missing labels.
This will now show a FutureWarning. In the future this will raise a
KeyError (GH15747). This warning will trigger on a DataFrame or a
Series for using .loc[] or [[]] when passing a list-of-labels with at
least 1 missing label.
For example,
df
A B C
0 7.0 NaN 8
1 3.0 3.0 5
2 8.0 1.0 7
3 NaN 0.0 3
4 8.0 2.0 7
Try some kind of slicing as you're doing -
df.loc[df.A.gt(6), ['A', 'C']]
A C
0 7.0 8
2 8.0 7
4 8.0 7
No problem. Now, try replacing C with a non-existent column label -
df.loc[df.A.gt(6), ['A', 'D']]
FutureWarning: Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
A D
0 7.0 NaN
2 8.0 NaN
4 8.0 NaN
So, in your case, the error is because of the column labels you pass to loc. Take another look at them.
This error also occurs with .append call when the list contains new columns. To avoid this
Use:
df=df.append(pd.Series({'A':i,'M':j}), ignore_index=True)
Instead of,
df=df.append([{'A':i,'M':j}], ignore_index=True)
Full error message:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:1472:
FutureWarning: Passing list-likes to .loc or with any missing label
will raise KeyError in the future, you can use .reindex() as an
alternative.
Thanks to https://stackoverflow.com/a/50230080/207661
If you want to retain the index you can pass list comprehension instead of a column list:
loan_data_inputs_train.loc[:,[i for i in List_col_without_reference_cat]]
Sorry, I'm not sure that I correctly understood you, but seems that next way could be acceptable for you:
df[df['a'].isin(['Apple', 'Pear', 'Mango'])][['a', 'b', 'f', 'g']]
Snippet description:
df['a'].isin(['Apple', 'Pear', 'Mango']) # it's "filter" by data in each row in column *a*
df[['a', 'b', 'f', 'g']] # it's "column filter" that provide ability select specific columns set

Unable to slice pandas dataframe (with date as key) using date as string

I'm generating an empty dataframe with a series of dates as the index. Data will be added to the dataframe at a later point.
cbd=pd.date_range(start=pd.datetime(2017,01,02),end=pd.datetime(2017,01,30),period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)
df.head()
Test1 Test2
2017-01-02 NaN NaN
2017-01-03 NaN NaN
2017-01-04 NaN NaN
2017-01-05 NaN NaN
2017-01-06 NaN NaN
A few slicing methods don't seem to work. The following returns a KeyError:
df['2017-01-02']
However any of the following work:
df['2017-01-02':'2017-01-02']
df.loc['2017-01-02']
What am I missing here? Why doesn't the first slice return a result?
Dual behavior of [] in df[]
When you don't use : inside [], then the value(s) inside it will be considered as column(s).
And when you use : inside [], then the value(s) inside it will be considered as row(s).
Why the dual nature?
Because most of the time people want to slice the rows instead of slicing the columns.
So they decided that x and y in df[x:y] should correspond to rows,
and x in d[x] or x, y in df[[x,y]] should correspond to column(s).
Example:
df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
index = ['A','B','C'], columns = ['A','B','C'])
print df
Output:
A B C
A 1 2 3
B 1 2 3
C 1 2 3
Now when you do df['B'], it can mean 2 things:
Take the 2nd index B and give you the 2nd row 1 2 3
OR
Take the 2nd column B and give you the 2nd column 2 2 2.
So in order to resolve this conflict and keep it unambiguous df['B'] will always mean that you want the column 'B', if there is no such column then it will throw an Error.
Why does df['2017-01-02'] fails?
It will search for a column '2017-01-02', Because there is no such column, it throws an error.
Why does df.loc['2017-01-02'] works then?
Because .loc[] has syntax of df.loc[row,column] and you can leave out the column if you will, as in your case, it simply means df.loc[row]
There is difference, because use different approaches:
For select one row is necessary loc:
df['2017-01-02']
Docs - partial string indexing:
Warning
The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one):
dft['2013-1-15 12:30:00']
To select a single row, use .loc
In [74]: dft.loc['2013-1-15 12:30:00']
Out[74]:
A 0.193284
Name: 2013-01-15 12:30:00, dtype: float64
df['2017-01-02':'2017-01-02']
This is pure partial string indexing:
This type of slicing will work on a DataFrame with a DateTimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date.
First I have updated your test data (just for info) as it returns an 'invalid token' error. Please see changes here:
cbd=pd.date_range(start='2017-01-02',end='2017-01-30',period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)
Now looking at the first row:
In[1]:
df.head(1)
Out[1]:
Test1 Test2
2017-01-02 NaN NaN
And then trying the initial slicing approach yields this error:
In[2]:
df['2017-01-02']
Out[2]:
KeyError: '2017-01-02'
Now try this using the column name:
In[3]:
df.columns
Out[3]:
Index(['Test1', 'Test2'], dtype='object')
In[4]:
We try 'Test1':
df['Test1']
And get the NaN output from this column.
Out[4]:
2017-01-02 NaN
2017-01-03 NaN
2017-01-04 NaN
2017-01-05 NaN
So the format you are using is designed to be used on the column name unless you use this format df['2017-01-02':'2017-01-02'].
The Pandas docs state "The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)".
So as you correctly identified, DataFrame.loc is a label based indexer which yields the output you are looking for:
In[5]:
df.loc['2017-01-02']
Out[5]:
Test1 NaN
Test2 NaN
Name: 2017-01-02 00:00:00, dtype: object

Categories

Resources