pandas transform with NaN values in grouped columns [duplicate]

pandas transform with NaN values in grouped columns [duplicate] - python

I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?

pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4

This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"

Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2

I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.

All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.

One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)

I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.

Related

How to iterate over rows and multiple columns in panda?

I have a dataframe (df1) and I want to replace the values for the columns V2 and V3 if they have the same value than V1.
import pandas as pd
import numpy as np
df_start= pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[10,5,20,17,15], "V3":[10, 25, 15, 10, 20]})
df_end = pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[np.nan,np.nan,20,17,15], "V3":[np.nan, 25, np.nan, 10, np.nan]})
I know iterrows is not recommended but I don't know what I should do.

You can use mask:
For a seperate dataframe use assign:
df_end = df_start.assign(**df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
For modifying the input dataframe just assign inplace:
df_start[['V2','V3']] = (df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
ID V1 V2 V3
0 1 10 NaN NaN
1 2 5 NaN 25.0
2 3 15 20.0 NaN
3 4 20 17.0 10.0
4 5 20 15.0 NaN

You'll still use a regular loop to go through the columns, but the apply function is your best friend for this kind of row-wise operation. If you're going to use info from more than one column (here you're comparing some column and "V1"), you use apply on the DataFrame and specify the axis. If you were only looking at info from one column (like making a column that doubles values from V1 if they're even, you can use apply with just a Series.
For both versions of the function, the argument you're going to pass is a lambda expression. If you apply it do a DataFrame like you are here, the x represents the values in a row that can be indexed by a column. Finally, you assign the result back to a new or existing column in your DataFrame.
Assuming that df_start and df_end represent your planned input and output:
cols = ["V2","V3"]
for col in cols:
df_start[col] = df.apply(lambda x[col] if x[col] != x["V1"] else np.nan, axis=1]

How to convert a Pandas series into a Dataframe for merging [duplicate]

If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6

Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)

You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.

Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.

Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2

If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c

You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.

If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6

What's the difference between x.iloc[1]['x'] and x['x'].iloc[1]

I cannot change the value of np.nan to 16 with x.iloc[1]['x']=16, but I can change it with x['x'].iloc[1]=16. Why? and what's the difference between these two expressions?
x = pd.DataFrame({'x': [1, np.nan, 3], 'y': [3, 4, 5]})
x.iloc[1]['x']=16
print(x.iloc[1]['x'])
nan
x['x'].iloc[1]=16
print(x.iloc[1]['x'])
16.0

Avoid chained indexing
As noted in the comments, neither of your alternatives are guaranteed to work. The documentation explain the reasoning and rationale.
The fact one works and the other doesn't isn't worthy of investigation, as these are implementation details liable to change.
For scalars, you should use iat to set values by integer position or atby label.
iat for scalar setting by integer position
x.iat[1, x.columns.get_loc('x')] = 16
at for scalar setting by label
x.at[x.index[1], 'x'] = 16
Where your dataframe index is a regular pd.RangeIndex, the last assignment can be simplified:
x.at[1, 'x'] = 16

Welcome on Stackoverflow
the answer provided in comment is clear and sufficient.
iloc is a great tool, I would add if you want to use it the way you want, you have to pass first the column on which you want to select the row. Example with a loop over a dataframe to change a value:
import pandas as pd
d = {'col1': [1, 2,'np.nan',4,5], 'col2': ['A','B','C','D','E']}
df = pd.DataFrame(data=d)
col1 col2
0 1 A
1 2 B
2 np.nan C
3 4 D
4 5 E
for i in range(len(df)):
if df['col1'].iloc[i] == "np.nan":
df['col1'].iloc[i] = 16
print(df)
col1 col2
0 1 A
1 2 B
2 16 C
3 4 D
4 5 E

Replace values in dataframe from another dataframe with Pandas

I have 3 dataframes: df1, df2, df3. I am trying to fill NaN values of df1 with some values contained in df2. The values selected from df2 are also selected according to the output of a simple function (mul_val) who processes some data stored in df3.
I was able to get such result but I would like to find in a simpler, easier way and more readable code.
Here is what I have so far:
import pandas as pd
import numpy as np
# simple function
def mul_val(a,b):
return a*b
# dataframe 1
data = {'Name':['PINO','PALO','TNCO' ,'TNTO','CUCO' ,'FIGO','ONGF','LABO'],
'Id' :[ 10 , 9 ,np.nan , 14 , 3 ,np.nan, 7 ,np.nan]}
df1 = pd.DataFrame(data)
# dataframe 2
infos = {'Info_a':[10,20,30,40,70,80,90,50,60,80,40,50,20,30,15,11],
'Info_b':[10,30,30,60,10,85,99,50,70,20,30,50,20,40,16,17]}
df2 = pd.DataFrame(infos)
dic = {'Name': {0: 'FIGO', 1: 'TNCO'},
'index': {0: [5, 6], 1: [11, 12, 13]}}
df3 = pd.DataFrame(dic)
#---------------Modify from here in the most efficient way!-----------------
for idx,row in df3.iterrows():
store_val = []
print(row['Name'])
for j in row['index']:
store_val.append([mul_val(df2['Info_a'][j],df2['Info_b'][j]),j])
store_val = np.asarray(store_val)
# - Identify which is the index of minimum value of the first column
indx_min_val = np.argmin(store_val[:,0])
# - Get the value relative number contained in the second column
col_value = row['index'][indx_min_val]
# Identify value to be replaced in df1
value_to_be_replaced = df1['Id'][df1['Name']==row['Name']]
# - Replace such value into the df1 having the same row['Name']
df1['Id'].replace(to_replace=value_to_be_replaced,value=col_value, inplace=True)
By printing store_val at every iteration I get:
FIGO
[[6800 5]
[8910 6]]
TNCO
[[2500 11]
[ 400 12]
[1200 13]]
Let's do a simple example: considering FIGO, I identify 6800 as the minimum number between 6800 and 8910. Therefore I select the number 5 who is placed in df1. Repeating such operation for the remaining rows of df3 (in this case I have only 2 rows but they could be a lot more), the final result should be like this:
In[0]: before In[0]: after
Out[0]: Out[0]:
Id Name Id Name
0 10.0 PINO 0 10.0 PINO
1 9.0 PALO 1 9.0 PALO
2 NaN TNCO -----> 2 12.0 TNCO
3 14.0 TNTO 3 14.0 TNTO
4 3.0 CUCO 4 3.0 CUCO
5 NaN FIGO -----> 5 5.0 FIGO
6 7.0 ONGF 6 7.0 ONGF
7 NaN LABO 7 NaN LABO
Nore: you can also remove the for loops if needed and use different type of formats to store the data (list, arrays...); the important thing is that the final result is still a dataframe.

I can offer two similar options that achieve the same result than your loop in a couple of lines:
1.Using apply and fillna() (fillna is faster than combine_first by a factor of two):
df3['Id'] = df3.apply(lambda row: (df2.Info_a*df2.Info_b).loc[row['index']].argmin(), axis=1)
df1 = df1.set_index('Name').fillna(df3.set_index('Name')).reset_index()
2.Using a function (lambda doesn't support assignment, so you have to apply a func)
def f(row):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, axis=1)
or a slight variant not relying on global definitions:
def f(row, df1, df2):
df1.ix[df1.Name==row['Name'], 'Id'] = (df2.Info_a*df2.Info_b).loc[row['index']].argmin()
df3.apply(f, args=(df1,df2,), axis=1)
Note that your solution, even though much more verbose, will take the least amount of time with this small dataset (7.5 ms versus 9.5 ms for both of mine). It makes sense that the speed would be similar, since in both cases it's a matter of looping on the rows of df3

Can I get a trimmed mean of all columns in a dataframe with nan values?

The problem is that I want to get the trimmed mean of all the columns in a pandas dataframe (i.e. the mean of the values in a given column, excluding the max and the min values). It's likely that some columns will have nan values. Basically, I want to get the exact same functionality as the pandas.DataFrame.mean function, except that it's the trimmed mean.
The obvious solution is to use the scipy tmean function, and iterate over the df columns. So I did:
import scipy as sp
trim_mean = []
for i in data_clean3.columns:
trim_mean.append(sp.tmean(data_clean3[i]))
This worked great, until I encountered nan values, which caused tmean to choke. Worse, when I dropped the nan values in the dataframe, there were some datasets that were wiped out completely as they had an nan value in every column. This means that when I amalgamate all my datasets into a master set, there'll be holes on the master set where the trimmed mean should be.
Does anyone know of a way around this? As in, is there a way to get tmean to behave like the standard scipy stats functions and ignore nan values?
(Note that my code is calculating a big number of descriptive statistics on large datasets with limited hardware; highly involved or inefficient workarounds might not be optimal. Hopefully, though, I'm just missing something simple.)
(EDIT: Someone suggested in a comment (that has since vanished?) that I should used the trim_mean scipy function, which allows you to top and tail a specific proportion of the data. This is just to say that this solution won't work for me, as my datasets are of unequal sizes, so I cannot specify a fixed proportion of data that will be OK to remove in every case; it must always just be the max and the min values.)

consider df
np.random.seed()
data = np.random.choice((0, 25, 35, 100, np.nan),
(1000, 2),
p=(.01, .39, .39, .01, .2))
df = pd.DataFrame(data, columns=list('AB'))
Construct your mean using sums and divide by relevant normalizer.
(df.sum() - df.min() - df.max()) / (df.notnull().sum() - 2)
A 29.707674
B 30.402228
dtype: float64
df.mean()
A 29.756987
B 30.450617
dtype: float64

you colud use df.mean(skipna =True) DataFrame.mean
df1 = pd.DataFrame([[5, 1, 'a'], [6, 2, 'b'],[7, 3, 'd'],[np.nan, 4, 'e'],[9, 5, 'f'],[5, 1, 'g']], columns = ["A", "B", "C"])
print df1
df1 = df1[df1.A != df1.A.max()] # Remove max values
df1 = df1[df1.A != df1.A.min()] # Remove min values
print "\nDatafrmae after removing max and min\n"
print df1
print "\nMean of A\n"
print df1["A"].mean(skipna =True)
output
A B C
0 5.0 1 a
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
4 9.0 5 f
5 5.0 1 g
Datafrmae after removing max and min
A B C
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
Mean of A
6.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.