Pandas Dataframe, Lambda not spitting out expected result? [duplicate]

Pandas Dataframe, Lambda not spitting out expected result? [duplicate] - python

pd.DataFrame({'A':[None,2,None,None,3,4],'B':[1,2,3,4,5,6]})
A B
0 NaN 1
1 2 2
2 NaN 3
3 NaN 4
4 3 5
5 4 6
how do I add column C that will take the value from column A if it's not NaN, otherwise column B's value?
A B C
0 NaN 1 1
1 2 2 2
2 NaN 3 3
3 NaN 4 4
4 3 5 3
5 4 6 4

try combine_first():
In [184]: a['C'] = a['A'].combine_first(a['B']).astype(int)
In [185]: a
Out[185]:
A B C
0 NaN 1 1
1 2.0 2 2
2 NaN 3 3
3 NaN 4 4
4 3.0 5 3
5 4.0 6 4

you could also try fillna():
In [26]: a['C'] = a['A'].fillna(a['B'])
In [27]: a
Out[27]:
A B C
0 NaN 1 1
1 2 2 2
2 NaN 3 3
3 NaN 4 4
4 3 5 3
5 4 6 4

Related

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.

There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

How to fill NaN in one column depending from values two different columns

I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help

Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0

How to do forward filling for each group in pandas

I have a dataframe similar to below
id A B C D E
1 2 3 4 5 5
1 NaN 4 NaN 6 7
2 3 4 5 6 6
2 NaN NaN 5 4 1
I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want the forward filling be applied on each id. How can I do that?

Use GroupBy.ffill for forward filling per groups for all columns, but if first values per groups are NaNs there is no replace, so is possible use fillna and last casting to integers:
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 NaN 4.0 NaN 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 NaN NaN 5.0 4 1.0
cols = ['A','B','C']
df[['id'] + cols] = df.groupby('id')[cols].ffill().fillna(0).astype(int)
print (df)
id A B C D E
0 1 2 3 4 5 NaN
1 1 2 4 4 6 NaN
2 2 3 4 5 6 6.0
3 2 3 4 5 4 1.0
Detail:
print (df.groupby('id')[cols].ffill().fillna(0).astype(int))
id A B C
0 1 2 3 4
1 1 2 4 4
2 2 3 4 5
3 2 3 4 5
Or:
cols = ['A','B','C']
df.update(df.groupby('id')[cols].ffill().fillna(0))
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 2.0 4.0 4.0 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 3.0 4.0 5.0 4 1.0

Replacing integers with NaN results in the entire column becoming float dtype

First, I did
a = [[6,5,4,3,2],[1,2,3,4,5,6],[3,4,5,6]]
b = pd.DataFrame(a)
print(b.head(2))
The output is
1 2 3 4 5 6
6 5 4 3 2.00 NaN
1 2 3 4 5.00 6.00
3 4 5 6 NaN NaN
So I did
a = [[6,5,4,3,2],[1,2,3,4,5,6],[3,4,5,6]]
b = pd.DataFrame(a).fillna(-1).astype(int)
print(b.head(2))
The output becomes
1 2 3 4 5 6
6 5 4 3 2 -1
1 2 3 4 5 6
3 4 5 6 -1 -1
But I don't want those -1, so I did
a = [[6,5,4,3,2],[1,2,3,4,5,6],[3,4,5,6]]
b = pd.DataFrame(a).fillna(-1).astype(int)
b = b.replace(-1, np.NaN)
print(b.head(2))
The output is again same as the first time
1 2 3 4 5 6
6 5 4 3 2.00 NaN
1 2 3 4 5.00 6.00
3 4 5 6 NaN NaN

Because of this:
type(np.nan)
# float
If you have NaNs in your column, the rest of your column is automatically upcasted to float for efficient computation.
pandas 0.24+
We can use the Nullable Integer Type which allow integers to coexist with NaNs:
b = b.astype('Int32')
b
0 1 2 3 4 5
0 6 5 4 3 2 NaN
1 1 2 3 4 5 6
2 3 4 5 6 NaN NaN
b.dtypes
0 Int32
1 Int32
2 Int32
3 Int32
4 Int32
5 Int32
dtype: object
<= 0.23
To get around that, convert the dtype to object, which I don't recommend unless it's only for display purposes (you kill efficiency this way).
u = df.select_dtypes(float)
b[u.columns] = u.astype(object)
b
0 1 2 3 4 5
0 6 5 4 3 2 NaN
1 1 2 3 4 5 6
2 3 4 5 6 NaN NaN
print(b.dtypes)
0 int64
1 int64
2 int64
3 int64
4 object
5 object
dtype: object

How can I merge two dataframes of dissimilar size and preserve their column order?

Consider the dataframes
A:
g N a
1 3 5
2 4 6
and B:
g N a e
3 3 4 7
4 9 1 8
Is there some way to merge these such that the resultant dataframe is:
g N a e
1 3 5 NaN
2 4 6 NaN
3 3 4 7
4 9 1 8
In other words, is there some way to preserve the column order rather than re-sort lexicographically?

Use reindex_axis:
pd.concat([A,B]).reindex_axis(B.columns, axis=1)
Output:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
0 3 3 4 7.0
1 4 9 1 8.0

When merging, specify sort=False.
In [1251]: A.merge(B, how='outer', sort=False)
Out[1251]:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
2 3 3 4 7.0
3 4 9 1 8.0

The following should do the trick: pd.concat([a, b])[b.columns]
Full test code:
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("""
g N a
1 3 5
2 4 6
"""), sep=r"\s*")
b = pd.read_csv(StringIO("""
g N a e
3 3 4 7
4 9 1 8
"""), sep=r"\s*")
pd.concat([a, b])[b.columns]
This produces:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
0 3 3 4 7.0
1 4 9 1 8.0
You might also want to reset the index:
pd.concat([a, b])[b.columns].reset_index(drop=True)
... in order to remove index duplicates. This gives:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
2 3 3 4 7.0
3 4 9 1 8.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe, Lambda not spitting out expected result? [duplicate] - python

pd.DataFrame({'A':[None,2,None,None,3,4],'B':[1,2,3,4,5,6]}) A B 0 NaN 1 1 2 2 2 NaN 3 3 NaN 4 4 3 5 5 4 6 how do I add column C that will take the value from column A if it's not NaN, otherwise column B's value? A B C 0 NaN 1 1 1 2 2 2 2 NaN 3 3 3 NaN 4 4 4 3 5 3 5 4 6 4

try combine_first(): In [184]: a['C'] = a['A'].combine_first(a['B']).astype(int) In [185]: a Out[185]: A B C 0 NaN 1 1 1 2.0 2 2 2 NaN 3 3 3 NaN 4 4 4 3.0 5 3 5 4.0 6 4

you could also try fillna(): In [26]: a['C'] = a['A'].fillna(a['B']) In [27]: a Out[27]: A B C 0 NaN 1 1 1 2 2 2 2 NaN 3 3 3 NaN 4 4 4 3 5 3 5 4 6 4

Related

Count preceding non NaN values in pandas

How to fill NaN in one column depending from values two different columns

How to do forward filling for each group in pandas

Replacing integers with NaN results in the entire column becoming float dtype

How can I merge two dataframes of dissimilar size and preserve their column order?

Categories

Resources