pd.DataFrame({'A':[None,2,None,None,3,4],'B':[1,2,3,4,5,6]})
A B
0 NaN 1
1 2 2
2 NaN 3
3 NaN 4
4 3 5
5 4 6
how do I add column C that will take the value from column A if it's not NaN, otherwise column B's value?
A B C
0 NaN 1 1
1 2 2 2
2 NaN 3 3
3 NaN 4 4
4 3 5 3
5 4 6 4
try combine_first():
In [184]: a['C'] = a['A'].combine_first(a['B']).astype(int)
In [185]: a
Out[185]:
A B C
0 NaN 1 1
1 2.0 2 2
2 NaN 3 3
3 NaN 4 4
4 3.0 5 3
5 4.0 6 4
you could also try fillna():
In [26]: a['C'] = a['A'].fillna(a['B'])
In [27]: a
Out[27]:
A B C
0 NaN 1 1
1 2 2 2
2 NaN 3 3
3 NaN 4 4
4 3 5 3
5 4 6 4
Related
I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help
Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
I have a dataframe similar to below
id A B C D E
1 2 3 4 5 5
1 NaN 4 NaN 6 7
2 3 4 5 6 6
2 NaN NaN 5 4 1
I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want the forward filling be applied on each id. How can I do that?
Use GroupBy.ffill for forward filling per groups for all columns, but if first values per groups are NaNs there is no replace, so is possible use fillna and last casting to integers:
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 NaN 4.0 NaN 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 NaN NaN 5.0 4 1.0
cols = ['A','B','C']
df[['id'] + cols] = df.groupby('id')[cols].ffill().fillna(0).astype(int)
print (df)
id A B C D E
0 1 2 3 4 5 NaN
1 1 2 4 4 6 NaN
2 2 3 4 5 6 6.0
3 2 3 4 5 4 1.0
Detail:
print (df.groupby('id')[cols].ffill().fillna(0).astype(int))
id A B C
0 1 2 3 4
1 1 2 4 4
2 2 3 4 5
3 2 3 4 5
Or:
cols = ['A','B','C']
df.update(df.groupby('id')[cols].ffill().fillna(0))
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 2.0 4.0 4.0 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 3.0 4.0 5.0 4 1.0
First, I did
a = [[6,5,4,3,2],[1,2,3,4,5,6],[3,4,5,6]]
b = pd.DataFrame(a)
print(b.head(2))
The output is
1 2 3 4 5 6
6 5 4 3 2.00 NaN
1 2 3 4 5.00 6.00
3 4 5 6 NaN NaN
So I did
a = [[6,5,4,3,2],[1,2,3,4,5,6],[3,4,5,6]]
b = pd.DataFrame(a).fillna(-1).astype(int)
print(b.head(2))
The output becomes
1 2 3 4 5 6
6 5 4 3 2 -1
1 2 3 4 5 6
3 4 5 6 -1 -1
But I don't want those -1, so I did
a = [[6,5,4,3,2],[1,2,3,4,5,6],[3,4,5,6]]
b = pd.DataFrame(a).fillna(-1).astype(int)
b = b.replace(-1, np.NaN)
print(b.head(2))
The output is again same as the first time
1 2 3 4 5 6
6 5 4 3 2.00 NaN
1 2 3 4 5.00 6.00
3 4 5 6 NaN NaN
Because of this:
type(np.nan)
# float
If you have NaNs in your column, the rest of your column is automatically upcasted to float for efficient computation.
pandas 0.24+
We can use the Nullable Integer Type which allow integers to coexist with NaNs:
b = b.astype('Int32')
b
0 1 2 3 4 5
0 6 5 4 3 2 NaN
1 1 2 3 4 5 6
2 3 4 5 6 NaN NaN
b.dtypes
0 Int32
1 Int32
2 Int32
3 Int32
4 Int32
5 Int32
dtype: object
<= 0.23
To get around that, convert the dtype to object, which I don't recommend unless it's only for display purposes (you kill efficiency this way).
u = df.select_dtypes(float)
b[u.columns] = u.astype(object)
b
0 1 2 3 4 5
0 6 5 4 3 2 NaN
1 1 2 3 4 5 6
2 3 4 5 6 NaN NaN
print(b.dtypes)
0 int64
1 int64
2 int64
3 int64
4 object
5 object
dtype: object
Consider the dataframes
A:
g N a
1 3 5
2 4 6
and B:
g N a e
3 3 4 7
4 9 1 8
Is there some way to merge these such that the resultant dataframe is:
g N a e
1 3 5 NaN
2 4 6 NaN
3 3 4 7
4 9 1 8
In other words, is there some way to preserve the column order rather than re-sort lexicographically?
Use reindex_axis:
pd.concat([A,B]).reindex_axis(B.columns, axis=1)
Output:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
0 3 3 4 7.0
1 4 9 1 8.0
When merging, specify sort=False.
In [1251]: A.merge(B, how='outer', sort=False)
Out[1251]:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
2 3 3 4 7.0
3 4 9 1 8.0
The following should do the trick: pd.concat([a, b])[b.columns]
Full test code:
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("""
g N a
1 3 5
2 4 6
"""), sep=r"\s*")
b = pd.read_csv(StringIO("""
g N a e
3 3 4 7
4 9 1 8
"""), sep=r"\s*")
pd.concat([a, b])[b.columns]
This produces:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
0 3 3 4 7.0
1 4 9 1 8.0
You might also want to reset the index:
pd.concat([a, b])[b.columns].reset_index(drop=True)
... in order to remove index duplicates. This gives:
g N a e
0 1 3 5 NaN
1 2 4 6 NaN
2 3 3 4 7.0
3 4 9 1 8.0