Efficiently apply several different operations to a dataset - python

I have to do several different operations to many columns of a DataSet, I did it but not in a very efficient way...
As an example, I have this table:
| A | B | C | D | E |
|------|------|------|------|------|
| 1.0 | 1.0 | 1.0 | 2.0 | a |
| 2.0 | 1.0 | 1.5 | 5.0 | a |
| 3.0 | 1.0 | 2.0 | 3.0 | b |
| 1.0 | 2.0 | 2.0 | 6.0 | a |
| 2.0 | 2.0 | 3.0 | 4.0 | b |
| 3.0 | 2.0 | 4.0 | 2.0 | b |
| 1.0 | 3.0 | 5.0 | 5.0 | b |
| 2.0 | 3.0 | 6.0 | 1.0 | a |
| 3.0 | 3.0 | 10.0 | 2.0 | a |
And I need to get the following result:
# I dont need the A column, the criteria is the B column, apply the mean
# to the C, the sum to the D and the most frequent on E
| B | C | D | E |
|------|------|------|------|
| 1.0 | 1.5 | 10.0 | a |
| 2.0 | 3.0 | 12.0 | b |
| 3.0 | 7.0 | 8.0 | a |
Here is my attempt but is extremely slow. My original dataset has 2.000.000 of rows. Transforming it to 130.000 takes more than 30 minutes and I have to apply it three times... this is why I need something more efficient.
import pandas as pd
df = pd.DataFrame({"A":[1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
"B":[1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0],
"C":[1.0, 1.5, 2.0, 2.0, 3.0, 4.0, 5.0, 6.0, 10.0],
"D":[2.0, 5.0, 3.0, 6.0, 4.0, 2.0, 5.0, 1.0, 2.0],
"E":['a', 'a', 'b', 'a', 'b', 'b', 'b', 'a', 'a']})
print(df)
dict_ds = { 'B' : [], 'C' : [], 'D' : [], 'E' : []}
df2 = pd.DataFrame(dict_ds)
df=df.groupby('B')
for n in df.first().index:
data = df.get_group(n)
partial = data.mean()
new_C = partial['C']
partial = data.sum()
new_D = partial['D']
new_E = data['E'].mode()[0]
df2.loc[len(df2)] = (n,new_C,new_D,new_E)
print(df2)
This part is after getting the solution.
If I apply the operation unique to the agg:
df.groupby('B').agg({
'A': 'unique',
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
}).reset_index()
I have the next result:
B A C D E
0 1.0 [1.0, 2.0, 3.0] 1.5 10.0 a
1 2.0 [1.0, 2.0, 3.0] 3.0 12.0 b
2 3.0 [1.0, 2.0, 3.0] 7.0 8.0 a
But I need to have it in this other way:
A B C D E
0 1.0 1.0 1.5 10.0 a
1 2.0 1.0 1.5 10.0 a
2 3.0 1.0 1.5 10.0 a
3 1.0 2.0 3.0 12.0 b
4 2.0 2.0 3.0 12.0 b
5 3.0 2.0 3.0 12.0 b
6 1.0 3.0 7.0 8.0 a
7 2.0 3.0 7.0 8.0 a
8 3.0 3.0 7.0 8.0 a
Is it possible to have something similar? A very efficent way?

new_df = df.groupby('B').agg({
'C': 'mean',
'D': 'sum',
'E': lambda x: x.mode()
})
>>> new_df
B C D E
1.0 1.5 10.0 a
2.0 3.0 12.0 b
3.0 7.0 8.0 a
EDIT: For your 2nd question...
I can't guarantee that this will be efficient but it gets what you want done:
df_1 = new_df['A'].apply(pd.Series).unstack().reset_index(level = 0, drop = True)
df_1.name = 'A'
df_2 = new_df[[col for col in df.columns if col != 'A']]
df_2.name = 'others'
pd.merge(df_1, df_2, left_index = True, right_index = True).reset_index(drop = True)
>>> output
A B C D E
0 1.0 1.0 1.5 10.0 a
0 2.0 1.0 1.5 10.0 a
0 3.0 1.0 1.5 10.0 a
1 1.0 2.0 3.0 12.0 b
1 2.0 2.0 3.0 12.0 b
1 3.0 2.0 3.0 12.0 b
2 1.0 3.0 7.0 8.0 a
2 2.0 3.0 7.0 8.0 a
2 3.0 3.0 7.0 8.0 a

Related

Fillna with mode column by column

I got some like this: x: Users y: Ratings
and this shows user 1 rating movie 1 with 4.0 user 1 not rating movie 2 user 1 rating movie 3 with 1.0 and so
rating
movieId 1 2 3 4 5 .....
userID
1 4.0 NaN 1.0 4.1 NaN
2 NaN 2 5.1 NaN NaN
3 3.0 2.0 NaN NaN NaN
4 5.0 NaN 2.8 NaN NaN
How could I fill NaN values with mode by Movie
example movieId 1 has ratings 4.0, NaN, 3.0, 5.0 ..... then fill NaNs with 4.0(mode) i tried to use fillna
rating.apply(lambda x: x.fillna(x.mode().item()))
Try
rating.apply(lambda x: x.fillna(x.mode()), axis=0)
specify axis=0
Alternatively,
import numpy as np
import pandas as pd
def fillna_mode(df, cols_to_fill):
for col in cols_to_fill:
df[col].fillna(df[col].mode()[0], inplace=True)
sample = {1: [4.0, np.nan,1.0, 4.1, np.nan],
2: [np.nan, 2, 5.1, np.nan, np.nan]}
rating = pd.DataFrame(sample)
print(rating)
1 2
0 4.0 NaN
1 NaN 2.0
2 1.0 5.1
3 4.1 NaN
4 NaN NaN
fillna_mode(rating, [1, 2])
Output
1 2
0 4.0 2.0
1 1.0 2.0
2 1.0 5.1
3 4.1 2.0
4 1.0 2.0

Construct new columns based on all previous pairs information using Pandas

I am trying to create 3 new columns in a dataframe, which are based on previous pairs information.
You can think of the dataframe as the results of comptetion ('xx' column) within diffrerent types ('type' column) at different dates ('date column).
The idea is to create the following new columns:
(i) numb_comp_past: sum of the number of times every type faced the competitors in the past.
(ii) win_comp_past: sum of the win (+1), ties (+0), and loss (-1) of the previous competitions that all the types competing with each other had in the past.
(iii) win_comp_past_difs: sum of difference of the results of the previous competitions that all the types competing with each other had in the past.
The original dataframe (df) is the following:
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Mar-18', 'Mar-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'D', 'E', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'xx': 1}, {'xx': 5}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3},{'xx': 1}, {'xx': 6}, {'xx': 3}, {'xx': 5}, {'xx': 2}, {'xx': 3},{'xx': 1}, {'xx': 9}, {'xx': 3}, {'xx': 2}, {'xx': 7}, {'xx': 3}, {'xx': 6}, {'xx': 8}, {'xx': 2}, {'xx': 7}, {'xx': 9}]
df = pd.DataFrame(data, index=idx, columns=['xx'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
df['xx'] = df.xx.astype('float')
And it looks like this:
xx
date type
2018-01-01 A 1.0
B 5.0
2018-02-01 B 3.0
2018-03-01 A 2.0
B 7.0
C 3.0
D 1.0
E 6.0
2018-05-01 B 3.0
2018-06-01 A 5.0
B 2.0
C 3.0
2018-07-01 A 1.0
2018-08-01 B 9.0
C 3.0
2018-09-01 A 2.0
B 7.0
2018-10-01 C 3.0
A 6.0
B 8.0
2018-11-01 A 2.0
2018-12-01 B 7.0
C 9.0
The 3 new columns I need to add to the dataframe are shown below (expected output of the Pandas code):
xx numb_comp_past win_comp_past win_comp_past_difs
date type
2018-01-01 A 1.0 0.0 0.0 0.0
B 5.0 0.0 0.0 0.0
2018-02-01 B 3.0 0.0 0.0 0.0
2018-03-01 A 2.0 1.0 -1.0 -4.0
B 7.0 1.0 1.0 4.0
C 3.0 0.0 0.0 0.0
D 1.0 0.0 0.0 0.0
E 6.0 0.0 0.0 0.0
2018-05-01 B 3.0 0.0 0.0 0.0
2018-06-01 A 5.0 3.0 -3.0 -10.0
B 2.0 3.0 3.0 13.0
C 3.0 2.0 0.0 -3.0
2018-07-01 A 1.0 0.0 0.0 0.0
2018-08-01 B 9.0 2.0 0.0 3.0
C 3.0 2.0 0.0 -3.0
2018-09-01 A 2.0 3.0 -1.0 -6.0
B 7.0 3.0 1.0 6.0
2018-10-01 C 3.0 5.0 -1.0 -10.0
A 6.0 6.0 -2.0 -10.0
B 8.0 7.0 3.0 20.0
2018-11-01 A 2.0 0.0 0.0 0.0
2018-12-01 B 7.0 4.0 2.0 14.0
C 9.0 4.0 -2.0 -14.0
Note that:
(i) for numb_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of 3 given that he previously competed with type B on 2018-01-01 and 2018-03-01 and with type C on 2018-03-01.
(ii) for win_comp_past if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -3 given that he previously lost with type B on 2018-01-01 (-1) and 2018-03-01 (-1) and with type C on 2018-03-01 (-1). Thus adding -1-1-1=-3.
(iii) for win_comp_past_value if there are no previous competitions I assign a value of 0. On 2018-06-01, for example, the type A has a value of -10 given that he previously lost with type B on 2018-01-01 by a difference of -4 (=1-5) and on 2018-03-01 by a diffrence of -5 (=2-7) and with type C on 2018-03-01 by -1 (=2-3). Thus adding -4-5-1=-10.
I really don't know how to start solving this problem. Any ideas of how to solve the new columns decribed in (i), (ii) and (ii) are very welcome.
Here's my take:
# get differences of pairs, useful for win counts and win_difs
def get_diff(x):
teams = x.index.get_level_values(1)
tmp = pd.DataFrame(x[:,None]-x[None,:],
columns = teams.values,
index=teams.values).stack()
return tmp[tmp.index.get_level_values(0)!=tmp.index.get_level_values(1)]
new_df = df.groupby('date').xx.apply(get_diff).to_frame()
# win matches
new_df['win'] = new_df.xx.ge(0).astype(int) - new_df.xx.le(0).astype(int)
# group by players
groups = new_df.groupby(level=[1,2])
# sum function
def cumsum_shift(x):
return x.cumsum().shift()
# assign new values
df['num_comp_past'] = groups.xx.cumcount().sum(level=[0,1])
df['win_comp_past'] = groups.win.apply(cumsum_shift).sum(level=[0,1])
df['win_comp_past_difs'] = groups.xx.apply(cumsum_shift).sum(level=[0,1])
Output:
+------------+------+-----+---------------+---------------+--------------------+
| | | xx | num_comp_past | win_comp_past | win_comp_past_difs |
+------------+------+-----+---------------+---------------+--------------------+
| date | type | | | | |
+------------+------+-----+---------------+---------------+--------------------+
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
| 2018-01-01 | A | 1.0 | 0.0 | 0.0 | 0.0 |
| | B | 5.0 | 0.0 | 0.0 | 0.0 |
| 2018-02-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-03-01 | A | 2.0 | 1.0 | -1.0 | -4.0 |
| | B | 7.0 | 1.0 | 1.0 | 4.0 |
| | C | 3.0 | 0.0 | 0.0 | 0.0 |
| | D | 1.0 | 0.0 | 0.0 | 0.0 |
| | E | 6.0 | 0.0 | 0.0 | 0.0 |
| 2018-05-01 | B | 3.0 | NaN | NaN | NaN |
| 2018-06-01 | A | 5.0 | 3.0 | -3.0 | -10.0 |
| | B | 2.0 | 3.0 | 3.0 | 13.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-07-01 | A | 1.0 | NaN | NaN | NaN |
| 2018-08-01 | B | 9.0 | 2.0 | 0.0 | 3.0 |
| | C | 3.0 | 2.0 | 0.0 | -3.0 |
| 2018-09-01 | A | 2.0 | 3.0 | -1.0 | -6.0 |
| | B | 7.0 | 3.0 | 1.0 | 6.0 |
| 2018-10-01 | C | 3.0 | 5.0 | -1.0 | -10.0 |
| | A | 6.0 | 6.0 | -2.0 | -10.0 |
| | B | 8.0 | 7.0 | 3.0 | 20.0 |
| 2018-11-01 | A | 2.0 | NaN | NaN | NaN |
| 2018-12-01 | B | 7.0 | 4.0 | 2.0 | 14.0 |
| | C | 9.0 | 4.0 | -2.0 | -14.0 |
+------------+------+-----+---------------+---------------+--------------------+

Generate New DataFrame without NaN Values

I've the following Dataframe:
a b c d e
0 NaN 2.0 NaN 4.0 5.0
1 NaN 2.0 3.0 NaN 5.0
2 1.0 NaN 3.0 4.0 NaN
3 1.0 2.0 NaN 4.0 NaN
4 NaN 2.0 NaN 4.0 5.0
What I try to to is to generate a new Dataframe without the NaN values.
There are always the same number of NaN Values in a row.
The final Dataframe should look like this:
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
Does someone know an easy way to do this?
Any help is appreciated.
Using array indexing:
pd.DataFrame(df.values[df.notnull().values].reshape(df.shape[0],3),
columns=list('xyz'),dtype=int)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If the dataframe has more inconsistance values across rows like 1st row with 4 values and from 2nd row if it has 3 values, Then this will do:
a b c d e g
0 NaN 2.0 NaN 4.0 5.0 6.0
1 NaN 2.0 3.0 NaN 5.0 NaN
2 1.0 NaN 3.0 4.0 NaN NaN
3 1.0 2.0 NaN 4.0 NaN NaN
4 NaN 2.0 NaN 4.0 5.0 NaN
pd.DataFrame(df.apply(lambda x: x.values[x.notnull()],axis=1).tolist())
0 1 2 3
0 2.0 4.0 5.0 6.0
1 2.0 3.0 5.0 NaN
2 1.0 3.0 4.0 NaN
3 1.0 2.0 4.0 NaN
4 2.0 4.0 5.0 NaN
Here we cannot remove NaN's in last column.
Use justify function and select first 3 columns:
df = pd.DataFrame(justify(df.values,invalid_val=np.nan)[:, :3].astype(int),
columns=list('xyz'),
index=df.index)
print (df)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
If, as in your example, values increase across columns, you can sort over axis=1:
res = pd.DataFrame(np.sort(df.values, 1)[:, :3],
columns=list('xyz'), dtype=int)
print(res)
x y z
0 2 4 5
1 2 3 5
2 1 3 4
3 1 2 4
4 2 4 5
You can use panda's method for dataframe df.fillna()
This method is used for converting the NaN or NA to your given parameter.
df.fillna(param to replace Nan)
import numpy as np
import pandas as pd
data = {
'A':[np.nan, 2.0, np.nan, 4.0, 5.0],
'B':[np.nan, 2.0, 3.0, np.nan, 5.0],
'C':[1.0 , np.nan, 3.0, 4.0, np.nan],
'D':[1.0 , 2.0, np.nan, 4.0, np.nan,],
'E':[np.nan, 2.0, np.nan, 4.0, 5.0]
}
df = pd.DataFrame(data)
print(df)
A B C D E
0 NaN NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 NaN 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
df = df.fillna(0) # Applying the method with parameter 0
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
If you want to apply this method to the particular column, the syntax would be like this
df[column_name] = df[column_name].fillna(param)
df['A'] = df['A'].fillna(0)
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0
You can also use Python's replace() method to replace np.nan
df = df.replace(np.nan,0)
print(df)
A B C D E
0 0.0 0.0 1.0 1.0 0.0
1 2.0 2.0 0.0 2.0 2.0
2 0.0 3.0 3.0 0.0 0.0
3 4.0 0.0 4.0 4.0 4.0
4 5.0 5.0 0.0 0.0 5.0
df['A'] = df['A'].replace() # Replacing only column A
print(df)
A B C D E
0 0.0 NaN 1.0 1.0 NaN
1 2.0 2.0 NaN 2.0 2.0
2 0.0 3.0 3.0 NaN NaN
3 4.0 NaN 4.0 4.0 4.0
4 5.0 5.0 NaN NaN 5.0

Combine 2 series pandas - overwriting the NANs [duplicate]

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?
use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0
Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop
combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN
I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)
For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible
Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

better or more efficient solution for combining DataFrames, fillna and thinning

I already have a solution how to get to the desired result, but for me it seems as if my solution is far away from optimal.
Now describing the Situation:
Given two differens Pandas DataFrames each having timestamps as indices (from synchronized clocks). For further description taking these visualisations and descriptors
Table 1
+-----+------+------+-----+------+
| ts1 | m1 | m2 | ... | mi |
+-----+------+------+-----+------+
| t_1 | m1_1 | m2_1 | ... | mi_1 |
| ... | ... | ... | ... | ... |
| t_k | m1_k | m2_k | ... | mi_k |
+-----+------+------+-----+------+
Table 2
+-----+------+------+-----+------+
| ts2 | s1 | s2 | ... | sn |
+-----+------+------+-----+------+
| s_1 | s1_1 | s2_1 | ... | si_1 |
| ... | ... | ... | ... | ... |
| s_k | s1_p | s2_p | ... | si_p |
+-----+------+------+-----+------+
The timestamps ts1 and ts2 are most likely different but they intersect each other.
I need to construct a result table of the form
Result Table
+-----+------+------+-----+------+------+------+-----+------+
| ts1 | m1 | m2 | ... | mi | s1 | s2 | ... | si |
+-----+------+------+-----+------+------+------+-----+------+
| t_1 | m1_1 | m2_1 | ... | mi_1 | z1_1 | z2_1 | ... | zi_1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| t_k | m1_k | m2_k | ... | mi_k | z1_k | z2_k | ... | zi_k |
+-----+------+------+-----+------+------+------+-----+------+
and the values z given in the table should each be the last (meaning over time thus using the timestamp) valid entry for the given data values in s equal of before the timestamp of the actual row. (I hope that could be understood.)
My solution reads:
# Combining data
ResultTable=pandas.concat([Table1, Table2]).sort_index()
# retrieving last valid entries for s
ResultTable.s1.fillna(method='pad', inplace=True)
ResultTable.s2.fillna(method='pad', inplace=True)
...
ResultTable.si.fillna(method='pad', inplace=True)
# removing unneeded timestamps `s_1 ... s_k` in result
# many ideas howto do that (deleting rows with NaN in m columns for example)
# please tell me, what would be most efficient?
Regarding the question for Efficiency - some Details on the sizes.
In my simple example I have 4.000.000 rows in Table 1 and 8 columns (might grow to 50).
Table 2 consists of about 1.000.000 rows and 85 columns.
WOW - jezrael solved that fast by his hint for merge_asof leading to a solution in just one line of code reading
test2=pandas.merge_asof(Table1.sort_index(), Table2.sort_index(),
left_index=True, right_index=True)
Another code should be simplify:
#if ts2 is column
cols2 = Table2.columns.difference(['ts2'])
#if ts2 is index
#cols2 = Table2.columns
ResultTable[cols2] = ResultTable[cols2].ffill()
instead:
ResultTable.s1.fillna(method='pad',inplace=True)
ResultTable.s2.fillna(method='pad',inplace=True)
...
ResultTable.si.fillna(method='pad',inplace=True)
If want delete NaNs in columns m use notnull for identify NaNs, check if all NaNs per row and filter by boolean indexing:
#if ts2 is column
cols1 = Table1.columns.difference(['ts1'])
#if ts1 is index
#cols1 = Table1.columns
m = ResultTable[cols1].notnull().all(axis=1)
ResultTable = ResultTable[m]
Sample:
np.random.seed(45)
rng = (pd.date_range('2017-03-26', periods=3).tolist() +
pd.date_range('2017-04-01', periods=2).tolist() +
pd.date_range('2017-04-08', periods=3).tolist() +
pd.date_range('2017-04-13', periods=2).tolist())
Table1 = pd.DataFrame(np.random.randint(10, size=(10, 10)), index=rng).add_prefix('m')
Table1.index.name = 'ts1'
print (Table1)
m0 m1 m2 m3 m4 m5 m6 m7 m8 m9
ts1
2017-03-26 3 0 5 3 4 9 8 1 5 9
2017-03-27 6 8 7 8 5 2 8 1 6 4
2017-03-28 8 4 6 4 9 1 6 8 8 1
2017-04-01 6 0 4 9 8 0 9 2 6 7
2017-04-02 0 0 2 9 2 6 0 9 6 0
2017-04-08 8 8 0 6 7 8 5 1 3 7
2017-04-09 5 9 3 2 7 7 4 9 9 9
2017-04-10 9 7 2 7 9 4 5 7 9 7
2017-04-13 6 2 7 7 6 6 3 6 0 7
2017-04-14 4 9 3 5 7 3 5 5 7 1
rng = (pd.date_range('2017-03-27', periods=3).tolist() +
pd.date_range('2017-04-03', periods=2).tolist() +
pd.date_range('2017-04-06', periods=3).tolist() +
pd.date_range('2017-04-10', periods=2).tolist())
Table2 = pd.DataFrame(np.random.randint(10, size=(10, 10)), index=rng).add_prefix('s')
Table2.index.name = 'ts2'
print (Table2)
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
ts2
2017-03-27 0 2 1 9 2 3 9 6 3 6
2017-03-28 1 9 1 7 4 0 2 1 1 4
2017-03-29 2 2 2 5 3 6 7 5 6 5
2017-04-03 2 8 7 1 2 7 9 6 4 5
2017-04-04 4 5 4 1 3 7 0 5 0 6
2017-04-06 5 8 0 1 9 9 2 4 4 0
2017-04-07 8 2 8 9 7 5 4 3 2 5
2017-04-08 7 9 2 5 8 0 8 9 4 0
2017-04-10 2 5 1 2 1 4 2 3 7 0
2017-04-11 2 0 8 8 6 8 7 5 2 9
ResultTable=pd.concat([Table1, Table2]).sort_index()
cols2 = Table2.columns
ResultTable[cols2] = ResultTable[cols2].ffill()
cols1 = Table1.columns
m = ResultTable[cols1].notnull().all(1)
ResultTable = ResultTable[m]
print (ResultTable)
m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 s0 s1 s2 \
2017-03-26 3.0 0.0 5.0 3.0 4.0 9.0 8.0 1.0 5.0 9.0 NaN NaN NaN
2017-03-27 6.0 8.0 7.0 8.0 5.0 2.0 8.0 1.0 6.0 4.0 NaN NaN NaN
2017-03-28 8.0 4.0 6.0 4.0 9.0 1.0 6.0 8.0 8.0 1.0 0.0 2.0 1.0
2017-04-01 6.0 0.0 4.0 9.0 8.0 0.0 9.0 2.0 6.0 7.0 2.0 2.0 2.0
2017-04-02 0.0 0.0 2.0 9.0 2.0 6.0 0.0 9.0 6.0 0.0 2.0 2.0 2.0
2017-04-08 8.0 8.0 0.0 6.0 7.0 8.0 5.0 1.0 3.0 7.0 8.0 2.0 8.0
2017-04-09 5.0 9.0 3.0 2.0 7.0 7.0 4.0 9.0 9.0 9.0 7.0 9.0 2.0
2017-04-10 9.0 7.0 2.0 7.0 9.0 4.0 5.0 7.0 9.0 7.0 7.0 9.0 2.0
2017-04-13 6.0 2.0 7.0 7.0 6.0 6.0 3.0 6.0 0.0 7.0 2.0 0.0 8.0
2017-04-14 4.0 9.0 3.0 5.0 7.0 3.0 5.0 5.0 7.0 1.0 2.0 0.0 8.0
s3 s4 s5 s6 s7 s8 s9
2017-03-26 NaN NaN NaN NaN NaN NaN NaN
2017-03-27 NaN NaN NaN NaN NaN NaN NaN
2017-03-28 9.0 2.0 3.0 9.0 6.0 3.0 6.0
2017-04-01 5.0 3.0 6.0 7.0 5.0 6.0 5.0
2017-04-02 5.0 3.0 6.0 7.0 5.0 6.0 5.0
2017-04-08 9.0 7.0 5.0 4.0 3.0 2.0 5.0
2017-04-09 5.0 8.0 0.0 8.0 9.0 4.0 0.0
2017-04-10 5.0 8.0 0.0 8.0 9.0 4.0 0.0
2017-04-13 8.0 6.0 8.0 7.0 5.0 2.0 9.0
2017-04-14 8.0 6.0 8.0 7.0 5.0 2.0 9.0
Another solution should be merging asof.

Categories

Resources