I have a pandas dataframe where there's three levels of row indexing. The last level is a datetime index. There are nan values and I am trying to fill them with the average of each row at the datetime level. How can I go about doing this?
data_df
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | nan
2019-01-28 18:00:00 | 2 | nan | 1
2019-01-28 19:00:00 | nan | nan | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | nan | nan | nan
Some rows may all be nan values. In this case I want to fill the row with 0's. Some rows may have all values filled in so imputing with average isn't needed.
I want this the following result:
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | 2
2019-01-28 18:00:00 | 2 | 1.5 | 1
2019-01-28 19:00:00 | 5 | 5 | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | 0 | 0 | 0
Use DataFrame.mask with mean per rows and last convert only NaNs rows by DataFrame.fillna:
df = df.mask(df.isna(), df.mean(axis=1), axis=0).fillna(0)
print (df)
a b c
Level 0 Level 1 Level 2
A 123 2019-01-28 17:00:00 3.0 1.0 2.0
2019-01-28 18:00:00 2.0 1.5 1.0
2019-01-28 19:00:00 5.0 5.0 5.0
234 2019-01-28 05:00:00 1.0 1.0 3.0
2019-01-28 06:00:00 0.0 0.0 0.0
Another solution is use DataFrame.fillna for replace, but because not implemented df.fillna(df.mean(axis=1), axis=1) is necessary double transpose:
df = df.T.fillna(df.mean(axis=1)).fillna(0).T
Related
I have a dataframe like this (actual data has 70 columns with timestamp) with Column name as A_Timestamp, BC_Timestamp, DA_Timestamp, CA_Timestamp, B_Values, C_values, D_Values, Q_Values
A_Timestamp
B_Values
2020-11-08 11:15:00
1
2020-11-10 15:34:00
2
BC_Timestamp
C_Values
2020-11-11 12:13:00
8
2020-11-15 02:47:00
4
DA_Timestamp
D_Values
2020-1-13 14:47:00
3
2020-11-9 5:34:00
5
CA_Timestamp
Q_Values
2020-7-18 01:04:00
7
2020-04-10 16:34:00
6
And I want Like this:
| Timestamp | |B_Values| C_values| D_values| Q_Values|
| 2020-11-08 11:15:00 | 1 | Nan | Nan | Nan|
| 2020-11-10 15:34:00 | 2 | Nan | Nan | Nan |
| 2020-11-11 12:13:00 | Nan | 8 | Nan | Nan|
| 2020-11-15 02:47:00 | Nan | 4 | Nan | Nan|
| 2020-1-13 14:47:00 | Nan | Nan | 3 | Nan|
| 2020-11-9 05:34:00 | Nan | Nan | 5 | Nan|
| 2020-7-18 01:04:00 | Nan | Nan | Nan | 7|
I want to merge all the columns ending with 'Timestamp' into one single column. And each timestamp with their respective value in the respective columns.
You can use a renamer for the Timestamp column:
dfs = [df1, df2, df3, df4]
renamer = lambda x: 'Timestamp' if x.endswith('Timestamp') else x
out = pd.concat([d.rename(renamer, axis=1) for d in dfs])
Output:
Timestamp B_Values C_Values D_Values Q_Values
0 2020-11-08 11:15:00 1.0 NaN NaN NaN
1 2020-11-10 15:34:00 2.0 NaN NaN NaN
0 2020-11-11 12:13:00 NaN 8.0 NaN NaN
1 2020-11-15 02:47:00 NaN 4.0 NaN NaN
0 2020-1-13 14:47:00 NaN NaN 3.0 NaN
1 2020-11-9 5:34:00 NaN NaN 5.0 NaN
0 2020-7-18 01:04:00 NaN NaN NaN 7.0
1 2020-04-10 16:34:00 NaN NaN NaN 6.0
alternative
Assuming you have a single DataFrame as input:
A_Timestamp B_Values BC_Timestamp C_Values DA_Timestamp D_Values CA_Timestamp Q_Values
0 2020-11-08 11:15:00 1 2020-11-11 12:13:00 8 2020-1-13 14:47:00 3 2020-7-18 01:04:00 7
1 2020-11-10 15:34:00 2 2020-11-15 02:47:00 4 2020-11-9 5:34:00 5 2020-04-10 16:34:00 6
You can then reshape with a MultiIndex:
m = df.columns.str.endswith('Timestamp')
s = df.columns.to_series().mask(m)
out = (df
.set_axis(pd.MultiIndex.from_arrays(
[s.bfill(), s.fillna('Timestamp')]), axis=1)
.T.stack().unstack(-2).droplevel(0)
)
Output:
B_Values C_Values D_Values Q_Values Timestamp
0 1 NaN NaN NaN 2020-11-08 11:15:00
1 2 NaN NaN NaN 2020-11-10 15:34:00
0 NaN 8 NaN NaN 2020-11-11 12:13:00
1 NaN 4 NaN NaN 2020-11-15 02:47:00
0 NaN NaN 3 NaN 2020-1-13 14:47:00
1 NaN NaN 5 NaN 2020-11-9 5:34:00
0 NaN NaN NaN 7 2020-7-18 01:04:00
1 NaN NaN NaN 6 2020-04-10 16:34:00
Or, if order of the rows doesn't matter:
m = df.columns.str.endswith('Timestamp')
s = df.columns.to_series().mask(m)
(df.set_axis(pd.MultiIndex.from_arrays(
[s.fillna('Timestamp'), s.bfill()]), axis=1)
.stack()
)
I have a DataFrame as follows. This DataFrame contains NAN values. I want to replace nan values with the earlier non nan value in my DataFrame from previous month(s):
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | nan
2022-02-02 | nan
2022-03-02 | nan
2022-04-02 | nan
...
2022-01-03 | nan
2022-02-03 | nan
2022-03-03 | nan
2022-04-03 | nan
Desired outcome
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | 1
2022-02-02 | 2
2022-03-02 | 3
2022-04-02 | 4
...
2022-01-03 | 1
2022-02-03 | 2
2022-03-03 | 3
2022-04-03 | 4
Data:
{'date (y-d-m)': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
'2022-01-02', '2022-02-02', '2022-03-02', '2022-04-02',
'2022-01-03', '2022-02-03', '2022-03-03', '2022-04-03'],
'value': [1.0, 2.0, 3.0, 4.0, nan, nan, nan, nan, nan, nan, nan, nan]}
You could convert "date (y-d-m)" column to datetime; then groupby "day" and forward fill with ffill (values from previous months' same day):
df['date (y-d-m)'] = pd.to_datetime(df['date (y-d-m)'], format='%Y-%d-%m')
df['value'] = df.groupby(df['date (y-d-m)'].dt.day)['value'].ffill()
Output:
date (y-d-m) value
0 2022-01-01 1.0
1 2022-01-02 2.0
2 2022-01-03 3.0
3 2022-01-04 4.0
4 2022-02-01 1.0
5 2022-02-02 2.0
6 2022-02-03 3.0
7 2022-02-04 4.0
8 2022-03-01 1.0
9 2022-03-02 2.0
10 2022-03-03 3.0
11 2022-03-04 4.0
Lets say I have the following model:
class DateRange(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
Is there a way to get all pairs of DateRange instances with any overlap in their start to end range? For example, if I had:
id | start | end
-----+---------------------+---------------------
1 | 2020-01-02 12:00:00 | 2020-01-02 16:00:00 # overlap with 2, 3 and 5
2 | 2020-01-02 13:00:00 | 2020-01-02 14:00:00 # overlap with 1 and 3
3 | 2020-01-02 13:30:00 | 2020-01-02 17:00:00 # overlap with 1 and 2
4 | 2020-01-02 10:00:00 | 2020-01-02 10:30:00 # no overlap
5 | 2020-01-02 12:00:00 | 2020-01-02 12:30:00 # overlap with 1
I'd want:
id_1 | id_2
------+-----
1 | 2
1 | 3
1 | 5
2 | 3
Any thoughts on the best way to do this? The order of id_1 and id_2 doesn't matter, but I do need it to be distinct (e.g.- id_1=1, id_2=2 is the same as id_1=2, id_2=1 and should not be repeated)
Let's say I have purchase records with two fields Buy and Time.
What I want to get is the third column of time elapsed since first not-buy so it looks like:
buy| time | time difference
1 | 8:00 | NULL
0 | 9:01 | NULL
0 | 9:10 | NULL
0 | 9:21 | NULL
1 | 9:31 | 0:30
0 | 9:41 | NULL
0 | 9:42 | NULL
1 | 9:53 | 0:12
How can I achieve this? It seems to me that it's a mix of pd.groupby() and pd.shift(), but I can't seem to work that out in my head.
IIUC
df.time=pd.to_datetime(df.time)
df.loc[df.buy==1,'DIFF']=df.groupby(df.buy.cumsum().shift().fillna(0)).time.transform(lambda x : x.iloc[-1]-x.iloc[0])
df
Out[19]:
buy time timedifference DIFF
0 1 2018-02-26 08:00:00 NaN 00:00:00
1 0 2018-02-26 09:01:00 NaN NaT
2 0 2018-02-26 09:10:00 NaN NaT
3 0 2018-02-26 09:21:00 NaN NaT
4 1 2018-02-26 09:31:00 0:30 00:30:00
5 0 2018-02-26 09:41:00 NaN NaT
6 0 2018-02-26 09:42:00 NaN NaT
7 1 2018-02-26 09:53:00 0:12 00:12:00
#df.buy.cumsum().shift().fillna(0) Create the key for groupby
#time.transform(lambda x : x.iloc[-1]-x.iloc[0]) create the different for each group
#df.loc[df.buy==1,'DIFF'] fill the value from groupby by the right position which buy equal to 1
I have a pandas dataframe as shown below where I have Month-Year, need to get the continuous dataframe which should include count as 0 if no rows are found for that month. Excepcted output is shown below.
Input dataframe
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Jul-15 | 10
Sep-15 | 11
Oct-15 | 1
Dec-15 | 15
Expected Output
Month | Count
--------------
Jan-15 | 10
Feb-15 | 100
Mar-15 | 20
Apr-15 | 0
May-15 | 0
Jun-15 | 0
Jul-15 | 10
Aug-15 | 0
Sep-15 | 11
Oct-15 | 1
Nov-15 | 0
Dec-15 | 15
You can set the Month column as the index. It looks like Excel input, if so, it will be parsed at 01.01.2015 so you can resample it as follows:
df.set_index('Month').resample('MS').asfreq().fillna(0)
Out:
Count
Month
2015-01-01 10.0
2015-02-01 100.0
2015-03-01 20.0
2015-04-01 0.0
2015-05-01 0.0
2015-06-01 0.0
2015-07-01 10.0
2015-08-01 0.0
2015-09-01 11.0
2015-10-01 1.0
2015-11-01 0.0
2015-12-01 15.0
If the month column is not recognized as date, you need to convert it first:
df['Month'] = pd.to_datetime(df['Month'], format='%b-%y')