I have a dataframe with a Category column (which we will group by) and a Value column. I want to add a new column LastCleanValue which shows the most recent non null value for this group. If there have not been any non-nulls yet in the group, we just take null. For example:
df = pd.DataFrame({'Category':['a','a','a','b','b','a','a','b','a','a','b'],
'Value':[np.nan, np.nan, 34, 40, 42, 25, np.nan, np.nan, 31, 33, np.nan]})
And the function should add a new column:
| | Category | Value | LastCleanValue |
|---:|:-----------|--------:|-----------------:|
| 0 | a | nan | nan |
| 1 | a | nan | nan |
| 2 | a | 34 | 34 |
| 3 | b | 40 | 40 |
| 4 | b | 42 | 42 |
| 5 | a | 25 | 25 |
| 6 | a | nan | 25 |
| 7 | b | nan | 42 |
| 8 | a | 31 | 31 |
| 9 | a | 33 | 33 |
| 10 | b | nan | 42 |
How can I do this in Pandas? I was attempting something like df.groupby('Category')['Value'].dropna().last()
This is more like ffill
df['new'] = df.groupby('Category')['Value'].ffill()
Out[430]:
0 NaN
1 NaN
2 34.0
3 40.0
4 42.0
5 25.0
6 25.0
7 42.0
8 31.0
9 33.0
10 42.0
Name: Value, dtype: float64
Please Please help.
I have a dataframe like
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-20 21:24:03.390000 | 2020-10-20 23:46:36.990000 |
| 1 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 04:36:03.390000 | 2020-10-21 06:58:36.990000 |
| 2 | 12345 | nan | 49584 | 2827 | nan | nan | nan | 2020-10-21 09:24:03.390000 | 2020-10-21 11:46:36.990000 |
| 3 | 12345 | nan | nan | nan | 3940 | nan | nan | 2020-10-21 14:12:03.390000 | 2020-10-21 16:34:36.990000 |
| 4 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 21:24:03.390000 | 2020-10-21 23:46:36.990000 |
| 5 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 02:40:51.390000 | 2020-10-22 05:03:24.990000 |
| 6 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 08:26:27.390000 | 2020-10-22 10:49:00.990000 |
| 7 | 12345 | Pass | nan | nan | nan | 392 | 304 | 2020-10-22 14:12:03.390000 | 2020-10-22 16:34:36.990000 |
| 8 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-22 19:57:39.390000 | 2020-10-22 22:20:12.990000 |
| 9 | 12346 | nan | 22839 | 4059 | nan | nan | nan | 2020-10-23 01:43:15.390000 | 2020-10-23 04:05:48.990000 |
| 10 | 12346 | nan | nan | nan | 4059 | nan | nan | 2020-10-23 07:28:51.390000 | 2020-10-23 09:51:24.990000 |
| 11 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 13:14:27.390000 | 2020-10-23 15:37:00.990000 |
| 12 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 19:00:03.390000 | 2020-10-23 21:22:36.990000 |
| 13 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-24 00:45:39.390000 | 2020-10-24 03:08:12.990000 |
| 14 | 12346 | Fail | nan | nan | nan | 2938 | 495 | 2020-10-24 06:31:15.390000 | 2020-10-24 08:53:48.990000 |
| 15 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-24 12:16:51.390000 | 2020-10-24 14:39:24.990000 |
| 16 | 12345 | nan | 62839 | 1827 | nan | nan | nan | 2020-10-24 18:02:27.390000 | 2020-10-24 20:25:00.990000 |
| 17 | 12345 | nan | nan | nan | 2726 | nan | nan | 2020-10-24 23:48:03.390000 | 2020-10-25 02:10:36.990000 |
| 18 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-25 05:33:39.390000 | 2020-10-25 07:56:12.990000 |
| 19 | 12345 | Fail | nan | nan | nan | nan | 1827 | 2020-10-25 11:19:15.390000 | 2020-10-25 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
and want my output to look like
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | Pass | 49584 | 2827 | 3940 | 392 | 304 | 2020-10-20 21:24:03.390000 | 2020-10-22 16:34:36.990000 |
| 1 | 12346 | Fail | 22839 | 4059 | 4059 | 2938 | 495 | 2020-10-22 19:57:39.390000 | 2020-10-24 08:53:48.990000 |
| 2 | 12345 | Fail | 62839 | 1827 | 2726 | nan | 1827 | 2020-10-24 12:16:51.390000 | 2020-10-23 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
so far I am able to group the cols on `ID` and `Result`. Now want to apply the Coalesce to it (newDf)
df = pd.read_excel("Test_Coalesce.xlsx")
newDf = df.groupby(['ID','Result'])
newDf.all().reset_index()
It looks like you want to groupby consecutive blocks of ID. If so:
blocks = df['ID'].ne(df['ID'].shift()).cumsum()
agg_dict = {k:'first' if k != 'end-time' else 'last'
for k in df.columns}
df.groupby(blocks).agg(agg_dict)
I have a dataframe consisting of several medical measurements taken at different hours (from 1 to 12) and from different patients.
The data is organised by two indices, one corresponding to the patient number (pid) and one to the time of the measurements.
The measurements themselves are in the columns.
The dataframe looks like this:
| Measurement1 |... |Measurement35
pid | Time | | |
-------------------------------------------------------
1 1 | Meas1#T1,pid1| | Meas35#T1,pid1
2 | Meas#T2,pid1| | Meas35#T2,pid1
3 | ... | | ...
... | | |
12. | | |
| | |
2 1. | Meas1#T1,pid2| | ...
2. | | |
3. | | |
... | ... | |
12. | | |
... | | |
9999 1. | | | ...
2. | | |
3. | | |
... | | | ...
12. | | |
And what I would like to get is one row for each patients and one column per each combination of Time and measurement (so the pid row contains all the data relative to that patient):
| Measurement1 |... | Measurement35 |
pid | T1 | T2 | ... | T12| |T1 | T2 | ... | T12 |
-------------------------------------------------------
1 | | | | | | | | | |
2 | | | | | | | | | |
... | | | | | | | | | |
9999| | | | | | | | | |
What I tried is to use DF.pivot(index ='pid', columns='Time') but I get 35 columns per each Measurement instead of 12 columns that I need (and the values in these 35 columns are sometimes shifted). Similar works with DF.unstack(1).
What am I missing?
You're missing the argument 'values' inside df.pivot
# df example
df = {'pid':[1 for _ in range(12)]+[2 for _ in range(12)]+[3 for _ in range(12)],'Time':[x+1 for x in range(12)]+[x+1 for x in range(12)]+[x+1 for x in range(12)],'Measurement1':['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12'], 'Measurement2':['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']}
Out:
pid Time Measurement1 Measurement2
0 1 1 val_time1 val_time1
1 1 2 NaN NaN
2 1 3 val_time3 val_time3
3 1 4 NaN NaN
4 1 5 NaN NaN
5 1 6 NaN NaN
6 1 7 val_time7 val_time7
7 1 8 val_time8 val_time8
8 1 9 val_time9 val_time9
9 1 10 NaN NaN
10 1 11 NaN NaN
11 1 12 val_time12 val_time12
12 2 1 val_time1 val_time1
13 2 2 NaN NaN
14 2 3 val_time3 val_time3
15 2 4 NaN NaN
Pivoting specifying that we want to use values for both columns, Measurement1 and 2
df_pivoted = df.pivot(index='pid', columns='Time', values=['Measurement1','Measurement2'])
Out:
Measurement1 ... Measurement2
Time 1 2 3 4 ... 9 10 11 12
pid ...
1 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
2 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
3 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
Check to see if we have 12 sub columns for each Measurement group:
print(df_pivoted.columns.levels)
Out:
[['Measurement1', 'Measurement2'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]
I Have such DataFrame:
df = pd.DataFrame({'id': [111,222], 'CycleOfRepricingAnchorTime': ['27.04.2018', '09.06.2018'], 'CycleOfRepricing': ['3M','5M'] })
df['CycleOfRepricingAnchorTime'] = pd.to_datetime(df['CycleOfRepricingAnchorTime'] )
df
I need to get such DataFrame:
The result DataFrame: the first column is id , the second column is Date with frequency equals 'CycleOfRepricing' of this id.
Max date is 31.12.2019
I have tried to solve such task with apply, map, etc. But I have not had a success since I can get only objects
df.apply(lambda x: \
pd.date_range(start = x.CycleOfRepricingAnchorTime, \
end = pd.to_datetime('31.12.2019'),
freq = x.CycleOfRepricing), axis = 1)
I will be grateful for the help.
Update to match day of the month for each period.
df.assign(ReportingTime=df.apply(lambda x: \
pd.date_range(start = x.CycleOfRepricingAnchorTime, \
end = pd.to_datetime('31.12.2019'),
freq = x.CycleOfRepricing+'S')+
pd.Timedelta(days=x.CycleOfRepricingAnchorTime.day-1),
axis = 1)).explode('ReportingTime').to_markdown()
Output:
| | id | CycleOfRepricingAnchorTime | CycleOfRepricing | ReportingTime |
|---:|-----:|:-----------------------------|:-------------------|:--------------------|
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2018-05-27 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2018-08-27 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2018-11-27 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-02-27 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-05-27 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-08-27 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-11-27 00:00:00 |
| 1 | 222 | 2018-09-06 00:00:00 | 5M | 2018-10-06 00:00:00 |
| 1 | 222 | 2018-09-06 00:00:00 | 5M | 2019-03-06 00:00:00 |
| 1 | 222 | 2018-09-06 00:00:00 | 5M | 2019-08-06 00:00:00 |
Try this using pandas version 0.25.0+:
df.assign(ReportingTime=df.apply(lambda x: \
pd.date_range(start = x.CycleOfRepricingAnchorTime, \
end = pd.to_datetime('31.12.2019'),
freq = x.CycleOfRepricing), axis = 1)).explode('ReportingTime')
Output:
| | id | CycleOfRepricingAnchorTime | CycleOfRepricing | ReportingTime |
|---:|-----:|:-----------------------------|:-------------------|:--------------------|
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2018-04-30 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2018-07-31 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2018-10-31 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-01-31 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-04-30 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-07-31 00:00:00 |
| 0 | 111 | 2018-04-27 00:00:00 | 3M | 2019-10-31 00:00:00 |
| 1 | 222 | 2018-09-06 00:00:00 | 5M | 2018-09-30 00:00:00 |
| 1 | 222 | 2018-09-06 00:00:00 | 5M | 2019-02-28 00:00:00 |
| 1 | 222 | 2018-09-06 00:00:00 | 5M | 2019-07-31 00:00:00 |
| 1 | 222 | 2018-09-06 00:00:00 | 5M | 2019-12-31 00:00:00 |
Here's my solution:
def convert_my_df(dataframe, end):
date, month, _id = ([] for x in range(3))
x = list(dataframe['CycleOfRepricingAnchorTime'])
y = list(dataframe['CycleOfRepricing'])
z = list(dataframe['id'])
end = pd.to_datetime(end)
for i in range(dataframe.shape[0]):
while x[i] < end:
_id.append(z[i])
month.append(y[i])
n_months = int(y[i][0])
x[i] += pd.DateOffset(months=n_months)
date.append(x[i])
new_df = pd.DataFrame({'id': _id, 'CycleOfRepricingAnchorTime': date, 'CycleOfRepricing': month})
new_df = new_df[new_df['CycleOfRepricingAnchorTime'] <= end]
new_df = pd.concat([new_df, dataframe]).sort_values(['id', 'CycleOfRepricingAnchorTime'])
return new_df
print(convert_my_df(df, '2019-12-31').to_markdown()) #to_markdown() added in pandas 1.0.0
| | id | CycleOfRepricingAnchorTime | CycleOfRepricing |
|---:|-----:|:-----------------------------|:-------------------|
| 0 | 111 | 2018-04-27 00:00:00 | 3M |
| 0 | 111 | 2018-07-27 00:00:00 | 3M |
| 1 | 111 | 2018-10-27 00:00:00 | 3M |
| 2 | 111 | 2019-01-27 00:00:00 | 3M |
| 3 | 111 | 2019-04-27 00:00:00 | 3M |
| 4 | 111 | 2019-07-27 00:00:00 | 3M |
| 5 | 111 | 2019-10-27 00:00:00 | 3M |
| 1 | 222 | 2018-09-06 00:00:00 | 5M |
| 7 | 222 | 2019-02-06 00:00:00 | 5M |
| 8 | 222 | 2019-07-06 00:00:00 | 5M |
| 9 | 222 | 2019-12-06 00:00:00 | 5M |
I have two data frames that are both multi-indexed on 'Date' and 'name', and want to do a SQL style JOIN to combine them. I've tried
pd.merge(df1.reset_index(), df2.reset_index(), on=['name', 'Date'], how='inner')
which then results in an empty DataFrame.
If I inspect the data frames I can see that the index of one is represented as '2015-01-01' and the other is represented as '2015-01-01 00:00:00' which explains my issues with joining.
Is there a way to 'recast' the index to a specific format within pandas?
I've included the tables to see what data I'm working with.
df1=
+-------------+------+------+------+
| Date | name | col1 | col2 |
+-------------+------+------+------+
| 2015-01-01 | mary | 12 | 123 |
| 2015-01-02 | mary | 23 | 33 |
| 2015-01-03 | mary | 34 | 45 |
| 2015-01-01 | john | 65 | 76 |
| 2015-01-02 | john | 67 | 78 |
| 2015-01-03 | john | 25 | 86 |
+-------------+------+------+------+
df2=
+------------+------+-------+-------+
| Date | name | col3 | col4 |
+------------+------+-------+-------+
| 2015-01-01 | mary | 80809 | 09885 |
| 2015-01-02 | mary | 53879 | 58972 |
| 2015-01-03 | mary | 23887 | 3908 |
| 2015-01-01 | john | 9238 | 2348 |
| 2015-01-02 | john | 234 | 234 |
| 2015-01-03 | john | 5325 | 6436 |
+------------+------+-------+-------+
DESIRED Result:
+-------------+------+------+-------+-------+-------+
| Date | name | col1 | col2 | col3 | col4 |
+-------------+------+------+-------+-------+-------+
| 2015-01-01 | mary | 12 | 123 | 80809 | 09885 |
| 2015-01-02 | mary | 23 | 33 | 53879 | 58972 |
| 2015-01-03 | mary | 34 | 45 | 23887 | 3908 |
| 2015-01-01 | john | 65 | 76 | 9238 | 2348 |
| 2015-01-02 | john | 67 | 78 | 234 | 234 |
| 2015-01-03 | john | 25 | 86 | 5325 | 6436 |
+-------------+------+------+-------+-------+-------+
The reason you cannot join is because you have different dtypes on the indicies. Pandas silently fails if the indicies have different dtypes.
You can easily change your indicies from string representations of time to proper pandas datetimes like this:
df = pd.DataFrame({"data":range(1,30)}, index=['2015-04-{}'.format(d) for d in range(1,30)])
df.index.dtype
dtype('O')
df.index = df.index.to_series().apply(pd.to_datetime)
df.index.dtype
dtype('<M8[ns]')
Now you can merge the dataframes on their index:
pd.merge(left=df, left_index=True,
right=df2, right_index=True)
Assuming you have a df2, which my example is omitting...