Pandas Series - groupby and take cumulative most recent non-null - python

I have a dataframe with a Category column (which we will group by) and a Value column. I want to add a new column LastCleanValue which shows the most recent non null value for this group. If there have not been any non-nulls yet in the group, we just take null. For example:
df = pd.DataFrame({'Category':['a','a','a','b','b','a','a','b','a','a','b'],
'Value':[np.nan, np.nan, 34, 40, 42, 25, np.nan, np.nan, 31, 33, np.nan]})
And the function should add a new column:
| | Category | Value | LastCleanValue |
|---:|:-----------|--------:|-----------------:|
| 0 | a | nan | nan |
| 1 | a | nan | nan |
| 2 | a | 34 | 34 |
| 3 | b | 40 | 40 |
| 4 | b | 42 | 42 |
| 5 | a | 25 | 25 |
| 6 | a | nan | 25 |
| 7 | b | nan | 42 |
| 8 | a | 31 | 31 |
| 9 | a | 33 | 33 |
| 10 | b | nan | 42 |
How can I do this in Pandas? I was attempting something like df.groupby('Category')['Value'].dropna().last()

This is more like ffill
df['new'] = df.groupby('Category')['Value'].ffill()
Out[430]:
0 NaN
1 NaN
2 34.0
3 40.0
4 42.0
5 25.0
6 25.0
7 42.0
8 31.0
9 33.0
10 42.0
Name: Value, dtype: float64

Related

Averaging five rows above the value in the target column

The challenge that I have, and don't know how to approach is to have averaged five, ten, or whatever amount of rows above the target value plus the target row.
Dataset
target | A | B |
----------------------
nan | 6 | 4 |
nan | 2 | 7 |
nan | 4 | 9 |
nan | 7 | 3 |
nan | 3 | 7 |
nan | 6 | 8 |
nan | 7 | 6 |
53 | 4 | 5 |
nan | 6 | 4 |
nan | 2 | 7 |
nan | 3 | 3 |
nan | 4 | 9 |
nan | 7 | 3 |
nan | 3 | 7 |
51 | 1 | 3 |
Desired format:
target | A | B |
----------------------
53 | 5.16|6.33 |
51 |3.33 |5.33 |
Try this, [::-1] reversing element to order the dataframe bottom to top, so we can group the values "above" valid targets:
df.groupby(df['target'].notna()[::-1].cumsum()[::-1]).apply(lambda x: x.tail(6).mean())
Output:
target A B
target
1 51.0 3.333333 5.333333
2 53.0 5.166667 6.333333

arranging data by date (month/day format)

After I append 4 different dataframes in:
list_1 = [ ]
I have the following data stored in list_1:
| date | 16/17 |
| -------- | ------|
| 2016-12-29 | 50 |
| 2016-12-30 | 52 |
| 2017-01-01 | 53 |
| 2017-01-02 | 51 |
[4 rows x 1 columns],
16/17
| date | 17/18 |
| -------- | ------|
| 2017-12-29 | 60 |
| 2017-12-31 | 62 |
| 2018-01-01 | 64 |
| 2018-01-03 | 65 |
[4 rows x 1 columns],
17/18
| date | 18/19 |
| -------- | ------|
| 2018-12-30 | 54 |
| 2018-12-31 | 53 |
| 2019-01-02 | 52 |
| 2019-01-03 | 51 |
[4 rows x 1 columns],
18/19
| date | 19/20 |
| -------- | ------|
| 2019-12-29 | 62 |
| 2019-12-30 | 63 |
| 2020-01-01 | 62 |
| 2020-01-02 | 60 |
[4 rows x 1 columns],
19/20
For changing the date format to month/day I use the following code:
pd.to_datetime(df['date']).dt.strftime('%m/%d')
But the problem is when I want to arrange the data by months/days like that:
| date | 16/17 | 17/18 | 18/19 | 19/20 |
| -------- | ------| ------| ------| ------|
| 12/29 | 50 | 60 | NaN | 62 |
| 12/30 | 52 | NaN | 54 | 63 |
| 12/31 | NaN | 62 | 53 | NaN |
| 01/01 | 53 | 64 | NaN | 62 |
| 01/02 | 51 | NaN | 52 | 60 |
| 01/03 | NaN | 65 | 51 | NaN |
I've tried the following:
df = pd.concat(list_1,axis=1)
also:
df = pd.concat(list_1)
df.reset_index(inplace=True)
df = df.groupby(['date']).first()
also:
df = pd.concat(list_1)
df.reset_index(inplace=True)
df = df.groupby(['date'] sort=False).first()
but still cannot achieve the desired result.
You can use sort=False in groupby and create new column for subtract by first value of DatetimeIndex and use it for sorting:
def f(x):
x.index = pd.to_datetime(x.index)
return x.assign(new = x.index - x.index.min())
L = [x.pipe(f) for x in list_1]
df = pd.concat(L, axis=0).sort_values('new', kind='mergesort')
df = df.groupby(df.index.strftime('%m/%d'), sort=False).first().drop('new', axis=1)
print (df)
16/17 17/18 18/19 19/20
date
12/29 50.0 60.0 NaN 62.0
12/30 52.0 NaN 54.0 63.0
12/31 NaN 62.0 53.0 NaN
01/01 53.0 64.0 NaN 62.0
01/02 51.0 NaN 52.0 60.0
01/03 NaN 65.0 51.0 NaN

Getting first/second/third... value in row of numpy array after nan using vectorization

I have the following pandas df:
| Date | GB | US | CA | AU | SG | DE | FR |
| ---- | -- | -- | -- | -- | -- | -- | -- |
| 1 | 25 | | | | | | |
| 2 | 29 | | | | | | |
| 3 | 33 | | | | | | |
| 4 | 31 | 35 | | | | | |
| 5 | 30 | 34 | | | | | |
| 6 | | 35 | 34 | | | | |
| 7 | | 31 | 26 | | | | |
| 8 | | 33 | 25 | 31 | | | |
| 9 | | | 26 | 31 | | | |
| 10 | | | 27 | 26 | 28 | | |
| 11 | | | 35 | 25 | 29 | | |
| 12 | | | | 33 | 35 | 28 | |
| 13 | | | | 28 | 25 | 35 | |
| 14 | | | | 25 | 25 | 28 | |
| 15 | | | | 25 | 26 | 31 | 25 |
| 16 | | | | | 26 | 31 | 27 |
| 17 | | | | | 34 | 29 | 25 |
| 18 | | | | | 28 | 29 | 31 |
| 19 | | | | | | 34 | 26 |
| 20 | | | | | | 28 | 30 |
I have partly acomplished what I am trying to do here using Pandas alone but the process takes ages so I am having to use numpy (see Getting the nearest values to the left in a pandas column) and that is where I am struggling.
Essentialy, I want my function f which takes an argument int(offset), to capture the first non nan value for each row from the left, and return the whole thing as a numpy array/vector so that:
f(offset=0)
| 0 | 1 |
| -- | -- |
| 1 | 25 |
| 2 | 29 |
| 3 | 33 |
| 4 | 31 |
| 5 | 30 |
| 6 | 35 |
| 7 | 31 |
| 8 | 33 |
| 9 | 26 |
| 10 | 27 |
| 11 | 35 |
| 12 | 33 |
| 13 | 28 |
| 14 | 25 |
| 15 | 25 |
| 16 | 26 |
| 17 | 34 |
| 18 | 28 |
| 19 | 34 |
| 20 | 28 |
As I have described in the other post, its best to imagine a horizontal line being drawn from the left for each row, and returning the values intersected by that line as an array. offset=0 then returns the first value (in that array) and offset=1 will return the second value intersected and so on.
Therefore:
f(offset=1)
| 0 | 1 |
| -- | --- |
| 1 | nan |
| 2 | nan |
| 3 | nan |
| 4 | 35 |
| 5 | 34 |
| 6 | 34 |
| 7 | 26 |
| 8 | 25 |
| 9 | 31 |
| 10 | 26 |
| 11 | 25 |
| 12 | 35 |
| 13 | 25 |
| 14 | 25 |
| 15 | 26 |
| 16 | 31 |
| 17 | 29 |
| 18 | 29 |
| 19 | 26 |
| 20 | 30 |
The pandas solution proposed in the post above is very effective:
def f(df, offset=0):
x = df.iloc[:, 0:].apply(lambda x: sorted(x, key=pd.isna)[offset], axis=1)
return x
print(f(df, 1))
However this is very slow with larger iterations. I have tried this with np.apply_along_axis and its even slower!
Is there a fatser way with numpy vectorization?
Many thanks.
Numpy approach
We can define a function first_value which takes a 2D array and offset (n) as input arguments and returns 1D array. Basically, for each row it returns the nth value after the first non-nan value
def first_valid(arr, offset=0):
m = ~np.isnan(arr)
i = m.argmax(axis=1) + offset
iy = np.clip(i, 0, arr.shape[1] - 1)
vals = arr[np.r_[:arr.shape[0]], iy]
vals[(~m.any(1)) | (i >= arr.shape[1])] = np.nan
return vals
Pandas approach
We can stack the dataframe to reshape then group the dataframe on level=0 and aggregate using nth, then reindex to conform the index of aggregated frame according to original frame
def first_valid(df, offset=0):
return df.stack().groupby(level=0)\
.nth(offset).reindex(df.index)
Sample run
>>> first_valid(df, 0)
Date
1 25.0
2 29.0
3 33.0
4 31.0
5 30.0
6 35.0
7 31.0
8 33.0
9 26.0
10 27.0
11 35.0
12 33.0
13 28.0
14 25.0
15 25.0
16 26.0
17 34.0
18 28.0
19 34.0
20 28.0
dtype: float64
>>> first_valid(df, 1)
Date
1 NaN
2 NaN
3 NaN
4 35.0
5 34.0
6 34.0
7 26.0
8 25.0
9 31.0
10 26.0
11 25.0
12 35.0
13 25.0
14 25.0
15 26.0
16 31.0
17 29.0
18 29.0
19 26.0
20 30.0
dtype: float64
>>> first_valid(df, 2)
Date
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 31.0
9 NaN
10 28.0
11 29.0
12 28.0
13 35.0
14 28.0
15 31.0
16 27.0
17 25.0
18 31.0
19 NaN
20 NaN
dtype: float64
Performance
# Sample dataframe for testing purpose
df_test = pd.concat([df] * 10000, ignore_index=True)
%%timeit # Numpy approach
_ = first_valid(df_test.to_numpy(), 1)
# 6.9 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit # Pandas approach
_ = first_valid(df_test, 1)
# 90 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit # OP's approach
_ = f(df_test, 1)
# 2.03 s ± 183 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numpy based approach is approximately 300x faster than the OP's given approach while pandas based approach is approximately 22x faster
This will do it for offset=0, and should be efficient:
df.unstack().groupby('Date').first()
I thought .nth(1) would work for offset=1 and so on, but it doesn't. This is left as an exercise for the reader.

Pandas.Unstack Python

I have a dataframe consisting of several medical measurements taken at different hours (from 1 to 12) and from different patients.
The data is organised by two indices, one corresponding to the patient number (pid) and one to the time of the measurements.
The measurements themselves are in the columns.
The dataframe looks like this:
| Measurement1 |... |Measurement35
pid | Time | | |
-------------------------------------------------------
1 1 | Meas1#T1,pid1| | Meas35#T1,pid1
2 | Meas#T2,pid1| | Meas35#T2,pid1
3 | ... | | ...
... | | |
12. | | |
| | |
2 1. | Meas1#T1,pid2| | ...
2. | | |
3. | | |
... | ... | |
12. | | |
... | | |
9999 1. | | | ...
2. | | |
3. | | |
... | | | ...
12. | | |
And what I would like to get is one row for each patients and one column per each combination of Time and measurement (so the pid row contains all the data relative to that patient):
| Measurement1 |... | Measurement35 |
pid | T1 | T2 | ... | T12| |T1 | T2 | ... | T12 |
-------------------------------------------------------
1 | | | | | | | | | |
2 | | | | | | | | | |
... | | | | | | | | | |
9999| | | | | | | | | |
What I tried is to use DF.pivot(index ='pid', columns='Time') but I get 35 columns per each Measurement instead of 12 columns that I need (and the values in these 35 columns are sometimes shifted). Similar works with DF.unstack(1).
What am I missing?
You're missing the argument 'values' inside df.pivot
# df example
df = {'pid':[1 for _ in range(12)]+[2 for _ in range(12)]+[3 for _ in range(12)],'Time':[x+1 for x in range(12)]+[x+1 for x in range(12)]+[x+1 for x in range(12)],'Measurement1':['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12'], 'Measurement2':['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']}
Out:
pid Time Measurement1 Measurement2
0 1 1 val_time1 val_time1
1 1 2 NaN NaN
2 1 3 val_time3 val_time3
3 1 4 NaN NaN
4 1 5 NaN NaN
5 1 6 NaN NaN
6 1 7 val_time7 val_time7
7 1 8 val_time8 val_time8
8 1 9 val_time9 val_time9
9 1 10 NaN NaN
10 1 11 NaN NaN
11 1 12 val_time12 val_time12
12 2 1 val_time1 val_time1
13 2 2 NaN NaN
14 2 3 val_time3 val_time3
15 2 4 NaN NaN
Pivoting specifying that we want to use values for both columns, Measurement1 and 2
df_pivoted = df.pivot(index='pid', columns='Time', values=['Measurement1','Measurement2'])
Out:
Measurement1 ... Measurement2
Time 1 2 3 4 ... 9 10 11 12
pid ...
1 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
2 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
3 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
Check to see if we have 12 sub columns for each Measurement group:
print(df_pivoted.columns.levels)
Out:
[['Measurement1', 'Measurement2'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]

How do you specify pandas groupby operations that operate on previous records?

I have a Pandas dataframe as following which has to be sorted by Col_2:
+----+-------+-------+
| id | Col_1 | Col_2 |
+----+-------+-------+
| 1 | 0 | 21 |
| 1 | 1 | 24 |
| 1 | 1 | 32 |
| 1 | 0 | 35 |
| 1 | 1 | 37 |
| 2 | 0 | 2 |
| 2 | 0 | 5 |
+----+-------+-------+
How can I create two new columns:
Col_1_sum: the sum of values in the previous rows for each id.
Col_2_max: the max value of Col_2 in the last rows which Col_1 was one. (for each id)
For example for above dataframe the result should be:
+----+-------+-------+-----------+-----------+
| id | Col_1 | Col_2 | Col_1_Sum | Col_2_Max |
+----+-------+-------+-----------+-----------+
| 1 | 0 | 21 | 0 | 0 |
| 1 | 1 | 24 | 0 | 0 |
| 1 | 1 | 32 | 1 | 24 |
| 1 | 0 | 35 | 2 | 32 |
| 1 | 1 | 37 | 2 | 32 |
| 2 | 0 | 2 | 0 | 0 |
| 2 | 0 | 5 | 0 | 0 |
+----+-------+-------+-----------+-----------+
You have two questions. One at a time.
Your first question is answered with groupby, shift, and cumsum:
df.groupby('id').Col_1.apply(lambda x: x.shift().cumsum())
0 NaN
1 0.0
2 1.0
3 2.0
4 2.0
5 NaN
6 0.0
Name: Col_1, dtype: float64
Or, if you prefer cleaner output,
df.groupby('id').Col_1.apply(lambda x: x.shift().cumsum()).fillna(0).astype(int)
0 0
1 0
2 1
3 2
4 2
5 0
6 0
Name: Col_1, dtype: int64
Your second, also similar, using groupby, shift, cummax and ffill:
df.Col_2.where(df.Col_1.eq(1)).groupby(df.id).apply(
lambda x: x.shift().cummax().ffill()
)
0 NaN
1 NaN
2 24.0
3 32.0
4 32.0
5 NaN
6 NaN
Name: Col_2, dtype: float64
In both cases, the essential ingredient is a groupby followed by a subsequent shift call. Note that these answers are difficult to solve sans apply because there are multiple operations to be carried out on sub-groups.
Consider taking the lambda out by defining a custom function. You'll save a few cycles on larger data.

Categories

Resources