Pandas.Unstack Python - python

I have a dataframe consisting of several medical measurements taken at different hours (from 1 to 12) and from different patients.
The data is organised by two indices, one corresponding to the patient number (pid) and one to the time of the measurements.
The measurements themselves are in the columns.
The dataframe looks like this:
| Measurement1 |... |Measurement35
pid | Time | | |
-------------------------------------------------------
1 1 | Meas1#T1,pid1| | Meas35#T1,pid1
2 | Meas#T2,pid1| | Meas35#T2,pid1
3 | ... | | ...
... | | |
12. | | |
| | |
2 1. | Meas1#T1,pid2| | ...
2. | | |
3. | | |
... | ... | |
12. | | |
... | | |
9999 1. | | | ...
2. | | |
3. | | |
... | | | ...
12. | | |
And what I would like to get is one row for each patients and one column per each combination of Time and measurement (so the pid row contains all the data relative to that patient):
| Measurement1 |... | Measurement35 |
pid | T1 | T2 | ... | T12| |T1 | T2 | ... | T12 |
-------------------------------------------------------
1 | | | | | | | | | |
2 | | | | | | | | | |
... | | | | | | | | | |
9999| | | | | | | | | |
What I tried is to use DF.pivot(index ='pid', columns='Time') but I get 35 columns per each Measurement instead of 12 columns that I need (and the values in these 35 columns are sometimes shifted). Similar works with DF.unstack(1).
What am I missing?

You're missing the argument 'values' inside df.pivot
# df example
df = {'pid':[1 for _ in range(12)]+[2 for _ in range(12)]+[3 for _ in range(12)],'Time':[x+1 for x in range(12)]+[x+1 for x in range(12)]+[x+1 for x in range(12)],'Measurement1':['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12'], 'Measurement2':['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']+['val_time1',np.nan,'val_time3',np.nan,np.nan,np.nan,'val_time7','val_time8','val_time9',np.nan,np.nan,'val_time12']}
Out:
pid Time Measurement1 Measurement2
0 1 1 val_time1 val_time1
1 1 2 NaN NaN
2 1 3 val_time3 val_time3
3 1 4 NaN NaN
4 1 5 NaN NaN
5 1 6 NaN NaN
6 1 7 val_time7 val_time7
7 1 8 val_time8 val_time8
8 1 9 val_time9 val_time9
9 1 10 NaN NaN
10 1 11 NaN NaN
11 1 12 val_time12 val_time12
12 2 1 val_time1 val_time1
13 2 2 NaN NaN
14 2 3 val_time3 val_time3
15 2 4 NaN NaN
Pivoting specifying that we want to use values for both columns, Measurement1 and 2
df_pivoted = df.pivot(index='pid', columns='Time', values=['Measurement1','Measurement2'])
Out:
Measurement1 ... Measurement2
Time 1 2 3 4 ... 9 10 11 12
pid ...
1 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
2 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
3 val_time1 NaN val_time3 NaN ... val_time9 NaN NaN val_time12
Check to see if we have 12 sub columns for each Measurement group:
print(df_pivoted.columns.levels)
Out:
[['Measurement1', 'Measurement2'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]

Related

How to unmerge the features of a dataframe from one column into several single columns separated by "\" via pandas?

More visually, I would like to move from this dataframe :
| A\B\C\D | Unnamed:1 | Unnamed:2 | Unnamed:3 | Unnamed:4 |
| --------| ----------|
0 | 1\2\3\4 | NaN | NaN | NaN | NaN |
1 | 1\2\3\4 | NaN | NaN | NaN | NaN |
2 | a\2\7\C | NaN | NaN | NaN | NaN |
3 | d\2\u\4 | NaN | NaN | NaN | NaN |
to this one:
| A | B | C | D |
| --------| ----------|
0 | 1 | 2 | 3 | 4 |
1 | 1 | 2 | 3 | 4 |
2 | a | 2 | 7 | C |
3 | d | 2 | u | 4 |
Thanks !
Try splitting the values first and then split the column name:
df2 = df.iloc[:,0].str.split('\\', expand = True)
df2.columns = df.columns[0].split('\\')
df2
result:
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
You can use DataFrame constructor:
out = pd.DataFrame(df.iloc[:, 0].str.split('\\').tolist(),
columns=df.columns[0].split('\\'))
print(out)
# Output
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
The question is: why do you have a such input? Do you read your data from csv file and you don't use the right separator?

Averaging five rows above the value in the target column

The challenge that I have, and don't know how to approach is to have averaged five, ten, or whatever amount of rows above the target value plus the target row.
Dataset
target | A | B |
----------------------
nan | 6 | 4 |
nan | 2 | 7 |
nan | 4 | 9 |
nan | 7 | 3 |
nan | 3 | 7 |
nan | 6 | 8 |
nan | 7 | 6 |
53 | 4 | 5 |
nan | 6 | 4 |
nan | 2 | 7 |
nan | 3 | 3 |
nan | 4 | 9 |
nan | 7 | 3 |
nan | 3 | 7 |
51 | 1 | 3 |
Desired format:
target | A | B |
----------------------
53 | 5.16|6.33 |
51 |3.33 |5.33 |
Try this, [::-1] reversing element to order the dataframe bottom to top, so we can group the values "above" valid targets:
df.groupby(df['target'].notna()[::-1].cumsum()[::-1]).apply(lambda x: x.tail(6).mean())
Output:
target A B
target
1 51.0 3.333333 5.333333
2 53.0 5.166667 6.333333

Fill duplicates with missing value after grouping with some logic

I have a dataframe, I need to take off the duplicates of ticket_id if the owner_type is the same, and if not, pick 'm' over 's', if no value is picket then a NaN is returned:
data = pd.DataFrame({'owner_type':['m','m','m','s','s','m','s','s'],'ticket_id':[1,1,2,2,3,3,4,4]})
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | 1 |
| 1 | m | 1 |
| 2 | m | 2 |
| 3 | s | 2 |
| 4 | s | 3 |
| 5 | m | 3 |
| 6 | s | 4 |
| 7 | s | 4 |'
Should give back:
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | NaN |
| 1 | m | NaN |
| 2 | m | 2 |
| 3 | s | NaN |
| 4 | s | NaN |
| 5 | m | 3 |
| 6 | s | NaN |
| 7 | s | NaN |'
Pseudo code would be like : If ticket_id is duplicated, look at owner_type, if owner_type has mover than one value, return value of 'm' and NaN for 's'.
My attempt
data.groupby('ticket_id').apply(lambda x: x['owner_type'] if len(x) < 2 else NaN)
Not working
Try this:
(df['ticket_id'].where(
~df.duplicated(['owner_type','ticket_id'],keep=False) &
df['owner_type'].eq(df.groupby('ticket_id')['owner_type'].transform('min'))))
Old answer:
m = ~df.duplicated(keep=False) & df['owner_type'].eq('m')
df['ticket_id'].where(m)
Output:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN

Pandas Series - groupby and take cumulative most recent non-null

I have a dataframe with a Category column (which we will group by) and a Value column. I want to add a new column LastCleanValue which shows the most recent non null value for this group. If there have not been any non-nulls yet in the group, we just take null. For example:
df = pd.DataFrame({'Category':['a','a','a','b','b','a','a','b','a','a','b'],
'Value':[np.nan, np.nan, 34, 40, 42, 25, np.nan, np.nan, 31, 33, np.nan]})
And the function should add a new column:
| | Category | Value | LastCleanValue |
|---:|:-----------|--------:|-----------------:|
| 0 | a | nan | nan |
| 1 | a | nan | nan |
| 2 | a | 34 | 34 |
| 3 | b | 40 | 40 |
| 4 | b | 42 | 42 |
| 5 | a | 25 | 25 |
| 6 | a | nan | 25 |
| 7 | b | nan | 42 |
| 8 | a | 31 | 31 |
| 9 | a | 33 | 33 |
| 10 | b | nan | 42 |
How can I do this in Pandas? I was attempting something like df.groupby('Category')['Value'].dropna().last()
This is more like ffill
df['new'] = df.groupby('Category')['Value'].ffill()
Out[430]:
0 NaN
1 NaN
2 34.0
3 40.0
4 42.0
5 25.0
6 25.0
7 42.0
8 31.0
9 33.0
10 42.0
Name: Value, dtype: float64

append pandas dataframe to column

I'm stuck and need some help. I have the following dataframe:
+-----+---+---+--+--+
| | A | B | | |
+-----+---+---+--+--+
| 288 | 1 | 4 | | |
+-----+---+---+--+--+
| 245 | 2 | 3 | | |
+-----+---+---+--+--+
| 543 | 3 | 6 | | |
+-----+---+---+--+--+
| 867 | 1 | 9 | | |
+-----+---+---+--+--+
| 345 | 2 | 7 | | |
+-----+---+---+--+--+
| 122 | 3 | 8 | | |
+-----+---+---+--+--+
| 233 | 1 | 1 | | |
+-----+---+---+--+--+
| 346 | 2 | 6 | | |
+-----+---+---+--+--+
| 765 | 3 | 3 | | |
+-----+---+---+--+--+
Column A has repeating values as shown. What I want to do is every time I see the repeating value in Column A I want to append a new colum with the corresponding values from column B as column C as shown below:
+-----+---+---+-----+
| | A | B | C |
+-----+---+---+-----+
| 288 | 1 | 4 | 9 |
+-----+---+---+-----+
| 245 | 2 | 3 | 7 |
+-----+---+---+-----+
| 543 | 3 | 6 | 8 |
+-----+---+---+-----+
| 867 | 1 | 9 | 1 |
+-----+---+---+-----+
| 345 | 2 | 7 | 6 |
+-----+---+---+-----+
| 122 | 3 | 8 | 3 |
+-----+---+---+-----+
| 233 | 1 | 1 | NaN |
+-----+---+---+-----+
| 346 | 2 | 6 | NaN |
+-----+---+---+-----+
| 765 | 3 | 3 | NaN |
+-----+---+---+-----+
Thanks.
Assuming that val is one of the repeated values,
slice = df.loc[df.A == val, 'B'].shift(-1)
will create a one-column data frame with the values re-indexed to their new positions.
Since none of the re-assigned index values should be redundant, you can use pandas.concat to stitch the different slices together without fear of losing data. Then just attach them as a new column:
df['C'] = pd.concat([df.loc[df['A'] == x, 'B'].shift(-1) for x in [1, 2, 3]])
When the column is assigned, the index values will make everything line up:
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN
Reverse the dataframe order, groupby transform it against shift function, and reverse it back:
df = df[::-1]
df['C'] = df.groupby(df.columns[0]).transform('shift')
df = df[::-1]
df
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN

Categories

Resources