How to create average distribution of nested group by data in pandas? - python

I have a dataframe, df1, I am trying to extract and average distributions from:
ID site timestamp tracking_value
0 03 AMF 2018-01-01 1.0
1 08 AMF 2018-01-01 1.0
2 09 AMF 2018-01-01 1.0
3 14 ARR 2018-01-01 0.0
4 16 ARR 2018-01-01 0.0
5 21 AZM 2018-01-01 0.0
6 22 BII 2018-01-01 0.0
7 23 ARR 2018-01-01 0.0
8 26 AZM 2018-01-01 1.0
9 27 AMF 2018-01-01 1.0
...
...
For each ID group, for each site for that ID group, I want to get the distribution of lengths of consecutive tracking values. Then I want to average those site distributions for the ID, to produce the distribution of lengths of time that tracking_value was not a dropout (0.0).
I have this working without the second group by (group by site), for only one ID:
import more_itertools as mit
import seaborn as sns
id = '03'
# Get the tracking_value data for ID 03
data = df1[df1['ID'] == id]['tracking_value']
# Get the "run length" for each value in data
distribution_of_run_lengths = list(mit.run_length.encode(data))
# Keep the distribution of run lengths for only the 0.0 values
distribution_of_run_lengths_for_zero = [x[1] for x in disn if x[0] == 0.0]
# Plot the counts of run_lengths for each run_length value
sns.countplot(distribution_of_run_lengths_for_zero )
Which is fine for only one ID. Plot shows the number of times (yaxis) we had the dropout lengths (xaxis) for the ID 03:
However I need to extend this as mentioned above and have started by grouping by ID then site, but have been stuck on where to go from there:
data = df1.groupby(['ID','site'])['tracking_value']
Any suggestions on a way forward would be helpful. Thanks.

The following should do what you're looking for. Setup:
dates = np.repeat(pd.date_range('2018-01-01', '2018-01-31'), 4)
np.random.seed(100)
test_df = pd.DataFrame({
'date': dates,
'site': ['A', 'A', 'B', 'B']*(dates.shape[0]//4),
'id': [1, 2, 1, 2]*(dates.shape[0]//4),
'tracking_val': np.random.choice([0, 1], p=[0.4, 0.6], size=dates.shape)
})
Now perform the (many) groupby aggregations necessary to get what you want:
run_length_dict = {} # place to hold results
for name, group in test_df.groupby(['site', 'id']):
# Number all consecutive runs of 1 or 0
runs = group['tracking_val'].ne(group['tracking_val'].shift()) \
.cumsum() \
.rename(columns={'tracking_val': 'run_number'})
# Group each run by its number *and* the tracking value, then find the length of that run
run_lengths = runs.groupby([runs, group['tracking_val']]).agg('size')
# One final groupby (this time, on the tracking_val/level=1) to get the count of lengths
# and push it into the dict with the name of the original group - ("site", "id") tuple
run_length_dict[name] = run_lengths.groupby(level=1).value_counts()
Result:
{('A', 1): tracking_val
0 1 2
2 1
3 1
4 1
5 1
1 1 3
2 3
6 1
dtype: int64, ('A', 2): tracking_val
0 1 5
2 2
3 1
4 1
1 1 6
2 1
3 1
4 1
dtype: int64, ('B', 1): tracking_val
0 1 6
2 2
1 2 4
1 2
4 2
3 1
dtype: int64, ('B', 2): tracking_val
0 1 5
2 2
3 2
1 1 5
2 2
3 1
4 1
dtype: int64}

Related

Count votes of a survey by the answer

I'm working with a survey relative to income. I have my data like this:
form Survey1 Survey2 Country
0 1 1 1 1
1 2 1 2 5
2 3 2 2 4
3 4 2 1 1
4 5 2 2 4
I want to group by the answer and by the Country. For example, let's think the Survey2 refers to the number of cars of the respondent, I want to know the number of people that owns one car in a certain country.
The expected output is as follows:
Country Survey1_1 Survey1_2 Survey2_1 Survey2_2
0 1 1 1 2 0
1 4 0 2 0 2
2 5 1 0 0 1
Here I added '_#' where # is the answer to count.
Until now I've created a code to find the different answers for each column and I've counted the answers responding, let's say 1, but I haven't founded the way to count the answers for a specific country.
number_unic = df.head().iloc[:,j+ci].nunique() # count unique answers
val_unic = list(df.iloc[:,column].unique()) # unique answers
for i in range(len(vals_unic)):
names = str(df.columns[j+ci]+'_' + str(vals[i])) #names of columns
count = (df.iloc[:,j+ci]==vals[i]).sum() #here I count the values that are equal to an unique answer
df.insert(len(df.columns.values),names, count) # to insert new columns
I would do this with a pivot_table:
In [11]: df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
Out[11]:
Survey1 Survey2
0 1 0 1
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
To get the output you wanted you could do something like:
In [21]: res = df.pivot_table(["Survey1", "Survey2"], ["Country"], df.groupby("Country").cumcount())
In [22]: res.columns = [s + "_" + str(n + 1) for s, n in res.columns.values]
In [23]: res
Out[23]:
Survey1_1 Survey1_2 Survey2_1 Survey2_2
Country
1 1.0 2.0 1.0 1.0
4 2.0 2.0 2.0 2.0
5 1.0 NaN 2.0 NaN
But, generally it's better to use the MultiIndex here...
To count the number of each responses you can do this somewhat more complicated groupby and value_count:
In [31]: df1 = df.set_index("Country")[["Survey1", "Survey2"]] # more columns work fine here
In [32]: df1.unstack().groupby(level=[0, 1]).value_counts().unstack(level=0, fill_value=0).unstack(fill_value=0)
Out[32]:
Survey1 Survey2
1 2 1 2
Country
1 1 1 2 0
4 0 2 0 2
5 1 0 0 1

Filtering a DataFrame for data of id's of which the values are decreasing over time

I have a large time series dataset of patient results. A single patient has one ID with various result values. The data is sorted by date and ID. I want to look only at patients of which the values are strictly descending over time. For example patient x has result values 5, 3, 2, 1 would be true. However 5,3,6,7,1 would be false.
Example data:
import pandas as pd
df = pd.read_excel(...)
print(df.head())
PSA PSAdateā€Ž PatientID ... datefirstinject ADTkey RT_PSAbin
0 2.40 2007-06-26 11448 ... 2006-08-05 00:00:00 1 14
1 0.04 2007-09-26 11448 ... 2006-08-05 00:00:00 1 15
2 2.30 2008-01-14 11448 ... 2006-08-05 00:00:00 1 17
3 4.03 2008-04-16 11448 ... 2006-08-05 00:00:00 1 18
4 6.70 2008-07-01 11448 ... 2006-08-05 00:00:00 1 19
So for this example, I want to only see lines with PatientIDs for which the PSA Value is decreasing over time.
groupID = df.groupby('PatientID')
def is_desc(d):
for i in range(len(d) - 1):
if d[i] > d[i+1]:
return False
return True
x = groupID.PSA.apply(is_desc)
df['is_desc'] = groupID.PSA.transform(is_desc)
#patients whose PSA values is decreasing overtime.
df1 = df[df['is_desc']]
I get:
KeyError: 0
I suppose the loop cant make its way through the grouped values as it requires an array to find the 'range'.
Any ideas for editing the loop?
TL;DR
# (see is_desc function definition below)
df['is_desc'] = df.groupby('PationtID').PSA.transform(is_desc)
df[df['is_desc']]
Explanation
Let's use a very simple data set:
df = pd.DataFrame({'id': [1,2,1,3,3,1], 'res': [3,1,2,1,5,1]})
It only contains the id and one value column (and it has an index automatically assigned from pandas).
So if you just want to get a list of all ids whose values are descending, we can group the values by the id, then check if the values in the group are descending, then filter the list for just ids with descending values.
So first let's define a function that checks if the values are descending:
def is_desc(d):
first = True
for i in d:
if first:
first = False
else:
if i >= last:
return False
last = i
return True
(yes, this could probably be done more elegantly, you can search online for a better implementation)
now we group by the id:
gb = df.groupby('id')
and apply the function:
x = gb.res.apply(is_desc)
x now holds this Series:
id
1 True
2 True
3 False
dtype: bool
so now if you want to filter this you can just do this:
x[x].index
which you can of course convert to a normal list like that:
list(x[x].index)
which would give you a list of all ids of which the values are descending. in this case:
[1, 2]
But if you want to also have all the original data for all those chosen ids do it like this:
df['is_desc'] = gb.res.transform(is_des)
so now df has all the original data it had in the beginning, plus a column that tell for each line if it's id's values are descending:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
3 3 1 False
4 3 5 False
5 1 1 True
Now you can very easily filter this like that:
df[df['is_desc']]
which is:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
5 1 1 True
Selecting and sorting your data is quite easy and objective. However, deciding whether or not a patient's data is declining can be subjective, so it is best if you decide on a criteria beforehand to see if their data is declining.
To sort and select:
import pandas as pd
data = [['pat_1', 10, 1],
['pat_1', 9, 2],
['pat_2', 11, 2],
['pat_1', 4, 5],
['pat_1', 2, 6],
['pat_2', 10, 1],
['pat_1', 7, 3],
['pat_1', 5, 4],
['pat_2', 20, 3]]
df = pd.DataFrame(data).rename(columns={0:'Patient', 1:'Result', 2:'Day'})
print df
df_pat1 = df[df['Patient']=='pat_1']
print df_pat1
df_pat1_sorted = df_pat1.sort_values(['Day']).reset_index(drop=True)
print df_pat1_sorted
returns:
df:
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_2 11 2
3 pat_1 4 5
4 pat_1 2 6
5 pat_2 10 1
6 pat_1 7 3
7 pat_1 5 4
8 pat_2 20 3
df_pat1
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
3 pat_1 4 5
4 pat_1 2 6
6 pat_1 7 3
7 pat_1 5 4
df_pat1_sorted
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_1 7 3
3 pat_1 5 4
4 pat_1 4 5
5 pat_1 2 6
For the purposes of this answer, I am going to say that if the first value of the new DataFrame is larger than the last, then their values are declining:
if df_pat1_sorted['Result'].values[0] > df_pat1_sorted['Result'].values[-1]:
print "Patient 1's values are declining"
This returns:
Patient 1's values are declining
There is a better way if you have many unique IDs (as I'm sure you do) of iterating through your patients. I shall present an example using integers, however you may need to use regex if your patient IDs include characters.
import pandas as pd
import numpy as np
min_ID = 1003
max_ID = 1005
patients = np.random.randint(min_ID, max_ID, size=10)
df = pd.DataFrame(patients).rename(columns={0:'Patients'})
print df
s = pd.Series(df['Patients']).unique()
print s
for i in range(len(s)):
print df[df['Patients']==s[i]]
returns:
Patients
0 1004
1 1004
2 1004
3 1003
4 1003
5 1003
6 1003
7 1004
8 1003
9 1003
[1004 1003] # s (the unique values in the df['Patients'])
Patients
3 1003
4 1003
5 1003
6 1003
8 1003
9 1003
Patients
0 1004
1 1004
2 1004
7 1004
I hope this has helped!
This should solve your question, interpreting 'decreasing' as monotonic decreasing:
import pandas as pd
d = {"PatientID": [1,1,1,1,2,2,2,2],
"PSAdate": [2010,2011,2012,2013,2010,2011,2012,2013],
"PSA": [5,3,2,1,5,3,4,5]}
# Sorts by id and date
df = pd.DataFrame(data=d).sort_values(['PatientID', 'PSAdate'])
# Computes change and max(change) between sequential PSA's
df["change"] = df.groupby('PatientID')["PSA"].diff()
df["max_change"] = df.groupby('PatientID')['change'].transform('max')
# Considers only patients whose PSA are monotonic decreasing
df = df.loc[df["max_change"] <= 0]
print(df)
PatientID PSAdate PSA change max_change
0 1 2010 5 NaN -1.0
1 1 2011 3 -2.0 -1.0
2 1 2012 2 -1.0 -1.0
3 1 2013 1 -1.0 -1.0
Note: to consider only strictly monotonic decreasing PSA, change the final loc condition to < 0

Pandas: per individual, find number of records that are near the current observation. Apply vs transform

Suppose I have several records for each person, each with a certain date. I want to construct a column that indicates, per person, the number of other records that are less than 2 months old. That is, I focus just on the records of, say, individual 'A', and I loop over his/her records to see whether there are other records of individual 'A' that are less than two months old (compared to the current row/record).
Let's see some test data to make it clearer
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
testdf.date = pd.to_datetime(testdf.date)
I'll add the expected column of counts
testdf['expected_counts'] = [1, 0, 0, 0, 1, 0]
#Gives:
date id_indiv id_record expected
0 2017-04-28 1 12 1
1 2017-04-05 1 13 0
2 2017-08-05 1 14 0
3 2016-02-01 2 19 0
4 2016-02-05 2 20 1
5 2017-10-05 2 23 0
My first thought was to group by id_indiv then use apply or transform with custom function. To make things easier, I'll first add a variable that substracts two months from the record date and then I'll write the count_months custom function for the apply or transform
testdf['2M_before'] = testdf['date'] - pd.Timedelta('{0}D'.format(30*2))
def count_months(chunk, month_var='2M_before'):
counts = np.empty(len(chunk))
for i, (ind, row) in enumerate(chunk.iterrows()):
#Count records earlier than two months old
#but not newer than the current one
counts[i] = ((chunk.date > row[month_var])
& (chunk.date < row.date)).sum()
return counts
I tried first with transform:
testdf.groupby('id_indiv').transform(count_months)
but it gives an AttributeError: ("'Series' object has no attribute 'iterrows'", 'occurred at index date') which I guess means that transform passes a Series object to the custom function, but I don't know how to fix that.
Then I tried with apply
testdf.groupby('id_indiv').apply(count_months)
#Gives
id_indiv
1 [1.0, 0.0, 0.0]
2 [0.0, 1.0, 0.0]
dtype: object
This almost works, but it gives the result as a list. To "unstack" that list, I followed an answer on this question:
#First sort, just in case the order gets messed up when pasting back:
testdf = testdf.sort_values(['id_indiv', 'id_record'])
counts = (testdf.groupby('id_indiv').apply(count_months)
.apply(pd.Series).stack()
.reset_index(level=1, drop=True))
#Now create the new column
testdf.set_index('id_indiv', inplace=True)
testdf['mycount'] = counts.astype('int')
assert (testdf.expected == testdf.mycount).all()
#df looks now likes this
date id_record expected 2M_before mycount
id_indiv
1 2017-04-28 12 1 2017-02-27 1
1 2017-04-05 13 0 2017-02-04 0
1 2017-08-05 14 0 2017-06-06 0
2 2016-02-01 19 0 2015-12-03 0
2 2016-02-05 20 1 2015-12-07 1
2 2017-10-05 23 0 2017-08-06 0
This seems to work, but it seems like there should be a much easier way (maybe using transform?). Besides, pasting back the column like that doesn't seem very robust.
Thanks for your time!
Edited to count recent records per person
Here's one way to count all records strictly newer than 2 months for each person using a lookback window of exactly two calendar months minus 1 day (as opposed to an approximate 2-month window of 60 days or something).
# imports and setup
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
# more setup
testdf['date'] = pd.to_datetime(testdf['date'])
testdf.set_index('date', inplace=True)
testdf.sort_index(inplace=True) # required for the index-slicing below
# solution
count_recent_records = lambda x: [x.loc[d - pd.DateOffset(months=2, days=-1):d].count() - 1 for d in x.index]
testdf['mycount'] = testdf.groupby('id_indiv').transform(count_recent_records)
# output
testdf
id_indiv id_record mycount
date
2016-02-01 2 19 0
2016-02-05 2 20 1
2017-04-05 1 13 0
2017-04-28 1 12 1
2017-08-05 1 14 0
2017-10-05 2 23 0
testdf = testdf.sort_values('date')
out_df = pd.DataFrame()
for i in testdf.id_indiv.unique():
for d in testdf.date:
date_diff = (d - testdf.loc[testdf.id_indiv == i,'date']).dt.days
out_dict = {'person' : i,
'entry_date' : d,
'count' : sum((date_diff > 0) & (date_diff <= 60))}
out_df = out_df.append(out_dict, ignore_index = True)
out_df
count entry_date person
0 0.0 2016-02-01 2.0
1 1.0 2016-02-05 2.0
2 0.0 2017-04-05 2.0
3 0.0 2017-04-28 2.0
4 0.0 2017-08-05 2.0
5 0.0 2017-10-05 2.0
6 0.0 2016-02-01 1.0
7 0.0 2016-02-05 1.0
8 0.0 2017-04-05 1.0
9 1.0 2017-04-28 1.0
10 0.0 2017-08-05 1.0
11 0.0 2017-10-05 1.0

Function to turn single Pandas dataframe into multi-year dataframe

I have this Pandas dataframe which is a single year snapshot:
data = pd.DataFrame({'ID' : (1, 2),
'area': (2, 3),
'population' : (100, 200),
'demand' : (100, 200)})
I want to make this into a time series where population grows by 10% per year and demand grows by 20% per year. In this example I do this for two extra years.
This should be the output (note: it includes an added 'year' column):
output = pd.DataFrame({'ID': (1,2,1,2,1,2),
'year': (1,1,2,2,3,3),
'area': (2,3,2,3,2,3),
'population': (100,200,110,220,121,242),
'demand': (100,200,120,240,144,288)})
Setup variables:
k = 5 #Number of years to forecast
a = 1.20 #Demand Growth
b = 1.10 #Population Growth
Forecast dataframe:
df_out = (data[['ID','area']].merge(pd.concat([(data[['demand','population']].mul([pow(a,i),pow(b,i)])).assign(year=i+1) for i in range(k)]),
left_index=True, right_index=True)
.sort_values(by='year'))
print(df_out)
Output:
ID area demand population year
0 1 2 100.00 100.00 1
1 2 3 200.00 200.00 1
0 1 2 120.00 110.00 2
1 2 3 240.00 220.00 2
0 1 2 144.00 121.00 3
1 2 3 288.00 242.00 3
0 1 2 172.80 133.10 4
1 2 3 345.60 266.20 4
0 1 2 207.36 146.41 5
1 2 3 414.72 292.82 5
create a numpy array with [1.1, 1.2] that I repeat and cumprod
prepend a set of ones [1.0, 1.0] to account for the initial condition
multiply by the values of a conveniently stacked pd.Series
manipulate into a pd.DataFrame constructor
clean up indices and what not
k = 5
cols = ['ID', 'area']
cum_ret = np.vstack(
[np.ones((1, 2)), np.array([[1.2, 1.1]]
)[[0] * k].cumprod(0)])[:, [0, 0, 1, 1]]
s = data.set_index(cols).unstack(cols)
pd.DataFrame(
cum_ret * s.values,
columns=s.index
).stack(cols).reset_index(cols).reset_index(drop=True)
ID area demand population
0 1 2 100.000 100.000
1 2 3 200.000 200.000
2 1 2 120.000 110.000
3 2 3 240.000 220.000
4 1 2 144.000 121.000
5 2 3 288.000 242.000
6 1 2 172.800 133.100
7 2 3 345.600 266.200
8 1 2 207.360 146.410
9 2 3 414.720 292.820
10 1 2 248.832 161.051
11 2 3 497.664 322.102

How to use `apply()` or other vectorized approach when previous value matters

Assume I have a DataFrame of the following form where the first column is a random number, and the other columns will be based on the value in the previous column.
For ease of use, let's say I want each number to be the previous one squared. So it would look like the below.
I know I can write a pretty simple loop to do this, but I also know looping is not usually the most efficient in python/pandas. How could this be done with apply() or rolling_apply()? Or, otherwise be done more efficiently?
My (failed) attempts below:
In [12]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
In [13]: a
Out[13]:
0 1 2 3
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
In [14]: a = a.apply(lambda x: x**2)
In [15]: a
Out[15]:
0 1 2 3
0 1 0 0 0
1 4 0 0 0
2 9 0 0 0
3 16 0 0 0
4 25 0 0 0
In [16]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
In [17]: pandas.rolling_apply(a,1,lambda x: x**2)
C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\spyderlib\widgets\externalshell\start_ipython_kernel.py:1: FutureWarning: pd.rolling_apply is deprecated for DataFrame and will be removed in a future version, replace with
DataFrame.rolling(center=False,window=1).apply(args=<tuple>,kwargs=<dict>,func=<function>)
# -*- coding: utf-8 -*-
Out[17]:
0 1 2 3
0 1.0 0.0 0.0 0.0
1 4.0 0.0 0.0 0.0
2 9.0 0.0 0.0 0.0
3 16.0 0.0 0.0 0.0
4 25.0 0.0 0.0 0.0
In [18]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
In [19]: a = a[:-1]**2
In [20]: a
Out[20]:
0 1 2 3
0 1 0 0 0
1 4 0 0 0
2 9 0 0 0
3 16 0 0 0
In [21]:
So, my issue is mostly how to refer to the previous column value in my DataFrame calculations.
What you're describing is a recurrence relation, and I don't think there is currently any non-loop way to do that. Things like apply and rolling_apply still rely on having all the needed data available before they begin, and outputting all the result data at once at the end. That is, they don't allow you to compute the next value using earlier values of the same series. See this question and this one as well as this pandas issue.
In practical terms, for your example, you only have three columns you want to fill in, so doing a three-pass loop (as shown in some of the other answers) will probably not be a major performance hit.
Unfortunately there isn't a way of doing this with no loops, as far as I know. However, you don't have to loop through every value, just each column. You can just call apply on the previous column and set the next one to the returned value:
a = pd.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})
for i in range(3):
a[i+1] = a[i].apply(lambda x: x**2)
a[1] = a[0].apply(lambda x: x**2)
a[2] = a[1].apply(lambda x: x**2)
a[3] = a[2].apply(lambda x: x**2)
will give you
0 1 2 3
0 1 1 1 1
1 2 4 16 256
2 3 9 81 6561
3 4 16 256 65536
4 5 25 625 390625
In this special case, we know this about the columns
0 will be what ever is there to the power of 1
1 will be what ever is in column 0 to the power of 2
2 will be what ever is in column 1 to the power of 2...
or will be what ever is in column 0 to the power of 4
3 will be what ever is in column 2 to the power of 2...
or will be what ever is in column 1 to the power of 4...
or will be what ever is in column 0 to the power of 8
So we can indeed vectorize your example with
np.power(df.values[:, [0]], np.power(2, np.arange(4)))
array([[ 1, 1, 1, 1],
[ 2, 4, 16, 256],
[ 3, 9, 81, 6561],
[ 4, 16, 256, 65536],
[ 5, 25, 625, 390625]])
Wrap this in a pretty dataframe
pd.DataFrame(
np.power(df.values[:, [0]], np.power(2, np.arange(4))),
df.index, df.columns)
0 1 2 3
0 1 1 1 1
1 2 4 16 256
2 3 9 81 6561
3 4 16 256 65536
4 5 25 625 390625

Categories

Resources