Pythonic way to map yyyymm formated column into a numeric column? - python

Sorry if it's not totally clear in the title, but the point is I have a Pandas DataFrame with the following Date column:
Date
201611
201612
201701
And I want to map that so I have a period column that takes value 1 for the first period, and then starts counting one by one until the last period, like this:
Date Period
201611 1
201612 2
201701 3
I achieved what I want doing this:
dic_t={}
for n,t in enumerate(sorted(df.Date.unique())):
dic_t[t]=n+1
df['Period']=df.Date.map(dic_t)
But it doesn't seem too pythonic. I guess I could achieve something similar using dictionary comprehensions, but I'm not good at them yet.
Any ideas?

pd.factorize can sort a list of items and return unique integer labels:
In [209]: pd.factorize(['201611','201612','201701','201702','201704','201612'], sort=True)[0]+1
Out[209]: array([1, 2, 3, 4, 5, 2])
Therefore you could use
df['Period'] = pd.factorize(df['Date'], sort=True)[0] + 1
pd.factorize returns both an array of labels and an array of unique values:
In [210]: pd.factorize(['201611','201612','201701','201702','201704','201612'], sort=True)
Out[210]:
(array([0, 1, 2, 3, 4, 1]),
array(['201611', '201612', '201701', '201702', '201704'], dtype=object))
Since, in this question, it appears you only want the labels, I used pd.factorize(...)[0] to obtain just the labels.

So, based on the info from the question and the comments, the enumeration of the periods (combinations of year and month) should start at the first period that is present in the dataframe.
For that purpose, your code works just fine. If you think that dict comprehensions look "more pythonic", you could express that as:
period_dict = {
period: i+1
for i, period in enumerate(sorted(df.Date.unique()))}
df['Period'] = df.Date.map(period_dict)
Just note: with this method, if for some reason there aren't any datapoints for a month after the start month, that month will not have a period number assigned for it.
For example, if you have no data for march 2017, then:
Date Period
201611 1
201612 2
201701 3
201702 4
201704 5 <== April is period 5 and not 6
If you need to generate the full enumeration for all possible periods, use something like this:
start_year = 2016
end_year = 2018
period_list = [
y*100 + m
for y in range(start_year, end_year+1)
for m in range(1, 13)]
period_dict = {
period: i+1
for i, period in enumerate(period_list)}
df['Period'] = df.Date.map(period_dict)

Related

Calculate average based on available data points

Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.

Grouping data based on time interval

I have to group a dataset with multiple participants. The participants work a specific time on a specific tablet. If rows are the same tablet, and the time difference between consecutive rows is no more than 10 minutes, the rows belong to one participant. I would like to create a new column ("Participant") that numbers the participants. I know some python but this goes over my head. Thanks a lot!
Dataframe:
ID, Time, Tablet
1, 9:12, a
2, 9:14, a
3, 9:17, a
4, 9:45, a
5, 9:49, a
6, 9:51, a
7, 9:13, b
8, 9:15, b
...
Goal:
ID, Time, Tablet, Participant
1, 9:12, a, 1
2, 9:14, a, 1
3, 9:17, a, 1
4, 9:45, a, 2
5, 9:49, a, 2
6, 9:51, a, 2
7, 9:13, b, 3
8, 9:15, b, 3
...
You can groupby first then do a cumsum to get the participant column the way you want. Please make sure the time column is in datetime format and also sort it before you do this.
df['time'] = pd.to_datetime(df['time'])
df['time_diff']=df.groupby(['tablet'])['time'].diff().dt.seconds/60
df['participant'] = np.where((df['time_diff'].isnull()) | (df['time_diff']>10), 1,0).cumsum()
I've done something similar before, I used a combination of a group_by statement and using the Pandas shift function.
df = df.sort_values(["Tablet", "Time"])
df["Time_Period"] = df.groupby("Tablet")["Time"].shift(-1)-df["Time"]
df["Time_Period"] = df["Time_Period"].dt.total_seconds()
df["New_Participant"] = df["Time_Period"] > 10*60 #10 Minutes
df["Participant_ID"] = df["New_Participant"].cumsum()
Basically I flag every time there is a gap of over 10 minutes between sessions, then do a rolling sum to give each participant a unique ID

Sum based on date range in two separate columns

I want to sum all the value in one column based on a range of date in two column:
Start_Date Value_to_sum End_date
2017-12-13 2 2017-12-13
2017-12-13 3 2017-12-16
2017-12-14 4 2017-12-15
2017-12-15 2 2017-12-15
A simple groupby won't do it since it would only add the value for a specific date.
We could do an embeeded for loop but it would take forever to run:
unique_date = carry.Start_Date.unique()
carry = pd.DataFrame({'Date':unique_date})
carry['total'] = 0
for n in tqdm(range(len(carry))):
tr = data.loc[data['Start_Date'] >= carry['Date'][n]]
for i in tr.index:
if carry['Date'][n] <= tr['End_date'][i]:
carry['total'][n] += tr['Value_to_sum'][i]
Something like that would work but like I said would take forever.
The expected output is unique date with the total for each day.
Here it would be
2017-12-13 = 5, 2017-12-14 = 7, 2017-12-15 = 9.
How do I compute the sum based on the date ranges?
First, group by ["Start_Date", "End_date"] to save some operations.
from collections import Counter
c = Counter()
df_g = df.groupby(["Start_Date", "End_date"]).sum().reset_index()
def my_counter(row):
s, v, e = row.Start_Date, row.Value_to_sum, row.End_date
if s == e:
c[pd.Timestamp(s, freq="D")] += row.Value_to_sum
else:
c.update({date: v for date in pd.date_range(s, e)})
df_g.apply(my_counter, axis=1)
print(c)
"""
Counter({Timestamp('2017-12-15 00:00:00', freq='D'): 9,
Timestamp('2017-12-14 00:00:00', freq='D'): 7,
Timestamp('2017-12-13 00:00:00', freq='D'): 5,
Timestamp('2017-12-16 00:00:00', freq='D'): 3})
"""
Tools used:
Counter.update([iterable-or-mapping]):
Elements are counted from an iterable or added-in from another mapping (or counter). Like dict.update() but adds counts instead of replacing them. Also, the iterable is expected to be a sequence of elements, not a sequence of (key, value) pairs. -- Cited from Python 3 Documentation
pandas.date_range
Unfortunately, I don't believe there's a way to do this without involving at least one loop. You are trying see if a date is between your start and end date. If it is, you want to sum the Value_to_Sum column. We can make your loop more efficient.
You can create a mask for each unique date and find all rows that match your criteria. You then apply that mask and take the sum of all matching rows. This should be much faster than iterating over each row individually and determining what date counters to increase.
unique_date = df.Start_Date.unique()
for d in unique_date:
# create a mask which will give us all the rows
# that we want to sum over
# then apply the mask and take the sum of the Value_to_sum column
m = (df.Start_Date <= d) & (df.End_date >= d)
print(d, df[m].Value_to_sum.sum())
This gives you the output you want:
2017-12-13 5
2017-12-14 7
2017-12-15 9
Someone else might be able to come up with a clever way to vectorize the entire thing, but I'm not seeing a way to do it.
if you want the sum to be part of the original dataframe you can use apply to iterate on each row (but this might not might the most optimized code as you are calculating the sum on every row)
carry['total'] = carry.apply(lambda current_row: carry.loc[(carry['Start_Date'] <= current_row.Start_Date) & (carry['End_date'] >= current_row.Start_Date)].Value_to_sum.sum(),axis=1)
above will result to
>>> print(carry)
End_date Start_Date Value_to_sum total
0 2017-12-13 2017-12-13 2 5
1 2017-12-16 2017-12-13 3 5
2 2017-12-15 2017-12-14 4 7
3 2017-12-15 2017-12-15 2 9

Pandas conditions across multiple series

Lets say I have some data like this:
category = pd.Series(np.ones(4))
job1_days = pd.Series([1, 2, 1, 2])
job1_time = pd.Series([30, 35, 50, 10])
job2_days = pd.Series([1, 3, 1, 3])
job2_time = pd.Series([10, 40, 60, 10])
job3_days = pd.Series([1, 2, 1, 3])
job3_time = pd.Series([30, 15, 50, 15])
Each entry represents an individual (so 4 people total). xxx_days represents the number of days an individual did something and xxx_time represents the number of minutes spent doing that job on a single day
I want to assign a 2 to category for an individual, if across all jobs they spent at least 3 days of 20 minutes each. So for example, person 1 does not meet the criteria because they only spent 2 total days with at least 20 minutes (their job 2 day count does not count toward the total because time is < 20). Person 2 does meet the criteria as they spent 5 total days (jobs 1 and 2).
After replacement, category should look like this:
[1, 2, 2, 1]
My current attempt to do this requires a for loop and manually indexing into each series and calculating the total days where time is greater than 20. However, this approach doesn't scale well to my actual dataset. I haven't included the code here as i'd like to approach it from a Pandas perspective instead
Whats the most efficient way to do this in Pandas? The thing that stumps me is checking conditions across multiple series and act accordingly after summation of days
Put days and time in two data frames with column positions correspondence maintained, then do the calculation in a vectorized approach:
import pandas as pd
time = pd.concat([job1_time, job2_time, job3_time], axis = 1) ​
days = pd.concat([job1_days, job2_days, job3_days], axis = 1)
((days * (time >= 20)).sum(1) >= 3) + 1
#0 1
#1 2
#2 2
#3 1
#dtype: int64

is there any quick function to do looking-back calculating in pandas dataframe?

I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)

Categories

Resources