What's the equivalent of cut/qcut for pandas date fields? - python

Update: starting with version 0.20.0, pandas cut/qcut DOES handle date fields. See What's New for more.
pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)
Original question: Pandas cut and qcut functions are great for 'bucketing' continuous data for use in pivot tables and so forth, but I can't see an easy way to get datetime axes in the mix. Frustrating since pandas is so great at all the time-related stuff!
Here's a simple example:
def randomDates(size, start=134e7, end=137e7):
return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')
df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10),
'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df
price qty recd ship
0 14.723510 3 2012-11-30 19:32:27 2013-03-08 23:10:12
1 53.535143 2 2012-07-25 14:26:45 2012-10-01 11:06:39
2 85.278743 7 2012-12-07 22:24:20 2013-02-26 10:23:20
3 35.940935 8 2013-04-18 13:49:43 2013-03-29 21:19:26
4 54.218896 8 2013-01-03 09:00:15 2012-08-08 12:50:41
5 61.404931 9 2013-02-10 19:36:54 2013-02-23 13:14:42
6 28.917693 1 2012-12-13 02:56:40 2012-09-08 21:14:45
7 88.440408 8 2013-04-04 22:54:55 2012-07-31 18:11:35
8 77.329931 7 2012-11-23 00:49:26 2012-12-09 19:27:40
9 46.540859 5 2013-03-13 11:37:59 2013-03-17 20:09:09
To bin by groups of price or quantity, I can use cut/qcut to bucket them:
df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()
price qty recd ship
qty price
(0, 1] [14.724, 46.541] 1 1 1 1
(1, 5] [14.724, 46.541] 2 2 2 2
(46.541, 61.405] 1 1 1 1
(5, 10] [14.724, 46.541] 1 1 1 1
(46.541, 61.405] 2 2 2 2
(61.405, 88.44] 3 3 3 3
But I can't see any easy way of doing the same thing with my 'recd' or 'ship' date fields. For example, generate a similar table of counts broken down by (say) monthly buckets of recd and ship. It seems like resample() has all of the machinery to bucket into periods, but I can't figure out how to apply it here. The buckets (or levels) in the 'date cut' would be equivalent to a pandas.PeriodIndex, and then I want to label each value of df['recd'] with the period it falls into?
So the kind of output I'm looking for would be something like:
ship recv count
2011-01 2011-01 1
2011-02 3
... ...
2011-02 2011-01 2
2011-02 6
... ... ...
More generally, I'd like to be able to mix and match continuous or categorical variables in the output. Imagine df also contains a 'status' column with red/yellow/green values, then maybe I want to summarize counts by status, price bucket, ship and recd buckets, so:
ship recv price status count
2011-01 2011-01 [0-10) green 1
red 4
[10-20) yellow 2
... ... ...
2011-02 [0-10) yellow 3
... ... ... ...
As a bonus question, what's the simplest way to modify the groupby() result above to just contain a single output column called 'count'?

Here's a solution using pandas.PeriodIndex (caveat: PeriodIndex doesn't
seem to support time rules with a multiple > 1, such as '4M'). I think
the answer to your bonus question is .size().
In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
....: pd.PeriodIndex(df.ship, freq='Q'),
....: pd.cut(df['qty'], bins=[0,5,10]),
....: pd.qcut(df['price'],q=2),
....: ]).size()
Out[49]:
qty price
2012Q2 2013Q1 (0, 5] [2, 5] 1
2012Q3 2013Q1 (5, 10] [2, 5] 1
2012Q4 2012Q3 (5, 10] [2, 5] 1
2013Q1 (0, 5] [2, 5] 1
(5, 10] [2, 5] 1
2013Q1 2012Q3 (0, 5] (5, 8] 1
2013Q1 (5, 10] (5, 8] 2
2013Q2 2012Q4 (0, 5] (5, 8] 1
2013Q2 (0, 5] [2, 5] 1

Just need to set the index of the field you'd like to resample by, here's some examples
In [36]: df.set_index('recd').resample('1M',how='sum')
Out[36]:
price qty
recd
2012-07-31 64.151194 9
2012-08-31 93.476665 7
2012-09-30 94.193027 7
2012-10-31 NaN NaN
2012-11-30 NaN NaN
2012-12-31 12.353405 6
2013-01-31 NaN NaN
2013-02-28 129.586697 7
2013-03-31 NaN NaN
2013-04-30 NaN NaN
2013-05-31 211.979583 13
In [37]: df.set_index('recd').resample('1M',how='count')
Out[37]:
2012-07-31 price 1
qty 1
ship 1
2012-08-31 price 1
qty 1
ship 1
2012-09-30 price 2
qty 2
ship 2
2012-10-31 price 0
qty 0
ship 0
2012-11-30 price 0
qty 0
ship 0
2012-12-31 price 1
qty 1
ship 1
2013-01-31 price 0
qty 0
ship 0
2013-02-28 price 2
qty 2
ship 2
2013-03-31 price 0
qty 0
ship 0
2013-04-30 price 0
qty 0
ship 0
2013-05-31 price 3
qty 3
ship 3
dtype: int64

I came up with an idea that relies on the underlying storage format of datetime64[ns]. If you define dcut() like this
def dcut(dts, freq='d', right=True):
hi = pd.Period(dts.max(), freq=freq) + 1 # get first period past end of data
periods = pd.PeriodIndex(start=dts.min(), end=hi, freq=freq)
# get a list of integer bin boundaries representing ns-since-epoch
# note the extra period gives us the extra right-hand bin boundary we need
bounds = np.array(periods.to_timestamp(how='start'), dtype='int')
# bin our time field as integers
cut = pd.cut(np.array(dts, dtype='int'), bins=bounds, right=right)
# relabel the bins using the periods, omitting the extra one at the end
cut.levels = periods[:-1].format()
return cut
Then we can do what I wanted:
df.groupby([dcut(df.recd, freq='m', right=False),dcut(df.ship, freq='m', right=False)]).count()
To get:
price qty recd ship
2012-07 2012-10 1 1 1 1
2012-11 2012-12 1 1 1 1
2013-03 1 1 1 1
2012-12 2012-09 1 1 1 1
2013-02 1 1 1 1
2013-01 2012-08 1 1 1 1
2013-02 2013-02 1 1 1 1
2013-03 2013-03 1 1 1 1
2013-04 2012-07 1 1 1 1
2013-03 1 1 1 1
I guess you could similarly define dqcut() which first "rounds" each datetime value to the integer representing the start of its containing period (at your specified frequency), and then uses qcut() to choose amongst those boundaries. Or do qcut() first on the raw integer values and round the resulting bins based on your chosen frequency?
No joy on the bonus question yet? :)

How about using Series and putting the parts of the DataFrame that you're interested into that, then calling cut on the series object?
price_series = pd.Series(df.price.tolist(), index=df.recd)
and then
pd.qcut(price_series, q=3)
and so on. (Though I think #Jeff's answer is best)

Related

Condition on all rows of a groupby

Concerning this type of dataframe:
import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1,1,1,1,2,2,2,3],
'Time': [datetime.date(2019, 12, 1), datetime.date(2019, 12, 5),datetime.date(2019, 12, 8),datetime.date(2019, 8, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 3),datetime.date(2019, 12, 20)],
'Value':[2,2,2,50,7,100,7,5]})
ID Time Value
0 1 2019-12-01 2
1 1 2019-12-05 2
2 1 2019-12-08 2
3 1 2019-08-04 50
4 2 2019-11-04 7
5 2 2019-11-04 100
6 2 2019-11-03 7
7 3 2019-12-20 5
I am intersted only in the 3 latest values (regarding the time)
and
I would like to keep only the IDs where these 3 values are < 10.
So my desired output will look like this:
ID
0 1
Indeed the value 50 for the first ID is the fourth last value, so it's not interesting.
You could use a combination of query and groupby+size:
ids = df.query('Value < 10').groupby('ID')['Time'].size().ge(3)
ids[ids].reset_index().drop('Time', axis=1)
output:
ID
0 1
Alternative using filter (slower):
df.groupby('ID').filter(lambda g: len(g[g['Value'].lt(10)]['Time'].nlargest(3))>2)
output:
ID Time Value
0 1 2019-12-01 2
1 1 2019-12-05 2
2 1 2019-12-08 2
3 1 2019-08-04 50
and to get only the ID: add ['ID'].unique()
Within a groupby:
I sort the group by time
use a boolean to determine if the condition <10 is satisfied or not
Take the last 3 values only and sum the boolean defined above
Check if this number is exactly 3
grp = df.groupby("ID")\
.apply(lambda x:
x.sort_values("Time")["Value"].lt(10)[-3:].sum()==3)
grp[grp]
ID
1 True
dtype: bool

How to check if there is a row with same value combinations in a dataframe?

I have a dataframe and want to create a new column based on other rows of the dataframe. My dataframe looks like
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
Now I want to check, if the freq of a row is zero, then I will check if there is another row with the same ProjektID and Year an Week where the freq is not 0. If this is true I want a new column "other" which is value 1 and 0 else.
So, the output should be
MitarbeiterID ProjektID Jahr Monat Week mean freq last other
0 583 83224 2020 1 2 3.875 4 0 0
1 373 17364 2020 1 3 5.00 0 4 1
2 923 19234 2020 1 4 5.00 3 3 0
3 643 17364 2020 1 3 4.00 2 2 0
This time I have no approach, can anyone help?
Thanks!
The following solution tests if the required conditions are True.
import io
import pandas as pd
Data
df = pd.read_csv(io.StringIO("""
MitarbeiterID ProjektID Jahr Monat Week mean freq last
0 583 83224 2020 1 2 3.875 4 0
1 373 17364 2020 1 3 5.00 0 4
2 923 19234 2020 1 4 5.00 3 3
3 643 17364 2020 1 3 4.00 2 2
"""), sep="\s\s+", engine="python")
Make a column other with all values zero.
df['other'] = 0
If ProjektID, Jahr, Week are duplicated and any of the Freq values is larger than zero, then the rows that are duplicated (keep=False to also capture the original duplicated row) and where Freq is zero will have the value Other filled with 1. Change any() to all() if you need all values to be larger than zero.
if (df.loc[df[['ProjektID','Jahr', 'Week']].duplicated(), 'freq'] > 0).any(): df.loc[(df[['ProjektID','Jahr', 'Week']].duplicated(keep=False)) & (df['freq'] == 0), ['other']] = 1
else: print("Other stays zero")
Output:
I think the best way to solve this is not to use pandas too much :-) converting things to sets and tuples should make it fast enough.
The idea is to make a dictionary of all the triples (ProjektID, Jahr, Week) that appear in the dataset with freq != 0 and then check for all lines with freq == 0 if their triple belongs to this dictionary or not. In code, I'm creating a dummy dataset with:
x = pd.DataFrame(np.random.randint(0, 2, (8, 4)), columns=['id', 'year', 'week', 'freq'])
which in my case randomly gave:
>>> x
id year week freq
0 1 0 0 0
1 0 0 0 1
2 0 1 0 1
3 0 0 1 0
4 0 1 0 0
5 1 0 0 1
6 0 0 1 1
7 0 1 1 0
Now, we want triplets only where freq != 0, so we use
x1 = x.loc[x['freq'] != 0]
triplets = {tuple(row) for row in x1[['id', 'year', 'week']].values}
Note that I'm using x1.values, which is not a pandas DataFrame but rather a numpy array; so each row in there can now be converted to tuple. This is necessary because dataframe rows, or even numpy array or lists, are mutable objects and cannot be hashed in a dictionary otherwise. Using a set instead of e.g. a list (which doesn't have this restriction) is for efficiency purposes.
Next, we define a boolean variable which is True if a triplet (id, year, week) belongs to the above set:
belongs = x[['id', 'year', 'week']].apply(lambda x: tuple(x) in triplets, axis=1)
We are basically done, this is the further column you want, except for also needing to force freq == 0:
x['other'] = np.logical_and(belongs, x['freq'] == 0).astype(int)
(the final .astype(int) is to have it values 0 and 1, as you were asking, instead of False and True). Final result in my case:
>>> x
id year week freq other
0 1 0 0 0 1
1 0 0 0 1 0
2 0 1 0 1 0
3 0 0 1 0 1
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 1 0
7 0 1 1 0 0
Looks like I am too late ...:
df.set_index(['ProjektID', 'Jahr', 'Week'], drop=True, inplace=True)
df['other'] = 0
df.other.mask(df.freq == 0,
df.freq[df.freq == 0].index.isin(df.freq[df.freq != 0].index),
inplace=True)
df.other = df.other.astype('int')
df.reset_index(drop=False, inplace=True)

How to create average distribution of nested group by data in pandas?

I have a dataframe, df1, I am trying to extract and average distributions from:
ID site timestamp tracking_value
0 03 AMF 2018-01-01 1.0
1 08 AMF 2018-01-01 1.0
2 09 AMF 2018-01-01 1.0
3 14 ARR 2018-01-01 0.0
4 16 ARR 2018-01-01 0.0
5 21 AZM 2018-01-01 0.0
6 22 BII 2018-01-01 0.0
7 23 ARR 2018-01-01 0.0
8 26 AZM 2018-01-01 1.0
9 27 AMF 2018-01-01 1.0
...
...
For each ID group, for each site for that ID group, I want to get the distribution of lengths of consecutive tracking values. Then I want to average those site distributions for the ID, to produce the distribution of lengths of time that tracking_value was not a dropout (0.0).
I have this working without the second group by (group by site), for only one ID:
import more_itertools as mit
import seaborn as sns
id = '03'
# Get the tracking_value data for ID 03
data = df1[df1['ID'] == id]['tracking_value']
# Get the "run length" for each value in data
distribution_of_run_lengths = list(mit.run_length.encode(data))
# Keep the distribution of run lengths for only the 0.0 values
distribution_of_run_lengths_for_zero = [x[1] for x in disn if x[0] == 0.0]
# Plot the counts of run_lengths for each run_length value
sns.countplot(distribution_of_run_lengths_for_zero )
Which is fine for only one ID. Plot shows the number of times (yaxis) we had the dropout lengths (xaxis) for the ID 03:
However I need to extend this as mentioned above and have started by grouping by ID then site, but have been stuck on where to go from there:
data = df1.groupby(['ID','site'])['tracking_value']
Any suggestions on a way forward would be helpful. Thanks.
The following should do what you're looking for. Setup:
dates = np.repeat(pd.date_range('2018-01-01', '2018-01-31'), 4)
np.random.seed(100)
test_df = pd.DataFrame({
'date': dates,
'site': ['A', 'A', 'B', 'B']*(dates.shape[0]//4),
'id': [1, 2, 1, 2]*(dates.shape[0]//4),
'tracking_val': np.random.choice([0, 1], p=[0.4, 0.6], size=dates.shape)
})
Now perform the (many) groupby aggregations necessary to get what you want:
run_length_dict = {} # place to hold results
for name, group in test_df.groupby(['site', 'id']):
# Number all consecutive runs of 1 or 0
runs = group['tracking_val'].ne(group['tracking_val'].shift()) \
.cumsum() \
.rename(columns={'tracking_val': 'run_number'})
# Group each run by its number *and* the tracking value, then find the length of that run
run_lengths = runs.groupby([runs, group['tracking_val']]).agg('size')
# One final groupby (this time, on the tracking_val/level=1) to get the count of lengths
# and push it into the dict with the name of the original group - ("site", "id") tuple
run_length_dict[name] = run_lengths.groupby(level=1).value_counts()
Result:
{('A', 1): tracking_val
0 1 2
2 1
3 1
4 1
5 1
1 1 3
2 3
6 1
dtype: int64, ('A', 2): tracking_val
0 1 5
2 2
3 1
4 1
1 1 6
2 1
3 1
4 1
dtype: int64, ('B', 1): tracking_val
0 1 6
2 2
1 2 4
1 2
4 2
3 1
dtype: int64, ('B', 2): tracking_val
0 1 5
2 2
3 2
1 1 5
2 2
3 1
4 1
dtype: int64}

Filtering a DataFrame for data of id's of which the values are decreasing over time

I have a large time series dataset of patient results. A single patient has one ID with various result values. The data is sorted by date and ID. I want to look only at patients of which the values are strictly descending over time. For example patient x has result values 5, 3, 2, 1 would be true. However 5,3,6,7,1 would be false.
Example data:
import pandas as pd
df = pd.read_excel(...)
print(df.head())
PSA PSAdateā€Ž PatientID ... datefirstinject ADTkey RT_PSAbin
0 2.40 2007-06-26 11448 ... 2006-08-05 00:00:00 1 14
1 0.04 2007-09-26 11448 ... 2006-08-05 00:00:00 1 15
2 2.30 2008-01-14 11448 ... 2006-08-05 00:00:00 1 17
3 4.03 2008-04-16 11448 ... 2006-08-05 00:00:00 1 18
4 6.70 2008-07-01 11448 ... 2006-08-05 00:00:00 1 19
So for this example, I want to only see lines with PatientIDs for which the PSA Value is decreasing over time.
groupID = df.groupby('PatientID')
def is_desc(d):
for i in range(len(d) - 1):
if d[i] > d[i+1]:
return False
return True
x = groupID.PSA.apply(is_desc)
df['is_desc'] = groupID.PSA.transform(is_desc)
#patients whose PSA values is decreasing overtime.
df1 = df[df['is_desc']]
I get:
KeyError: 0
I suppose the loop cant make its way through the grouped values as it requires an array to find the 'range'.
Any ideas for editing the loop?
TL;DR
# (see is_desc function definition below)
df['is_desc'] = df.groupby('PationtID').PSA.transform(is_desc)
df[df['is_desc']]
Explanation
Let's use a very simple data set:
df = pd.DataFrame({'id': [1,2,1,3,3,1], 'res': [3,1,2,1,5,1]})
It only contains the id and one value column (and it has an index automatically assigned from pandas).
So if you just want to get a list of all ids whose values are descending, we can group the values by the id, then check if the values in the group are descending, then filter the list for just ids with descending values.
So first let's define a function that checks if the values are descending:
def is_desc(d):
first = True
for i in d:
if first:
first = False
else:
if i >= last:
return False
last = i
return True
(yes, this could probably be done more elegantly, you can search online for a better implementation)
now we group by the id:
gb = df.groupby('id')
and apply the function:
x = gb.res.apply(is_desc)
x now holds this Series:
id
1 True
2 True
3 False
dtype: bool
so now if you want to filter this you can just do this:
x[x].index
which you can of course convert to a normal list like that:
list(x[x].index)
which would give you a list of all ids of which the values are descending. in this case:
[1, 2]
But if you want to also have all the original data for all those chosen ids do it like this:
df['is_desc'] = gb.res.transform(is_des)
so now df has all the original data it had in the beginning, plus a column that tell for each line if it's id's values are descending:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
3 3 1 False
4 3 5 False
5 1 1 True
Now you can very easily filter this like that:
df[df['is_desc']]
which is:
id res is_desc
0 1 3 True
1 2 1 True
2 1 2 True
5 1 1 True
Selecting and sorting your data is quite easy and objective. However, deciding whether or not a patient's data is declining can be subjective, so it is best if you decide on a criteria beforehand to see if their data is declining.
To sort and select:
import pandas as pd
data = [['pat_1', 10, 1],
['pat_1', 9, 2],
['pat_2', 11, 2],
['pat_1', 4, 5],
['pat_1', 2, 6],
['pat_2', 10, 1],
['pat_1', 7, 3],
['pat_1', 5, 4],
['pat_2', 20, 3]]
df = pd.DataFrame(data).rename(columns={0:'Patient', 1:'Result', 2:'Day'})
print df
df_pat1 = df[df['Patient']=='pat_1']
print df_pat1
df_pat1_sorted = df_pat1.sort_values(['Day']).reset_index(drop=True)
print df_pat1_sorted
returns:
df:
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_2 11 2
3 pat_1 4 5
4 pat_1 2 6
5 pat_2 10 1
6 pat_1 7 3
7 pat_1 5 4
8 pat_2 20 3
df_pat1
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
3 pat_1 4 5
4 pat_1 2 6
6 pat_1 7 3
7 pat_1 5 4
df_pat1_sorted
Patient Result Day
0 pat_1 10 1
1 pat_1 9 2
2 pat_1 7 3
3 pat_1 5 4
4 pat_1 4 5
5 pat_1 2 6
For the purposes of this answer, I am going to say that if the first value of the new DataFrame is larger than the last, then their values are declining:
if df_pat1_sorted['Result'].values[0] > df_pat1_sorted['Result'].values[-1]:
print "Patient 1's values are declining"
This returns:
Patient 1's values are declining
There is a better way if you have many unique IDs (as I'm sure you do) of iterating through your patients. I shall present an example using integers, however you may need to use regex if your patient IDs include characters.
import pandas as pd
import numpy as np
min_ID = 1003
max_ID = 1005
patients = np.random.randint(min_ID, max_ID, size=10)
df = pd.DataFrame(patients).rename(columns={0:'Patients'})
print df
s = pd.Series(df['Patients']).unique()
print s
for i in range(len(s)):
print df[df['Patients']==s[i]]
returns:
Patients
0 1004
1 1004
2 1004
3 1003
4 1003
5 1003
6 1003
7 1004
8 1003
9 1003
[1004 1003] # s (the unique values in the df['Patients'])
Patients
3 1003
4 1003
5 1003
6 1003
8 1003
9 1003
Patients
0 1004
1 1004
2 1004
7 1004
I hope this has helped!
This should solve your question, interpreting 'decreasing' as monotonic decreasing:
import pandas as pd
d = {"PatientID": [1,1,1,1,2,2,2,2],
"PSAdate": [2010,2011,2012,2013,2010,2011,2012,2013],
"PSA": [5,3,2,1,5,3,4,5]}
# Sorts by id and date
df = pd.DataFrame(data=d).sort_values(['PatientID', 'PSAdate'])
# Computes change and max(change) between sequential PSA's
df["change"] = df.groupby('PatientID')["PSA"].diff()
df["max_change"] = df.groupby('PatientID')['change'].transform('max')
# Considers only patients whose PSA are monotonic decreasing
df = df.loc[df["max_change"] <= 0]
print(df)
PatientID PSAdate PSA change max_change
0 1 2010 5 NaN -1.0
1 1 2011 3 -2.0 -1.0
2 1 2012 2 -1.0 -1.0
3 1 2013 1 -1.0 -1.0
Note: to consider only strictly monotonic decreasing PSA, change the final loc condition to < 0

Add Multiple Columns to Pandas Dataframe from Function

I have a pandas data frame mydf that has two columns,and both columns are datetime datatypes: mydate and mytime. I want to add three more columns: hour, weekday, and weeknum.
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip or merge? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()
Here's on approach to do it using one apply
Say, df is like
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2
To complement John Galt's answer:
Depending on the task that is performed by lambdafunc, you may experience some speedup by storing the result of apply in a new DataFrame and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join. You will be able to avoid the (always annoying) SettingWithCopyWarning that may pop up when assigning directly on the columns:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
You can do this in a somewhat cleaner method by having the function you apply return a pd.Series with named elements:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11
def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))

Categories

Resources