python: what is wrong with my date index? - python

I have a dataframe that uses dates as index. Although I can read the index values from series.index, I fail to get the corresponding record.
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4], [datetime.date(2019,1,2), 'B', 6]], columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index', values='Value')
index = series2.index[0]
This far, everything works.
But this line of code fails:
row = series[index]
The error message is
KeyError: datetime.date(2019, 1, 1)
Why does it fail, and how can I fix it?

Use Series.loc for selecting, but in series2, because in series is RangeIndex, not dates:
row = series2.loc[index]
print (row)
Index
A 4.0
B NaN
Name: 2019-01-01, dtype: float64
Details:
print (series)
Date Index Value
0 2019-01-01 A 4
1 2019-01-02 B 6
print (series.index)
RangeIndex(start=0, stop=2, step=1)
print (series2)
Index A B
Date
2019-01-01 4.0 NaN
2019-01-02 NaN 6.0

Add this part after your three lines:
series.set_index('Date', inplace=True)
So, the whole thing is:
import pandas as pd
import datetime
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4],
[datetime.date(2019,1,2), 'B', 6]],
columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index',
values='Value')
index = series2.index[0]
series.set_index('Date', inplace=True) # this part was added
series.loc[index]
Out[57]:
Index A
Value 4
Name: 2019-01-01, dtype: object

Related

How to prevent data from being recycled when using pd.merge_asof in Python

I am looking to join two data frames using the pd.merge_asof function. This function allows me to match data on a unique id and/or a nearest key. In this example, I am matching on the id as well as the nearest date that is less than or equal to the date in df1.
Is there a way to prevent the data from df2 being recycled when joining?
This is the code that I currently have that recycles the values in df2.
import pandas as pd
import datetime as dt
df1 = pd.DataFrame({'date': [dt.datetime(2020, 1, 2), dt.datetime(2020, 2, 2), dt.datetime(2020, 3, 2)],
'id': ['a', 'a', 'a']})
df2 = pd.DataFrame({'date': [dt.datetime(2020, 1, 1)],
'id': ['a'],
'value': ['1']})
pd.merge_asof(df1,
df2,
on='date',
by='id',
direction='backward',
allow_exact_matches=True)
This is the output that I would like to see instead where only the first match is successful
Given your merge direction is backward, you can do a mask on duplicated id and df2's date after merge_asof:
out = pd.merge_asof(df1,
df2.rename(columns={'date':'date1'}), # rename df2's date
left_on='date',
right_on='date1', # so we can work on it later
by='id',
direction='backward',
allow_exact_matches=True)
# mask the value
out['value'] = out['value'].mask(out.duplicated(['id','date1']))
# equivalently
# out.loc[out.duplicated(['id', 'date1']), 'value'] = np.nan
Output:
date id date1 value
0 2020-01-02 a 2020-01-01 1
1 2020-02-02 a 2020-01-01 NaN
2 2020-03-02 a 2020-01-01 NaN

Converting multiple columns to datetime using iloc or loc

I am unsure if this is the expected behavior, but below is an example dataframe.
df = pd.DataFrame([['2020-01-01','2020-06-30','A'],
['2020-07-01','2020-12-31','B']],
columns = ['start_date', 'end_date', 'field1'])
Before I upgraded to pandas version 1.3.4, I believe I was able to convert column dtypes like this:
df.iloc[:,0:2] = df.iloc[:,0:2].apply(pd.to_datetime)
Although is appears to have converted the columns to datetime,
start_date end_date field1
0 2020-01-01 00:00:00 2020-06-30 00:00:00 A
1 2020-07-01 00:00:00 2020-12-31 00:00:00 B
The dtypes appear to still be objects:
start_date object
end_date object
field1 object
I know I am able to do the same thing using the code below, I am just wondering if this is the intended behavior of both loc and iloc.
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
start_date datetime64[ns]
end_date datetime64[ns]
field1 object
This behaviour is part of the changes introduced in 1.3.0.
Try operating inplace when setting values with loc and iloc
When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.
Meaning that iloc and loc will try to not change the dtype of an array if the new array can fit in the existing type:
import pandas as pd
df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df.loc[:, 'A':'B'] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)
Output:
A B
0 1.2 3.4
1 2.3 4.5
A float64
B float64
dtype: object
A B
0 1.0 3.0
1 2.0 4.0
A float64
B float64
dtype: object
Conversely:
Never operate inplace when setting frame[keys] = values:
When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH39510). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.
import pandas as pd
df = pd.DataFrame({'A': [1.2, 2.3], 'B': [3.4, 4.5]})
print(df)
print(df.dtypes)
df[['A', 'B']] = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df)
print(df.dtypes)
Output:
A B
0 1.2 3.4
1 2.3 4.5
A float64
B float64
dtype: object
A B
0 1 3
1 2 4
A int64
B int64
dtype: object
With these changes in mind, we now have to do something like:
import pandas as pd
df = pd.DataFrame([['2020-01-01', '2020-06-30', 'A'],
['2020-07-01', '2020-12-31', 'B']],
columns=['start_date', 'end_date', 'field1'])
cols = df.columns[0:2]
df[cols] = df[cols].apply(pd.to_datetime)
# or
# df[df.columns[0:2]] = df.iloc[:, 0:2].apply(pd.to_datetime)
print(df)
print(df.dtypes)
Output:
start_date end_date field1
0 2020-01-01 2020-06-30 A
1 2020-07-01 2020-12-31 B
start_date datetime64[ns]
end_date datetime64[ns]
field1 object
dtype: object

Counting number of events on each user in two dataframe

I'm attempting to count the number of events that occurred in the past for each user in a table. Actually, I have two dataframe, one for each user at a specific point 'T' in time and one for each event that also occur in time.
This is the exemple of the user table:
ID_CLIENT START_DATE
0 A 2015-12-31
1 A 2016-12-31
2 A 2017-12-31
3 B 2016-12-31
This is the exemple of the event table:
ID_CLIENT DATE_EVENT
0 A 2017-01-01
1 A 2017-05-01
2 A 2018-02-01
3 A 2016-05-02
4 B 2015-01-01
The idea is that I want for each line in the "user" table the count of event that occurs before the date registered on "START_DATE".
Exemple of the final result :
ID_CLIENT START_DATE nb_event_tot
0 A 2015-12-31 0
1 A 2016-12-31 1
2 A 2017-12-31 3
3 B 2016-12-31 1
I have created a function which leverage the ".apply" function of pandas but it's too slow... If anyone have an idea on how to speed it up it would be glady appreciated. I have 800K line of user and 200k line of event which take up to 3 hours with the apply method.
Here is my code to reproduce :
import pandas as pd
def check_below_df(row, df_events, col_event):
# Select the ids
id_c = row['ID_CLIENT']
date = row['START_DATE']
# Select subset of events df
sub_df_events = df_events.loc[df_events['ID_CLIENT'] == id_c, :]
sub_df_events = sub_df_events.loc[sub_df_events[col_event] <= date, :]
count = len(sub_df_events)
return count
def count_events(df_clients: pd.DataFrame, df_event: pd.DataFrame, col_event_date: str = 'DATE_EVENEMENT',
col_start_date: str = 'START_DATE', col_end_date: str = 'END_DATE', col_event:str = 'nb_sin', events = ['compensation']):
df_clients_cp = df_clients[["ID_CLIENT", col_start_date]].copy()
df_event_cp = df_event.copy()
df_event_cp[col_event] = 1
# TOTAL
df_clients_cp[f'{col_event}_tot'] = df_clients_cp.apply(lambda row: check_below_df(row, df_event_cp, col_event_date), axis=1)
return df_clients_cp
# ------------------------------------------------------------------
# ------------------------------------------------------------------
df_users = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'B'],
'START_DATE': ['2015-12-31', '2016-12-31', '2017-12-31', '2016-12-31'],
})
df_users["START_DATE"] = pd.to_datetime(df_users["START_DATE"])
df_events = pd.DataFrame(data={
'ID_CLIENT': ['A', 'A', 'A', 'A', 'B'],
'DATE_EVENT': ['2017-01-01', '2017-05-01', '2018-02-01', '2016-05-02', '2015-01-01']
})
df_events["DATE_EVENT"] = pd.to_datetime(df_events["DATE_EVENT"])
tmp = count_events(df_users, df_events, col_event_date='DATE_EVENT', col_event='nb_event')
tmp
Thank's for your help.
I guess the slow exection is caused by pd.apply(axis=1), which is explained here.
I estimate that you can improve the execution time by using functions that are not applied rowwise, for instance by using merge and groupby.
First we merge the frames:
df_merged = pd.merge(df_users, df_events, on='ID_CLIENT', how='left')
Then we check where DATE_EVENT <= START_DATE for the entire frame:
df_merged.loc[:, 'before'] = df_merged['DATE_EVENT'] <= df_merged['START_DATE']
Then we group by CLIENT_ID and START_DATE, and sum the 'before' column:
df_grouped = df_merged.groupby(by=['ID_CLIENT', 'START_DATE'])
df_out = df_grouped['before'].sum() # returns a series
Finally we convert df_out (a series) back to a dataframe, renaming the new column to 'nb_event_tot', and subsequently reset the index to get your desired output:
df_out = df_out.to_frame('nb_event_tot')
df_out = df_out.reset_index()

pandas: populate df column with values matching index and column in another df

I am facing a problem that I am uncapable of finding a way around it.
I find very difficult too to explain what I am trying to do so hopefully a small example would help
I have df1 as such:
Id product_1 product_2
Date
1 0.1855672 0.8855672
2 0.1356667 0.0356667
3 1.1336686 1.7336686
4 0.9566671 0.6566671
and I have df2 as such:
product_1 Month
Date
2018-03-30 11.0 3
2018-04-30 18.0 4
2019-01-29 14.0 1
2019-02-28 22.0 2
and what I am trying to achieve is this in df2:
product_1 Month seasonal_index
Date
2018-03-30 11.0 3 1.1336686
2018-04-30 18.0 4 0.9566671
2019-01-29 14.0 1 0.1855672
2019-02-28 22.0 2 0.1356667
So what I try is to match the product name in df2 with the corresponding column in d1 and then get the value of for each index value that matches the month number in df2
I have tried doing things like:
for i in df1:
df2['seasonal_index'] = df1.loc[df1.iloc[:,i] == df2['Month']]
but with no success. Hopefully someone could have a clue on how to unblock the situation
Here you are my friend, this produces exactly the output you specified.
import pandas as pd
# replicate df1
data1 = [[0.1855672, 0.8855672],
[0.1356667, 0.0356667],
[1.1336686, 1.7336686],
[0.9566671, 0.6566671]]
index1 = [1, 2, 3, 4]
df = pd.DataFrame(data=data1,
index= index1,
columns=['product_1', 'product_2'])
df.columns.name = 'Id'
df.index.name = 'Date'
# replicate df2
data2 = [[11.0, 3],
[18.0, 4],
[14.0, 1],
[22.0, 2]]
index2 = [pd.Timestamp('2018-03-30'),
pd.Timestamp('2018-04-30'),
pd.Timestamp('2019-01-29'),
pd.Timestamp('2019-02-28')]
df2 = pd.DataFrame(data=data2, index=index2,
columns=['product_1', 'Month'])
df2.index.name = 'Date'
# Merge your data
df3 = pd.merge(left=df2, right=df[['product_1']],
left_on='Month',
right_index=True,
how='outer',
suffixes=('', '_df2'))
df3 = df3.rename(columns={'product_1_df2': 'seasonal_index'})
print(df3)
If you are interested in learning why this works, take a look at this link explaining the pandas.merge function. Notice specifically that for your dataframes, the key for df2 is one of its columns (so we use the left_on parameter in pd.merge) and the key for df is its index (so we use the right_index parameter in pd.merge).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Business days between two columns of dates with Pandas Groupby

I have a Dataframe in Pandas with a letter and two dates as columns. I would like to calculate the difference between the two date columns for the previous row using shift(1) provided that the Lettervalue is the same (using a groupby). The complex part is I would like to calculate business days, not just elapsed days. The best way I have found to do that is using a numpy.busday_count, which takes two lists as an argument. I am essentially trying to use .apply to make each row it's own list. Not sure if this is the best way to do it, but running into some problems, which are ambiguous.
import pandas as pd
from datetime import datetime
import numpy as np
# create dataframe
df = pd.DataFrame(data=[['A', datetime(2016,01,07), datetime(2016,01,09)],
['A', datetime(2016,03,01), datetime(2016,03,8)],
['B', datetime(2016,05,01), datetime(2016,05,10)],
['B', datetime(2016,06,05), datetime(2016,06,07)]],
columns=['Letter', 'First Day', 'Last Day'])
# convert to dates since pandas reads them in as time series
df['First Day'] = df['First Day'].apply(lambda x: x.to_datetime().date())
df['Last Day'] = df['Last Day'].apply(lambda x: x.to_datetime().date())
df['Gap'] = (df.groupby('Letter')
.apply(
lambda x: (
np.busday_count(x['First Day'].shift(1).tolist(),
x['Last Day'].shift(1).tolist())))
.reset_index(drop=True))
print df
I get the following error on the lambda function. I'm not sure what object it's having problems with as the two passed arguments should be dates:
ValueError: Could not convert object to NumPy datetime
Desired Output:
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NAN
1 A 2016-03-01 2016-03-08 1
2 B 2016-05-01 2016-05-10 NAN
3 B 2016-06-05 2016-06-07 7
The following should work - first removing the leading zeros from the date digits):
df = pd.DataFrame(data=[['A', datetime(2016, 1, 7), datetime(2016, 1, 9)],
['A', datetime(2016, 3, 1), datetime(2016, 3, 8)],
['B', datetime(2016, 5, 1), datetime(2016, 5, 10)],
['B', datetime(2016, 6, 5), datetime(2016, 6, 7)]],
columns=['Letter', 'First Day', 'Last Day'])
df['Gap'] = df.groupby('Letter')
.apply(
lambda x:
pd.DataFrame(
np.busday_count(x['First Day'].tolist(), x['Last Day'].tolist())).shift())
.reset_index(drop=True)
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NaN
1 A 2016-03-01 2016-03-08 2.0
2 B 2016-05-01 2016-05-10 NaN
3 B 2016-06-05 2016-06-07 6.0
I don't think you need the .date() conversion.

Categories

Resources