Interval-matching PeriodIndex with DatetimeIndex during merge - python

Given the below DataFrames, I'd like to add series 'bar' from df_other2 into df2, so that the period of df2 (which I understand as an interval) "matches" the datetime index (not an interval) of df_other2 (also called "period", but is really a datetime). The matching criteria should be that df_other2.period is within df2's period (i.e. date is within the interval).
I was hoping that defining the target index as PeriodIndex and the source index as DatetimeIndex would be sufficient to do the matching, but that doesn't seem to be the case. What alternatives do I have to get this to work?
>>> df = pd.DataFrame({'period': PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='int64', freq='M'), 'foo': [1, 2, 3]})
>>> df2 = df.set_index('period')
>>> df2
foo x
period
2012-01 1 NaN
2012-02 2 NaN
2012-03 3 NaN
>>> df_other = pd.DataFrame({'period': [datetime.datetime(2012, 1, 1), datetime.datetime(2012, 2, 3), datetime.datetime(2012, 3, 10)], 'bar': ['a', 'b', 'c']})
>>> df_other2 = df_other.set_index('period')
>>> df_other2
bar
period
2012-01-01 a
2012-02-03 b
2012-01-10 c
>>> df2['x'] = df_other['bar']
>>> df2
foo x
period
2012-01 1 NaN # expected x='a' as '2012-1-1' is part of this period
2012-02 2 NaN # expected x='b'
2012-03 3 NaN # expected x='c'

I decided to align all df_other2.period with df2.period (and make them DatetimeIndex) and merge as usual.
I'll wait for better support in future releases:
For regular time spans, pandas uses Period objects for scalar values
and PeriodIndex for sequences of spans. Better support for irregular
intervals with arbitrary start and end points are forth-coming in
future releases.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-stamps-vs-time-spans

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

How to prevent data from being recycled when using pd.merge_asof in Python

I am looking to join two data frames using the pd.merge_asof function. This function allows me to match data on a unique id and/or a nearest key. In this example, I am matching on the id as well as the nearest date that is less than or equal to the date in df1.
Is there a way to prevent the data from df2 being recycled when joining?
This is the code that I currently have that recycles the values in df2.
import pandas as pd
import datetime as dt
df1 = pd.DataFrame({'date': [dt.datetime(2020, 1, 2), dt.datetime(2020, 2, 2), dt.datetime(2020, 3, 2)],
'id': ['a', 'a', 'a']})
df2 = pd.DataFrame({'date': [dt.datetime(2020, 1, 1)],
'id': ['a'],
'value': ['1']})
pd.merge_asof(df1,
df2,
on='date',
by='id',
direction='backward',
allow_exact_matches=True)
This is the output that I would like to see instead where only the first match is successful
Given your merge direction is backward, you can do a mask on duplicated id and df2's date after merge_asof:
out = pd.merge_asof(df1,
df2.rename(columns={'date':'date1'}), # rename df2's date
left_on='date',
right_on='date1', # so we can work on it later
by='id',
direction='backward',
allow_exact_matches=True)
# mask the value
out['value'] = out['value'].mask(out.duplicated(['id','date1']))
# equivalently
# out.loc[out.duplicated(['id', 'date1']), 'value'] = np.nan
Output:
date id date1 value
0 2020-01-02 a 2020-01-01 1
1 2020-02-02 a 2020-01-01 NaN
2 2020-03-02 a 2020-01-01 NaN

python: what is wrong with my date index?

I have a dataframe that uses dates as index. Although I can read the index values from series.index, I fail to get the corresponding record.
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4], [datetime.date(2019,1,2), 'B', 6]], columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index', values='Value')
index = series2.index[0]
This far, everything works.
But this line of code fails:
row = series[index]
The error message is
KeyError: datetime.date(2019, 1, 1)
Why does it fail, and how can I fix it?
Use Series.loc for selecting, but in series2, because in series is RangeIndex, not dates:
row = series2.loc[index]
print (row)
Index
A 4.0
B NaN
Name: 2019-01-01, dtype: float64
Details:
print (series)
Date Index Value
0 2019-01-01 A 4
1 2019-01-02 B 6
print (series.index)
RangeIndex(start=0, stop=2, step=1)
print (series2)
Index A B
Date
2019-01-01 4.0 NaN
2019-01-02 NaN 6.0
Add this part after your three lines:
series.set_index('Date', inplace=True)
So, the whole thing is:
import pandas as pd
import datetime
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4],
[datetime.date(2019,1,2), 'B', 6]],
columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index',
values='Value')
index = series2.index[0]
series.set_index('Date', inplace=True) # this part was added
series.loc[index]
Out[57]:
Index A
Value 4
Name: 2019-01-01, dtype: object

How to get pd.Grouper() to include empty groups

I have a dataset that I want to groupby a column AND every month of data in the dataset. I'm using pd.Grouper() for the groupby date per month part of it.
df.groupby(['A',pd.Grouper(key='date', freq='M')]).agg({'B':list})
But this returns only the months for each A,B that actually have data. I also want every month where there was no data for that A,B combo. I don't see this option in the pd.Grouper() documentation.
Given this DataFrame:
date A B
2018-01-01 1 3
2018-03-01 2 4
After the groupby you can use resample BUT in order to resample unfortunately you need to create the MultiIndex yourself:
In [11]: res = df.groupby(['A',pd.Grouper(key='date', freq='M')]).agg({'B':list})
In [12]: m = pd.MultiIndex.from_product([df.A.unique(), pd.date_range(df.date.min(), df.date.max() + pd.offsets.MonthEnd(1), freq='M')])
In [13]: m
Out[13]:
MultiIndex(levels=[[1, 2], [2018-01-31 00:00:00, 2018-02-28 00:00:00, 2018-03-31 00:00:00]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
In [14]: res.reindex(m)
Out[14]:
B
1 2018-01-31 [3]
2018-02-28 NaN
2018-03-31 NaN
2 2018-01-31 NaN
2018-02-28 NaN
2018-03-31 [4]
Note: to fillna with [] is a little tricky, ideally you'd be able to work around this (in general having lists inside a DataFrame is not recommended).

Business days between two columns of dates with Pandas Groupby

I have a Dataframe in Pandas with a letter and two dates as columns. I would like to calculate the difference between the two date columns for the previous row using shift(1) provided that the Lettervalue is the same (using a groupby). The complex part is I would like to calculate business days, not just elapsed days. The best way I have found to do that is using a numpy.busday_count, which takes two lists as an argument. I am essentially trying to use .apply to make each row it's own list. Not sure if this is the best way to do it, but running into some problems, which are ambiguous.
import pandas as pd
from datetime import datetime
import numpy as np
# create dataframe
df = pd.DataFrame(data=[['A', datetime(2016,01,07), datetime(2016,01,09)],
['A', datetime(2016,03,01), datetime(2016,03,8)],
['B', datetime(2016,05,01), datetime(2016,05,10)],
['B', datetime(2016,06,05), datetime(2016,06,07)]],
columns=['Letter', 'First Day', 'Last Day'])
# convert to dates since pandas reads them in as time series
df['First Day'] = df['First Day'].apply(lambda x: x.to_datetime().date())
df['Last Day'] = df['Last Day'].apply(lambda x: x.to_datetime().date())
df['Gap'] = (df.groupby('Letter')
.apply(
lambda x: (
np.busday_count(x['First Day'].shift(1).tolist(),
x['Last Day'].shift(1).tolist())))
.reset_index(drop=True))
print df
I get the following error on the lambda function. I'm not sure what object it's having problems with as the two passed arguments should be dates:
ValueError: Could not convert object to NumPy datetime
Desired Output:
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NAN
1 A 2016-03-01 2016-03-08 1
2 B 2016-05-01 2016-05-10 NAN
3 B 2016-06-05 2016-06-07 7
The following should work - first removing the leading zeros from the date digits):
df = pd.DataFrame(data=[['A', datetime(2016, 1, 7), datetime(2016, 1, 9)],
['A', datetime(2016, 3, 1), datetime(2016, 3, 8)],
['B', datetime(2016, 5, 1), datetime(2016, 5, 10)],
['B', datetime(2016, 6, 5), datetime(2016, 6, 7)]],
columns=['Letter', 'First Day', 'Last Day'])
df['Gap'] = df.groupby('Letter')
.apply(
lambda x:
pd.DataFrame(
np.busday_count(x['First Day'].tolist(), x['Last Day'].tolist())).shift())
.reset_index(drop=True)
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NaN
1 A 2016-03-01 2016-03-08 2.0
2 B 2016-05-01 2016-05-10 NaN
3 B 2016-06-05 2016-06-07 6.0
I don't think you need the .date() conversion.

Categories

Resources