Pandas select columns with boolean date condition - python

I'd like to use a boolean index to select columns from a pandas dataframe with a datetime index as the column header:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(4, 6), index=list('ABCD'), columns=dates)
returns:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.173096 0.344348 1.059990 -1.246944 1.624399 -0.276052
B 0.277148 0.965226 -1.301612 -1.264500 -0.124489 1.704485
C -0.375106 0.103812 0.939749 -2.826329 -0.275420 0.664325
D 0.039756 0.631373 0.643565 -1.516543 -0.654626 -1.544038
I'd like to return only the first three columns.

I might do
>>> df.loc[:, df.columns <= datetime(2013, 1, 3)]
2013-01-01 2013-01-02 2013-01-03
A 1.058112 0.883429 -1.939846
B 0.753125 1.664276 -0.619355
C 0.014437 1.125824 -1.421609
D 1.879229 1.594623 -1.499875
You can do vectorized comparisons on the column index directly without using the map/lambda combination.

I had a nice long chat with the duck, and finally realised it was as simple as this:
print df.loc[:, :datetime(2013, 1, 3, 0, 0)]
I love Pandas.
EDIT:
Well, in fact that wasn't exactly what I wanted, because it relies on the 'query' date being present in the column headers. This is actually what I needed:
print df.loc[:, df.columns.map(lambda col: col < datetime(2013, 1, 3, 0, 0))]

Related

Selecting datetime range in MultiIndexed dataframe in Pandas

Here is the problem:
I want to select the dataframe (say, df3) with each index1 in df1 to be in the range between d_reach and d_start in df2,
Below is the code to generate samples:
import numpy as np
import pandas as pd
import datetime
from datetime import timedelta
index1 = pd.date_range(datetime.datetime(2021, 1, 1, 1, 1), periods = 1000, freq = "3min")
df1 = pd.DataFrame(np.random.random(1000), index = index1, columns = ['r'])
d_start = pd.date_range(datetime.datetime(2021, 1, 1, 1, 1), periods = 500, freq = "5min")
d_reach = d_start + timedelta(seconds = np.random.randint(low = 4, high = 6))
value = {'id3': np.tile([0,1], 250)}
df2 = pd.DataFrame(value, index = [d_start,d_reach])
df2.index.names = ['d_start','d_reach']
df2 is MultiIndexed.
The expected ouput of df3 should be:
2021-01-01 01:07:00 0.011026
2021-01-01 01:10:00 0.423813
...
here index1 in df1 2021-01-01 01:07:00 >= 2021-01-01 01:06:05 which is one of the d_reach in df2
and the next index1 in df1 2021-01-01 01:10:00 < 2021-01-01 01:11:00 which is the next d_start in df2
Below is the code I tried but failed:
df = pd.DataFrame()
for i in df1.index:
df = df.append(df1.loc[i])
for idx1, idx2 in zip(df2.index.get_level_values(0).tolist(),
df2.index.get_level_values(1).tolist())
if i >= idx1 and i <= idx2
Really appreciate any advice as to find df3 in Python. Thanks!
I want to select the dataframe (say, df3) with each index1 in df1 to be in the range between d_reach and d_start in df2,
here is one way to cross join then find the matches and filter them out :
mdf = pd.merge(df1.reset_index(), df2.reset_index() , how='cross', on=None)
result = mdf.loc[mdf['index'].between(mdf['d_start'], mdf['d_reach']),['index','r']].set_index('index')
print(result.head())
output:
>>>
r
index
2021-01-01 01:01:00 0.415163
2021-01-01 01:16:00 0.729592
2021-01-01 01:31:00 0.411244
2021-01-01 01:46:00 0.524753
2021-01-01 02:01:00 0.105035
That's going to be memory intensive though, another way is to load your dataframes into an in-memory database and join them based on the condition and load the result back to your result dataframe, you will find a lot of samples on that method online.

plot dataframe columns via for loop

I have the following code:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 6), index=dates, columns=["a","b","c","a_x","b_x","c_x"])
which results in the following:
a b c a_x b_x c_x
2013-01-01 -0.871681 0.938965 -0.804039 0.329384 -1.211573 0.160477
2013-01-02 1.673895 2.017654 2.181771 0.336220 0.389709 0.246264
2013-01-03 -0.670211 -0.561792 -0.747824 -0.837123 0.129040 1.044153
2013-01-04 -0.571023 -0.430249 0.024393 1.017622 1.072909 0.816249
2013-01-05 0.074952 -0.119953 0.245248 2.658196 -1.525059 1.131054
2013-01-06 0.203816 0.379939 -0.162919 -0.674444 -0.650636 0.415143
I want to generate simple line plot charts - a total of three, each plotting the couples:
a and a_x, b and b_x and c and c_x
I know how to generate charts but since the table is big and has the same pattern in the column naming conventions I was thinking if that is possible to be achieved via for loop. For examples the original table would have a column d and column d_x, also column e and e_x etc.
You could use groupby along axis=1, grouped by the first element of splitting columns names:
for _, data in df.groupby(df.columns.str.split('_').str[0], axis=1):
data.plot()
[out]

python: what is wrong with my date index?

I have a dataframe that uses dates as index. Although I can read the index values from series.index, I fail to get the corresponding record.
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4], [datetime.date(2019,1,2), 'B', 6]], columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index', values='Value')
index = series2.index[0]
This far, everything works.
But this line of code fails:
row = series[index]
The error message is
KeyError: datetime.date(2019, 1, 1)
Why does it fail, and how can I fix it?
Use Series.loc for selecting, but in series2, because in series is RangeIndex, not dates:
row = series2.loc[index]
print (row)
Index
A 4.0
B NaN
Name: 2019-01-01, dtype: float64
Details:
print (series)
Date Index Value
0 2019-01-01 A 4
1 2019-01-02 B 6
print (series.index)
RangeIndex(start=0, stop=2, step=1)
print (series2)
Index A B
Date
2019-01-01 4.0 NaN
2019-01-02 NaN 6.0
Add this part after your three lines:
series.set_index('Date', inplace=True)
So, the whole thing is:
import pandas as pd
import datetime
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4],
[datetime.date(2019,1,2), 'B', 6]],
columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index',
values='Value')
index = series2.index[0]
series.set_index('Date', inplace=True) # this part was added
series.loc[index]
Out[57]:
Index A
Value 4
Name: 2019-01-01, dtype: object

Business days between two columns of dates with Pandas Groupby

I have a Dataframe in Pandas with a letter and two dates as columns. I would like to calculate the difference between the two date columns for the previous row using shift(1) provided that the Lettervalue is the same (using a groupby). The complex part is I would like to calculate business days, not just elapsed days. The best way I have found to do that is using a numpy.busday_count, which takes two lists as an argument. I am essentially trying to use .apply to make each row it's own list. Not sure if this is the best way to do it, but running into some problems, which are ambiguous.
import pandas as pd
from datetime import datetime
import numpy as np
# create dataframe
df = pd.DataFrame(data=[['A', datetime(2016,01,07), datetime(2016,01,09)],
['A', datetime(2016,03,01), datetime(2016,03,8)],
['B', datetime(2016,05,01), datetime(2016,05,10)],
['B', datetime(2016,06,05), datetime(2016,06,07)]],
columns=['Letter', 'First Day', 'Last Day'])
# convert to dates since pandas reads them in as time series
df['First Day'] = df['First Day'].apply(lambda x: x.to_datetime().date())
df['Last Day'] = df['Last Day'].apply(lambda x: x.to_datetime().date())
df['Gap'] = (df.groupby('Letter')
.apply(
lambda x: (
np.busday_count(x['First Day'].shift(1).tolist(),
x['Last Day'].shift(1).tolist())))
.reset_index(drop=True))
print df
I get the following error on the lambda function. I'm not sure what object it's having problems with as the two passed arguments should be dates:
ValueError: Could not convert object to NumPy datetime
Desired Output:
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NAN
1 A 2016-03-01 2016-03-08 1
2 B 2016-05-01 2016-05-10 NAN
3 B 2016-06-05 2016-06-07 7
The following should work - first removing the leading zeros from the date digits):
df = pd.DataFrame(data=[['A', datetime(2016, 1, 7), datetime(2016, 1, 9)],
['A', datetime(2016, 3, 1), datetime(2016, 3, 8)],
['B', datetime(2016, 5, 1), datetime(2016, 5, 10)],
['B', datetime(2016, 6, 5), datetime(2016, 6, 7)]],
columns=['Letter', 'First Day', 'Last Day'])
df['Gap'] = df.groupby('Letter')
.apply(
lambda x:
pd.DataFrame(
np.busday_count(x['First Day'].tolist(), x['Last Day'].tolist())).shift())
.reset_index(drop=True)
Letter First Day Last Day Gap
0 A 2016-01-07 2016-01-09 NaN
1 A 2016-03-01 2016-03-08 2.0
2 B 2016-05-01 2016-05-10 NaN
3 B 2016-06-05 2016-06-07 6.0
I don't think you need the .date() conversion.

Interval-matching PeriodIndex with DatetimeIndex during merge

Given the below DataFrames, I'd like to add series 'bar' from df_other2 into df2, so that the period of df2 (which I understand as an interval) "matches" the datetime index (not an interval) of df_other2 (also called "period", but is really a datetime). The matching criteria should be that df_other2.period is within df2's period (i.e. date is within the interval).
I was hoping that defining the target index as PeriodIndex and the source index as DatetimeIndex would be sufficient to do the matching, but that doesn't seem to be the case. What alternatives do I have to get this to work?
>>> df = pd.DataFrame({'period': PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='int64', freq='M'), 'foo': [1, 2, 3]})
>>> df2 = df.set_index('period')
>>> df2
foo x
period
2012-01 1 NaN
2012-02 2 NaN
2012-03 3 NaN
>>> df_other = pd.DataFrame({'period': [datetime.datetime(2012, 1, 1), datetime.datetime(2012, 2, 3), datetime.datetime(2012, 3, 10)], 'bar': ['a', 'b', 'c']})
>>> df_other2 = df_other.set_index('period')
>>> df_other2
bar
period
2012-01-01 a
2012-02-03 b
2012-01-10 c
>>> df2['x'] = df_other['bar']
>>> df2
foo x
period
2012-01 1 NaN # expected x='a' as '2012-1-1' is part of this period
2012-02 2 NaN # expected x='b'
2012-03 3 NaN # expected x='c'
I decided to align all df_other2.period with df2.period (and make them DatetimeIndex) and merge as usual.
I'll wait for better support in future releases:
For regular time spans, pandas uses Period objects for scalar values
and PeriodIndex for sequences of spans. Better support for irregular
intervals with arbitrary start and end points are forth-coming in
future releases.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-stamps-vs-time-spans

Categories

Resources