I have a dataset that I want to groupby a column AND every month of data in the dataset. I'm using pd.Grouper() for the groupby date per month part of it.
df.groupby(['A',pd.Grouper(key='date', freq='M')]).agg({'B':list})
But this returns only the months for each A,B that actually have data. I also want every month where there was no data for that A,B combo. I don't see this option in the pd.Grouper() documentation.
Given this DataFrame:
date A B
2018-01-01 1 3
2018-03-01 2 4
After the groupby you can use resample BUT in order to resample unfortunately you need to create the MultiIndex yourself:
In [11]: res = df.groupby(['A',pd.Grouper(key='date', freq='M')]).agg({'B':list})
In [12]: m = pd.MultiIndex.from_product([df.A.unique(), pd.date_range(df.date.min(), df.date.max() + pd.offsets.MonthEnd(1), freq='M')])
In [13]: m
Out[13]:
MultiIndex(levels=[[1, 2], [2018-01-31 00:00:00, 2018-02-28 00:00:00, 2018-03-31 00:00:00]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
In [14]: res.reindex(m)
Out[14]:
B
1 2018-01-31 [3]
2018-02-28 NaN
2018-03-31 NaN
2 2018-01-31 NaN
2018-02-28 NaN
2018-03-31 [4]
Note: to fillna with [] is a little tricky, ideally you'd be able to work around this (in general having lists inside a DataFrame is not recommended).
Related
this is my first question on stack overflow so the formatting might be a bit off. I have a problem that I know a solution for with a for loop in python. However, I don't know if there is a way in pandas itself that does the same thing faster.
Problem:
Suppose I have a pandas Series 'in' consisting of an index date and where every date has a value (integer). There is also a Series 'out' that has the same structure.
Ex:
in
date val
2022-12-01 5
2022-12-02 8
2022-12-03 19
out
date val
2022-12-01 3
2022-12-02 7
2022-12-03 21
If I want to make a Series of the amount of events that are being processed each day, I could do it with a for loop in the following where the value of every day is open.iloc[i]=open[i-1]+in[i]-out[i]. The result should be
open
date val
2022-12-01 2 #5-3
2022-12-02 3 #2+8-7
2022-12-03 1 #3+19-21
Is there a way to do this in pandas itself, without the need for a for loop?
new answer
Use cumsum:
ser_open = ser_in.sub(ser_out).cumsum()
Output:
2022-12-01 2
2022-12-02 3
2022-12-03 1
dtype: int64
Used input:
ser_in = pd.Series([5, 8, 19], index=['2022-12-01', '2022-12-02', '2022-12-03'])
ser_out = pd.Series([3, 7, 21], index=['2022-12-01', '2022-12-02', '2022-12-03'])
initial answer
Use shift after setting date as index:
out = (df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)
.reset_index()
)
Or, for assignment use variant with map:
df_open['val'] = df_open['date'].map(df_open
.set_index('date').shift()
.add(df_in.set_index('date')-df_out.set_index('date'),
fill_value=0
)['val']
)
Output:
date val
0 2022-12-01 2.0
1 2022-12-02 3.0
2 2022-12-03 1.0
Used inputs:
df_in = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [5, 8, 19]})
df_out = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [3, 7, 21]})
df_open = pd.DataFrame({'date': ['2022-12-01', '2022-12-02', '2022-12-03'], 'val': [2.0, 3.0, 1.0]})
I have a DataFrame df1 and I want to get at a specific date, for example 2022-01-04 all the column names of df1 in a list which would be: 01G, 02G, 04G. So far I was only able to get the number of each row, but not the column names.
This would be a simple example:
df1:
01G 02G 03G 04G
Dates
2022-01-01 0 1 0 1
2022-01-02 1 1 1 0
2022-01-03 0 1 1 1
2022-01-04 1 1 0 1
For reproducibility:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'Dates':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
'01G':[0, 1, 0, 1],
'02G':[1, 1, 1, 1],
'03G':[0, 1, 1, 0],
'04G':[1, 0, 1, 1]})
df1 = df1.set_index('Dates')
np.count_nonzero(df1, axis=1)
Thanks a lot!
Use DataFrame.loc for filter row by datetime, compare for greater like 0 and filter columns names:
print (df1.columns[df1.loc['2022-01-04'].gt(0)].tolist())
['01G', '02G', '04G']
For your special case, it seems we can also filter using the row values directly after changing dtype to bool:
out = df1.columns[df1.loc['2022-01-04'].astype(bool)].tolist()
Output:
['01G', '02G', '04G']
I have a dataframe that uses dates as index. Although I can read the index values from series.index, I fail to get the corresponding record.
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4], [datetime.date(2019,1,2), 'B', 6]], columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index', values='Value')
index = series2.index[0]
This far, everything works.
But this line of code fails:
row = series[index]
The error message is
KeyError: datetime.date(2019, 1, 1)
Why does it fail, and how can I fix it?
Use Series.loc for selecting, but in series2, because in series is RangeIndex, not dates:
row = series2.loc[index]
print (row)
Index
A 4.0
B NaN
Name: 2019-01-01, dtype: float64
Details:
print (series)
Date Index Value
0 2019-01-01 A 4
1 2019-01-02 B 6
print (series.index)
RangeIndex(start=0, stop=2, step=1)
print (series2)
Index A B
Date
2019-01-01 4.0 NaN
2019-01-02 NaN 6.0
Add this part after your three lines:
series.set_index('Date', inplace=True)
So, the whole thing is:
import pandas as pd
import datetime
series = pd.DataFrame([[datetime.date(2019,1,1), 'A', 4],
[datetime.date(2019,1,2), 'B', 6]],
columns = ('Date', 'Index', 'Value'))
series2 = series.pivot(index='Date', columns='Index',
values='Value')
index = series2.index[0]
series.set_index('Date', inplace=True) # this part was added
series.loc[index]
Out[57]:
Index A
Value 4
Name: 2019-01-01, dtype: object
Given the below DataFrames, I'd like to add series 'bar' from df_other2 into df2, so that the period of df2 (which I understand as an interval) "matches" the datetime index (not an interval) of df_other2 (also called "period", but is really a datetime). The matching criteria should be that df_other2.period is within df2's period (i.e. date is within the interval).
I was hoping that defining the target index as PeriodIndex and the source index as DatetimeIndex would be sufficient to do the matching, but that doesn't seem to be the case. What alternatives do I have to get this to work?
>>> df = pd.DataFrame({'period': PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='int64', freq='M'), 'foo': [1, 2, 3]})
>>> df2 = df.set_index('period')
>>> df2
foo x
period
2012-01 1 NaN
2012-02 2 NaN
2012-03 3 NaN
>>> df_other = pd.DataFrame({'period': [datetime.datetime(2012, 1, 1), datetime.datetime(2012, 2, 3), datetime.datetime(2012, 3, 10)], 'bar': ['a', 'b', 'c']})
>>> df_other2 = df_other.set_index('period')
>>> df_other2
bar
period
2012-01-01 a
2012-02-03 b
2012-01-10 c
>>> df2['x'] = df_other['bar']
>>> df2
foo x
period
2012-01 1 NaN # expected x='a' as '2012-1-1' is part of this period
2012-02 2 NaN # expected x='b'
2012-03 3 NaN # expected x='c'
I decided to align all df_other2.period with df2.period (and make them DatetimeIndex) and merge as usual.
I'll wait for better support in future releases:
For regular time spans, pandas uses Period objects for scalar values
and PeriodIndex for sequences of spans. Better support for irregular
intervals with arbitrary start and end points are forth-coming in
future releases.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-stamps-vs-time-spans
I'm recoding multiple columns in a dataframe and have come across a strange result that I can't quite figure out. I'm probably not recoding in the most efficient manner possible, but it's mostly the error that I'm hoping someone can explain.
s1 = pd.DataFrame([np.nan, '1', '2', '3', '4', '5'], columns=['col1'])
s2 = pd.DataFrame([np.nan, 1, 2, 3, 4, 5], columns=['col1'])
s1_dic = {np.nan: np.nan, '1': 1, '2':2, '3':3, '4':3, '5':3}
s2_dic = {np.nan: np.nan, 1: 1, 2:2, 3:3, 4:3, 5:3}
s1['col1'].apply(lambda x: s1_dic[x])
s2['col1'].apply(lambda x: s2_dic[x])
s1 works fine, but when I try to do the same thing with a list of integers and a np.nan, I get KeyError: nan which is confusing. Any help would be appreciated.
A workaround is to use the get dict method, rather than the lambda:
In [11]: s1['col1'].apply(s1_dic.get)
Out[11]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
In [12]: s2['col1'].apply(s2_dic.get)
Out[12]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
It's not clear to me right now why this is different...
Note: the dicts can be accessed by nan:
In [21]: s1_dic[np.nan]
Out[21]: nan
In [22]: s2_dic[np.nan]
Out[22]: nan
and hash(np.nan) == 0 so it's not that...
Update: Apparently the issue is with np.nan vs np.float64(np.nan), the former has np.nan is np.nan (because np.nan is bound to a specific instantiated nan object) whilst float('nan') is not float('nan'):
This means that get won't find float('nan'):
In [21]: nans = [float('nan') for _ in range(5)]
In [22]: {f: 1 for f in nans}
Out[22]: {nan: 1, nan: 1, nan: 1, nan: 1, nan: 1}
This means you can actually retrieve the nans from a dict, any such retrieval would be implementation specific! In fact, as the dict uses the id of these nans, this entire behavior above may be implementation specific (if nan shared the same id, as they may do in a REPL/ipython session).
You can catch the nullness beforehand:
In [31]: s2['col1'].apply(lambda x: s2_dic[x] if pd.notnull(x) else x)
Out[31]:
0 NaN
1 1
2 2
3 3
4 3
5 3
Name: col1, dtype: float64
But I think the original suggestion of using .get is a better option.