Pandas get_group method on DatetimeIndexResamplerGroupby - python

Question: Does the get_group method work on a DataFrame with a DatetimeIndexResamplerGroupby index? If so, what is the appropriate syntax?
Sample data:
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.set_index('dates').groupby('a').resample('D')
DatetimeIndexResamplerGroupby [freq=<Day>, axis=0, closed=left, label=left, convention=e, base=0]
gb3.sum()
a b c
a dates
2 2017-01-01 2.0 4.0 1.0
2017-01-02 NaN NaN NaN
2017-01-03 NaN NaN NaN
2017-01-04 NaN NaN NaN
2017-01-05 2.0 4.0 2.0
3 2017-01-07 3.0 4.0 1.0
The get_group method is working for me on a pandas.core.groupby.DataFrameGroupBy object.
I've tried various approaches, the typical error is TypeError: Cannot convert input [(0, 1)] of type <class 'tuple'> to Timestamp

The below should be what you're looking for (if I understand the question correctly):
import pandas as pd
import datetime
​
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.groupby(['a',pd.Grouper('dates')])
gb3.get_group((2, '2017-01-01'))
​
Out[14]:
a b c dates
0 2 4 1 2017-01-01
I believe resample/pd.Grouper can be used interchangeably in this case (someone correct me if I'm wrong). Let me know if this works for you.

Yes it does, the following code returns the monthly values sum of the year 2015
df.resample('MS').sum().resample('Y').get_group('2015-12-31')

Related

DataFrame How to vectorize this for loop?

I need help vectorizing this for loop.
i couldn't come up with my own solution.
So the general idea is that I want to calculate the number of bars since the last time the condition was true.
I have DataFrame with initial values 0 and 1 where 0 is the anchor point for counting to start and stop (0 means that the condition was met for the index in this cell).
For example inital DataFrame would look like this (I am typing only the series raw values and I am ommiting column names etc.)
NaN, NaN, NaN, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0
The output should look like this:
NaN, NaN, NaN, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 5, 0
My current code:
cond_count = pd.DataFrame(index=range(cond.shape[0]), columns=range(1))
cond_count.rename(columns={0: 'Bars Since'})
cond_count['Bars Since'] = 'NaN'
cond_count['Bars Since'].iloc[indices_cond_met] = 0
cond_count['Bars Since'].iloc[indices_cond_cut] = 1
for i in range(cond_count.shape[0]):
if cond_count['Bars Since'].iloc[i] == 'NaN'
pass
elif cond_count['Bars Since'].iloc[i] == 0:
while cond_count['Bars Since'].iloc[j] != 0:
cond_count['Bars Since'].iloc[j] = cond_count['Bars Since'].shift(1).iloc[j] + 1
else:
pass
import numpy as np
import pandas as pd
df = pd.DataFrame({'data': [np.nan, np.nan, np.nan, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0]})
df['cs'] = df['data'].le(0).cumsum()
aaa = df.groupby(['data', 'cs'])['data'].apply(lambda x: x.cumsum())
df.loc[aaa.index[0]:, 'data'] = aaa
df = df.drop(['cs'], axis=1)#if you need to remove the auxiliary column
print(df)
Output
data
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
5 2.0
6 3.0
7 4.0
8 0.0
9 1.0
10 2.0
11 0.0
12 1.0
13 2.0
14 3.0
15 4.0
16 5.0
17 0.0
Here used le to get True where 0.
Then I applied cumsum(), thereby marking the lines into groups.
In the list 'aaa' applied the grouping to the columns 'data', 'cs' and submitted the data to apply, where it applied cumsum().
Using the first slice index: df.loc[aaa.index[0]:, 'data'] in loc overwrote the rows.

pandas groupby only aggregating rows that are common between two consecutive fields that are grouped

I am trying to calculate a sum for each date field, however I only want to calculate the sum of IDs that are in both the current and next date, so a rolling comparison of IDs and then a groupby sum. Currently I have to loop over the dataframe which is very slow.
For example my df:
df = pd.DataFrame({
'Date': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
'ID': [ 1, 2, 3, 4 , 2, 3, 4 , 2, 3, 4, 5, 1, 2, 3, 4],
'Value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
})
Ideally I want to group the dataframe by Date and only sum the IDs that are common between two dates, for example below. However this is very slow.
tmpL = df.groupby('Date')['ID'].apply(list)
tmpV = df.groupby('Date')['Value'].sum()
for i in range(1, tmpL.shape[0]):
res = list(set(tmpL.iloc[i]) - set(tmpL.iloc[i - 1]))
v = df.loc[ df.ID.isin(res) & (df.Date == tmpL.index[i]), 'Value'].sum()
tmpV.iloc[i] = tmpV.iloc[i] - v
tmpV
Date
1 10
2 18
3 27
4 42
Name: Value, dtype: int64
Is there a way to do this in pandas without looping over the dataframe?
Use DataFrame.pivot_table with aggregate sum, compare for not equal with DataFrame.diff, and last passed to DataFrame.mask with sum:
df1 = df.pivot_table(index='Date', columns='ID', values='Value', aggfunc='sum')
s = df1.mask(df1.notna().diff().fillna(False)).sum(axis=1)
print (s)
Date
1 10.0
2 18.0
3 27.0
4 42.0
dtype: float64
First solution, I think slowier:
You can get all not matched sets by convert original to sets, then use Series.diff, Series.explode and get all matched values of original by DataFrame.merge, last aggregate sum and subtract:
tmpL = (df.groupby('Date')['ID'].apply(set)
.diff()
.explode()
.reset_index()
.merge(df)
.groupby('Date')['Value']
.sum())
tmpV = df.groupby('Date')['Value'].sum()
out = tmpV.sub(tmpL, fill_value=0)
print (out)
Date
1 10.0
2 18.0
3 27.0
4 42.0
Try:
df = df.pivot_table(index='Date', columns='ID', values='Value')#.reset_index()
condition = df.notna() & df.notna().shift(1)
condition.iloc[0,:]=True
print(df[condition].sum(axis=1))
Output:
Date
1 10.0
2 18.0
3 27.0
4 42.0

Keep the last n real values of uneven rows in a dataframe?

I am collecting heart rate values over the course of time. Each subject varies in the length of time that data was collected. I would like to make a table of the last 2 seconds of collected data.
import pandas as pd
import numpy as np
#example data
example_s = [["4/20/21 4:20", 302, 0, 0, 1, 2, 3],
["2/17/21 9:20",135, 1, 1.4, 8, 10, np.NaN, np.NaN],
["2/17/21 9:20", 111, 5, 5,1, np.NaN, np.NaN,np.NaN, np.NaN]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', 0, 1, 2, 3, 4, 5, 6])
desired_outcome = [["4/20/21 4:20",302,1, 2, 3],
["2/17/21 9:20",135, 1.4, 8, 10 ],
["2/17/21 9:20",111, 5, 5,1 ]]
desired_outcome_table = pd.DataFrame(desired_outcome,columns=['Date_Time','CID', "Second 1", "Second 2", "Second 3"])
I can see how to collect a single instance of the data from the example shown here, but would like to know how to quickly add multiple values to my table:
desired_outcome_table["Last Second"]=example_s_table.iloc[:,1:].ffill(axis=1).iloc[:, -1]
Python Dataframe Get Value of Last Non Null Column for Each Row
Try:
df = example_s_table.copy()
df = df.set_index(['Date_Time', 'CID'])
df_out = df.mask(df.eq(0))\
.apply(lambda x: pd.Series(x.dropna().tail(3).values), axis=1)\
.rename(columns = lambda x: f'Second {x+1}')
df_out['Last Second'] = df_out['Second 3']
print(df_out.reset_index())
Output:
Date_Time CID Second 1 Second 2 Second 3 Last Second
0 4/20/21 4:20 302 1.0 2.0 3.0 3.0
1 2/17/21 9:20 135 1.4 8.0 10.0 10.0
2 2/17/21 9:20 111 5.0 5.0 1.0 1.0

python pandas query date variable in quarterly format

I need to query a panda dataframe handed to me which contains a quarterly date format.
Data
import pandas as pd
import datetime
table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
[datetime.datetime(2015, 1, 27), 1, 0.5],
[datetime.datetime(2015, 1, 31), 1, 0.5],
[datetime.datetime(2015, 4, 1), 1, 2],
[datetime.datetime(2015, 4, 3), 1, 2],
[datetime.datetime(2015, 4, 15), 1, 2],
[datetime.datetime(2015, 5, 28), 1, 2],
[datetime.datetime(2015, 5, 1), 1, 3],
[datetime.datetime(2015, 5, 17), 1, 3],
[datetime.datetime(2015, 8, 31), 1, 3]]
df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])
df = df.assign(Date = lambda x: x.Date.dt.to_period('Q'))
Code
df.query("Date == '2015Q2'")
results in an empty dataframe.
For me working if compare by quarter period:
df = df.query("Date == #pd.Period('2015Q2', 'Q')")
print (df)
Date Id Value
3 2015Q2 1 2.0
4 2015Q2 1 2.0
5 2015Q2 1 2.0
6 2015Q2 1 2.0
7 2015Q2 1 3.0
8 2015Q2 1 3.0
If use boolean indexing it working correct:
df = df[df["Date"] == '2015Q2']
print (df)
Date Id Value
3 2015Q2 1 2.0
4 2015Q2 1 2.0
5 2015Q2 1 2.0
6 2015Q2 1 2.0
7 2015Q2 1 3.0
8 2015Q2 1 3.0

Pandas: Find first occurrences of elements that appear in a certain column

Let's assume that I have the following data-frame:
df_raw = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [np.nan, 3, np.nan, 4, 5], "val3": [4, np.nan, np.nan, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
I want to have access to the rows where the first occurrence of each id is. So these rows would be:
df_first = pd.DataFrame({"id": [102, 103], "val1": [9, 4], "val2": [np.nan, np.nan], "val3": [4, np.nan], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2003, 4, 4)]})
Basically, at the end what I would like to achieve is fill up the NaNs that appear in the first occurrence of each id. So the final data frame might be:
df_processed = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [-1, 3, -1, 4, 5], "val3": [4, np.nan, -1, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
An important note is that the rows are already grouped by id and date and sorted in a ascending manner. So they appear exactly as in the provided example.
IIUC using drop_duplicates then concat
df1=df_raw.drop_duplicates('id').fillna(-1)
target=pd.concat([df1,df_raw.loc[~df_raw.index.isin(df1.index)]]).sort_index()
target
date id val1 val2 val3
0 2002-01-01 102 9 -1.0 4.0
1 2002-03-03 102 2 3.0 NaN
2 2003-04-04 103 4 -1.0 -1.0
3 2003-08-09 103 7 4.0 5.0
4 2005-02-03 103 6 5.0 1.0
You can use pd.Series.duplicated with Boolean row indexing:
mask = ~df_raw['id'].duplicated()
val_cols = ['val2', 'val3']
df_raw.loc[mask, val_cols] = df_raw.loc[mask, val_cols].fillna(-1)
print(df_raw)
id val1 val2 val3 date
0 102 9 -1.0 4.0 2002-01-01
1 102 2 3.0 NaN 2002-03-03
2 103 4 -1.0 -1.0 2003-04-04
3 103 7 4.0 5.0 2003-08-09
4 103 6 5.0 1.0 2005-02-03

Categories

Resources