pandas group by chunks - python

I have a data set:
df = pd.DataFrame({
'service': ['a', 'a', 'a', 'b', 'c', 'a', 'a'],
'status': ['problem', 'problem', 'ok', 'problem', 'ok', 'problem', 'ok'],
'created': [
datetime(2019, 1, 1, 1, 1, 0),
datetime(2019, 1, 1, 1, 1, 10),
datetime(2019, 1, 1, 1, 2, 0),
datetime(2019, 1, 1, 1, 3, 0),
datetime(2019, 1, 1, 1, 5, 0),
datetime(2019, 1, 1, 1, 10, 0),
datetime(2019, 1, 1, 1, 20, 0),
],
})
print(df.head(10))
service status created
0 a problem 2019-01-01 01:01:00 # -\
1 a problem 2019-01-01 01:01:10 # --> one group
2 a ok 2019-01-01 01:02:00 # -/
3 b problem 2019-01-01 01:03:00
4 c ok 2019-01-01 01:05:00
5 a problem 2019-01-01 01:10:00 # -\
6 a ok 2019-01-01 01:20:00 # - --> one group
As you can see a service changed status problem -> ok(0, 2 items; 5, 6 items). Also you can see that 3, 4 items has no changes(only 1 record - without group/chunk). I need to create the next data set:
service downtime_seconds
0 a 60 # `created` difference between 2 and 0
1 a 600 # `created` difference between 6 and 5
I can do it through iteration:
for i in range(len(df.index)):
# if df.loc[i]['status'] blablabla...
Is it possible to do it using pandas without iteration? Maybe there is a more elegant method?
Thank you.

In your case we need create the groupby key by reverse the order and cumsum , then we just need to filter the df before we groupby , use nunique with transform
s=df.status.eq('ok').iloc[::-1].cumsum()
con=df.service.groupby(s).transform('nunique')==1
df_g=df[con].groupby(s).agg({'service':'first','created':lambda x : (x.iloc[-1]-x.iloc[0]).seconds})
Out[124]:
service created
status
1 a 600
3 a 60

Related

DataFrame How to vectorize this for loop?

I need help vectorizing this for loop.
i couldn't come up with my own solution.
So the general idea is that I want to calculate the number of bars since the last time the condition was true.
I have DataFrame with initial values 0 and 1 where 0 is the anchor point for counting to start and stop (0 means that the condition was met for the index in this cell).
For example inital DataFrame would look like this (I am typing only the series raw values and I am ommiting column names etc.)
NaN, NaN, NaN, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0
The output should look like this:
NaN, NaN, NaN, 0, 1, 2, 3, 4, 0, 1, 2, 0, 1, 2, 3, 4, 5, 0
My current code:
cond_count = pd.DataFrame(index=range(cond.shape[0]), columns=range(1))
cond_count.rename(columns={0: 'Bars Since'})
cond_count['Bars Since'] = 'NaN'
cond_count['Bars Since'].iloc[indices_cond_met] = 0
cond_count['Bars Since'].iloc[indices_cond_cut] = 1
for i in range(cond_count.shape[0]):
if cond_count['Bars Since'].iloc[i] == 'NaN'
pass
elif cond_count['Bars Since'].iloc[i] == 0:
while cond_count['Bars Since'].iloc[j] != 0:
cond_count['Bars Since'].iloc[j] = cond_count['Bars Since'].shift(1).iloc[j] + 1
else:
pass
import numpy as np
import pandas as pd
df = pd.DataFrame({'data': [np.nan, np.nan, np.nan, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0]})
df['cs'] = df['data'].le(0).cumsum()
aaa = df.groupby(['data', 'cs'])['data'].apply(lambda x: x.cumsum())
df.loc[aaa.index[0]:, 'data'] = aaa
df = df.drop(['cs'], axis=1)#if you need to remove the auxiliary column
print(df)
Output
data
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
5 2.0
6 3.0
7 4.0
8 0.0
9 1.0
10 2.0
11 0.0
12 1.0
13 2.0
14 3.0
15 4.0
16 5.0
17 0.0
Here used le to get True where 0.
Then I applied cumsum(), thereby marking the lines into groups.
In the list 'aaa' applied the grouping to the columns 'data', 'cs' and submitted the data to apply, where it applied cumsum().
Using the first slice index: df.loc[aaa.index[0]:, 'data'] in loc overwrote the rows.

How to concatenate multiple rows into a single row and repeate this operation over a big dataframe?

I'm working with a data frame containing 582,260 rows and 24 columns. Each row corresponds to a 24 hours vector length time series, and 20 rows (days) correspond to id_1, 20 to id_2... and so on up to id_N. I would like to concatenate into a single row all the 20 rows of id_1 so that my concatenated time series become a 480 (20 days * 24 hrs/day) vector length, and repeat this operation from id_1 to id_N.
A very reduced and reproducible version of my data frame is shown (ID column should be an index but for iteration purposes I reseted it):
df = pd.DataFrame([['id1', 1, 1, 3, 4, 1], ['id1', 0, 1, 5, 2, 1], ['id1', 3, 4, 5, 0, 0],
['id2', 1, 1, 8, 0, 6], ['id2', 5, 3, 1, 1, 2], ['id2', 5, 4, 5, 2, 7]],
columns = ['ID', 'h0', 'h1', 'h2', 'h3', 'h4'] )
I've tried with the next function to iterate over the rows in the data frame but it doesn't give me the expected output.
def concatenation(df):
for i, row in df.iterrows():
if df.ix[i]['ID'] == df.ix[i+1]['ID']:
pd.concat([df], axis = 1)
return(df)
concatenation(df)
The expected output should look like this:
df = pd.DataFrame([['id1', 1, 1, 3, 4, 1, 0, 1, 5, 2, 1, 3, 4, 5, 0, 0],
['id2', 1, 1, 8, 0, 6, 5, 3, 1, 1, 2, 5, 4, 5, 2, 7]],
columns = ['ID', 'h0', 'h1', 'h2', 'h3', 'h4',
'h0', 'h1', 'h2', 'h3', 'h4',
'h0', 'h1', 'h2', 'h3', 'h4'])
Is there a compact and elegant way of programming this task with pandas tools?
Thank you in advance for your help.
First add a column day, then create a hierarchical index of ID and day that then gets unstacked:
df['day'] = df.groupby('ID').cumcount()
df = df.set_index(['ID','day'])
res = df.unstack()
Intermediate result:
h0 h1 h2 h3 h4
day 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
ID
id1 1 0 3 1 1 4 3 5 5 4 2 0 1 1 0
id2 1 5 5 1 3 4 8 1 5 0 1 2 6 2 7
Now we flatten the index and re-order the columns as requested:
res.set_axis([f"{y}{x}" for x, y in res.columns], axis=1, inplace=True)
res = res.reindex(sorted(res.columns), axis=1)
Final result:
0h0 0h1 0h2 0h3 0h4 1h0 1h1 1h2 1h3 1h4 2h0 2h1 2h2 2h3 2h4
ID
id1 1 1 3 4 1 0 1 5 2 1 3 4 5 0 0
id2 1 1 8 0 6 5 3 1 1 2 5 4 5 2 7
You can use defaultdict(list) and .extend() method to store all the values in exact order and to create the same output as you defined.
But this would require you to do a crude loop which is not recommended for large dataframes.

Random selection of one value among different columns?

Suppose I have the following data frame
from pandas import DataFrame
Cars = { 'value': [10, 31, 661, 1, 51, 61, 551],
'action1': [1, 1, 1, 1, 1, 1, 1],
'price1': [ 12,0, 15,3, 0, 12,0],
'action2': [2, 2, 2, 2, 2, 2, 2],
'price2': [ 0, 16, 19, 0, 1, 10,0],
'action3': [3, 3, 3, 3, 3, 3, 3],
'price3': [ 14, 36, 9, 0, 0, 0,0]
}
df = DataFrame(Cars,columns= ['value', 'action1', 'price1', 'action2', 'price2', 'action3', 'price3'])
print (df)
How can I select randomly value (action and price) among 3 columns? As a result I want to have a dataframe that will look something like this one?
RandCars = {'value': [10, 31, 661, 1, 51, 61, 551],
'action': [1, 3, 1, 3, 1, 2, 2],
'price': [ 12, 36, 15, 0, 3, 10, 0]
}
df2 = DataFrame(RandCars, columns = ['value','action', 'price'])
print(df2)
Use:
#get columns names not starting by action or price
cols = df.columns[~df.columns.str.startswith(('action','price'))]
print (cols)
Index(['value'], dtype='object')
#convert filtered columns to 2 numpy arrays
arr1 = df.filter(regex='^action').values
arr2 = df.filter(regex='^price').values
#pandas 0.24+
#arr1 = df.filter(regex='^action').to_numpy()
#arr2 = df.filter(regex='^price').to_numpy()
i, c = arr1.shape
#create random positions of both DataFrames to new df
idx = np.random.choice(np.arange(c), i)
df3 = pd.DataFrame({'action': arr1[np.arange(len(df)), idx],
'price': arr2[np.arange(len(df)), idx]},
index=df.index)
print (df3)
action price
0 2 0
1 3 36
2 3 9
3 1 3
4 3 0
5 1 12
6 1 0
#add all another columns by join
df4 = df[cols].join(df3)
print (df4)
value action price
0 10 2 0
1 31 3 36
2 661 3 9
3 1 1 3
4 51 3 0
5 61 1 12
6 551 1 0

Pandas: Last time when a column had a non-nan value

Let's assume that I have the following data-frame:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "nominal": [1, np.nan, 1, 1, np.nan], "numeric1": [3, np.nan, np.nan, 7, np.nan], "numeric2": [2, 3, np.nan, 2, np.nan], "numeric3": [np.nan, 2, np.nan, np.nan, 3], "date":[pd.Timestamp(2005, 6, 22), pd.Timestamp(2006, 2, 11), pd.Timestamp(2008, 9, 13), pd.Timestamp(2009, 5, 12), pd.Timestamp(2010, 5, 9)]})
As output, I want to get a data-frame, that will indicate the number of days that have passed since a non-nan value was seen for that column, for that id. If a column has a value for the corresponding date, or if a column doesn't have a value at the start for an new id, the value should be a 0. In addition, this is supposed to be computed only for the numeric columns. With that said, the output data-frame should be:
output_df = pd.DataFrame({"numeric1_delta": [0, 234, 1179, 0, 362], "numeric2_delta": [0, 0, 945, 0, 362], "numeric3_delta": [0, 0, 945, 0, 0]})
Looking forward to your answers!
You can groupby the cumsum of the non null and then subtract the first date:
In [11]: df.numeric1.notnull().cumsum()
Out[11]:
0 1
1 1
2 1
3 2
4 2
Name: numeric1, dtype: int64
In [12]: df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[12]:
0 2005-06-22
1 2005-06-22
2 2005-06-22
3 2009-05-12
4 2009-05-12
Name: date, dtype: datetime64[ns]
In [13]: df.date - df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[13]:
0 0 days
1 234 days
2 1179 days
3 0 days
4 362 days
Name: date, dtype: timedelta64[ns]
For multiple columns:
ncols = [col for col in df.columns if col.startswith("numeric")]
for c in ncols:
df[c + "_delta"] = df.date - df.groupby(df[c].notnull().cumsum()).date.transform('first')

Pandas get_group method on DatetimeIndexResamplerGroupby

Question: Does the get_group method work on a DataFrame with a DatetimeIndexResamplerGroupby index? If so, what is the appropriate syntax?
Sample data:
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.set_index('dates').groupby('a').resample('D')
DatetimeIndexResamplerGroupby [freq=<Day>, axis=0, closed=left, label=left, convention=e, base=0]
gb3.sum()
a b c
a dates
2 2017-01-01 2.0 4.0 1.0
2017-01-02 NaN NaN NaN
2017-01-03 NaN NaN NaN
2017-01-04 NaN NaN NaN
2017-01-05 2.0 4.0 2.0
3 2017-01-07 3.0 4.0 1.0
The get_group method is working for me on a pandas.core.groupby.DataFrameGroupBy object.
I've tried various approaches, the typical error is TypeError: Cannot convert input [(0, 1)] of type <class 'tuple'> to Timestamp
The below should be what you're looking for (if I understand the question correctly):
import pandas as pd
import datetime
​
data = [[2, 4, 1, datetime.datetime(2017, 1, 1)],
[2, 4, 2, datetime.datetime(2017, 1, 5)],
[3, 4, 1, datetime.datetime(2017, 1, 7)]]
df1 = pd.DataFrame(data, columns=list('abc') + ['dates'])
gb3 = df1.groupby(['a',pd.Grouper('dates')])
gb3.get_group((2, '2017-01-01'))
​
Out[14]:
a b c dates
0 2 4 1 2017-01-01
I believe resample/pd.Grouper can be used interchangeably in this case (someone correct me if I'm wrong). Let me know if this works for you.
Yes it does, the following code returns the monthly values sum of the year 2015
df.resample('MS').sum().resample('Y').get_group('2015-12-31')

Categories

Resources